Monday, 2020-08-03

*** ryohayakawa has joined #opendev		00:01
*** DSpider has quit IRC		00:12
openstackgerrit	Pierre-Louis Bonicoli proposed zuul/zuul-jobs master: Avoid to use 'length' filter with null value https://review.opendev.org/742316	01:28
openstackgerrit	Pierre-Louis Bonicoli proposed zuul/zuul-jobs master: Avoid to use 'length' filter with null value https://review.opendev.org/742316	01:32
*** mlavalle has quit IRC		03:31
*** mlavalle has joined #opendev		03:31
*** raukadah is now known as chkumar\|rover		04:33
*** DSpider has joined #opendev		06:07
*** ryo_hayakawa has joined #opendev		06:12
*** ysandeep\|away is now known as ysandeep		06:13
*** ryohayakawa has quit IRC		06:14
openstackgerrit	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/744096	06:28
*** rhayakaw__ has joined #opendev		06:32
*** ryo_hayakawa has quit IRC		06:34
*** frickler_pto is now known as frickler		07:26
*** ssaemann has joined #opendev		07:27
*** ssaemann has quit IRC		07:28
*** ssaemann has joined #opendev		07:28
*** tosky has joined #opendev		07:35
*** hashar has joined #opendev		07:38
*** ysandeep is now known as ysandeep\|lunch		07:44
*** ssaemann has quit IRC		07:55
*** sshnaidm\|afk is now known as sshnaidm		07:58
*** ttx has quit IRC		08:00
*** moppy has quit IRC		08:01
*** ttx has joined #opendev		08:01
*** moppy has joined #opendev		08:01
*** dtantsur\|afk is now known as dtantsur		08:10
*** fressi has joined #opendev		08:14
*** ysandeep\|lunch is now known as ysandeep		08:27
*** bolg has joined #opendev		08:36
*** hashar has quit IRC		09:28
*** ysandeep is now known as ysandeep\|afk		09:41
*** fressi_ has joined #opendev		09:56
*** fressi has quit IRC		09:57
*** fressi_ is now known as fressi		09:57
*** fressi has quit IRC		10:01
*** fressi has joined #opendev		10:01
*** dtantsur is now known as dtantsur\|brb		10:09
*** tkajinam has quit IRC		10:13
*** ysandeep\|afk is now known as ysandeep		10:54
*** rhayakaw__ has quit IRC		11:05
*** chkumar\|rover is now known as chandankumar		11:12
*** chandankumar is now known as chkumar\|rover		11:13
openstackgerrit	Joshua Hesketh proposed opendev/bindep master: Allow underscores in profile names https://review.opendev.org/735142	11:23
openstackgerrit	Joshua Hesketh proposed opendev/bindep master: Allow uppercase letters in profiles https://review.opendev.org/735143	11:23
*** fressi has quit IRC		11:41
*** fressi_ has joined #opendev		11:41
*** fressi_ has quit IRC		11:46
*** dtantsur\|brb is now known as dtantsur		11:58
frickler	infra-root: I think something is bad, big backlog in zuul, lots of failed node launch attemps since about 0600 today http://grafana.openstack.org/d/rZtIH5Imz/nodepool?orgId=1&from=now-24h&to=now	14:10
clarkb	if I hadto guess its some version of the grub update bugs that have hit various distros (oddly the debuntu version is apparently unrelated to uefi and the rhel/ce tos version is specific to uefi)	14:12
openstackgerrit	Liam Young proposed openstack/project-config master: Add Ceph iSCSI charm to OpenStack charms https://review.opendev.org/744479	14:12
clarkb	I think nodepool dumps nodepool console logs for failedlaunches, we should maybe start thereif wehave large numbers of failed launches	14:12
corvus	looking	14:12
corvus	oh, not a zuul event backlog -- a node request backlog?	14:13
fungi	if it's centos, then maybe we're using uefi more than we realized and are getting bit by that bug they introduced?	14:14
*** jhesketh has quit IRC		14:14
clarkb	thats how I parsed it gievn the failed node launch attempts portion of the message	14:15
clarkb	fungi: ya or maybe we'rehitting the debuntu version which has to do with broken dpkg config for grub not updating the right device	14:15
frickler	hmm the launch attempts are a red herring maybe, that seems to be limestone http://grafana.openstack.org/d/WFOSH5Siz/nodepool-limestone?orgId=1&from=now-7d&to=now	14:15
*** jhesketh has joined #opendev		14:16
corvus	no errors in rax	14:16
corvus	it looks like limestone hasn't produced a working node in a while?	14:19
corvus	like, not since the end of june	14:20
fungi	logan-: ^ in case you weren't aware	14:20
fungi	so a little over a month i guess	14:21
corvus	but i agree, something is amiss -- we have a 600 node backlog and are only using 50% of our overall capacity	14:22
corvus	we're using 50% of our rax quota	14:24
corvus	nl01 (which is responsible for rax) is pegging a cpu	14:25
corvus	apparently that's not unusual	14:26
frickler	maybe the launch attempts on limestone (which if I read correctly take 2 min each to timeout) block the nodes from being scheduled on other providers	14:26
frickler	so likely we should disable limestone until we know what happened there?	14:26
logan-	looking into it	14:27
corvus	frickler: yeah, that will slow things down	14:27
*** zigo has joined #opendev		14:29
*** _mlavalle_1 has joined #opendev		14:37
*** mlavalle has quit IRC		14:39
corvus	http://grafana.openstack.org/d/8wFIHcSiz/nodepool-rackspace?orgId=1&from=now-14d&to=now	14:41
corvus	that node graph is kinda weird	14:41
corvus	like it starts a downward trend line at the beginning of last week that continues unseen over the weekend	14:42
corvus	it might be interesting to restart nl01 and see if behavior changes	14:42
corvus	(it's been running for 2 weeks, so the start of the trend doesn't correspond with the start of the process)	14:43
corvus	i'm going to sigusr2 it	14:44
clarkb	looking in the logs for nl01 (whch talks to rax) we don't have a lot of launch attempt failures in the last 12 hours. Around 10 and those were all single failures (attept 1/3 failed but no 2/3 or 3/3)	14:45
openstackgerrit	Thierry Carrez proposed opendev/system-config master: Redirect UC content to TC site https://review.opendev.org/744497	14:45
clarkb	estimatedNodepoolQuotaUsed() in the openstack driver has a bug in nl03	14:46
openstackgerrit	Liam Young proposed openstack/project-config master: Add Ceph iSCSI charm to OpenStack charms https://review.opendev.org/744479	14:46
openstackgerrit	Thierry Carrez proposed opendev/system-config master: Redirect UC content to TC site https://review.opendev.org/744497	14:46
* frickler needs to afk for a bit		14:48
clarkb	http://paste.openstack.org/show/796537/ I doubt that is causing the backlog but likely something we'll want to fix	14:48
corvus	i guess we don't have the profiler lib installed? because i did a second sigusr2 and don't see any profile info	14:49
clarkb	that would be determined by the upstream docker container contents	14:49
clarkb	(I don't know if nodepool installs them to the container or not)	14:50
fungi	ooh, yeah since they're not in the requirements list i bet we don't	14:50
corvus	nl01 is only launching about 24 nodes currently	14:52
clarkb	as a sanity checking grepping 'Predicted remaining' in nl01's log shows it thinks it has quota available	14:53
clarkb	unlikely to be a cloud side change to quotas then	14:53
clarkb	and we aren't really failing many launches according to the log either. Seems to be more performance related?	14:54
corvus	all 3 pool worker threads were in getMostRecentImageUpload both times i ran the sigusr2	14:55
corvus	hypothesis: that method is slow	14:56
corvus	i'm running 'nodepool image-list' from the cli and it is very slow	14:58
corvus	i suspect if it ever returns, we're about to find a runaway image upload loop	14:58
corvus	(incidentally, i had to run this inside the container because kazoo outside the container hasn't been updated to get the fix for large ssl packets)	14:59
clarkb	the outside container version should probably be removed as nothing is updating them now aiui	14:59
corvus	if we do that, we should set up an alias like we did with zuul	15:00
corvus	it returned; there's a lot that went past my scroll buffer; i'm re-running it with a redirect, but it's looking like a lot of ovh image upload failures	15:01
clarkb	nb01's build lceanup worker has errors on json decode errors	15:02
clarkb	(which I think may be an issue that swest had a patch up to start working around/debugging)	15:02
clarkb	ah ok ovh upload failures	15:02
corvus	clarkb: this one? https://review.opendev.org/738013	15:03
clarkb	corvus: ya	15:03
corvus	maybe we can identify the node, clear it out, then review swest's patch	15:04
clarkb	++ in theory if we clear those out then nodepool can cleanup the failed uploads itself	15:05
corvus	why aren't we seeing this error? https://review.opendev.org/716566	15:05
clarkb	we are seeing that error too	15:06
corvus	clarkb: what host/file?	15:07
corvus	i can't find it grepping	15:07
clarkb	getMostRecentBuilds() in nb01 seems to trip it for uploads and cleanups. nb01.opendev.org:/var/log/nodepool/builder-debug.log	15:07
corvus	grep "Error loading"	15:07
clarkb	oh I was looking at the commit message, you mean the stff that was changed to furthe rdebug /me checks that	15:08
corvus	i'm not getting any results for grep "Error loading" /var/log/nodepool/builder-debug.log	15:08
clarkb	maybe we haven't restarted the builders on that change	15:08
clarkb	though its from april so we should've	15:08
corvus	happen to remember which launcher is ovh?	15:09
corvus	nl04	15:09
corvus	doesn't show up there either	15:09
corvus	tobiash: we think we're hitting this bug but don't see the error output you added: https://review.opendev.org/716566	15:10
clarkb	that call, getImageUpload() isn't in the traceback for either cleanups or uploads when they fail	15:10
clarkb	possible we just don't get to that point because we're failing earlier?	15:10
logan-	corvus frickler fungi: limestone should be scheduling now. I had emptied the host aggregate for hypervisor reboots and forgot to re-add them afterwards, sorry!	15:11
fungi	logan-: no worries, thanks for taking a look!	15:11
fungi	(and for all the resources)	15:12
corvus	clarkb: ack. after breakfast i'll try to find it the old fashioned way	15:12
tobiash	corvus: weird	15:12
logan-	np!	15:12
clarkb	corvus: ok. I too am sorting out some breakfast noms	15:15
clarkb	and then I think I have a meeting in a few minutes. Let me know if I can help though and I can make room	15:16
fungi	i'm still sorting out the last of our flood prep, so unfortunately not much help, sorry	15:17
tobiash	corvus: if you have a runaway image upload loop you might want to check the zk size and if that gets too big stop all builders as a precaution	15:19
clarkb	tobiash: it seems to be a slow leak based on grafana data. I think the issue is more that we can't cleanup effectively than that we are adding too much data at once	15:20
clarkb	(though I've not confirmed that)	15:20
tobiash	ok, I just wanted to raise awareness about this, if it's a slow leak, it's probably ok	15:20
clarkb	but ya stopping the builders does seem reasonable and low impact	15:20
tobiash	this might also be interesting (once it works): https://review.opendev.org/743790 since upload workers cause a significant load on the builders	15:23
tobiash	(that's possibly the root cause you see getMostRecentBuilds in every thread dump)	15:23
tobiash	*getMostRecentImageUpload I meant	15:24
*** ysandeep is now known as ysandeep\|afk		15:41
corvus	i wonder why the 'nodepool image-list' command works -- seems like it should be deserializing all of the uploads too	15:55
corvus	(so it should hit the empty znode)	15:55
*** dtantsur is now known as dtantsur\|afk		15:58
corvus	i was able to deserialize all the uploads	16:09
corvus	oh it's failing deserializing a build	16:10
corvus	tobiash: that's why we're not seeing your log error	16:11
corvus	clarkb: and i think that may mean that swest's fix wouldn't fix this either	16:11
clarkb	ah	16:12
corvus	/nodepool/images/fedora-31/builds/0000011944	16:14
corvus	that's our culprit	16:14
corvus	#status log deleted corrupt znode /nodepool/images/fedora-31/builds/0000011944 to unblock image cleanup threads	16:16
openstackstatus	corvus: finished logging	16:16
corvus	looks like that kicks off every 5m, so we may see cleanup start at 16:20	16:17
*** chkumar\|rover is now known as raukadah		16:18
corvus	oh those are uploads every 5m	16:18
corvus	cleanup workers are a little less regular, but still ~5m	16:19
clarkb	corvus: did you connect with the python zk shell too? Does that take flags for ssl certs?	16:20
corvus	clarkb: i connected with python zk-shell and used the non-ssl port	16:20
corvus	clarkb: i don't think it supports ssl	16:20
clarkb	corvus: huh maybe we should keep non ssl open but firewall it off? then we can shell in via localhost if necessary?	16:21
corvus	we may need to either fix that, or use the zkCli.sh from zk itself, or continue to run a non-ssl port (but firewalled to localhost) for emergencies	16:21
corvus	clarkb: indeed :)	16:21
corvus	bunch of deletes happening on nb01 now	16:25
corvus	this will probably take quite a long time. maybe by eod we'll be back up to speed	16:26
clarkb	we probably can't make it go any fast er out of band	16:27
clarkb	since we're limited by zk journal speeds	16:27
clarkb	?	16:27
corvus	clarkb: i'd guess the limit right now is the nodepool cleanup thread not being optimized for this case (and doing lots of "unnecessary" locking)	16:27
corvus	so we probably could speed it up if we did a custom routine	16:28
corvus	let me turn my test script into a node counter and estimate progress	16:28
corvus	oh, also the cleanup worker is now doing actual on-disk deletions as well (which are fighting local io with builds). otoh, i think all the builders should be contributing.	16:30
corvus	1822 uploads now	16:31
fungi	we could pause image builds to dedicate i/o to the deletions i suppose, if that's more important to get caught up	16:32
corvus	i don't think the backlog is dire at this point.	16:32
*** ysandeep\|afk is now known as ysandeep\|away		16:36
corvus	1617	16:39
corvus	logan-: limestone looks better now, thanks. we're using it at 50% of our max-nodes setting. that could be due to the bug we just started cleaning up after; let's check back in a bit and see how it looks	16:41
*** tosky has quit IRC		16:45
corvus	1464	16:49
corvus	clarkb, fungi: i feel like any attempt to speed this up will probably take long enough we'll be substantially through the backlog by then, and am inclined to leave it be. whadya think?	16:50
clarkb	ya seems to be moving along well enough	16:50
fungi	yeah, i'm not especially concerned	16:50
*** sshnaidm has quit IRC		16:55
corvus	1297	16:56
* corvus is a numbers station		16:57
*** auristor has quit IRC		17:20
*** auristor has joined #opendev		17:21
corvus	900	17:25
fungi	i was hoping it would be over 9000	17:25
corvus	rax utilization is significantly up	17:25
corvus	limestone is flat, so we may be seeing a quota < max servers, or some other constraint	17:26
openstackgerrit	Clark Boylan proposed opendev/system-config master: Use pip install -r not -f to install extras https://review.opendev.org/744531	17:28
clarkb	corvus: ^ I think that is the fix for the yappi and objgraph packages on the nodepool images	17:28
clarkb	infra-root https://review.opendev.org/#/c/744255/ is a cleanup from the gitea upgrade on friday that would be good to land so we don't worry about it next upgrade	17:37
mordred	clarkb: doh on the -r vs -f change	17:38
*** auristor has quit IRC		17:43
mordred	clarkb: since you were patching python-builder and have some of it paged in, mnaser has a patch: https://review.opendev.org/#/c/742249/	17:49
clarkb	mordred: mnaser left a comment, basically this should probably come with docs of some sort	17:59
*** auristor has joined #opendev		18:02
*** tosky has joined #opendev		18:08
*** sshnaidm has joined #opendev		18:14
corvus	i think we're stable around 620-630	18:20
corvus	backlog is headed down, and i think we're at our practical utilization limit	18:21
*** sdmitriev has joined #opendev		18:30
openstackgerrit	Merged opendev/system-config master: Increase gitea indexer startup timeout https://review.opendev.org/744255	18:31
*** hashar has joined #opendev		18:31
clarkb	corvus: logan- we're still using about half our limestone max-servers count. I wonder if there is a quota thing cloud side loewring that for us?	18:34
clarkb	not a big deal if so but would explain the decrease there	18:34
logan-	I'll check in a few. I think I may have dropped the quota a while back to see if we could mitigate some slow jobs due to IO congestion. I have a feeling that the SSDs in those nodes are beginning to feel the wear after several years of nodepool hammering on them. Long term I really need to get this cloud on bionic/focal (it is still xenial nodes), replace aged SSDs, and add a couple more nodes. But /time :/	18:37
clarkb	I know the feeling :)	18:38
clarkb	mordred: corvus if you've got a moment https://review.opendev.org/#/c/741277/ is another one I've had on a back burner for a bit. We'll want to land then then make a release, then we can update jeepyb to support the branch things	18:44
openstackgerrit	Merged opendev/system-config master: Use pip install -r not -f to install extras https://review.opendev.org/744531	19:03
*** redrobot has quit IRC		20:01
*** hashar has quit IRC		20:23
clarkb	fungi: re the sshfp records for review, what if we put port 22 on review01's record and port 29418 on review.o.o's record?	20:33
fungi	clarkb: yeah, that's what i suggested in e-mail	20:35
fungi	on the ml i mean	20:35
fungi	the challenge there is that right now we generate an ssl cert for review01 and include the other records as altnames, which works because they're cnames for it	20:35
fungi	we can't cname review.o.o to review01 if we want different sshfp records for them	20:36
fungi	so we'll also have to split how we're doing ssl cert renewals	20:36
fungi	(or more likely just not put review01 in the altnames and generate the cert for review.opendev.org and review.openstack.org cname'd to review.opendev.org)	20:37
clarkb	fungi: oh does ssh expect all fp records to resolve to the same value (or valid values I guess for all hosts) if they cname regardless of what you ssh to?	20:38
clarkb	fungi: but also I'm not sure why the https certs matter here? I think we can verify review01's http cert without the cname	20:39
clarkb	what LE's acme is looking for is that we can control dns for review01.opendev.org and review.opendev.org which is independent of the actual records aiui	20:39
clarkb	so we could drop the cname then verify the certs as is and have split sshfp records?	20:40
fungi	oh, right, i guess we just need to set a separate acme cname for review.opendev.org if there isn't one already	20:40
fungi	so if we switched review.opendev.org to a/aaaa instead of cname we could add new sshfp records for it, and then switch review.openstack.org to cname to review.opendev.org instead of to review01 like it does currently	20:41
clarkb	yup	20:41
clarkb	and possible add new acme records if necessary	20:41
fungi	well, duplicate basically all the rrs from review01	20:42
fungi	so caa records and so on	20:42
fungi	confirmed, we already have "_acme-challenge.review IN CNAME acme.opendev.org." so that doesn't need to change	20:42
fungi	we'd just need to add caa rrs, looks like	20:43
fungi	so get rid of the "review IN CNAME review01" and duplicate the a, aaaa and two caa rrs from review01 to review, then generate the six new sshfp records for the gerrit api port	20:44
clarkb	and that can all happen in a single opendev.org one update?	20:47
clarkb	which will minimize any user facing impact	20:47
clarkb	with rax dns we'd risk an outage during the cname delete -> aaaa/a create period	20:47
fungi	yep for single opendev.org commit	20:48
clarkb	I discovered the elasticsearch05 and elasticsearch07 had stopped running elasticsearch. I've restarted them but then services on most of the workers seem to hvae died as a result so I'm rebooting those as quick way to get them back up	20:49
fungi	the rax dns update is that we need to change the review.openstack.org cname from pointing to review01.opendev.org to just review.opendev.org or it will continue to return the wrong sshfp records for folks	20:49
clarkb	also the reboots are a decent sanity check that the recent grub updates won't affect at least our xenial hosts on rax	20:50
clarkb	(we may want to reboot a gitea backend soonish too to check those)	20:50
clarkb	fungi: ya but that should be less impactful now that that name is less used	20:50
clarkb	so far all the logstash workers are coming back just fine so I think we are good there re grub updates	20:51
fungi	agreed	20:51
clarkb	of course now that I've said that I have a slow to return host :/	20:51
clarkb	ah there it goes	20:51
clarkb	#status log Restarted elasticsearch on elasticsearch05 and elasticsearch07 as they had stopped. Rebooted logstash-worker01-20 as their logstash daemons had failed after the elasticsearch issues.	20:59
openstackstatus	clarkb: finished logging	20:59
openstackgerrit	Jeremy Stanley proposed opendev/zone-opendev.org master: Split review's resource records from review01's https://review.opendev.org/744557	21:01
fungi	clarkb: frickler: ^	21:02
clarkb	we'll also want to think it through from an ansible perspective but I expect we're good with that split	21:03
fungi	if someone wants to push a follow up to add sshfp rrs for the api's host key i'll be happy to review, but i figure that should at least solve the immediate issue	21:04
*** Eighth_Doctor is now known as Conan_Kudo		21:05
*** Conan_Kudo is now known as Eighth_Doctor		21:07
corvus	sergey voted on that?! :)	21:46
corvus	oh, i typod, sorry	21:47
corvus	ignore me	21:47
openstackgerrit	Merged opendev/zone-opendev.org master: Split review's resource records from review01's https://review.opendev.org/744557	21:50
openstackgerrit	Monty Taylor proposed zuul/zuul-jobs master: Add a job for publishing a site to netlify https://review.opendev.org/739047	21:56
*** tkajinam has joined #opendev		21:59
openstackgerrit	Clark Boylan proposed zuul/zuul-jobs master: Fix partial subunit stream logging https://review.opendev.org/744565	22:12
*** qchris has quit IRC		22:22
*** qchris has joined #opendev		22:35
openstackgerrit	Clark Boylan proposed zuul/zuul-jobs master: Fix partial subunit stream logging https://review.opendev.org/744565	22:52
*** _mlavalle_1 has quit IRC		23:01
corvus	ianw: remote: https://review.opendev.org/744574 Remove status-url from check start	23:05
*** DSpider has quit IRC		23:05
ianw	corvus: ok, i was going to look into that. it was linking to opendev.org when i looked at it	23:06
corvus	ianw: it's the same issue -- the method used in checks is better	23:06
ianw	ok, let's merge that and i'll run a recheck and we can see it live	23:07
*** tosky has quit IRC		23:11
ianw	we must be busy, it's still got noop queued	23:16
ianw	Failed to update check run pyca/check: 403 Resource not accessible by integration	23:19
corvus	are we missing a permission?	23:24
corvus	clarkb: sn5 hop test in 33m :)	23:24
clarkb	ooh	23:25
ianw	2020-08-03 23:18:39,178 DEBUG github3: POST https://api.github.com/repos/pyca/cryptography/check-runs with {"name": "pyca/check", "head_sha": "a46b22f283ab1c09a476cef8fe340ceefc0dd362", "details_url": "https://zuul.opendev.org/t/pyca/status/change/5341,a46b22f283ab1c09a476cef8fe340ceefc0dd362", "status": "in_progress", "output": {"title": "Summary", "summary": "Starting check jobs."}}, {'headers': {'Accept': 'application/vnd.github.antiope-preview+json'}}	23:26
ianw	it wanted to do the right thing	23:26
ianw	yeah i wonder if the app has settings to request permissions to the check api	23:27
corvus	ianw: https://zuul-ci.org/docs/zuul/reference/drivers/github.html#application	23:28
corvus	ianw: there's a "checks read/write" perm that i'd wager a nickel we don't have	23:28
ianw	ok let me get the login details and check	23:29
ianw	right, it does not have "Checks"	23:33
ianw	ok, i've added that. i think pyca will have to re-accept it now with the new permissiosn	23:34
ianw	and all the jobs failed with "msg": "Failed to update apt cache: " so i'm guessing the mirror isn't happy? ... :/ one thing at a time	23:35
ianw	https://github.com/pyca/cryptography/pull/5341#issuecomment-668292883	23:41

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!