Monday, 2020-08-03

*** ryohayakawa has joined #opendev00:01
*** DSpider has quit IRC00:12
openstackgerritPierre-Louis Bonicoli proposed zuul/zuul-jobs master: Avoid to use 'length' filter with null value
openstackgerritPierre-Louis Bonicoli proposed zuul/zuul-jobs master: Avoid to use 'length' filter with null value
*** mlavalle has quit IRC03:31
*** mlavalle has joined #opendev03:31
*** raukadah is now known as chkumar|rover04:33
*** DSpider has joined #opendev06:07
*** ryo_hayakawa has joined #opendev06:12
*** ysandeep|away is now known as ysandeep06:13
*** ryohayakawa has quit IRC06:14
openstackgerritOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml
*** rhayakaw__ has joined #opendev06:32
*** ryo_hayakawa has quit IRC06:34
*** frickler_pto is now known as frickler07:26
*** ssaemann has joined #opendev07:27
*** ssaemann has quit IRC07:28
*** ssaemann has joined #opendev07:28
*** tosky has joined #opendev07:35
*** hashar has joined #opendev07:38
*** ysandeep is now known as ysandeep|lunch07:44
*** ssaemann has quit IRC07:55
*** sshnaidm|afk is now known as sshnaidm07:58
*** ttx has quit IRC08:00
*** moppy has quit IRC08:01
*** ttx has joined #opendev08:01
*** moppy has joined #opendev08:01
*** dtantsur|afk is now known as dtantsur08:10
*** fressi has joined #opendev08:14
*** ysandeep|lunch is now known as ysandeep08:27
*** bolg has joined #opendev08:36
*** hashar has quit IRC09:28
*** ysandeep is now known as ysandeep|afk09:41
*** fressi_ has joined #opendev09:56
*** fressi has quit IRC09:57
*** fressi_ is now known as fressi09:57
*** fressi has quit IRC10:01
*** fressi has joined #opendev10:01
*** dtantsur is now known as dtantsur|brb10:09
*** tkajinam has quit IRC10:13
*** ysandeep|afk is now known as ysandeep10:54
*** rhayakaw__ has quit IRC11:05
*** chkumar|rover is now known as chandankumar11:12
*** chandankumar is now known as chkumar|rover11:13
openstackgerritJoshua Hesketh proposed opendev/bindep master: Allow underscores in profile names
openstackgerritJoshua Hesketh proposed opendev/bindep master: Allow uppercase letters in profiles
*** fressi has quit IRC11:41
*** fressi_ has joined #opendev11:41
*** fressi_ has quit IRC11:46
*** dtantsur|brb is now known as dtantsur11:58
fricklerinfra-root: I think something is bad, big backlog in zuul, lots of failed node launch attemps since about 0600 today
clarkbif I hadto guess its some version of the grub update bugs that have hit various distros (oddly the debuntu version is apparently unrelated to uefi and the rhel/ce tos version is specific to uefi)14:12
openstackgerritLiam Young proposed openstack/project-config master: Add Ceph iSCSI charm to OpenStack charms
clarkbI think nodepool dumps nodepool console logs for failedlaunches, we should maybe start thereif wehave large numbers of failed launches14:12
corvusoh, not a zuul event backlog -- a node request backlog?14:13
fungiif it's centos, then maybe we're using uefi more than we realized and are getting bit by that bug they introduced?14:14
*** jhesketh has quit IRC14:14
clarkbthats how I parsed it gievn the failed node launch attempts portion of the message14:15
clarkbfungi: ya or maybe we'rehitting the debuntu version which has to do with broken dpkg config for grub not updating the right device14:15
fricklerhmm the launch attempts are a red herring maybe, that seems to be limestone
*** jhesketh has joined #opendev14:16
corvusno errors in rax14:16
corvusit looks like limestone hasn't produced a working node in a while?14:19
corvuslike, not since the end of june14:20
fungilogan-: ^ in case you weren't aware14:20
fungiso a little over a month i guess14:21
corvusbut i agree, something is amiss -- we have a 600 node backlog and are only using 50% of our overall capacity14:22
corvuswe're using 50% of our rax quota14:24
corvusnl01 (which is responsible for rax) is pegging a cpu14:25
corvusapparently that's not unusual14:26
fricklermaybe the launch attempts on limestone (which if I read correctly take 2 min each to timeout) block the nodes from being scheduled on other providers14:26
fricklerso likely we should disable limestone until we know what happened there?14:26
logan-looking into it14:27
corvusfrickler: yeah, that will slow things down14:27
*** zigo has joined #opendev14:29
*** _mlavalle_1 has joined #opendev14:37
*** mlavalle has quit IRC14:39
corvusthat node graph is kinda weird14:41
corvuslike it starts a downward trend line at the beginning of last week that continues unseen over the weekend14:42
corvusit might be interesting to restart nl01 and see if behavior changes14:42
corvus(it's been running for 2 weeks, so the start of the trend doesn't correspond with the start of the process)14:43
corvusi'm going to sigusr2 it14:44
clarkblooking in the logs for nl01 (whch talks to rax) we don't have a lot of launch attempt failures in the last 12 hours. Around 10 and those were all single failures (attept 1/3 failed but no 2/3 or 3/3)14:45
openstackgerritThierry Carrez proposed opendev/system-config master: Redirect UC content to TC site
clarkbestimatedNodepoolQuotaUsed() in the openstack driver has a bug in nl0314:46
openstackgerritLiam Young proposed openstack/project-config master: Add Ceph iSCSI charm to OpenStack charms
openstackgerritThierry Carrez proposed opendev/system-config master: Redirect UC content to TC site
* frickler needs to afk for a bit14:48
clarkb I doubt that is causing the backlog but likely something we'll want to fix14:48
corvusi guess we don't have the profiler lib installed?  because i did a second sigusr2 and don't see any profile info14:49
clarkbthat would be determined by the upstream docker container contents14:49
clarkb(I don't know if nodepool installs them to the container or not)14:50
fungiooh, yeah since they're not in the requirements list i bet we don't14:50
corvusnl01 is only launching about 24 nodes currently14:52
clarkbas a sanity checking grepping 'Predicted remaining' in nl01's log shows it thinks it has quota available14:53
clarkbunlikely to be a cloud side change to quotas then14:53
clarkband we aren't really failing many launches according to the log either. Seems to be more performance related?14:54
corvusall 3 pool worker threads were in getMostRecentImageUpload both times i ran the sigusr214:55
corvushypothesis: that method is slow14:56
corvusi'm running 'nodepool image-list' from the cli and it is very slow14:58
corvusi suspect if it ever returns, we're about to find a runaway image upload loop14:58
corvus(incidentally, i had to run this inside the container because kazoo outside the container hasn't been updated to get the fix for large ssl packets)14:59
clarkbthe outside container version should probably be removed as nothing is updating them now aiui14:59
corvusif we do that, we should set up an alias like we did with zuul15:00
corvusit returned; there's a lot that went past my scroll buffer; i'm re-running it with a redirect, but it's looking like a lot of ovh image upload failures15:01
clarkbnb01's build lceanup worker has errors on json decode errors15:02
clarkb(which I think may be an issue that swest had a patch up to start working around/debugging)15:02
clarkbah ok ovh upload failures15:02
corvusclarkb: this one?
clarkbcorvus: ya15:03
corvusmaybe we can identify the node, clear it out, then review swest's patch15:04
clarkb++ in theory if we clear those out then nodepool can cleanup the failed uploads itself15:05
corvuswhy aren't we seeing this error?
clarkbwe are seeing that error too15:06
corvusclarkb: what host/file?15:07
corvusi can't find it grepping15:07
clarkbgetMostRecentBuilds() in nb01 seems to trip it for uploads and cleanups.
corvusgrep "Error loading"15:07
clarkboh I was looking at the commit message, you mean the stff that was changed to furthe rdebug /me checks that15:08
corvusi'm not getting any results for grep "Error loading" /var/log/nodepool/builder-debug.log15:08
clarkbmaybe we haven't restarted the builders on that change15:08
clarkbthough its from april so we should've15:08
corvushappen to remember which launcher is ovh?15:09
corvusdoesn't show up there either15:09
corvustobiash: we think we're hitting this bug but don't see the error output you added:
clarkbthat call, getImageUpload() isn't in the traceback for either cleanups or uploads when they fail15:10
clarkbpossible we just don't get to that point because we're failing earlier?15:10
logan-corvus frickler fungi: limestone should be scheduling now. I had emptied the host aggregate for hypervisor reboots and forgot to re-add them afterwards, sorry!15:11
fungilogan-: no worries, thanks for taking a look!15:11
fungi(and for all the resources)15:12
corvusclarkb: ack.  after breakfast i'll try to find it the old fashioned way15:12
tobiashcorvus: weird15:12
clarkbcorvus: ok. I too am sorting out some breakfast noms15:15
clarkband then I think I have a meeting in a few minutes. Let me know if I can help though and I can make room15:16
fungii'm still sorting out the last of our flood prep, so unfortunately not much help, sorry15:17
tobiashcorvus: if you have a runaway image upload loop you might want to check the zk size and if that gets too big stop all builders as a precaution15:19
clarkbtobiash: it seems to be a slow leak based on grafana data. I think the issue is more that we can't cleanup effectively than that we are adding too much data at once15:20
clarkb(though I've not confirmed that)15:20
tobiashok, I just wanted to raise awareness about this, if it's a slow leak, it's probably ok15:20
clarkbbut ya stopping the builders does seem reasonable and low impact15:20
tobiashthis might also be interesting (once it works): since upload workers cause a significant load on the builders15:23
tobiash(that's possibly the root cause you see getMostRecentBuilds in every thread dump)15:23
tobiash*getMostRecentImageUpload I meant15:24
*** ysandeep is now known as ysandeep|afk15:41
corvusi wonder why the 'nodepool image-list' command works -- seems like it should be deserializing all of the uploads too15:55
corvus(so it should hit the empty znode)15:55
*** dtantsur is now known as dtantsur|afk15:58
corvusi was able to deserialize all the uploads16:09
corvusoh it's failing deserializing a build16:10
corvustobiash: that's why we're not seeing your log error16:11
corvusclarkb: and i think that may mean that swest's fix wouldn't fix this either16:11
corvusthat's our culprit16:14
corvus#status log deleted corrupt znode /nodepool/images/fedora-31/builds/0000011944 to unblock image cleanup threads16:16
openstackstatuscorvus: finished logging16:16
corvuslooks like that kicks off every 5m, so we may see cleanup start at 16:2016:17
*** chkumar|rover is now known as raukadah16:18
corvusoh those are uploads every 5m16:18
corvuscleanup workers are a little less regular, but still ~5m16:19
clarkbcorvus: did you connect with the python zk shell too? Does that take flags for ssl certs?16:20
corvusclarkb: i connected with python zk-shell and used the non-ssl port16:20
corvusclarkb: i don't think it supports ssl16:20
clarkbcorvus: huh maybe we should keep non ssl open but firewall it off? then we can shell in via localhost if necessary?16:21
corvuswe may need to either fix that, or use the from zk itself, or continue to run a non-ssl port (but firewalled to localhost) for emergencies16:21
corvusclarkb: indeed :)16:21
corvusbunch of deletes happening on nb01 now16:25
corvusthis will probably take quite a long time.  maybe by eod we'll be back up to speed16:26
clarkbwe probably can't make it go any fast er out of band16:27
clarkbsince we're limited by zk journal speeds16:27
corvusclarkb: i'd guess the limit right now is the nodepool cleanup thread not being optimized for this case (and doing lots of "unnecessary" locking)16:27
corvusso we probably could speed it up if we did a custom routine16:28
corvuslet me turn my test script into a node counter and estimate progress16:28
corvusoh, also the cleanup worker is now doing actual on-disk deletions as well (which are fighting local io with builds).  otoh, i think all the builders should be contributing.16:30
corvus1822 uploads now16:31
fungiwe could pause image builds to dedicate i/o to the deletions i suppose, if that's more important to get caught up16:32
corvusi don't think the backlog is dire at this point.16:32
*** ysandeep|afk is now known as ysandeep|away16:36
corvuslogan-: limestone looks better now, thanks.  we're using it at 50% of our max-nodes setting.  that could be due to the bug we just started cleaning up after; let's check back in a bit and see how it looks16:41
*** tosky has quit IRC16:45
corvusclarkb, fungi: i feel like any attempt to speed this up will probably take long enough we'll be substantially through the backlog by then, and am inclined to leave it be.  whadya think?16:50
clarkbya seems to be moving along well enough16:50
fungiyeah, i'm not especially concerned16:50
*** sshnaidm has quit IRC16:55
* corvus is a numbers station16:57
*** auristor has quit IRC17:20
*** auristor has joined #opendev17:21
fungii was hoping it would be over 900017:25
corvusrax utilization is significantly up17:25
corvuslimestone is flat, so we may be seeing a quota < max servers, or some other constraint17:26
openstackgerritClark Boylan proposed opendev/system-config master: Use pip install -r not -f to install extras
clarkbcorvus: ^ I think that is the fix for the yappi and objgraph packages on the nodepool images17:28
clarkbinfra-root is a cleanup from the gitea upgrade on friday that would be good to land so we don't worry about it next upgrade17:37
mordredclarkb: doh on the -r vs -f change17:38
*** auristor has quit IRC17:43
mordredclarkb: since you were patching python-builder and have some of it paged in, mnaser has a patch:
clarkbmordred: mnaser left a comment, basically this should probably come with docs of some sort17:59
*** auristor has joined #opendev18:02
*** tosky has joined #opendev18:08
*** sshnaidm has joined #opendev18:14
corvusi think we're stable around 620-63018:20
corvusbacklog is headed down, and i think we're at our practical utilization limit18:21
*** sdmitriev has joined #opendev18:30
openstackgerritMerged opendev/system-config master: Increase gitea indexer startup timeout
*** hashar has joined #opendev18:31
clarkbcorvus: logan- we're still using about half our limestone max-servers count. I wonder if there is a quota thing cloud side loewring that for us?18:34
clarkbnot a big deal if so but would explain the decrease there18:34
logan-I'll check in a few. I think I may have dropped the quota a while back to see if we could mitigate some slow jobs due to IO congestion. I have a feeling that the SSDs in those nodes are beginning to feel the wear after several years of nodepool hammering on them. Long term I really need to get this cloud on bionic/focal (it is still xenial nodes), replace aged SSDs, and add a couple more nodes. But /time :/18:37
clarkbI know the feeling :)18:38
clarkbmordred: corvus if you've got a moment is another one I've had on a back burner for a bit. We'll want to land then then make a release, then we can update jeepyb to support the branch things18:44
openstackgerritMerged opendev/system-config master: Use pip install -r not -f to install extras
*** redrobot has quit IRC20:01
*** hashar has quit IRC20:23
clarkbfungi: re the sshfp records for review, what if we put port 22 on review01's record and port 29418 on review.o.o's record?20:33
fungiclarkb: yeah, that's what i suggested in e-mail20:35
fungion the ml i mean20:35
fungithe challenge there is that right now we generate an ssl cert for review01 and include the other records as altnames, which works because they're cnames for it20:35
fungiwe can't cname review.o.o to review01 if we want different sshfp records for them20:36
fungiso we'll also have to split how we're doing ssl cert renewals20:36
fungi(or more likely just not put review01 in the altnames and generate the cert for and cname'd to
clarkbfungi: oh does ssh expect all fp records to resolve to the same value (or valid values I guess for all hosts) if they cname regardless of what you ssh to?20:38
clarkbfungi: but also I'm not sure why the https certs matter here? I think we can verify review01's http cert without the cname20:39
clarkbwhat LE's acme is looking for is that we can control dns for and which is independent of the actual records aiui20:39
clarkbso we could drop the cname then verify the certs as is and have split sshfp records?20:40
fungioh, right, i guess we just need to set a separate acme cname for if there isn't one already20:40
fungiso if we switched to a/aaaa instead of cname we could add new sshfp records for it, and then switch to cname to instead of to review01 like it does currently20:41
clarkband possible add new acme records if necessary20:41
fungiwell, duplicate basically all the rrs from review0120:42
fungiso caa records and so on20:42
fungiconfirmed, we already have "      IN  CNAME" so that doesn't need to change20:42
fungiwe'd just need to add caa rrs, looks like20:43
fungiso get rid of the "review          IN  CNAME   review01" and duplicate the a, aaaa and two caa rrs from review01 to review, then generate the six new sshfp records for the gerrit api port20:44
clarkband that can all happen in a single one update?20:47
clarkbwhich will minimize any user facing impact20:47
clarkbwith rax dns we'd risk an outage during the cname delete -> aaaa/a create period20:47
fungiyep for single commit20:48
clarkbI discovered the elasticsearch05 and elasticsearch07 had stopped running elasticsearch. I've restarted them but then services on most of the workers seem to hvae died as a result so I'm rebooting those as quick way to get them back up20:49
fungithe rax dns update is that we need to change the cname from pointing to to just or it will continue to return the wrong sshfp records for folks20:49
clarkbalso the reboots are a decent sanity check that the recent grub updates won't affect at least our xenial hosts on rax20:50
clarkb(we may want to reboot a gitea backend soonish too to check those)20:50
clarkbfungi: ya but that should be less impactful now that that name is less used20:50
clarkbso far all the logstash workers are coming back just fine so I think we are good there re grub updates20:51
clarkbof course now that I've said that I have a slow to return host :/20:51
clarkbah there it goes20:51
clarkb#status log Restarted elasticsearch on elasticsearch05 and elasticsearch07 as they had stopped. Rebooted logstash-worker01-20 as their logstash daemons had failed after the elasticsearch issues.20:59
openstackstatusclarkb: finished logging20:59
openstackgerritJeremy Stanley proposed opendev/ master: Split review's resource records from review01's
fungiclarkb: frickler: ^21:02
clarkbwe'll also want to think it through from an ansible perspective but I expect we're good with that split21:03
fungiif someone wants to push a follow up to add sshfp rrs for the api's host key i'll be happy to review, but i figure that should at least solve the immediate issue21:04
*** Eighth_Doctor is now known as Conan_Kudo21:05
*** Conan_Kudo is now known as Eighth_Doctor21:07
corvussergey voted on that?! :)21:46
corvusoh, i typod, sorry21:47
corvusignore me21:47
openstackgerritMerged opendev/ master: Split review's resource records from review01's
openstackgerritMonty Taylor proposed zuul/zuul-jobs master: Add a job for publishing a site to netlify
*** tkajinam has joined #opendev21:59
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Fix partial subunit stream logging
*** qchris has quit IRC22:22
*** qchris has joined #opendev22:35
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Fix partial subunit stream logging
*** _mlavalle_1 has quit IRC23:01
corvusianw: remote: Remove status-url from check start23:05
*** DSpider has quit IRC23:05
ianwcorvus: ok, i was going to look into that.  it was linking to when i looked at it23:06
corvusianw: it's the same issue -- the method used in checks is better23:06
ianwok, let's merge that and i'll run a recheck and we can see it live23:07
*** tosky has quit IRC23:11
ianwwe must be busy, it's still got noop queued23:16
ianwFailed to update check run pyca/check: 403 Resource not accessible by integration23:19
corvusare we missing a permission?23:24
corvusclarkb: sn5 hop test in 33m :)23:24
ianw2020-08-03 23:18:39,178 DEBUG github3: POST with {"name": "pyca/check", "head_sha": "a46b22f283ab1c09a476cef8fe340ceefc0dd362", "details_url": ",a46b22f283ab1c09a476cef8fe340ceefc0dd362", "status": "in_progress", "output": {"title": "Summary", "summary": "Starting check jobs."}}, {'headers': {'Accept': 'application/vnd.github.antiope-preview+json'}}23:26
ianwit wanted to do the right thing23:26
ianwyeah i wonder if the app has settings to request permissions to the check api23:27
corvusianw: there's a "checks read/write" perm that i'd wager a nickel we don't have23:28
ianwok let me get the login details and check23:29
ianwright, it does not have "Checks"23:33
ianwok, i've added that.  i think pyca will have to re-accept it now with the new permissiosn23:34
ianwand all the jobs failed with "msg": "Failed to update apt cache: " so i'm guessing the mirror isn't happy?  ... :/  one thing at a time23:35

Generated by 2.17.2 by Marius Gedminas - find it at!