Monday, 2021-11-08

opendevreviewMerged opendev/system-config master: Add LE config for zuul01  https://review.opendev.org/c/opendev/system-config/+/81690300:51
corvuszuul is still looking good; i think we should just continue to run it on master with both schedulers00:56
corvusnote that there is a minor consideration to be aware of re zuul: it is possible for the status json to be slightly out-of-date or flap between two states.  it's random which scheduler formats the json, and they can be slightly out of sync with each other.  i don't expect users to notice most of the time (i haven't even noticed so far), so i think it's okay to run in this configuration, and the benefits of early testing of 2 schedulers outweigh01:11
corvusthe downsides.  but i wanted to make sure folks are aware of that in case it is noticed.  (that's the next thing that's being worked on, so this isn't a long-term issue).01:11
opendevreviewMerged opendev/system-config master: refstack: don't chown db directory  https://review.opendev.org/c/opendev/system-config/+/81676102:00
ianwDigest:sha256:bc67ba409c0f02a4ee36bf855a40f511228aa4286009cd99cb3595cdee9cf3ac is the build from 81661802:03
ianw"opendevorg/gerrit@sha256:bc67ba409c0f02a4ee36bf855a40f511228aa4286009cd99cb3595cdee9cf3ac"02:04
ianwso we've got the the right image on the server02:04
ianwi'm going to restart now to get the new plugin02:06
ianw#status log restarted gerrit to pick up plugin changes from #81661802:09
opendevstatusianw: finished logging02:10
opendevreviewMerged opendev/system-config master: Adjust RefStack build for osf->openinfra rename  https://review.opendev.org/c/opendev/system-config/+/80848002:20
fungii'm back, but looks like everything's okay02:51
*** diablo_rojo_phone_ is now known as diablo_rojo_phone04:06
*** diablo_rojo_phone is now known as Guest531104:06
*** gouthamr_ is now known as gouthamr04:07
*** Guest5311 is now known as diablo_rojo_phone04:08
opendevreviewIan Wienand proposed openstack/project-config master: Add Fedora 35  https://review.opendev.org/c/openstack/project-config/+/81693104:15
opendevreviewIan Wienand proposed openstack/project-config master: Add Fedora 35  https://review.opendev.org/c/openstack/project-config/+/81693104:20
opendevreviewIan Wienand proposed openstack/project-config master: Set Fedora 34 min-ready to zero  https://review.opendev.org/c/openstack/project-config/+/81693204:20
opendevreviewIan Wienand proposed openstack/project-config master: Remove Fedora 34  https://review.opendev.org/c/openstack/project-config/+/81693304:20
ianw2021-11-08 04:28:04.208 | ++ /opt/dib_tmp/dib_build.VHZvuUPn/hooks/root.d/08-containerfile:main:62            :   podman run -d dib-work-image /bin/sh04:29
ianw2021-11-08 04:28:04.487 | Error: error allocating lock for new container: no space left on device04:29
ianwthis is weird because there is space04:29
ianwso it must be talking about something else ...04:30
ianwnb0204:30
ianwsame thing on nb0104:31
opendevreviewMerged openstack/diskimage-builder master: tests: remove debootstrap install  https://review.opendev.org/c/openstack/diskimage-builder/+/81557104:59
ianw$ podman ps -a | wc -l06:33
ianw204906:33
ianwi think this has to do with it06:33
*** ysandeep|pto is now known as ysandeep06:36
ianw$ podman ps -a | wc -l06:37
ianw204906:37
ianwsame number on nb01 as nb02 ... that can not be a coincidence 06:37
ianwI've run "podman system reset"; the builders have daily queues going now, although i suspect that will fix it for now06:38
ianwi'd say we're leaking the dib-work-image.  i'll look into it more tomorrow06:38
*** akahat is now known as akahat|rover07:00
*** pojadhav is now known as pojadhav|lunch07:47
*** sshnaidm|off is now known as sshnaidm08:04
*** pojadhav|lunch is now known as pojadhav08:19
*** ykarel is now known as ykarel|lunch08:53
*** ysandeep is now known as ysandeep|afk09:18
*** ysandeep|afk is now known as ysandeep09:51
*** ykarel|lunch is now known as ykarel10:21
opendevreviewMartin Kopec proposed opendev/system-config master: Update Interop doc  https://review.opendev.org/c/opendev/system-config/+/81698210:33
tobias-urdinianw: thanks for fixing the fedora image patches when it was removed from mirrors :)10:33
*** pojadhav is now known as pojadhav|afk10:51
*** tobias-urdin5 is now known as tobias-urdin11:06
*** lbragstad7 is now known as lbragstad11:41
*** ysandeep is now known as ysandeep|brb11:45
*** ysandeep|brb is now known as ysandeep11:58
*** jpena|off is now known as jpena13:24
fungiin the openstack tenant i see we've got some queue items waiting hours with a handful of their builds queued. no idea yet if it's related to the zuul updates over the weekend and dual schedulers, or maybe a recurrence of the problem we observed with dependent job sets stuck trying to get nodes from very constrained providers like airship-kna1. i'll start looking once my coffee kicks in a little13:36
fungialso i get the impression the periodic-stable pipeline e-mails did not get distributed to the stable-maint ml. i'll double-check it, but we may be sending e-mail differently (or not at all?) from the second scheduler13:42
*** ykarel is now known as ykarel|away14:36
corvusfungi: thanks, i'll start looking into those14:38
fungistable-maint ml may just be me imagining things. i see at least some failures posted there as recently as Mon Nov 8 03:32:26 UTC 202114:43
corvusfungi: oh, i believe i did not re-enqueue a batch of periodic jobs over the weekend; so they may not have gotten saturday/sunday but should have gotten sunday/monday14:44
fungiyep, seems like that's working as expected, thanks14:45
corvusoops, one of the log lines has accidentally changed to be less helpful, which will make tracking down this node issue a little harder14:46
fungii think check queue item 816807,3 in the zuul tenant may be exhibiting the same situation i'm seeing for a number of changes in the openstack tenant14:47
corvus"for job <zuul.model.FrozenJob object at 0x7ff3fe56dbb0>"  that used to be a job name :/14:47
*** pojadhav|afk is now known as pojadhav14:50
corvusi used the repl to get the node request id for the top check queued job: 'neutron-tempest-plugin-scenario-openvswitch': '300-0016042207'14:51
fungias for general exception roundup, i'm not spotting a couple of new ones the scheduler debug logs aside from the previously observed "TypeError: 'NoneType' object is not subscriptable"14:51
fungithis hit a couple of times on zuul02: https://paste.opendev.org/show/810851/14:53
corvusthat could definitely cause problems; maybe even the one we're seeing now with unhandled node requests14:54
fungiand this on zuul01: https://paste.opendev.org/show/810852/14:54
fungialso zuul01 logged 13 occurrences of kazoo.exceptions.ConnectionLoss and zuul02 logged 54 of them14:55
fungino idea if any of that is related14:55
corvusthat one is probably harmless; i think it's a delete of an already deleted node, but could be a fallow-on from the first error14:55
corvusthat's a lot of connection losses14:56
fungitrying to see if i they're interspersed randomly or clustered14:56
corvuswould also be good to see what the rate was last week14:56
fungi/var/log/zuul/debug.log.6.gz on zuul02 contains none as far as i can see14:57
fungichecking a few other days14:58
fungias for clustering, it looks like the entries today on zuul02 came in bunches, around 11:38, 12:03-12:06, 12:10, 12:16, 12:27, 13:18, 13:36, 13:55, 14:16, 14:26, 14:29, 14:34, 14:5815:00
fungisome clusters were larger, some as few as one exception15:00
fungithere could be some ongoing/intermittent network disruption15:01
fungido we maybe only log kazoo.exceptions.ConnectionLoss in some very recent commit? i've scanned back more than two weeks in the zuul02 debug log and found no hits before today15:04
corvuswow, zuul01 saw a multi-hour period it couldn't connect to zk: from 06:56:17 to 08:09:3415:05
corvusfungi: no that should be unchanged; something is likely very different with our zk connection, whether that's zuul or network issues is unclear15:05
fungiyeah, i'm continuing to grep farther and farther back, finally found one from 2021-10-22 07:55:08,90615:06
fungiso it's been almost 3 weeks15:07
fungicorvus: this is likely related? http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70037&rra_id=all15:07
fungithe other two zk servers are only using 42% of their rootfs15:08
corvusthat is entirely possible15:09
fungi/var/zookeeper/datalog/version-2 contains 28gb of data15:10
fungiand that's around 9gb on the other two servers15:11
fungiso 3x larger on 0415:11
corvusstarting at 02:00 the amount of data in zk spiked according to https://grafana.opendev.org/d/5Imot6EMk/zuul-status?orgId=1&from=now-24h&to=now15:11
fungiperiodic-stable jobs were going on around then15:11
fungifor the openstack tenant15:12
fungiperiodic-stable triggers at 02:01 utc daily15:12
fungiand periodic at 02:00 utc15:13
fungiso that could be related to queuing hundreds/thousands of builds in a matter of minutes15:14
corvusmakes sense15:17
corvusit looks like the other zk servers had their storage fill up as well, but they recovered?15:18
fungilooks like zk may have crashed on 04? i'm trying to see if there are clues in the docker-compose logs15:21
fungi2021-11-08 06:53:42,330 [myid:4] - ERROR [Snapshot Thread:ZooKeeperServer@329] - Severe unrecoverable error, exiting15:22
fungijava.io.IOException: No space left on device15:22
fungiseems like filling up the rootfs crasged it when trying to snapshot15:22
fungicrashed15:22
fungii wonder if it would recover if we freed up enough space to start it again15:23
fungialso i wonder if it should really be consuming that much disk, and if so we should probably attach dedicated volumes for them15:24
*** ysandeep is now known as ysandeep|dinner15:25
fungihappy to get started on that if it's the direction we want to go15:25
corvuswe have zk purge snapshots every 6 hours; that corresponds with the drop in this graph: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70101&rra_id=all15:25
corvusi wonder if we're keeping too many snapshots and should reduce that15:26
corvusaiui, we retain 3 snapshots, but only delete the ones >3 every 6 hours15:26
fungithe disk utilization graph suggests that it doesn't normally use anywhere close to this much on previous days15:27
corvuson zk04 (which is helpfully frozen in time at max usage), we have around 75 snapshot files15:28
corvusand aronud 25 on zk0515:28
fungigrowth on all three zk servers seems to have increased by close to an order of magnitude15:29
corvusfungi: maybe before looking at expanding the disk, we should increase the snapshot cleanup frequency?15:29
corvusmaybe every 2 hours?15:30
fungiyeah, that seems worthwhile, though also it appears that letting zk fill up the rootfs destabilizes it, dedicating a filesystem would shield the system from that impact15:30
corvusyes, that's true.  want to do both then?15:30
fungii'm fine with focusing on cleanup improvements first, and then working on adding cinder volumes after we've stabilized the cluster15:31
corvusok.  i'll propose a patch to change the interval15:32
fungii'm curious what's come in after 4.10.4 though which causes it to create so much larger snapshots15:32
corvusqueue state :)15:32
fungiahh, okay, so that's the churn from moving queue state in15:32
corvusi have some thoughts on reducing the data size, but they depend on having the queue state in zk to start with15:33
corvusbasically, we should be able to share some data which we currently duplicate in the executor queue15:33
corvus(global repo state, job vars, etc; with queue state in zk now, executors can reach all of that data directly)15:34
fungiwhat's the safest way to get zk going again on 04? i could free up ~1gb by clearing /var/log/journal for instance15:34
fungior can we just start it as-is and expect it will clean up old snapshots before it tries to write anything?15:35
corvusfungi: i'm unsure; it might be okay to manually delete the snapshots15:36
Clark[m]Not really here yet but there might be old docker images to prune too if we aren't pruning them regularly15:36
corvusi think we should get the new interval in place first though15:36
corvus(that way we can do just one rolling restart to get it in use)15:37
fungithere's only one zookeeper image and two opendevorg/zookeeper-statsd images according to `docker image list`15:37
fungii could delete the older opendevorg/zookeeper-statsd which would free up 122mb15:38
opendevreviewJames E. Blair proposed opendev/system-config master: Purge ZK snapshots more frequently  https://review.opendev.org/c/opendev/system-config/+/81703815:38
corvusi'm going to work on breakfast, then resume zk work15:55
fungii see what you mean about the status json being inconsistent, i was watching that change in the status page and saw a couple of builds transition from starting back to queued15:55
*** ysandeep|dinner is now known as ysandeep15:56
yuriysI noticed some errors for launch attempts on the inmotion cloud, and all of them were due to (raise exception.NoMoreFixedIps(net=network_id) , which shouldn't be happening, because that's statically configuration, unless we're not 'releasing' ips fast enough during destroy+provision requests.15:57
yuriysstatic config* i meant15:57
yuriysalthough it did go from 0 to 100 at 13:58 UTC... hmm16:00
fungiwe have at times in the past seen floating ips leak due to pathological conditions in clouds preventing us from deleting them when we free up the ports/servers to which they were bound16:02
yuriysYeah looks like at 13:58 a call to destroy nearly all instances went out, shortly by rebuilding all them, not quite sure how to combat that for the ip release delay vs the reprovision16:02
funginodepool expects to occasionally see errors for such things, it's fine as long as they don't persist and it's able to eventually start booting new server instances there16:03
fungiit's possible we could teach the launchers to track floating ip capacity similar to how it does other limits and refrain from making boot calls until there's capacity to assign new floating ips, i'm not quite sure what would be entailed in implementing that however16:05
yuriysyeah it looks good now, just need a delay between destroy and create calls16:06
Clark[m]Note this wasn't floating ips just normal IPs aiui16:06
fungioh, yep sorry i see now from the pasted error it was fixed ips16:07
fungiso i guess the "ip release delay" is something neutron does?16:08
fungiand likely doesn't expose it via the api such that we could query16:08
clarkbas soon as my glasses are defogged I can ACK the zk purge interval change16:18
clarkbLet me know if I can help with zk otherwise I'll stay out of the way as I've not been on top of it with y'all16:20
fungithere's not much to catch up on. basically latest zuul moved queue state into zk which creates a ton of new churn resulting in much larger zk snapshots, so much so that it's easy for them to overrun the available space on the rootfs within the 6-hour cleanup interval we'd configured16:21
fungiand that can also lead to crashing zk entirely if it runs out of space while writing a snapshot16:22
clarkbI think another option available to us is to increase the snapshot interval16:22
clarkbso that we produce snapshots less often16:22
clarkbthat is based on numbers of transactions iirc16:23
fungieither way, yes the goal is to keep the snapshot count to a manageable level, but also probably to relocate /var/zookeeper or at least /var/zookeeper/datalog to a dedicated filesystem/volume16:23
clarkbapparently I've got a kernel update waiting so first reboots, then acking that change16:25
fungialso we may need some manual intervention on zk04 as the zk service on it crashed ~10 hours ago leaving its rootfs full16:25
fungibut holding off there until the cleanup interval change deploys16:26
opendevreviewClark Boylan proposed opendev/bindep master: Try out PBR pep 517 support  https://review.opendev.org/c/opendev/bindep/+/81674116:43
clarkbthrough localy testing I think maybe that all we need to do is keep the setup.py script then it will work? we'll find out16:43
clarkbif that does work I have a docs updated that I need to push to pbr16:44
fungiif it winds up calling into setup.py even with latest build/pip/wheel, then that probably doesn't avoid the deprecation warning16:45
clarkbfungi: ya I'm not sure on that point yet. You'd think that the official method of doing this via setuptools wouldn't trip the deprecation warning. But I don't know yet16:46
fungithough i guess it's just calls to `setup.py install` and setup_requires/easy_install which trigger the warnings16:47
fungiso calling into setup.py on its own may not be a problem16:47
clarkbfungi: https://github.com/pypa/setuptools/blob/main/setuptools/build_meta.py#L158 ultimately that is what it does16:47
clarkboh ya and its doing setup.py sdist and setup.py bdist_wheel equivalents16:48
clarkband then those artifacts are handled directly by pip or poetry or whatever16:48
*** marios is now known as marios|out17:03
*** ysandeep is now known as ysandeep|out17:04
corvusi think i have identified the issue that's caused some stuck 'queued' jobs.  details and a fix in #zuul.  i'd like to merge the fix, then do what i guess would be our first ever rolling scheduler restart to deploy it.17:08
fungiwoah, i'm game!17:14
corvuslooks like we're about 5m or so from the zk interval change landing.  perhaps once it does, we can go ahead and manually update it and roll zk17:15
fungishould we expect the rolling restart to get those stuck queue items moving again? (for whatever definition of "expect" we have at this point in regards to rolling scheduler restarts)17:19
fungilooking at the fix, i assume we'll need to manually dequeue/enqueue them17:20
clarkbya I think we would have to reenqueue so they get new node requests17:20
fungias they presumably correspond to incorrectly cleaned-up node requests17:20
fungior we'd have to wait for the scheduler's node request timeout, which if memory serves is fairly long (8 hours? 10?)17:21
corvuswe may not have a node request timeout.  and yeah, dequeue/enqueue.  we could go ahead and do that now.17:23
corvusi'll get started on that17:23
opendevreviewMerged opendev/system-config master: Purge ZK snapshots more frequently  https://review.opendev.org/c/opendev/system-config/+/81703817:24
fungithere are a couple of builds to go in the opendev-prod-hourly buildset, before it will be able to start infra-prod-service-zookeeper in deploy17:26
fungihuh, infra-prod-service-zuul just failed in opendev-prod-hourly17:28
fungioh, it's the lack of letsencrypting breaking apache reloads17:30
fungion zuul0117:30
clarkbyes, known problem with a proposed fix17:31
clarkbhttps://review.opendev.org/c/opendev/system-config/+/816903 hrm fix is merged maybe insufficient fix?17:31
clarkbACME records I bet. One sec17:31
fungiclarkb: following https://discuss.python.org/t/11810 it seems like we may still be forced to call into deprecated setuptools functions when doing editable installs17:31
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add acme record for zuul01  https://review.opendev.org/c/opendev/zone-opendev.org/+/81706117:32
clarkbfungi: corvus  ^ re zuul01 le17:33
fungiand yeah, i agree with 81706117:33
fungi[Mon Nov  8 03:41:12 UTC 2021] zuul01.opendev.org:Verify error:DNS problem: NXDOMAIN looking up TXT for _acme-challenge.zuul01.opendev.org - check that a DNS record exists for this domain17:33
fungirather, acme.sh agrees ;)17:33
clarkbfungi: re editable installs sounds like setuptools is hoping to address that at some point17:34
fungiright, they simply haven't yet17:34
fungii should have said "...for now"17:34
corvusclarkb: thx, sorry17:35
corvusthere are many items that need re-enqueing; i'm having emacs help me with the commands17:35
fungielisp to the rescue?17:36
corvusokay, it's 122 commands (half dequeue half enqueue)17:36
corvusthat's running.  it will take a long time.17:37
corvusit should be safe to manually edit the zoo.cfg files now, right?17:38
fungiinfra-prod-service-zookeeper failed in deploy17:38
*** jpena is now known as jpena|off17:38
clarkblikely due to the lack of disk space to ansible on zk04?17:38
corvusprobably because zk04 is out of space17:38
corvusyeah17:38
fungimkdir: cannot create directory ‘/root/.ansible/tmp/ansible-tmp-1636392894.8472807-6361-108486208652702’: No space left on device17:39
fungiindeed17:39
corvusi'd like to manually make those edits on all 3 servers; delete the snapshots from under zk04, then restart zk04.  how does that sound?17:40
clarkbthat seems reasonable.17:40
clarkb(I don't know how zk04 will handle the loss of snapshots but in theory they are intended to be moved around for restoring stuff so should be safe to move them away/delete them17:40
corvusyeah; i'm expecting it to fully resync from the other servers; we'll see17:41
fungisounds great to me, thanks17:41
corvusi deleted the logs and snapshots.  that appears to have caused a zk outage :/17:47
corvuszuul is...recovering now?17:49
fungiwas that by deleting from 05 and/or 06 with 04 still down, or was 04 running again?17:50
corvusonly restarting 04 after deleting files from disk17:50
fungimaybe the cluster members got really busy trying to restore missing snapshots so stopped handling requests for a while?17:51
fungii guess zk doesn't like to answer requests if it has reason to believe the cluster is incoherent17:51
clarkbI think snapshots are per node and sort of forgotten about once written and not puging17:51
fungiahh, so maybe not then17:51
clarkbbut zk04 reentering the cluster will cause the cluster to bring it up to speed for the current dataset17:52
corvus05 crashed when i brought 04 up17:52
fungiouch17:52
corvusi think that explains the outage (but i don't know why 05 crashed)17:52
corvus06 stayed up17:53
fungisomething was clearly going on with zk05 around 17:3917:54
corvusmy dequeue command is stuck.  i wonder if the rpc handler thread is stuck17:54
corvushowever, both schedulers are still processing pipelines17:56
corvusi've restarted the dequeue/enqueue script17:56
corvusand it is proceeding (i'm unsure if only one scheduler is handling those commands now or both; it doesn't really matter since we're going to restart them)17:57
fungizk05 logged quite a few "java.io.IOException: ZK down" and immediately prior to those appearing there was this:17:57
fungi2021-11-08 17:39:42,051 [myid:5] - WARN  [QuorumPeer[myid=5](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):Follower@96] - Exception when following the leader17:57
corvusthe zk graphs look reasonable now; i think it's time to update the config for 05 and restart it now17:57
corvusoh, zk05 disk is full17:58
fungid'oh!17:58
fungiyou'd think i would have checked that FIRST17:58
* fungi sighs17:58
corvusnow we know why it crashed17:58
corvusi did check it first, but i clicked on ze05 and not zk05 :(17:59
* corvus sighs17:59
corvusokay, so i'm going to repeat the process i did on zk0417:59
clarkbshould we check zk06 state first?17:59
clarkbthen see if restarting zk05 causes it to fill tis disk?18:00
corvusthere's lots of space os zk06 right now18:00
fungi/dev/root        39G  7.9G   31G  21% /18:00
clarkblike becoming the leader causes an explosion of disk use maybe?18:00
corvusso i think we have good reason to think it might work this time18:00
clarkbcool18:00
corvusstarting 05 now18:02
fungidisk usage is climbing on 06, but not super rapidly18:03
opendevreviewMerged opendev/zone-opendev.org master: Add acme record for zuul01  https://review.opendev.org/c/opendev/zone-opendev.org/+/81706118:05
corvusi don't know if we get a confirmation that zk05 is fully synced or not.  it looks like it has the right amount of data, but there's no warm-fuzzy log message saying everything is okay.18:06
clarkbcorvus: I think with zookeeper it can't be part of the cluster without having synced fully18:06
clarkbbasically once it shows up as a follower to one of the others it is good. But ya no explicit indication of that18:06
corvusi think i'm ready to assume that everything is okay and we can proceed to zk06 now18:06
corvus++18:07
corvusi will shut down zk06 now.  it's the leader, so this will cause some churn18:07
corvusstarting 0618:08
corvusit's back up; all 3 servers are up; 05 is the leader now18:09
corvusi think that concludes zk work.  dequeue/enqueue script is still crawling along.18:10
clarkbthen next up is the update to fix the identified zuul bugs?18:11
corvusincidentally, i just observed the expected status page flap for the first time on the zuul tenant status page18:11
corvusyep18:11
fungii observed it a few times earlier today watching system-config changes ni the status page for the openstack tenant18:12
corvusonce those changes merge, we'll do the rolling restart.18:12
clarkbcorvus: for the status page flag that happens because zuul-web can load the state from the non active scheduler and get stale info?18:12
corvusclarkb: yep (though i wouldn't characterize it as a non-active scheduler, but rather, a scheduler who may not have processed the pipeline most recently)18:13
clarkbthe one that won the election least recently :)18:14
corvus(since it's sort of random which scheduler processes which pipeline(s) at any given time); odds are they should eventually converge on the same data relatively quickly18:14
corvusthey're both active at the same time.  they just both race through the list of tenants and pipelines and try to lock and process each one individually18:15
corvusif they start at the same time, and one of them locks openstack/check the other will skip that one and lock openstack/gate18:15
clarkbgot it18:15
corvusthen when the first finishes openstack/check, it'll try to lock openstack/gate.  if the second has finished, it will succeed and process it again (which is wasteful but harmless and quick); if the second still has the lock, it'll leapfrog and go to openstack/poste.18:16
fungikeeping an eye on the filesystem utilization on the zk servers... based on the fairly linear trending i anticipate we'll keep them under 60% used when they prune snapshots19:02
fungis/when/just before/19:02
clarkbnote I'm trying to dequeue/enqueue things to address the fight between the schedulers over builds with retry results19:02
fungithis should also give us some idea on sizing for cinder volumes19:02
clarkbit is slow going.19:03
fungithough i think rackspace requires cinder volumes be a minimum of 100gb, which will probably be 4x what we end up using with the current churn and pruning19:04
clarkbI am unable to tell if the tripleo gate queue is thrashing in this way or not. It is hard for me to see with the flapping status page. It seems like tripleo might be ok?19:06
corvusi monkeypatched in a fix for the retry thrashing bug; actual fix is https://review.opendev.org/817067 (and that's == what i patched)19:26
clarkbI've approved it just now19:28
clarkbI have ~7 changes to reenqueue that were dequeued and that is in progress19:28
corvusclarkb: yeah, i don't see a way to dequeue that periodic item, but with the immediate retry thrashing issue resolved i think we can probably just leave it?19:28
corvus(if we really wanted to dequeue it i could probably come up with something in the repl, but danger seems to outweigh benefit)19:29
clarkbok19:29
corvusincidentally, my de/re-enqueue of nodepool-bug-affected changes is continuing19:30
fungiit will eventually time out the node requests, right?19:30
clarkba number of the jobs are actually running I think19:30
clarkbit should largely be harmless other than node consumption19:30
corvusfungi: at this point i don't think the periodic item is suffering from any bugs19:30
corvus(and i don't believe node requests have a timeout)19:31
corvusso i think the current state is: retry bug hotpatched; nodepool bug still present but rarely triggered; remediation of earlier nodepool bug fixes in progress; waiting for zuul changes to merge in order to rolling-restart with all known bugs fixed19:32
fungii thought the scheduler would resubmit a node request if not satisfied in something like 8 hours?19:32
fungi(i vaguely recall reviewing a change for that, at least)19:33
SpamapSOhai, trying to add an email identity on my opendev account.. just wondering if the email loop verification emails are working. I waited a whole 5 minutes. ;)19:33
clarkbdown to two reenqueues in progress and then I should be done recovering19:33
fungiSpamapS: i can check the mta on the server19:33
corvusi'm going to afk for a bit19:34
SpamapSfungi: TY, I don't mean to be urgent I'd just like to submit the patch before I have to task switch back to not-zuul things. :)19:35
fungiSpamapS: i see a message for what i assume to be your new work address accepted by aspmx.l.google.com at 19:29:21 utc with gsmtp queue id e18si38289043vsu.30619:35
fungimaybe check your junk mail?19:35
fungidefinitely wasn't rejected, at least19:36
SpamapSOh boo it was counted as spam wow that's rare.19:36
SpamapSTY, reported as not-spam :)19:36
fungiSpamapS: if you can help us determine why it was counted as spam we could try to remedy it or pass the reasons along to the gerrit upstream maintainers19:37
clarkbreenqueue complete for the things I dequeued. I think we're in a good spot now. But will keep an eye on the oepnstack check queue in particular to make sure things are rotating out of there19:38
SpamapSfungi: DM'd you the full text with headers. I don't see many clues in there. It's possible some creepy AI made a few assumptions.19:39
clarkbI want to say in the past we've had to force smtp over ipv4 for gmail addresses. Maybe we need to find a way to send email via ipv4 for any google hosted domain?19:40
fungiclarkb: in this case it went over ipv4 anyway from the looks of the headers and our logs19:46
clarkbgotcha likely unrelated to google assuming ipv6 mail is spam then19:47
fungii like that my personal mailserver's spamassassin is configured to inject a header line indicating which sa rules a given message matched so i can have some idea of why it was or wasn't considered spam19:47
SpamapSIt's also possible that it's just assuming nobody has ever sent @spotify.com an email from opendev and so it made me +1 the first one.19:47
fungiwish other mailservers did that19:47
clarkbI think the tripleo gate queue may have reset without a failing job. Indication of another bug? I don't have hard evidence of that though as its more a I was skimming things and noticed it seemed jobs went away and now they have restarted but I don't see a failed change19:52
fungizk04 pruned its snapshots around 50% utilization on the fs19:54
clarkbtooz runs a ridiculous number of jobs19:54
fungii expect 05 and 06 to also do so shortly, and they're at similar utilization levels19:54
fungiyep, they all cleaned up at around 20-21gb used on the rootfs20:28
fungihopefully that's enough to survive through tomorrow's periodic jobs20:29
clarkbwe are able to reduce the cleanup period to an hour too20:29
clarkba few more options available to us to trim it back further if we feel that is necesary20:30
fungiyep20:30
clarkbthe zuul fixes are landing now and over the next half hour should hopefully all be in20:31
clarkbI'm goign to have to grab lunch soon but will be back after and can help with watching it post restart20:31
fungiyeah, i'll be context switching to dinner prep in roughly an hour, but after i've eaten i can help with restarts or whatever20:31
clarkbthe tripleo queue has reset but this time clearly due to a failing job so no concern there21:09
*** dviroel is now known as dviroel|out21:11
ianwanyone mind if i supervise f35 builds today?  https://review.opendev.org/c/openstack/project-config/+/81693121:13
ianwi don't want to spend time debugging f34 any more; if f35 still has issues booting i guess we/i can debug that21:13
ianwwe also have some sort of leak with containerfile builds + podman images; probably on an error path, but i'm not sure yet.  i'll look into that21:14
clarkbianw: is there a corresponding grafyaml update we need to do? But +2 from me21:15
ianwclarkb: hrm, looks like we haven't updated that in a while; i'll do that today21:17
ianwon the gmail spam thing -- prompted me to look as all my @redhat.com is gmail and i have 30 days history -- nov 3, 5 & 6 i got one mail, and today every mail in the stack https://review.opendev.org/c/openstack/heat/+/816604 got put in there21:19
clarkbfungi: corvus: looks like we had a kazoo connection loss at ~20:03 UTC from zuul0221:20
ianwso, that would be another data point that perhaps in early nov something did change to make gmail think gerrit might be more spammy21:20
ianwfor all their machine learning skills, why they can't realise i've probably got 10 of thousands of emails from this address and never marked one as spam, and have filters setup to apply labels ... maybe that's a pretty big flag i do want to see these mails21:21
clarkbwe seem to have a number of these connection closed due to Len errors in the recent zk05 docker container log21:26
clarkbside note we seem to write a new snapshot every 5 minutes or so21:26
ianwtobias-urdin: yeah, sorry i missed that use of the image.  although i'd strongly suggest that with the stable branches, a fedora is not really a good choice.  i presume it's because something recent is required -- but perhaps 8-stream; or even 9-stream now -- would work better?21:27
corvusmy de/re-enqueues from earlier are done21:28
corvusi think we may be writing a too-large object into zk which is triggering a connection loss21:31
clarkbah ok21:31
clarkbthe size is ~1.8MB and our limit is 1MB?21:31
clarkbthe size from the len error I mean is ~1.8MB21:32
corvushttps://paste.opendev.org/show/810857/21:32
corvuswhat's the len error?21:32
corvusoh that's the error from the zk server?21:32
clarkbyes if you docker logs the zk container you'll see the len error tracebacks21:32
clarkbwith what I presume is a byte length21:32
corvusthat sounds plausible21:33
clarkbI noticed it when looking into the "Unknown Configuration Error" messages for https://review.opendev.org/c/openstack/octavia/+/798151 the latest recheck seems to have enqueued though21:34
corvusit looks like we're writing a big serialized config error to zk21:34
corvusi don't quite understand why we're doing that in the format status json call21:35
corvusi think i want to assume the second traceback is somehow corrupt, and i only need to look at the first one.  it makes sense.21:36
clarkblooking at it I think the second one may be side effect of the connection loss but not due to the specific write21:37
clarkbbasically first one explains the connection loss then second one is explained by the connection loss21:37
corvusyeah, just not sure why the second one is doing any writes; like, _loadDynamicLayout shouldn't be in the call stack for a status json get21:38
clarkboh I see, could be interleaved writes maybe?21:40
corvusi don't think we get that with the logging module21:41
clarkblooks like it is trying to get the event queue length to report in the status json. Maybe that loads more info than anticipated?21:43
corvushow do we go from "raise self._exception" to "item.setConfigErrors" in the next frame?21:44
corvusmaybe it has something to do with kazoo's internal async processing and retry handling21:45
corvusand the second thread has an async retry of the original exception in its call stack.  i dunno.  it's weird.  i'm still just going to try to fix the first error.  :)21:46
corvusunfortunately, i don't know which attribute of the ConfigurationError we're trying to serialize which is too big21:46
clarkbmy hunch would be the message content since that can be quite large iirc21:47
corvusright, but "error" or "short_error" ?21:48
corvusor is it that there are a large number of ConfigurationError objects21:48
clarkbgotcha21:49
clarkbcorvus: should we shard all the error objects? since they can easily get large? or are you still trying to track down the specific attribute(s)22:33
corvusclarkb: i'm working on a change to shard all the error objects22:34
clarkbok I look forward to reviewing it22:34
fungiokay, i'm back from cooking and eating foods22:49
clarkbinfra-root I've updated https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting with notes about multi zuul scheduler and the container user stuff I started looking at. I dropped fips stuff and fedora 34. Anything else we should add?22:50
fungiandcaught up on the zk disconnect sitch22:50
clarkbianw: btw the new js theme looks great22:50
ianwthanks, i think that's it.  will be working on the 3.4 upgrade prep22:52
ianwi don't know if i just hadn't noticed before, but it was great to see on the polymer docs page i was looking at that it's all deprecated and now you should use "lit"22:56
clarkbhahaha22:56
ianwso just as we get polygerrit and i have some vague understanding of it ... it's already old news22:56
fungiweb frameworks will stabilize some day23:16
fungiand on that day, i'll begin to understand them23:16
clarkbcorvus: fungi  there are ~8 changes in openstack check that look like they might be stuck. They are the first 823:24
clarkbis this possibly related to the lost noderequests somehow?23:24
corvusclarkb: yes, likely so.23:24
clarkbok I can dequeue enqueue them if we think that is the best step for them23:25
corvusi think all the bugfixes so far have landed, except the one i just uploaded for the configerrors23:25
corvusclarkb: yeah, i'm leaning toward dequeue/enqueue and wait for the next bugfix to land before doing the rolling restart23:26
clarkbok I'll start on those now23:26
corvusok.  i'm going to afk so i can be around when it lands23:27
clarkbalright dequeue enqueue complete for the changes that looked suspect to me23:39
clarkbcorvus: when you return 815696,4 seems to be stuck but is also only just 2 hours old which means it should be more recent than the monkey patches. I'll hold off on dequeue/enqueing that one in case that is a good one to examine for further debugging23:43
corvusclarkb: i only monkeypatched the retry fix, not nodepool23:47
corvusso i think you should just go ahead and re-enqueue that one23:48
clarkboh got it23:48
clarkbproceeding23:48
corvuswhew, i saw zuul tenant gate head with a queued job, but it just started :)23:54
clarkbI notice there are a few more changes that need to be dequeue/enqueued. Working on that now23:59
clarkbthey weren't all at the top of the queue23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!