Wednesday, 2021-06-09

mordredBUT- if it starts growing its own set of dependencies - then it might want to have its own container image00:00
mordredircbot and statusbot are using the same libs right?00:00
clarkbno one is limnoria, the other is python irclib or whatever it is called00:01
clarkbthere may be some overlap in other libs like ssl things00:01
ianwmordred: i mean not really; ircbot is really a limnoria container.  we'd *like* statusbot to be a limnoria plugin but it isn't00:01
mordredah. I mean - for my pov - that sounds like two different images. but- if it's not a problem to co-install right now, then shrug - wait until it's a problem, yeah?00:02
ianwi mean we do have to override the entrypoint, which is a bit ugly00:02
opendevreviewMerged opendev/system-config master: Remove special x/ handling patch in gerrit  https://review.opendev.org/c/opendev/system-config/+/79199500:03
mordredany reason to avoid making a new image? just the overhead of the dockerfiles and the jobs and whatnot?00:03
*** odyssey4me has quit IRC00:03
ianwyeah, it's mostly me wanting to be done with this post-haste.  but i think i will just make the separate image to avoid confusion00:04
mordredI hear you :)00:06
*** odyssey4me has joined #opendev00:12
clarkbianw: fungi re 791995 we should be covering that in testing with https://opendev.org/opendev/system-config/src/branch/master/testinfra/test_gerrit.py#L29-L3300:13
clarkband https://gerrit.googlesource.com/gerrit/+/b1f4115304a3820be434a6201da57e4508862f82 is the upstream commit to fix things00:14
ianw++00:15
clarkbI had to go and tripleo check to convince myslelf it is probably ok00:15
ianwi'll probably look at pulling/restarting in ~4 hours?00:16
clarkbwfm, but I'll not be around :) I'm also happy to try and get it done tomorrow if you end up busy00:16
fungii may be around then, but also may not be very useful if so00:18
clarkbianw: I would take note of the image that is currently running so that a rollback is straightforward if necessary00:18
clarkbI checked via our config mgmt stuff to see if we clean up images like we do with zuul and we don't appaer to do that with gerrit so we should be able to easily rollback if necessary by overriding the image in the docker-compose file then reverting the chagne above and udpating our image promotion00:19
ianw++00:24
opendevreviewIan Wienand proposed opendev/statusbot master: Add container image build  https://review.opendev.org/c/opendev/statusbot/+/79542800:37
*** ysandeep has joined #opendev00:41
*** ysandeep is now known as ysandeep|ruck00:41
ianwmnaser: ^ we know f34 containerfile boots but after that ... i'm sure you'll let us know of any issues :)  note that to use containerfile you'll want to make sure /var/lib/containers is mounted; see https://opendev.org/zuul/nodepool/src/branch/master/playbooks/nodepool-functional-container-openstack/templates/docker-compose.yaml.j200:42
ianwit's all extremely new so documentation and bug reports all welcome00:42
opendevreviewIan Wienand proposed opendev/statusbot master: Add container image build  https://review.opendev.org/c/opendev/statusbot/+/79542800:57
ianwTemporary failure resolving 'debian.map.fastlydns.net' Could not connect to deb.debian.org:80 (151.101.250.132), connection timed out01:25
ianwi wonder if there's still issues01:25
opendevreviewMerged opendev/statusbot master: Add container image build  https://review.opendev.org/c/opendev/statusbot/+/79542801:51
opendevreviewIan Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org  https://review.opendev.org/c/opendev/system-config/+/79521302:19
*** ysandeep|ruck is now known as ysandeep|away02:53
*** ykarel|away has joined #opendev03:30
opendevreviewIan Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org  https://review.opendev.org/c/opendev/system-config/+/79521303:36
ianwok the current container on review is "Image": "sha256:57df55aec1eb7835bf80fa6990459ed4f3399ee57f65b07f56cabb09f1b5e455",03:48
ianwthe latest docker.io/opendevorg/gerrit:3.2 is https://hub.docker.com/layers/opendevorg/gerrit/3.2/images/sha256-9cbb7c83155b41659cd93cf769275644470ce2519a158ff1369e8b0eebe47671?context=explore that was pushed a few hours ago, that lines up03:50
ianwi've pulled the latest image now03:54
ianwreporting itself as 3.2.10-21-g6ce7d261e1-dirty03:58
ianwopendevorg/gerrit                                         3.2                                    sha256:9cbb7c83155b41659cd93cf769275644470ce2519a158ff1369e8b0eebe47671   7a921f417d3c        4 hours ago         811MB03:59
ianw       "Image": "sha256:7a921f417d3cf7f9e1aa602e934fb22e8c0064017d3bf4f5694aafd3ed8d163c",04:00
ianwergo the container is running the image that has the same digest as the upstream tag.  i.e. we're done :)04:00
ianw#status restarted gerrit to pick up changes for https://review.opendev.org/c/opendev/system-config/+/79199504:01
opendevstatusianw: unknown command04:01
ianw#status log restarted gerrit to pick up changes for https://review.opendev.org/c/opendev/system-config/+/79199504:01
opendevstatusianw: finished logging04:01
*** ysandeep|away has quit IRC04:18
*** ysandeep|away has joined #opendev04:18
opendevreviewIan Wienand proposed opendev/statusbot master: Dockerfile: correct config command line  https://review.opendev.org/c/opendev/statusbot/+/79547304:29
*** odyssey4me has quit IRC04:31
*** ysandeep|away is now known as ysandeep|ruck04:36
opendevreviewMerged opendev/statusbot master: Dockerfile: correct config command line  https://review.opendev.org/c/opendev/statusbot/+/79547304:47
opendevreviewIan Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org  https://review.opendev.org/c/opendev/system-config/+/79521305:06
*** marios has joined #opendev05:10
opendevreviewMerged openstack/project-config master: Move devstack-gate jobs list to in-tree  https://review.opendev.org/c/openstack/project-config/+/79537905:30
*** ykarel|away has quit IRC05:41
*** ykarel|away has joined #opendev05:41
*** ykarel|away is now known as ykarel05:41
opendevreviewMerged opendev/system-config master: Cleanup ask.openstack.org  https://review.opendev.org/c/opendev/system-config/+/79520705:42
*** ykarel is now known as ykarel|mtg06:07
*** whoami-rajat has joined #opendev06:12
*** ralonsoh has joined #opendev06:29
*** osmanlic- has joined #opendev06:45
*** osmanlicilegi has quit IRC06:45
*** odyssey4me has joined #opendev06:55
*** hashar has joined #opendev06:58
*** dklyle has quit IRC06:58
*** osmanlic- has quit IRC07:01
*** osmanlicilegi has joined #opendev07:05
*** amoralej|off is now known as amoralej07:13
*** andrewbonney has joined #opendev07:16
*** jpena|off is now known as jpena07:31
*** rav has joined #opendev07:49
ravHow do i rebase master branch to another branch in my repo?07:50
*** tosky has joined #opendev07:53
fricklerrav: what would be the use case for that? you usually rebase other branches onto master (likely before merging them into master), but rebasing master seems very unusual? also, it this a general git question or somehow specific to opendev?07:57
*** ysandeep|ruck is now known as ysandeep|lunch08:05
ravSo i did a clone to a branch.. then i did git checkout "stable/wallaby" then i made some changes but when i pushed the changes the changes have gone to master instead of stable/wallaby. In general git i know how to merge to a branch but opendev seems to not work the same way for me08:08
*** lucasagomes has joined #opendev08:10
toskywhen you say "push", do you mean "git review"? And when you did git checkout stable/wallaby, did git really create a local stable/wallaby branch which points to the remote stable/wallaby branch?08:21
*** ykarel|mtg is now known as ykarel08:30
ravYes its git review08:39
ravI think its creating a local branch08:41
*** ysandeep|lunch is now known as ysandeep08:50
*** ysandeep is now known as ysandeep|ruck08:50
*** boistordu_ex has joined #opendev08:55
ysandeep|ruck#opendev is this the right channel to report if some jobs are stuck on zuul?09:21
*** rpittau|afk is now known as rpittau09:22
ysandeep|rucktop job in tripleo queue are stuck even though they are not running09:24
ysandeep|rucki think i should try my luck on # zuul09:25
ravHow to delete a branch in opendev??09:28
tobiash[m]ysandeep|ruck: looks like something might be stuck in the queue processing there. I guess an infra-root needs to look at that. (however most are located in us timezone)09:29
ysandeep|rucktobiash[m]: thank you so much for sharing that, I will reping in few hours then..09:31
*** swest has joined #opendev09:32
*** hjensas has joined #opendev09:35
fricklertobiash[m]: ysandeep|ruck: we did a gerrit restart earlier, maybe that broke something in zuul processing09:48
ravCan someone tell me how to delete a branch in OpenDev.org? TIA09:51
*** hashar has quit IRC09:52
tobiash[m]ysandeep|ruck: if you need to unblock urgently you could abandon/restore the top change, but I guess it would be helpful for the analysis to leave it until someone had a chance to look at that if it's not urgent09:52
ysandeep|rucktobiash[m], frickler thanks! we can wait for few hours, so that infra-root can check and fix the issue permanently.09:54
fricklerrav: likely an admin would have to do that, which branch in particular are you referring to?09:54
ravstable/wallaby09:55
ravI have the admin rights09:55
ravin my own repo09:55
fricklerrav: sorry, I should have been more specific: which repo?09:55
ravhttps://opendev.org/x/networking-infoblox/src/branch/master09:55
ravfrickler: this is the branch09:58
fricklerrav: hmm, o.k., I'm actually not sure how our procedure to deal with this type of issues is for the x/ tree, so I'll have to defer to some other infra-root, should be around in a couple of hours10:02
ravok10:02
fricklerinfra-root: checking zuul logs I found these errors, I see no direct relation to the tripleo stuckness, but likely worth looking at anyhow http://paste.openstack.org/show/806483/10:06
fricklerthese look more suspicous even: Exception while removing nodeset from build <Build 60b0ee70b6f34543a85719b673929635 of n10:07
fricklerova-tox-functional-py38 ... No job nodeset for nova-tox-functional-py3810:08
tobiash[m]frickler: can you paste the stack trace?10:08
fricklertobiash[m]: http://paste.openstack.org/show/806484/ I hope I selected the proper context10:11
fricklerto me it looks like zuul is in some weird broken state in general10:11
tobiash[m]do you see some recurring exceptions in the live log of the scheduler?10:12
fricklerI've also earlier wondered why https://review.opendev.org/c/openstack/devstack-gate/+/795383 doesn't go into gate10:12
fricklertobiash[m]: that seems to repeat every minute or so for multiple jobs, yes10:12
tobiash[m]hrm, that exception is cancel related, are there other recurring exceptions as well other than the one in removeJobNodeSet?10:13
tobiash[m]maybe something mentionung the run handler10:14
fricklertobiash[m]: well there a those about semaphores I posted before that http://paste.openstack.org/show/806483/10:14
tobiash[m]is there anything suspicious when grepping the log for 4ef4a09fbb50478f8c7f6bfee2fb3926 (which is the event id of the stuck item in tripleo)10:17
fricklertobiash[m]: not in today's log, checking yesterday now10:20
*** ysandeep|ruck is now known as ysandeep|afk10:33
fricklerso the semaphore errors have been present for some time and thus are likely unrelated. found nothing else yet, will check back later10:41
*** ysandeep|afk is now known as ysandeep|ruck11:06
*** jpena is now known as jpena|lunch11:44
fungirav: make sure the .gitreview file on that branch mentions the correct branch as its defaultbranch, or alternatively use the first command-line argument with git review to tell it what branch you're proposing for12:03
fungirav: branch deletion is a permission which can be delegated in the project acl, but generally yes what frickler said, you want to be careful with branch deletions, and should probably tag them before deleting if they had anything useful on them12:04
fungiysandeep|ruck: frickler: tobiash[m]: the recent common cause i've seen for stuck items in pipelines is that one of the nodepool has managed to not unlock the node request after thinking it's either satisfied or declined it12:05
fungiif you can find which launcher last tried to service those node requests, then restarting its container frees up the locks12:06
*** ysandeep|ruck is now known as ysandeep|mtg12:07
fungii have yet to be able to identify the cause of that behavior in nodepool, but it seems to happen most when there are lots of boot failures, so likely some sort of race around that12:07
fungithough this case looks different, i see the stuck builds aren't waiting for nodes12:18
*** amoralej is now known as amoralej|lunch12:19
fungithe "Exception: No job nodeset for" occurrences, while vast in number, are likely unrelated since they stretch back at least several days in the scheduler's logs at similar levels... 338913 so far today, 530287 last thursday12:26
*** artom has quit IRC12:28
fungiwhatever's happening seems to stall after result processing... looking at the in progress tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates build for 795302,3 currently at the top of the check pipeline for the openstack tenant, this is the last thing the scheduler logged about it:12:30
fungi2021-06-09 01:10:26,804 DEBUG zuul.Scheduler: Processing result event <zuul.model.BuildStatusEvent object at 0x7f267ce9f940> for build fa45a5d15d404267b912be4bc14b373412:30
fungiso that build seems to have completed over 11 hours ago but hasn't been fully marked as such and so the buildset hasn't reported12:31
fungiinterestingly, a different build (7fbf525fe42b4f11849933cb606212a3) for that same nodeset completed a minute later and seems to have gone on to register a "Build complete, result ABORTED" in the log12:37
fungier, for that same buildset i mean12:37
fungitobiash[m]: as for your speculation in #zuul about lost result events, this does seem like it could be a case of it12:38
*** weshay|ruck has joined #opendev12:44
*** dviroel has joined #opendev12:47
opendevreviewBenjamin Schanzel proposed zuul/zuul-jobs master: Add a meta log upload role with a failover mechanism  https://review.opendev.org/c/zuul/zuul-jobs/+/79533612:48
*** jpena|lunch is now known as jpena12:49
*** dviroel is now known as dviroel|brb13:02
*** arxcruz|rover is now known as arxcruz13:03
*** amoralej|lunch is now known as amoralej13:07
*** ykarel has quit IRC13:19
fungii could try to manually dequeue the affected items, or worst case perform a scheduler restart, but would prefer to wait for corvus and clarkb to be around in case the running state provides us some clues as to the cause which we could otherwise lose13:23
fungialso hoping tobiash[m] has input, since he seemed to have suspicions as to the cause already (the affected build was seen as generating a result event but the scheduler never logged processing it, and the results queue is showing 0, suggesting the result was indeed lost somewhere along the way)13:25
*** rav has quit IRC13:26
tobiash[m]fungi: dequeue should be sufficient13:26
fungier, rather, the shcheduler logged that it was processing it, but never continued to log what the result actually was13:26
tobiash[m]fungi: afaik the result events now go through zk so maybe there was a zk session loss during that time?13:27
fungiahh, good point, i'll check the logs for the zk cluster members13:27
tobiash[m]fungi: I'd start with grepping the scheduler logs for 'ZooKeeper connection'13:30
fungiinterestingly we seem to mount /var/zookeeper/logs as /logs in the zk containers, but the directory is completely empty13:30
fungitobiash[m]: yeah, not finding any signs of a zk connection problem in the scheduler debug log, at least13:33
tobiash[m]ok, that's good, so there is maybe some edge case during result event processing13:33
fungii'm going to need to step away soon for a bit to take care of some errands, but hopefully shouldn't be gone more than an hour-ish13:35
*** ysandeep|mtg is now known as ysandeep13:43
corvuso/13:43
fungilooks like zk is just logging to stdout/stderr, so `docker-compose logs` has it, though quite verbose13:43
fungimornin' corvus!13:44
fungiwe have what looks like a lost result for build fa45a5d15d404267b912be4bc14b3734 and possibly others13:44
fungitrying to work out what could have precipitated it losing track of that13:45
fungiit's basically stuck in a permanent running state because the scheduler didn't evaluate the result reported back in gearman13:46
fungilooking in the zk cluster member logs, there was some sort of connection event around 01:28 which all of them logged, though that was 8 minutes after the build returned a result. i guess if the scheduler had a backlog in the results queue at that point due to a reconfiguration event or something then maybe the writes to zk were lost "somehow" ? (i'm a bit lost really)13:57
*** ykarel has joined #opendev14:00
fungiokay, taking a quick break now to run errands, but should return soon14:05
*** tcervi has joined #opendev14:13
*** tosky has quit IRC14:16
*** tosky has joined #opendev14:18
*** tcervi has quit IRC14:42
ysandeep#opendev are you still investigation tripleo stuck gate, tobiash[m] earlier mentioned if we abandon/restore the top patch in gate queue.. it would clean the queue. Should we go that route?14:52
ysandeepweshay|ruck, fyi.. ^^14:53
weshay|ruckhappy to abandon all the patches in gate if that will help infra14:53
tobiash[m]you'd just need to abandon/restore the top item, that should cause a gate reset and re-run all the other items14:54
*** hashar has joined #opendev14:57
weshay|rucktobiash[m], ysandeep just confirming it was just the top patch that needs a reset? https://review.opendev.org/c/openstack/tripleo-heat-templates/+/794634/15:00
weshay|ruckjust FYI.. if I would have known that earlier.. I would have done that happily :)15:00
*** dviroel|brb is now known as dviroel15:01
ysandeepweshay|ruck, tobiash[m] suggested we wait for infra-root to debug first.. lets see what infra-root says15:01
tobiash[m]weshay|ruck: yes, that worked, feel free to restore it again15:02
ysandeeptobiash[m], thanks!15:02
fricklercorvus: ^^ since IIUC clarkb is familying, it would be up to you to decide whether you want to debug further or let folks try to get their queue unstuck15:03
*** dklyle has joined #opendev15:04
clarkbI'm sort of here but ya slow start today15:05
clarkblooks like they may have gone ahead and done the abandon and restore? the top of the tripleo queue apperas to be moving at least15:10
ysandeepclarkb, yes weshay|ruck abandon/restore the top one .. gate queue was getting really long15:14
fungiysandeep: weshay|ruck: sorry, i had to take a break from investigating to run some errands, and was hoping we could keep it in that state to make it easier for others to possibly identify the cause so we can hopefully keep it from happening again in the future15:15
weshay|ruckah k15:15
fungibut maybe the logs we have will still lend a clue15:16
weshay|ruckroger that.. hope I didn't mess you up15:16
weshay|ruckthat was not clear to me15:16
fungiit's fine, wouldn't expect you to read dozens of lines of scrollback15:16
weshay|ruckya.. my friggin bouncer is down :(15:16
weshay|ruckhardware :(15:16
weshay|ruckfungi, so if it gets stuck again.. do you want us to hold it there?15:17
weshay|ruckfor investigation?15:17
fungiweshay|ruck: at least give us a heads up again before resetting it, yeah15:18
weshay|ruckk.. I'll make sure we're communicating that.. thanks for the help / support :)15:19
fungiyw15:21
fungiand thanks for bringing it to our attention earlier15:21
clarkbfungi: does the zk blip line up with the gerrit 500s reported in -infra time wise?15:22
clarkbI wonder if there was a disturbance in the force15:22
fungiit's possible there are more examples still stuck, i'll check for them once i get done eating15:22
clarkba netowrk flap in that region could explain a number of these issues people are reporting possibly15:22
clarkband even if the times don't line up maybe there was some ongoing or rolling set of network updates15:23
fricklerclarkb: fungi: another example to look at may be https://review.opendev.org/c/openstack/devstack-gate/+/795383 which doesn't start gating even after I rebased and re-workflowed it. different error at first, but possibly related15:25
fungiclarkb: no, the lost result came into the results queue at 01:10 utc, the zk logs i was looking at were from 01:28 utc, the reports of gerrit and gitea problems were around 10:10-10:15 utc15:25
fricklerand with that I'm mostly off for today15:25
fungithanks frickler!15:25
fungiand yeah, i started to look at that d-g change, the last thing zuul logged was that it could not merge15:25
fungiso possibly unrelated15:25
clarkbfungi: also I think the zk connection status goes into syslog if you need an easy way to check that in the future15:26
clarkbs/status/state changes/15:27
fungiit didn't seem to be in syslog that i could spot at least15:27
fungibut docker-compose was capturing it all15:27
clarkbthe +A on the d-g change is not when the tripleo had a sad either looks like15:28
clarkbfungi: frickler  I reapproved 795383 and it is enqueued now15:38
*** ysandeep is now known as ysandeep|away15:50
opendevreviewHervĂ© Beraud proposed openstack/project-config master: Temporarly reverting project template to finalize train-em  https://review.opendev.org/c/openstack/project-config/+/79558315:57
clarkbfungi: frickler: is it possible zuul also lost connectivity to the gerrit event stream and missed that approval event?16:01
tobiash[m]fungi, clarkb, corvus : this is another (still) live item that looks stuck as well and doesn't block a gate pipeline: https://zuul.opendev.org/t/openstack/status/change/795302,316:01
tobiash[m]so if you need to live debug I guess this can be used as well16:01
tobiash[m]same goes for the first few items in check16:01
clarkbtobiash[m]: ya looks like everything that is older than ~14 hours in check?16:02
tobiash[m]so to me it looks like there was some kind of event 14-16h ago that lead to a hand full of stuck items16:02
clarkbtobiash[m]: we have evidence of network issues between gerrit and its database at a different time (about 6 hours ago)16:02
tobiash[m]clarkb: yes16:02
opendevreviewHervĂ© Beraud proposed openstack/project-config master: Dropping revert  https://review.opendev.org/c/openstack/project-config/+/79558616:03
clarkbI'm starting to think that there may have been widespread/rolling network problems and this is all fallout from that16:03
tobiash[m]maybe fallout from the fastly cdn issue16:03
clarkbapparently zoom is having trouble too16:04
tobiash[m]at least network issues would be the best case since that would mean that we didn't introduce a regression in event handling16:04
clarkbI'm also seeing incredibly slow downloads to my suse mirror for sytem updates16:06
corvusi'm back from breakfast and will continue looking into this now16:06
clarkbcorvus: thanks, note the changes in openstack's check pipeline if you want to see some in action16:06
*** ysandeep|away has quit IRC16:06
*** hjensas is now known as hjensas|afk16:06
clarkbI'll hold off on trying to reset those until you give a go ahead16:06
clarkb(the ones older than 14 hours are likely to be hit by this)16:06
corvusyes, thanks; especially whe something is "stuck" it's helpeful to have it stay stuck to try to figure out why16:07
*** ykarel has quit IRC16:07
corvusi'm going to ignore the previous gate issue for now and focus on 795302 instead16:07
fungithe current zk logs on zk04 logged a rather lengthy java.io.IOException backtrace at 2021-06-09 01:28:33,55016:11
fungiseems to be a "Connection reset by peer" event16:11
fungithe uptime of all 3 zk servers is a couple months though, so none of them rebooted16:11
clarkbfungi: zk will reset connections if there is an internal ping timeout16:12
clarkband those internal ping timeouts could be caused by network instability16:13
fungisimilar one on zk05 at 01:28:31,08316:13
fungi(though it also has a prior one in its log from 2021-06-04 13:33:34,950)16:13
*** lucasagomes has quit IRC16:14
*** marios is now known as marios|out16:15
funginone in the logs on zk06 however16:15
fungiso yes, more anecdotal evidence of "network instability" but no smoking gun16:15
*** rpittau is now known as rpittau|afk16:16
clarkbfungi: if you look at the client side (a lot more work since we have a few) we might see evidence that those talking to 04 and 05 disconnected but those connected to 06 were ok?16:17
fungigot it, so this could be client disconnects not intra-cluster16:20
corvuslooking at the zk watches graph, i don't see any shift happing aronud 13:3016:21
corvusif there were client disconnects i would expect that graph to rebalance16:21
corvusoh i'm looking at the wrong time; 13:30 is from a long time ago16:22
corvus1:30 is the relevant time16:22
fungiyep, seems it could be clients... immediately after the disconnects i see nb01.opendev.org and nb03.opendev.org16:23
*** marios|out has quit IRC16:23
fungireauthenticating16:23
fungialso nb02.opendev.org16:24
fungiso all the builders reconnected within seconds after the 3 disconnects. before the disruption one was connected to zk04 and two to zk05, but on reconnecting one of them moved from 05 to 0616:25
fungiat least that's what i infer from the zk logs anyway16:25
funginice that we have recognizable cn fields in the client auth certs16:26
corvuswe haven't seen any indication the scheduler had a problem connecting with zk though, right?16:27
fungii found none, no16:28
funginothing in the scheduler logs to indicate it, and i found only 3 connection reset exceptions in the zk server logs all of which are accounted for by the nb servers reconnectinfg16:29
clarkbthe build result events are recorded by executors now though right?16:29
clarkbmaybe the executor fo rtose jobs failed to write to zk properly and the scheduler never saw things update?16:29
corvusyeah, i'm heading over to the executor to look now16:30
corvusfa45a5d15d404267b912be4bc14b3734 on ze0216:31
corvus 16:32:02 up 14:26,  1 user,  load average: 5.25, 4.70, 4.8816:32
corvusthat's suspicious16:32
corvuspossible executor hard reboot around the time16:32
corvuswe may be looking at a case of an improperly closed tcp connection, so gearman didn't see the disconnect?16:33
tobiash[m]I thought I had implemented a two way keep alive years ago16:34
corvus2021-06-09 01:28:42,748 is the last log entry on ze02 before:16:35
corvus2021-06-09 02:06:17,890 DEBUG zuul.Executor: Configured logging: 4.4.1.dev916:35
corvus(which is a startup log message)16:35
corvusalso there's some binary garbage between the two16:35
corvustobiash: me too16:36
tobiash[m]did it reboot or got stuck?16:37
clarkbthe uptime indicates a reboot16:37
clarkb16:32:02 up 14:2616:37
tobiash[m]I think I've seen seldom stuck executors that kept the gearman connection but didn't do anything, but that was long ago16:37
*** jpena is now known as jpena|off16:38
corvusdefinitely a reboot; and no syslogs preceeding it, so i'm assuming a forced reboot from the cloud16:38
corvuslike crash+boot16:38
corvusi don't see any extra TCP connections between the scheduler and ze0216:39
corvusthe gearman server log is empty16:40
corvusgeard should have detected that and sent a work_fail packet to the scheduler16:45
corvusi don't know why it didn't, but the fact that we have zero geard logs will make that very hard to debug16:45
clarkbcorvus: did the geard logs possibly end up in the debug log?16:46
clarkbcorvus: fungi: I finally managed to check up on emails and both ze02 and the gerrit mysql had host problems16:48
clarkbI guess that means not widespread networking issues but hosting problems. Good to have that info to point at though showing it didn't happen in a vacuum16:49
fungiclarkb: ooh, good thinking, i hadn't checked that inbox yet today16:49
fungibut in the "widespread network disruption bucket" this failure just got mentioned in #openstack-release https://zuul.opendev.org/t/openstack/build/8f1a900189f547c688259da9fcafa71216:49
corvusclarkb: i don't see any geard logs there, but i did find that the loop that's supposed to detect and clean up this condition is broken:16:49
corvushttp://paste.openstack.org/show/806500/16:50
corvusthat is repeated X times in the log... I'm guessing X == the number of stuck jobs right now16:50
corvus11 times in the logs16:50
corvusi count 9 stuck jobs16:51
fungiseems like about the right order of magnitude, yeah16:51
fungione we know was reset by abandon/restore16:52
fungimaybe the other got a new patchset in the meantime16:52
corvusgood point; i don't know if the abandoned job is still in that list; could be16:52
*** amoralej is now known as amoralej|off16:54
corvusoh i see16:55
corvuswe're right between two v5 ZK changes:  we have result events in zk, but build requests in gearman16:55
corvusa client disconnect is a gearman result event16:56
corvusand they're effectively ignored16:56
fungioh neat!16:56
corvusso it's very likely that geard did send the WORK_FAIL packet; and the scheduler ignored it since "real" results come from zk now16:56
corvustobiash: ^16:57
fungiyeah, that seems like a pretty good explanation of exactly what we saw16:57
corvusthe cleanup in http://paste.openstack.org/show/806500/ should have caught it even so, but it has a fatal flaw16:59
corvuslet me see if i can patch that real quick16:59
opendevreviewMichael Johnson proposed opendev/system-config master: Removing openstack-state-management from statusbot  https://review.opendev.org/c/opendev/system-config/+/79559616:59
corvuspatch in #zuul17:03
fungithanks!17:04
*** artom has joined #opendev17:04
opendevreviewMichael Johnson proposed openstack/project-config master: Removing openstack-state-management from gerritbot  https://review.opendev.org/c/openstack/project-config/+/79559917:04
*** andrewbonney has quit IRC17:05
johnsomHi opendev neighbors. I have posted the follow up patches to retire the openstack-state-management channel in favor of using openstack-oslo as discussed on the discuss mailing list. I don't have op privilege on the openstack-state-management channel to update the topic.17:06
johnsomCan someone update the topic to something similar to "The channel is retired. Please join us in #openstack-oslo" ?17:07
fungiwe can take care of it, sure. i'll do that in a moment17:07
johnsomThank you!17:07
fungijohnsom: can you also take a look at https://docs.opendev.org/opendev/system-config/latest/irc.html#renaming-an-irc-channel and comment it out in the accessbot config?17:09
fungijohnsom: i've adjusted the topic now17:09
johnsomfungi Ok, I wasn't sure on that part. Updating...17:10
fungiit's new, capabilities on oftc differ from freenode so we had to adjust our channel renaming process, and decided to start aging out our channel registrations17:10
opendevreviewMichael Johnson proposed openstack/project-config master: Removing openstack-state-management from the bots  https://review.opendev.org/c/openstack/project-config/+/79559917:11
johnsomUpdated17:11
fungiappreciated!17:12
opendevreviewMohammed Naser proposed opendev/system-config master: Add Fedora 34 mirrors  https://review.opendev.org/c/opendev/system-config/+/79560217:27
clarkbmnaser: question on ^17:29
mnaseryes, yes we do clarkb :)17:29
opendevreviewMohammed Naser proposed opendev/system-config master: Add Fedora 34 mirrors  https://review.opendev.org/c/opendev/system-config/+/79560217:30
mnaseri will update nodepool to build f3417:30
mnaserand then move f32 to f3417:31
*** ralonsoh has quit IRC17:34
opendevreviewMohammed Naser proposed openstack/project-config master: Build images for Fedora 34  https://review.opendev.org/c/openstack/project-config/+/79560417:35
*** amoralej|off has quit IRC17:42
*** whoami-rajat has quit IRC18:13
opendevreviewMohammed Naser proposed zuul/zuul-jobs master: Switch jobs to use fedora-34 nodes  https://review.opendev.org/c/zuul/zuul-jobs/+/79563618:15
opendevreviewMohammed Naser proposed opendev/base-jobs master: Switch fedora-latest to use fedora-34  https://review.opendev.org/c/opendev/base-jobs/+/79563918:18
opendevreviewMohammed Naser proposed openstack/project-config master: Stop launch fedora-32 nodes nodepool  https://review.opendev.org/c/openstack/project-config/+/79564318:30
opendevreviewMohammed Naser proposed openstack/project-config master: Remove fedora-32 disk image config  https://review.opendev.org/c/openstack/project-config/+/79564418:30
mnaserinfra-core: i just pushed up everything needed to get rid of f3218:30
mnasermight need some rechecks here and there because of the dependencies18:30
mnaserit's all covered here - https://review.opendev.org/q/hashtag:%22fedora-34%22+(status:open%20OR%20status:merged)18:31
clarkbmnaser: thanks, I'll try to review them18:32
*** amoralej|off has joined #opendev18:54
*** timburke_ is now known as timburke18:58
*** amoralej|off has quit IRC19:11
*** hashar has quit IRC19:39
*** amoralej|off has joined #opendev19:50
*** dviroel is now known as dviroel|brb20:23
*** amoralej|off has quit IRC20:53
*** dviroel|brb is now known as dviroel21:09
ianwmnaser: thanks!22:11
ianwone thing is that i think we'll need to update our builders to mount /var/lib/containers22:11
ianwto use the containerfile element22:11
clarkbianw: do we also need to mount in cgroup stuff from sysfs?22:12
clarkbor is that automagic since the host conatiner depends on it?22:12
ianwyeah i've not had to do that in local testing, and we don't do that in the gate test and it works22:13
ianwi should probably qualify that with "for now" :)22:13
mnaserianw, clarkb: i've not have to do any cgroup stuff either and once i mounted /var/lib/containers -- i got a clean build that boots22:13
ianwfungi/clarkb: not sure if you saw https://review.opendev.org/c/opendev/system-config/+/795213 but i went ahead and made a statusbot container and deploy it with that.  it's really just a mechanical patch now deploying the config file22:15
clarkbianw: I haven't yet22:16
ianwif you don't have any objections, i can try moving meetbot to limnoria maybe again this afternoon when it's quiet22:17
clarkbno objections from me. fungi ^ anything to consider when doing that? maybe double checking the meeting schedule to ensure its not quiet with a meeting?22:18
ianwbeforehand i'll sort out syncs of the logs22:18
ianwfrom e.openstack.org -> e.opendev.org22:18
fungiianw: go for it. sorry i'm not much help, internet outage here, i'm limping along on a wireless modem for the moment22:19
ianwno worries :)  i'm surprised i still have power/internet, it's crazy here22:21
ianwhttps://www.abc.net.au/news/2021-06-10/wild-weather-batters-victoria/10020353222:22
fungiyikes22:22
fungistay safe!22:22
ianwsettling down now but last night could have thought you were on the Pequod at times :)22:22
opendevreviewIan Wienand proposed opendev/system-config master: nodepool-builder: add volume for /var/lib/containers  https://review.opendev.org/c/opendev/system-config/+/79570722:41
ianwmnaser / infra-root: ^ that mirrors what we do in nodepool gate for production22:41
*** tosky has quit IRC22:47
ianw-09 22:42:25.839331 | LOOP [build-docker-image : Copy sibling source directories]22:50
ianw2021-06-09 22:42:26.548311 | ubuntu-focal | cp: cannot stat 'opendev.org/opendev/meetbot': No such file or directory22:50
ianwi wonder why the ircbot build would fail in the gate like that :/22:50
ianwhttps://zuul.opendev.org/t/openstack/build/1529ed5e1b0242e39eabc2bc3c86b79a/log/job-output.txt#279 is the prepare-workspace from the check job22:55
ianwthe gate job only cloned system-config22:55
clarkbmissing required project?22:55
ianwmaybe but why would it work in check?22:56
ianwhttps://1aa2de7fd6bc4a6a901d-a3c1233a0305e644b60ccc0279f1954b.ssl.cf1.rackcdn.com/793704/24/gate/system-config-upload-image-ircbot/ec9c181/job-output.txt is the failed job22:56
ianwohhh, i guess it's the "upload-image" job ... not "build-image"22:57
ianwthey should be sharing required jobs via yaml tags, but something must have gone wrong22:57
ianwrequired projects i mean22:57
*** whoami-rajat has joined #opendev22:58
opendevreviewIan Wienand proposed opendev/system-config master: Create ircbot container  https://review.opendev.org/c/opendev/system-config/+/79370423:04
opendevreviewIan Wienand proposed opendev/system-config master: limnoria/meetbot setup on eavesdrop01.opendev.org  https://review.opendev.org/c/opendev/system-config/+/79371923:04
opendevreviewIan Wienand proposed opendev/system-config master: Move meetbot config to eavesdrop01.opendev.org  https://review.opendev.org/c/opendev/system-config/+/79500723:04
opendevreviewIan Wienand proposed opendev/system-config master: Cleanup eavesdrop puppet references  https://review.opendev.org/c/opendev/system-config/+/79501423:04
opendevreviewIan Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org  https://review.opendev.org/c/opendev/system-config/+/79521323:04
ianwhttps://zuul.opendev.org/t/openstack/build/6804e9a6b18d44cd947f03280b9921be failed with POST_FAILURE -- WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!23:30
clarkbwe've seen that if there is an arp fight for an IP23:31
clarkbusually it will get bad for a day or two and the ngo away as the cloud notices and cleans those up ?23:32
clarkber :/23:32
ianwrax-iad in that one23:32

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!