Wednesday, 2021-08-18

ianwthanks, yeah will need to debug00:00
*** ysandeep|out is now known as ysandeep04:30
*** ykarel|away is now known as ykarel05:02
*** iurygregory_ is now known as iurygregory06:42
*** rpittau|afk is now known as rpittau07:22
*** jpena|off is now known as jpena07:34
*** mgoddard- is now known as mgoddard08:20
*** ysandeep is now known as ysandeep|lunch08:26
opendevreviewchzhang8 proposed openstack/project-config master: bring tricircle under x namespaces  https://review.opendev.org/c/openstack/project-config/+/80496909:00
opendevreviewchzhang8 proposed openstack/project-config master: bring tricircle under x namespaces  https://review.opendev.org/c/openstack/project-config/+/80497009:04
*** ykarel is now known as ykarel|lunch09:04
opendevreviewchzhang8 proposed openstack/project-config master: bring tricircle under x namespaces  https://review.opendev.org/c/openstack/project-config/+/80497209:20
noonedeadpunkfolks, I started seing weird issues when trying to pull changes from gerrit https://paste.opendev.org/show/808165/09:40
noonedeadpunkreally no idea what's wrong here...09:40
jssfrnoonedeadpunk, it works for me09:44
jssfrthe error sounds as if your filesystem may be corrupt09:44
noonedeadpunkhm... might be...09:45
jssfrdid git or your machine crash recently?09:45
noonedeadpunkwell, X crashed several days ago, but dunno, worth running fsck indeed.09:45
noonedeadpunkthanks anyway for checking that09:45
jssfr(fwiw, I ran `git init foobar && cd foobar && git fetch "https://review.opendev.org/openstack/openstack-ansible" refs/changes/68/804868/1` and that passed without error)09:47
*** ysandeep|lunch is now known as ysandeep09:47
opendevreviewchzhang8 proposed openstack/project-config master: bring tricircle under x namespaces  https://review.opendev.org/c/openstack/project-config/+/80497709:51
*** odyssey4me is now known as Guest472210:37
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch  https://review.opendev.org/c/opendev/elastic-recheck/+/80389710:38
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch  https://review.opendev.org/c/opendev/elastic-recheck/+/80389710:52
*** ykarel|lunch is now known as ykarel10:58
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch  https://review.opendev.org/c/opendev/elastic-recheck/+/80389711:00
*** dviroel|ruck|out is now known as dviroel|ruck11:12
*** jpena is now known as jpena|lunch11:34
*** sshnaidm|pto is now known as sshnaidm12:25
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch  https://review.opendev.org/c/opendev/elastic-recheck/+/80389712:29
*** jpena|lunch is now known as jpena12:32
*** ykarel is now known as ykarel|away14:42
clarkbnoonedeadpunk: jssfr I too can clone the repo and fetch that ref.15:15
clarkbnoonedeadpunk: I would run fsck on the repo and on the filesystem. But also you may want to check SMART for your disks and memtest your memory15:15
fungicorvus: on the stale cache for the moved config, it looks like zuul did actually record processing the change-merged event:15:19
fungi2021-07-21 10:38:28,954 DEBUG zuul.Scheduler: [e: 9f2d4a1e191b4ebd86e908bb8c30cbe1] Processing trigger event <GerritTriggerEvent change-merged opendev.org/x/devstack-plugin-tobiko master 801436,4>15:19
fungithat's currently in /var/log/zuul/debug.log.28.gz on the scheduler15:19
fungiso not the same situation as what we found yesterday15:19
clarkbToday was the listed day for the gitea release (if their milestones are accurate in github), but no release yet15:30
clarkbfungi: might be worth making a copy of that file as it will rotate out in a couple of days15:35
clarkbjust in case debugging this takes longer15:35
fungiyep15:35
*** ysandeep is now known as ysandeep|away15:38
*** jpena is now known as jpena|off15:42
*** rpittau is now known as rpittau|afk16:06
clarkbwe restarted to actually use the cache after that point in time16:11
clarkbIs it possible there was some half implemented cache state we were running whcih wuold've populated the cache but not cleared it when the merge event occurred?16:11
clarkbgenerally removals of content seem to be working since the removal for the cloud launcher job in our hourly deploy pipeline has been reflected16:12
clarkbthough that wasn't the removal ofa file16:13
clarkbfungi: maybe we should try and reproduce with teh sandbox repo?16:13
clarkbcreate a new job in a new file. Merge that with pipeline use. Then remove it and see if zuul complains would be a pretty minimal reproducer if this is a consistent issue16:14
clarkbfungi: do you think you can review the stack at https://review.opendev.org/c/opendev/system-config/+/804925 to furhter improve the hourly run time? also I've just approved https://review.opendev.org/c/opendev/system-config/+/758594 to ensure we don't forget that for the next renames16:19
clarkbiurygregory: out of curiousity what is the location for the ironinc midcycle? Will you be using meetpad? If so would be good ot hear if you find any oddness between jitsi meet and etherpad since we upgraded etherpad last week (we did some testing and it seems fine)16:23
iurygregoryclarkb, hey! I think we will use the meetpad - https://meetpad.opendev.org/ironic16:24
iurygregoryI haven't check with Julia since she is on PTO16:24
clarkbiurygregory: cool let us know if you see any weirdness but like I said we expect it will be fine based on tested we did16:25
clarkbs/tested/testing/16:25
iurygregoryclarkb, sure!16:26
fungiif anything, it seems to be working better recently than it did around the time of the last ptg16:28
fungias far as handling of the "shared document" etherpad embedding16:29
clarkbfungi: also let me know if you want me to look at logs or zk db for that cache thing. I'll be boostrapping a lot of that from scratch as I didn't review that stack but happy to look if the extra eyes will be helpful16:32
clarkbfungi: I half expect that we may need to look at the zk dbto see what we have cached then figure out why it didn't got away16:32
fungiclarkb: yeah, i liked your idea of testing a file move in the sandbox repo. i'm working my way back around to this problem and will give that a shot16:34
fungialso good theory on the "maybe we populated the cache via wip cache management which wasn't entirely clean, but were not actually reading from it until the restart yesterday"16:35
corvusclarkb, fungi: i'm around now16:41
clarkbcorvus: sounds like fungi may try to reproduce with the sandbox repo and I threw out a theory that maybe we had a half complete cache implementation that wouldn't properly purge things back on July 21 but does now (at least in simpler deletions of content that seems to work today)16:44
corvuskk.  i'll inspect the zk contents16:45
fungii also didn't follow the cache implementation closely. did it start out by writing a cache but not actually reading from it at restart? and then the restart yesterday was the first one where it read its config state in from the cache?16:45
corvusfungi: it's... complicated.  but we started fully relying on it last week.16:47
fungiokay, so doesn't necessarily explain why the tobiko changes would have just started breaking on stale configuration after the most recent restart16:47
corvusif it becomes important, i can go narrow down when each thing happened and in what order -- not trying to be evasive, just don't have that info handy right now16:48
corvusfungi: i agree, that's the bit of data that doesn't make sense to me.  there should be no difference between today's performance and 2 days ago.16:48
clarkbfungi: did the dleetion happen before the most recent restart?16:49
clarkboh ya on the 21st duh16:49
corvuslet's start an etherpad for notes: https://etherpad.opendev.org/p/34eHDRUw0OH3IXn3grT416:49
fungithanks16:52
fungii'll copy some examples in there16:52
clarkbdo we know if tobiko was active between july 21 and ~now?16:52
clarkbthat could explain not noticing the issue if it was present all along16:53
fungiyes, there's a change brought to our attention in #openstack-infra where it was working yesterday16:53
fungiand then a minor edit to the change this morning couldn't be tested16:53
fungii'll get that in there16:54
clarkbok we can rule that out then16:54
corvuswe've had a lot of restarts since july 21; even if we assume that zuul worked correctly when it got the event, but is now using old cached data, it seems surprising that it worked yesterday.16:55
clarkbin the new startup process is there detection of stale cache and if so does merging happen again? Is it possible that something caused zuul to think an older repo state was current which would invalidate the cache and then cause it to update with the old version in the cache?17:00
clarkblike maybe a merger failed to get the HEAD of the remote repo so it treated its local HEAD as current?17:00
corvusclarkb: there is no detection of a stale cache on startup; we're operating under the assumption for now that merge events don't happen when zuul isn't watching17:03
corvus(if they, do, press the reset button)17:04
fungii've added the details from the two observed symptoms to the pad17:04
fungifor the first symptom, we have a rough ~16 hour time window where it seems to have started, and the latest scheduler restart falls in that window17:05
fungifor the second symptom, i don't think we have a history for config-errors so hard to know if it was complaining before the restart17:05
fungiwith some additional research in open changes for tobiko and/or zuul logs we could probably narrow the window17:06
opendevreviewMerged opendev/system-config master: Add additional post project rename reindexing  https://review.opendev.org/c/opendev/system-config/+/75859417:20
*** dviroel|ruck is now known as dviroel|ruck|out19:07
Clark[m]I'm starting to page in some Gerrit 3.3 upgrade stuff. Does anyone understand what is meant by step 2 of the downgrade process at https://www.gerritcodereview.com/3.3.html#downgrade19:14
Clark[m]Do we need to hash an object whose content is 183 and then update the ref to that value? I guess they don't use a proper dag on that ref allowing us to revert a commit?19:14
Clark[m]Other than some confusion over that process I think this upgrade is straightforward. From a user noticeability standpoint the only comments toggle seems to go away so we need to see what the new behavior is from that. We also need to decide if we want to enable attention sets19:16
fungiyes, looks like it's saying to update the refs/meta/version to point at a hash of 18319:16
fungibut again, that's a manual step when downgrading19:16
fungii guess that's their equivalent of a schema serial19:17
fungiClark[m]: questions on two of the topic:hourly-run-optimizations changes19:17
Clark[m]corvus: ^ totally not urgent but I think if we do enable attention sets that gertty may want to set the state on comment responses more intentionally so that the attention is toggled properly19:17
Clark[m]fungi: thanks will take a look19:17
corvusClark: ack thx19:18
fungiClark[m]: have a (link to a) summary of "attention sets?"19:19
clarkbfungi: http://gerrit-documentation.storage.googleapis.com/Documentation/3.3.5/user-attention-set.html19:19
clarkbI'm thinking I may try to put together a change that tests a 3.2 to 3.3 upgrade on our test jobs. Then hold that and test the revert process on that test setup19:20
clarkbIf that isn't a terrible process we might consider doing this upgrade during a period of time we'd otherwise avoid since there is a revert19:21
clarkbbut I'm still looking at the change list. I note that hashar upgraded wikimedia to 3.3 recently and they upated their jgit.conf to include a setting we already set.19:21
clarkbhashar: ^ you may have other input on upgrading to 3.3?19:21
fungiat first i thought maybe attention sets could alleviate the need some folks saw in adding the reviewers plugin, but on reading that it looks like it might increase interest in adding it19:21
clarkbfungi: responded to your review comments19:24
fungiyep, saw the notifications, thanks19:24
clarkbfungi: ya I think the reviewers plugin may be complimentary to attention sets. I suspect that some people may find attention sets to be annoying but I personally think they may be worth trying after interacting with them with my upstream gerrit changes19:25
fungiright, i foresee some considering the reviewers plugin as a way to make attention sets less annoying19:25
fungior at least more useful19:26
opendevreviewMerged opendev/system-config master: Run infra-prod-service-zuul-preview daily instaed of hourly  https://review.opendev.org/c/opendev/system-config/+/80492519:33
opendevreviewMerged opendev/system-config master: Run remote-puppet-else daily instead of hourly  https://review.opendev.org/c/opendev/system-config/+/80492619:37
opendevreviewMerged opendev/system-config master: Stop requiring puppet things for afs, eavesdrop, and nodepool  https://review.opendev.org/c/opendev/system-config/+/80492719:37
clarkbI think we also need to put all of the bot accounts in Non-Interactive Users (which becomes Service Users in 3.3) to prevent them from confusing attention set19:50
fungiwhich also means making sure we don't delegate any special privileges to that group19:54
clarkbyup19:55
clarkbhttps://etherpad.opendev.org/p/gerrit-3.3-upgrade-prep I'm going to start putting notes in there19:55
mordredin theory the idea of attention sets make me happy. like - Important Changes was an attempt to answer "what should I be looking at"20:28
mordredso I'll be intersted to see how it is20:28
*** timburke_ is now known as timburke21:00
clarkbfungi there is a netflix documentary called Fantastic Fungi21:03
clarkbit comes with a health disclaimer.21:03
corvusif you've ever been drinking with fungi, you'll know he should come with a health disclaimer too21:28
Unit193fungi: I was able to get in contact with the pastebinit dev, he's at least read what I said and ACK'd it.  Hopefully we'll see changes soon™.  Last time I backported pastebinit to buster, if it's fixed I'll likely do it again for bullseye.21:45
corvusi'd like to restart zuul to pick up a bugfix22:33
clarkbthe release queues look empty22:33
corvusi'm assuming that tobiko will be stable after the full-reconfig, since they merged a valid config change22:35
corvusi'm running the docker pull now; it's doing work, so will be a minute22:36
clarkbcorvus: is it sitll going?22:51
corvusdone; sorry task switched momentarily22:51
corvusrestarting now22:52
clarkbno problem, wanted to make sure there wasn't an issue with the images22:52
corvusthe light's a little orange here today22:54
clarkbdoes the air remind you of a campfire?22:55
corvusnot yet -- so far i think it's all upper atmosphere22:55
corvusapparently that may start to change soon22:56
corvusit's sort of weird seeing the tenants come on-line one-by-one now22:57
corvusit's up; i'm going to run full-reconfigure now23:02
corvuscat jobs are being dispatched23:03
corvusi think i have spotted a case where the fix isn't completely thorough -- i don't think it's wrong, but it may be only 99% complete -- i'm going to look into that real quick23:13
clarkbcorvus: we reenqueue after the full reconfigure?23:13
corvusclarkb: that's my plan; i felt that would produce fewer errors23:14
clarkbmakes sense23:14
fungiclarkb: yep, i've witnessed that documentary, however i did neither participate in its creation nor suggest its fantastic name23:19
fungiUnit193: thanks! ianw and i are debating in https://review.opendev.org/804539 over the most reasonable compromise to still have a redirect but not break pastebinit users23:20
fungicorvus: thanks for the fix and restart. in theory the x/devstack-tobiko-plugin entries in openstack tenant config-errors should disappear23:21
corvusokay, good news and bad news!23:23
corvusgood news 1: the files we wanted to be deleted from the cache have been!23:23
corvusbad news 1: the extra debug line i added has indicated that we have a latent bug in that code, which could have caused a cache corruption problem once the cache has more than one user:23:24
corvus2021-08-18 23:20:05,346 DEBUG zuul.TenantParser: Removing file from cache for project gerrit.googlesource.com/zuul/jobs @master: zuul.d/devstack-tobiko.yaml23:24
corvusthe project name is wrong there -- it's not used in the cache cleanup, so i'm confident that part is correct.  but it is used to lock the cache, so the wrong part of the cache is being locked23:25
clarkbthat is an interesting mixup23:26
corvusbad news 2: due to the way the merger returns files, the fix is incomplete and doesn't cover the case where someone deletes a zuul.yaml file (ie, zuul.yaml -> .zuul.d/foo.yaml).23:26
corvusessentially, the merger always returns specifically requested file paths, whether they exist or not23:26
corvusi'm hoping the value is None or similar; will check on that in a bit23:27
corvuswe're still waiting on cat jobs23:27
Unit193fungi: FWIW, I passed him https://paste.opendev.org/show/bFVbXF44VrHbyS1Fxpd023:27
clarkbcorvus: hrm seems like cat jobs took about 12 minutes to complete before we switched to the cache?23:28
clarkbmaybe that was different than a full reconfigure though23:28
fungiUnit193: thanks! that will eventually help as the old versions age out, but we also want to make sure we accommodate users of old versions in various distros for however many years that takes to happen23:29
Unit193Indeed.23:29
ianwsorry yes i will get back to that in a minute, just got my head deep in a treeview23:32
clarkbI see jobs have started and queue lengths went to zero23:33
clarkbI think that means the full reconfigure is done?23:33
corvusyes, re-enqueueing23:33
corvusit looks like there were tobiko config errors but they are gone now23:34
fungithat's a good sign at least23:34
corvusso i think the fix worked (modulo the above caveat)23:34
corvus#status log restarted all of zuul on 598db8a78ba8fef9a29c35b9f86c9a62cf144f0c to correct tobiko config error23:43
opendevstatuscorvus: finished logging23:43
clarkb           23:54
clarkboops23:55

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!