Friday, 2021-07-02

opendevreviewIan Wienand proposed opendev/lodgeit master: Add mariadb connector to container  https://review.opendev.org/c/opendev/lodgeit/+/79841100:33
*** odyssey4me is now known as Guest123101:12
opendevreviewIan Wienand proposed opendev/lodgeit master: Add mariadb connector to container  https://review.opendev.org/c/opendev/lodgeit/+/79841101:16
*** ysandeep|out is now known as ysandeep01:48
*** ysandeep is now known as ysandeep|afk02:11
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [wip] test centos8-stream with ro /sys  https://review.opendev.org/c/openstack/diskimage-builder/+/79912603:22
*** ysandeep|afk is now known as ysandeep03:59
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [wip] test centos8-stream with ro /sys  https://review.opendev.org/c/openstack/diskimage-builder/+/79912604:19
*** ykarel|away is now known as ykarel05:34
kopecmartinianw: oh, I thought you've already did, because I noticed yesterday that the server stopped downloading guidelines and was throwing errors so I merged the interop change - https://review.opendev.org/c/osf/interop/+/79641306:22
kopecmartineverything is working now which is very weird if the container hasn't been pulled yet 06:23
*** gthiemon1e is now known as gthiemonge06:32
*** jpena|off is now known as jpena06:52
*** amoralej|off is now known as amoralej06:56
*** ysandeep is now known as ysandeep|lunch08:30
*** ykarel is now known as ykarel|lunch08:31
*** ysandeep|lunch is now known as ysandeep09:34
*** ykarel|lunch is now known as ykarel09:52
ricolinianw, fungi clarkb found this error in https://nb03.opendev.org/debian-bullseye-arm64-0000029245.log10:11
ricolinExit code: 110:11
ricolin"/usr/local/lib/python3.7/site-packages/diskimage_builder/lib/disk-image-create: line 145: cannot create temp file for here-document: No space left on device"10:13
ricolincurrently all debian-bullseye-arm64 jobs are queued for days10:14
*** frenzy_friday is now known as frenzyfriday|afk11:00
*** ysandeep is now known as ysandeep|afk11:01
*** dviroel|out is now known as dviroel11:34
*** jpena is now known as jpena|lunch11:36
*** bhagyashris_ is now known as bhagyashris|ruck12:15
*** ysandeep|afk is now known as ysandeep12:26
*** ysandeep is now known as ysandeep|mtg12:29
*** jpena|lunch is now known as jpena12:36
*** ysandeep|mtg is now known as ysandeep12:38
*** amoralej is now known as amoralej|lunch12:45
*** ysandeep is now known as ysandeep|mtg13:00
*** amoralej|lunch is now known as amoralej13:41
fungiricolin: thanks for the heads up, i wonder if we're having growroot issues on those specific images13:46
fungiricolin: oh! that's in the build log, so we've likely filled up the disk on that builder, i'll check it13:47
fungiwe may need to shut down the builder container on it and clean up the disk13:47
fungi/dev/mapper/main-main  787G  787G     0 100% /opt13:47
fungibingo13:47
fungiwe basically haven't been building any new arm64 images13:48
fungiricolin: the backlog may be unrelated, i know we were also waiting on the linaro-us cloud to fix an expired ssl cert, i need to see if it's been replaced yet13:49
fungithe ssl cert for the api endpoint expired some days ago13:49
fungithe full disk on nb03 might actually be related to that if it's been struggling and failing to upload new images there13:50
fungii've downed the nodepool-builder container on nb03.opendev.org now13:50
corvusi'd like to restart zuul to see how the zk executor api changes perform14:04
fungicorvus: seems like a good day for it. also we'll get the zuul vars back in the inventory.yaml file after that14:15
*** ysandeep|mtg is now known as ysandeep14:17
corvusya14:17
corvusrestarting now14:21
corvus#status log restarted all of zuul on commit cc3ab7ee3512421d7b2a6c78745ca618aa79fc52 (includes zk executor api and zuul vars changes)14:22
opendevstatuscorvus: finished logging14:22
fungii let the openstack release team know, they were about to start approving some patches in their meeting14:28
corvusoh sorry, i thought they were typically idle on friday; i will re-evaluate my assumptions14:29
corvusit's up again, and jobs are running14:29
fungino worries, i told them i would give them a heads up when we were starting, but no harm done14:29
corvusre-enqueue in progress14:29
fungithanks!14:30
corvusjobs seem to be running, so that's a good sign14:30
corvusthere are significantly more ephemeral nodes in zk14:32
corvusalso signficantly less data size (probably compression)14:33
corvuswe've added about 2k nodes (for a total of 39k) but dropped from 21.5mb to 14.9mb14:34
corvusoh, interesting, the data has gone back up and increased; i guess that metric lagged a bit?14:35
tobiash[m]has been the scheduler startup time impacted (due to mergers via zk)?14:36
corvustobiash: it didn't seem significant; let me see if i can get a number14:36
corvustobiash: almost exactly average.  our mean of 4 reconfiguratons in the last month was 378 seconds (range from 357-403), today's was 37514:40
tobiash[m]great14:41
fungithat's great news14:42
corvusthe executors seem to have reached their nominal capacity for builds fairly quickly14:43
corvusi wonderi f we need a stats adjustment for the executors and executor queue though; those graphs appear to have flatlined14:43
fungiokay, release team meeting has wrapped up and i'm back to looking at nb03 to see what we need to clean up14:45
tobiash[m]the queued jobs still counts the gearman queue14:46
fungii expect the contents of /opt/dib_tmp are all leaked trash at this point14:46
tobiash[m]as it looks like14:46
*** ysandeep is now known as ysandeep|dinner14:46
corvusi think for the executors graph we need to add "unzoned"14:46
fungiooh, yep, none of our executors are zoned14:47
fungiso if it's treed by zone now that would make sense we'd have to adjust the stat we're polling14:47
tobiash[m]corvus: I wonder why the running jobs graph still works given that the stats seems to still count the gearman queue: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L36314:49
corvusi'm confused, i think the plain zuul.executors.accepting stat should work; we shouldn't need to switch to unzoned yet14:50
tobiash[m]the accepting should work14:51
*** dviroel is now known as dviroel|lunch14:51
corvusyet it doesn't; and the running should not work, yet it does14:52
tobiash[m]that's weird14:52
tobiash[m]the running might be taken from the per executor metric14:54
tobiash[m]which should not have changed14:54
tobiash[m]ah I think I got it, the "Executor Queue" graph is taken from the queue metrics from the scheduler which are broken now and is flatlined14:56
tobiash[m]the "Running Builds" graph uses the executor stats and works14:56
corvusah yep, that's it14:57
tobiash[m]which leaves the "Executors" graph to be checked14:57
tobiash[m]which I think should continue to work14:57
corvusthough, we still have the mystery of why zuul.executors.accepting isn't working but zuul.executors.unzoned.accepting is14:59
fungiapparently nl02 got caught up in a hypervisor host problem earlier in the week and was rebooted, per a ticket from rackspace15:01
fungibut looks like it's running okay currently15:01
tobiash[m]corvus: there is a bug: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L32915:04
tobiash[m]that's not taking the accepting into account15:04
corvusaha15:05
*** ysandeep|dinner is now known as ysandeep15:10
*** ysandeep is now known as ysandeep|away15:24
*** ykarel is now known as ykarel|away15:26
* clarkb is catching up15:35
clarkbSounds like things are working toher than stats reporting? not bad considering15:37
clarkbfungi: re nb03 the builders all do that. I suspect it is partially related to us updating the docker container images forcefully. But ianw thought that the issue on the dib side that let that happen had been addressed15:38
clarkbfungi: one thought I had after the last cleanup was that we could run a simple find in cron to clean those up based on what the current build is (basically find a way to ignore the current build)15:38
fungii suppose we could hold a lock on a file in the tempdir and then check that known filename for open handles before removing the containing directory?15:46
clarkbfungi: ya that should probably do it. I think you can also find the random string in the current build log or in the process tree (I suppose your idea is to look it up from the process tree)15:47
funginah, i mean actually stat the known filename inside each tempdir and then if it has no open file handles we know it's been leaked... but that assumes the process grows or is wrapped in a script with the feature to hold that lock until the process terminates15:50
clarkbfungi: also I started thinking about the gerrit account cleanup and realized that the last set of data was generated long enough ago that if I disabled accounts today that suddenly started being active again in the last 2 months that would be sadness. I don't expect a large delta but I think I should regenerate all the outputs of our scripts around this (redo the config check in15:52
clarkbgerrit, feed that into the audit, compare the audit from nowish to a couple of months ago) before retiring accounts15:52
clarkbI suspect we'll get zero delta and we can proceed without much extra checking beyond that, but if there is a delta it should be small and we can accomodate it15:53
*** jpena is now known as jpena|off15:53
*** amoralej is now known as amoralej|off15:55
fungiyeah, that's a great point15:58
fungitrying to run du over /opt/dib_tmp on nb03 is taking a very long time to return16:00
opendevreviewClark Boylan proposed opendev/system-config master: Update gerrit image to v3.2.11  https://review.opendev.org/c/opendev/system-config/+/79922516:01
clarkbmelwitt: fungi: ^ re gerrit update16:01
*** dviroel|lunch is now known as dviroel16:16
fungiclarkb: i'm beginning to think du is never going to finish counting the contents of /opt/dib_tmp, is it safe just to empty that while the builder is stopped?16:25
clarkbfungi: yes all of the data there is temporary. One suggestion though is that you down the builder container, then reboot to clear out any stale mounts that may exist for those entries (hopefully would only be for the running build that dies due to the stop), then cleanup and start the process again16:26
fungiclarkb: there is nothing mounted currently anyway, at least not according to df/mount commands16:27
fungijust the normal system mounts and a /run/user mount for my session16:28
clarkbin that case should be totally fine without a reboot16:28
fungiokay, wiping out everything inside /opt/dib_tmp in that case16:29
fungiit's been an hour of deleting and freed ~120GiB so far, but i have a feeling it's still going to be deleting for a while17:26
JayFI'm pretty reliably getting 400 errors from storyboard trying to submit a new story. error is a red box popup saying "400: POST /api/v1/stories/2009026: Invalid input for field/attribute story. Value: '2009026'. unable to convert to Story17:26
clarkbJayF: https://storyboard.openstack.org/#!/story/2009026 I think that is because it was already created17:28
JayFGot a new browser window and it... oh17:28
clarkbI suspect you had a non fatal error on the intial creation then subsequent attempts result in that error you posted above17:28
JayFWell, it worked in a new browser window. Now it's obvious as to why.17:28
clarkbThe timestampsfor creation are from about 7 minutes ago17:29
JayFyeah, it matches17:29
JayFweird but glad it's all fine, I'll cleanu p my dupe17:29
fungiJayF: also it can do that if you try to add two initial tasks in the story creation dialog, known bug17:40
JayFThat is /exactly/ what I did.17:40
fungithe task creations seem to try to happen in overlappnig transactions17:41
JayFThanks for closing the loop on it, that matches b/c I got a different error the first time (but didn't recall it) and then got this one every other step17:41
fungiand the api call to add the second tasks fails on a lock17:41
fungi2.5 hours into cleanup we've deleted 300GiB from /opt/dib_tmp on nb0318:58
clarkbinfra-root I'm going to run a gerrit config consistency check now to get an up to date list of conflicts that I can use to rerun an audit with. Though at this rate I probably won't get to that today as I think the zuul stuff is going to take priority19:16
clarkbconsistency check hasn't changed since we last ran it (good that is expected). Now I need to run the audit to see if user interactions have changed19:31
opendevreviewGoutham Pacha Ravi proposed openstack/project-config master: Add feature branch notifications to openstack-sdks  https://review.opendev.org/c/openstack/project-config/+/79932319:32
clarkbI have a user audit running now19:45
fungi/opt/dib_tmp on nb03 is finally empty. 356GiB available now. is there anything else i should clean up before starting the nodepool-builder container there again?19:46
fungiit has 22 base images plus their variants and checksums in /opt/nodepool_dib which is probably reasonable19:47
clarkbfungi: you can check if we haev leaked images in /opt/nodepool_dib but you can clean those up safely while the process is running19:47
clarkbfungi: on the x86 builders you occasionally see the intermediate vhd file get lost19:48
clarkbbut we don't build vhds for arm64 so that shouldn't happen I suspect htat that cleanup si fairly complete19:48
fungiyeah, no vhd files in there19:48
fungistarting the container again in that case19:48
*** dviroel is now known as dviroel|brb19:50
clarkbI'm glad I decided to rerun the audit. There is at least one account that had gone from inactive for the last three yaers to active (not sure it was one the cleanup list yet, but there was certainly enough churn to make double checking a good idea)20:13
fungiyep20:14
clarkbdoesn't look like it was on the chopping block (good means that my methods are not completely terrible)20:15
clarkbbut now I have a pretty good indication I can put the other account related to this user on the chopping block. However I was going to save those for when we got to the ~80 I think we haev remaining and reach out to people about it first20:16
*** dviroel|brb is now known as dviroel20:34
clarkbfungi: fwiw I'm going through the existnig proposals for my own piece of mind. I'm flagging any that seem more dangerous than others and I may ask you to take a look at those and double check them. If we want we can trim them out or if they look safe we can proceed with them.20:36
clarkbOnce I'ev gotten through this I'll push up files like we already have on review but with newer timestamps20:37
fungithanks20:55
*** dviroel is now known as dviroel|out21:03
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Enable ZooKeeper 4 letter words  https://review.opendev.org/c/zuul/zuul-jobs/+/79933421:24
opendevreviewMerged zuul/zuul-jobs master: Enable ZooKeeper 4 letter words  https://review.opendev.org/c/zuul/zuul-jobs/+/79933421:45
clarkbfungi: there are three files in review:~clarkb/gerrit_user_cleanups/notes/ audit-results.yaml.20210702 is the otuput of the audit resutls which you can refer to to see what data was used to make decisions. proposed-cleanups.20210702 is the list of accounts that we will retire, then later the email associated with the external id conflicts that will be dleeted on the retired accounts.22:05
clarkbAnd finally doublecheck.20210702 a subset of those in the previous file whcih I have identified as riskier because the other side of the conflict was somewhat recently used22:05
clarkbif you can take a look at those files and doublecheck the double check list I think we're just about ready to retire the accounts identified in the proposed-cleanups.20210702 file22:06
fungilookin'22:06
clarkbI probably won't do that today because the way that script is set up it takes a long time and I have to acknowledge use of my ssh key (though I could temporarily turn that off). But Definitely should be able to run that tuesday22:06
fungiso 36 high-risk22:08
clarkbya and even then I think those are relatively low risk because for each of them its pretty clear which is used more recently22:08
clarkbbut if we are going to run into problems I suspect it would be with that set. Maybe they are using the second account in some way that is harder to measure for example22:09
fungilow-high-risk ;)22:09
fungihuh... i only just noticed that the poetry readme uses oslo.utils as its example of a challenging dep solver problem: https://pypi.org/project/poetry/22:55
opendevreviewarkady kanevsky proposed opendev/irc-meetings master: Changed Interop WG meeting time for the summer 2 hours earlier.  https://review.opendev.org/c/opendev/irc-meetings/+/79933723:44

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!