Wednesday, 2022-05-25

*** dviroel|out is now known as dviroel00:21
*** rlandy is now known as rlandy|out00:31
opendevreviewIan Wienand proposed opendev/glean master: testing: Add ipv6 details to OVH  https://review.opendev.org/c/opendev/glean/+/84322500:53
*** dviroel is now known as dviroel|out01:24
opendevreviewIan Wienand proposed opendev/glean master: Revert "Add option to ignore config drive interfaces info"  https://review.opendev.org/c/opendev/glean/+/84322502:03
*** rcastillo_ is now known as rcastillo02:56
*** ysandeep|out is now known as ysandeep|rover04:35
*** ysandeep|rover is now known as ysandeep|rover|brb05:25
*** ysandeep|rover|brb is now known as ysandeep|rover05:32
opendevreviewIan Wienand proposed opendev/glean master: write_redhat_interfaces: refactor to walk interfaces first  https://review.opendev.org/c/opendev/glean/+/84324106:57
opendevreviewIan Wienand proposed opendev/glean master: write_redhat_interfaces: pass multiple networks to output functions  https://review.opendev.org/c/opendev/glean/+/84324206:57
opendevreviewIan Wienand proposed opendev/glean master: [wip] write out ipv6  https://review.opendev.org/c/opendev/glean/+/84324306:57
ianwclarkb/corvus: ^ i now understand why we just skipped ipv6.  the whole thing is written to more or less assume that one network entry == one ifcfg-* config file.  this fails when two network entries (ipv4 & ipv6) == one ifcfg-* 07:00
ianwi think that maps out a path forward, but it doesn't work yet.  i'll keep at it, but fyi07:00
*** jpena|off is now known as jpena07:33
*** ysandeep|rover is now known as ysandeep|rover|lunch07:44
fricklernot sure if related, but I'm seeing retries on ovh like https://zuul.opendev.org/t/openstack/build/3af04b3c4db24a01b88605924e5a2f1c07:51
frickleractually only a one off it seems, so possibly just coincidence07:54
*** ysandeep|rover|lunch is now known as ysandeep|rover08:33
*** rlandy|out is now known as rlandy10:23
*** arxcruz is now known as arxcruz|off10:52
*** dviroel|out is now known as dviroel11:30
fungithe up side is, that's exactly why we put that check in validate-host, so it would be caught as early as possible and retry rather than fail outright11:44
fungithough the traceroute is actually breaking because of dns resolution issues, according to the output from the task11:45
fungi"opendev.org: Temporary failure in name resolution"11:45
fungiboth the v4 and v6 traceroute failed trying to trace to opendev.org because of that11:45
fungiwe didn't collect the log from unbound though, so not sure if it was struggling11:47
frickleryes, I got red-herringed by ian's glean patches earlier and then the console only telling something about "no valid v4/v6 interface found"11:48
fricklerseems there are some more hits, but not too many. https://zuul.opendev.org/t/openstack/build/a5a754408ce2485c9a27f50823a940c5 for example11:49
fricklermight be that ovh is a common factor though11:50
*** rlandy is now known as rlandy|biab12:36
*** ysandeep|rover is now known as ysandeep|rover|afk12:41
*** rlandy|biab is now known as rlandy12:53
*** ysandeep|rover|afk is now known as ysandeep|rover13:31
opendevreviewJeremy Stanley proposed opendev/glean master: write_redhat_interfaces: pass multiple networks to output functions  https://review.opendev.org/c/opendev/glean/+/84324213:32
opendevreviewJeremy Stanley proposed opendev/glean master: [wip] write out ipv6  https://review.opendev.org/c/opendev/glean/+/84324313:32
opendevreviewMerged zuul/zuul-jobs master: ensure-podman: Remove kubic from Ubuntu 18.04 and drop 20.04  https://review.opendev.org/c/zuul/zuul-jobs/+/84309313:48
opendevreviewMerged zuul/zuul-jobs master: buildset registry: run socat in new session  https://review.opendev.org/c/zuul/zuul-jobs/+/84303815:00
*** dviroel is now known as dviroel|lunch15:09
*** rlandy is now known as rlandy|mtg15:25
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: Check and mount boot volume for data extraction with nouuid  https://review.opendev.org/c/openstack/diskimage-builder/+/84329715:37
clarkbThe gerrit 3.6.0 upgrade is going to be more difficult than the more recent ones we've done due to the "copy-approvals" command that needs to be run.15:41
clarkbDetails in a commit message for updated gerrit images shorty15:41
clarkb*shortly15:41
fungiso they're reworking how votes/labels are recorded and stored then?15:43
opendevreviewClark Boylan proposed opendev/system-config master: Update Gerrit images to 3.4.5 and 3.5.2  https://review.opendev.org/c/opendev/system-config/+/84329815:45
clarkbyup15:45
clarkbor at least that is the implication I haven't dug into the specifics yet15:45
clarkbfungi: I think older gerrit would look at old patchests to see if any votes needed to be forward ported to current patchsets. But startingin 3.6.0 they don't do that so you have to forward port manually which can be slow (their warning)15:45
clarkbBut we should be able to punt on all of that until we're ready to start looking at the 3.6 upgrade15:46
clarkbfor now we just get up to date on our images and be aware of that larger upgrade process to 3.6 when we get there15:46
fungigot it, thanks for flagging15:48
opendevreviewMerged openstack/diskimage-builder master: Make centos reset-bls-entries behave the same as rhel  https://review.opendev.org/c/openstack/diskimage-builder/+/83983016:04
*** marios is now known as marios|out16:08
*** dviroel|lunch is now known as dviroel16:16
clarkbcorvus: do schedulers and web components haev a graceful stop mechanism? It looks like no but double checking16:19
opendevreviewClark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster  https://review.opendev.org/c/opendev/system-config/+/84331716:29
clarkbinfra-root ^ I think I've got most of the details there but left TODOs where I wanted feedback16:29
clarkbIf you can take a look at leave comments on what you think would be appropriate that would be great16:29
*** rlandy|mtg is now known as rlandy16:34
clarkbonce we've gotten those bits cleaned up the next step would be to run it manually in a screen session from bridge. Then if that works we can automate it16:34
corvusclarkb: are you aware of https://review.opendev.org/828176 ?16:41
clarkboh no I had completely missed that16:42
clarkbI can rebase on that16:42
corvusclarkb: i think i like my approach for waiting better -- but maybe we should add a 'down' like you add in yours16:43
clarkbcorvus: my only concern with that appraoch to waiting is that I'm not sure it will ever timeout? Or if it does can we control the timeout length?16:43
corvus(i like that the wait is a shell one-liner that can be copy/pasted, and is a single task)16:43
clarkblooks like you can set a per task timeout regardless of what the task is so we can do that if we want a timeout16:44
corvusyeah, though i'm ambivalent on that -- i don't think it's been a problem so far16:45
clarkbya I think we can add it later if it becomes a problem.16:45
clarkbAdding the down is probably a good idea though since docker can restart containers in some situations otherwise16:45
clarkbcorvus: do you want me to rebase and add that or do you want to add it?16:46
corvusi can add16:47
clarkbcorvus: and I guess the ps -q thing avoids any races because we'll wait on no containers if the exit immediately?16:48
corvusi believe so16:48
opendevreviewJames E. Blair proposed opendev/system-config master: Add the start of a Zuul rolling restart playbook  https://review.opendev.org/c/opendev/system-config/+/82817616:49
opendevreviewClark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster  https://review.opendev.org/c/opendev/system-config/+/84331716:51
*** ysandeep|rover is now known as ysandeep|out16:59
corvusclarkb: re your question -- yes, i'd use the ansible uri module against the api17:08
clarkbcorvus: one thing I'm quickly noticing is that since we do lists of dicts the parsing is super clunky in ansible17:12
clarkbvs if it was a dict of dicts17:12
corvusclarkb: well, if we had done dict of dicts, it actually would have had to be dict of lists of dicts -- because we can and do have multiple entries for the same host17:13
corvus(so in our playbook, we should probably wait until there is exactly one entry for a host and it is running)17:14
corvus(by design, zuul will happily allow you to run more than one component on a host; that doesn't happen in our deployment, but we do still see multiple entries when we stop and start depending on how cleanly and recently the component shut down)17:15
clarkbright and ansible's expression on conditions and looping make that super weird. I'm sorting through it but wish ansible had a better way to express waiting on blocks of information17:15
corvusclarkb: we could muddle through doing it in jinja -- or we could make a quick ansible module17:16
clarkbI dont' think zuul needs to chagne. I think ansible needs to change, but the PR to address this has apparently been open for years :/17:16
corvusjust a quick python function that takes the json, a hostname, and a status and returns true when those conditions are met17:16
corvus(why can't we just inline a python function in ansible yaml?)17:17
clarkbya there is json_query which is a level above that which I'm currently trying to use to express this17:17
opendevreviewClark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster  https://review.opendev.org/c/opendev/system-config/+/84331717:27
clarkbcorvus: ^ something like that maybe17:27
opendevreviewClark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster  https://review.opendev.org/c/opendev/system-config/+/84331717:29
clarkbI should actually test that json_query stuff locally really quickly17:32
corvusclarkb: quick comment on that17:36
opendevreviewJames E. Blair proposed opendev/system-config master: WIP: Add inline_python role  https://review.opendev.org/c/opendev/system-config/+/84332217:40
corvusjust spitballing here; but that ^ seems handy to me.17:40
corvuser, that should say module, not role, but you get the idea17:41
opendevreviewJames E. Blair proposed opendev/system-config master: WIP: Add inline_python module  https://review.opendev.org/c/opendev/system-config/+/84332217:41
clarkbthe only way I can get the jinja to work is to get ansible to emit a warning for [WARNING]: conditional statements should not include jinja2 templating delimiters such as {{ }} or {% %} if I don't then it explodes on the ? in the json_query query :/17:49
clarkbanyway I think I have something that works now just need to port it over17:49
opendevreviewClark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster  https://review.opendev.org/c/opendev/system-config/+/84331717:53
clarkbthat seems to work locally and if I set 'running' to 'notrunning' it does retries17:54
corvusclarkb: i just left a comment suggesting we give the scheduler/web a few hours timeout on startup17:56
clarkbcorvus: sure. Do you think that 15 second delay between queries is too short too?17:57
corvus(i actually have some backlog tasks to look at startup improvements, but for now, it wouldn't surprise me if it took 30m to come online during a rolling restart, so 45m seems too close for a timeout that shouldn't be necessary)17:57
corvusclarkb: probably not necessary to check more than every 30-60 seconds?17:57
clarkbya I guess even if it takes 20 minutes to startup a check every 60 seconds isn't that big of a deal17:58
corvusit's a lightweight method, but it's not cached.  i'd probably settle at 30s?17:59
opendevreviewClark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster  https://review.opendev.org/c/opendev/system-config/+/84331717:59
clarkboh heh I just pushed at 60 seconds. Updating to 30s17:59
clarkboh I did the wrong component anyway :/18:00
opendevreviewClark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster  https://review.opendev.org/c/opendev/system-config/+/84331718:01
corvusclarkb: some timeout for web too18:02
corvuser -- same18:02
corvus(it has to do the same init work as scheduler)18:02
clarkbcorvus: I thought about that but we are waiting for the scheudler for up to 3 hours then just need to wait a bit for web afterwards since it should have the same three hour block to do what it needs to?18:03
clarkbif we want to be extra careful I can set it to 3 hours for web though18:03
clarkbI guess it doesn't hurt to be extra careful /me updates18:04
opendevreviewClark Boylan proposed opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster  https://review.opendev.org/c/opendev/system-config/+/84331718:05
clarkb(because the scheduler could be really fast for some reason but not web and then web would potentially trip)18:05
corvusoh, yeah, you're right, it's probably not as important as i thought, but still, shouldn't hurt.18:09
opendevreviewClark Boylan proposed opendev/system-config master: Explicitly install jmespath alongside ansible on bridge  https://review.opendev.org/c/opendev/system-config/+/84333018:12
clarkb^ is something I noticed. The lib is already there so this should work as is but I can't sort out why it is there and this helps make it explicit18:13
opendevreviewJames E. Blair proposed opendev/system-config master: Add inline_python module  https://review.opendev.org/c/opendev/system-config/+/84332218:14
corvusi took the wip off of that.  we can take it or leave it, i don't feel strongly about it.  i at least wanted to explore the idea.18:15
clarkbcorvus: how does the nested exit_json work in that?18:16
corvusclarkb: i believe it literally calls sys.exit, which is why it works nested18:16
corvusso if you call exit_json or fail_json in the script, that happens; if you forget, then there's the fallback in the module itself.  but if you call the nested functions, control never reaches there.18:18
clarkbgotcha18:25
fungiclarkb: is the idea to use that as-is manually working around the bits mentioned in the todo comments, then knock those out in a later change?18:38
clarkbfungi: I'm thinking it would probably be good to run that by hand in a screen in roughly its current state during a quiet time (over a weekend?) then look at the todos more closely if/when we choose to run that automatically18:39
clarkbmost of the todos at this point are related to running it automatically in the background and aren't a big concern for manual runs18:39
fungiyeah, agreed. just checking before i approve it in the current state18:40
clarkbthe other thought I had was hackign it up to only run a set of services at a time18:42
clarkbthen we could be sure the executors are happy before doing the mergers and so on18:42
opendevreviewMerged opendev/system-config master: Add the start of a Zuul rolling restart playbook  https://review.opendev.org/c/opendev/system-config/+/82817618:42
clarkbI guess I should try and make time friday to start that then check in on it over the weekend?18:43
fungior i can start it first thing in my day tomorrow18:52
clarkbdo we think thursday is quiet enough to run something liek that? if so I'm game18:52
clarkbI do think that the risk is low until we get to the schedulers18:53
clarkbbefore then worst case is maybe jobs get killed unexpectedly then are restarted18:53
fungiyeah, i'm not too worried. things are also relatively quiet this week, or so it's seemed18:58
opendevreviewMerged opendev/system-config master: Add playbook to gracefully stop and reboot the zuul cluster  https://review.opendev.org/c/opendev/system-config/+/84331719:01
corvusHoliday in many Europe countries tomorrow and in USA on Monday 19:02
fungieven better19:13
fungii barely manage to keep track of holidays here any more19:13
corvusi'm now in the position of needing to know all of them :)19:14
fungiindeed19:14
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: Check and mount boot volume for data extraction with nouuid  https://review.opendev.org/c/openstack/diskimage-builder/+/84329719:40
clarkbits a nice day outside. Time to park myself in the yard with some code reviews20:00
corvusnice!  it's finally back to "not hot" here.20:02
*** dviroel is now known as dviroel|afk20:07
*** jpena is now known as jpena|off20:17
clarkbianw: I've reviewed the glean stack. It isn't clear to me if reverting the ignore interfaces change is necessary to simplify the ipv6 addition (I think it would just hae your new code iterate over an empty dict which should be a noop?). But we may want t omove it to a followup change instead of being at the base so that we can communicate the removal of that flag20:55
*** timburke__ is now known as timburke20:59
BlaisePabon[m]It's been hard for me to find resources to explain zuul and gating to my colleagues. This video is a real pleasure and I think it will get me get the point across. https://www.youtube.com/watch?v=apLHQ4DkIHU 21:46
BlaisePabon[m]I think that the presenter is probably a member of this community in fact (Ian Wienand ?)21:46
BlaisePabon[m]s/get/help/21:46
clarkbBlaisePabon[m]: yup ianw 21:46
*** tosky_ is now known as tosky21:49
*** rlandy is now known as rlandy|bbl21:58
corvusinfra-root: https://review.opendev.org/843034 to bump the zuul tenant default ansible version is ready22:34
ianwBlaisePabon[m]: yes, that was my talk -- happy to answer any questions.  For reasons unknown some of the overlays didn't work great in that video.  if you want the original openoffice presentation files happy to forward on23:04
clarkbI need to start converting my notes into slides for the summit. I only have 10-15 minute sthough so the real trick is slimming things down sufficiently23:06
ianwthat is not long!  i just scraped that one in at 45m23:07
fungiyeah, harder to scale talks down than up ;)23:09
ianwclarkb (and fungi) : thanks for the review on the glean bits.  sometimes sleeping on it reveals new ideas but so far i don't really have any other than completely refactoring everything23:09
ianwthe networkd work prometheanfire did is a bit better way of parsing and writing out config file from the metadata23:09
fungii'm not opposed to merging that option removal chage first, just think we need to make sure we communicate it ahead of release23:10
ianwyeah it was really mostly that i was quite confused as to why the things i was adding to OVH for testing weren't triggering.  it is no fault of the original patch, we've named all the testing scenarios a bit obscurely 23:12
fungiclarkb: so when i get settled in tomorrow, i'll plan to run `ansible-playbook /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_reboot.yaml` in a root screen session on bridge.o.o23:24
fungishould i add -f10 or anything? not sure if that causes problems with the serialization23:25
clarkbfungi: I think you do want -f 20 for the pull. It shouldn't affect the serialization23:25
fungialso i guess i need to disable deployments before that to make sure nothing races?23:25
clarkb-f says "you can use up to this many forks" and serial: 1 says this runs one at a time regardless23:25
fungiokay, cool, so it won't force parallelization overriding what we require inside the playbook23:26
clarkbI think racing is ok since we can watch it and catch things up that might end up behind23:26
clarkbthe race would be pulling newer images halfway through23:26
fungiyeah23:26
clarkband ya you want it in screen because it might take a couple days23:26
clarkbmaybe `time` it so that we can get a sense for the runtime23:26
opendevreviewMerged openstack/project-config master: Set "zuul" tenant default Ansible version to 5  https://review.opendev.org/c/openstack/project-config/+/84303423:27
fungiokay, `time ansible-playbook -f20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_reboot.yaml 2>&1 | tee zuul_reboot.log`23:27
fungii've got that ready to fire in the screen session23:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!