19:01:10 <clarkb> #startmeeting infra
19:01:10 <opendevmeet> Meeting started Tue Oct 19 19:01:10 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:10 <opendevmeet> The meeting name has been set to 'infra'
19:01:17 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-October/000292.html Our Agenda
19:01:22 <clarkb> #topic Announcements
19:01:37 <clarkb> As mentioned it is the PTG this week. There are a couple of session that might interset those listening in
19:01:42 <clarkb> OpenDev session Wednesday October 20, 2021 at 14:00 - 16:00 UTC in https://meetpad.opendev.org/oct2021-ptg-opendev
19:02:08 <clarkb> This is intended to be an opendev office hours. If this time is not good for you you shouldn't feel obligated to stay up late or get up early. I'll be there and I should have it covered
19:02:35 <clarkb> If you've got opendev related questions concerns etc feel free to add them to the etherpad on that meetpad and join us tomorrow
19:02:42 <clarkb> Zuul session Thursday October 21, 2021 at 14:00 UTC in https://meetpad.opendev.org/zuul-2021-10-21
19:02:58 <clarkb> This was scheduled recently and sounds liek a birds of the feather session with a focus on the k8s operator
19:03:26 <clarkb> I'd like to go to this one but I think  Imay end up getting pulled into the openstack tc session around this time as well since they will be talking CI requirement related stuff (python versions and distros)
19:03:54 <clarkb> Also I've got dentistry wednesday and school stuff thursday so will have a few blocks of time where I'm not super available outside of ptg hours
19:04:36 <fungi> also worth noting the zuul operator bof does not appear on the official ptg schedule
19:04:52 <fungi> owing to there not being any available slots for the preferred time
19:05:40 <clarkb> #topic Actions from last meeting
19:05:45 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-12-19.01.txt minutes from last meeting
19:05:50 <clarkb> We didn't record any actions that I could see
19:05:55 <clarkb> #topic Specs
19:06:24 <clarkb> The prometheus spec has been landed if anyone is interested in lookg at the implementation for that. I'm hoping that once I get past some of the zuul and gerrit stuff I've been doing I'll wind up with time for that myself
19:06:26 <clarkb> we'll see
19:06:31 <clarkb> #link https://review.opendev.org/810990 Mailman 3 spec
19:06:51 <clarkb> I think I've been the only reviewer on the mailman3 spec so far. Would be great to get some other reviews input on that too
19:07:00 <fungi> further input welcome, but i don't think there are any unaddressed comments at this point
19:07:34 <clarkb> Would be good to have this up for approval next week if others think they can get reviews in between now and then
19:07:42 <clarkb> I guess we can make that decision in a week at our next meeting
19:07:59 <fungi> yeah, hopefully i'll be nearer to being able to focus on it by then
19:08:35 <clarkb> #topic Topics
19:08:41 <clarkb> #topic Improving OpenDev's CD Throughput
19:09:13 <clarkb> ianw: this sort of stalled on the issue in zuul preventing this from merging. Those issues in zuul should be corrected now and if you recheck you should get an error message? Then we can fix the changes and hopefully land them?
19:09:25 <ianw> ahh, yep, will get back to it!
19:09:30 <clarkb> thanks
19:09:40 <clarkb> #topic Gerrit Account Cleanups
19:10:00 <clarkb> same situation as the last numebr of weeks. Too many more urgent distractions. I may pull this off the agenda then add it back when I have time to make progress again
19:10:08 <clarkb> #topic Gerrit Project Renames
19:10:28 <clarkb> Wanted to do a followup on the project renames that fungi and I did last Friday. Overall things went well but we have noticed a few things
19:10:58 <clarkb> The first thing is that part of the rename process has us copy zuul key in zk for the renamed projects to their new names then delete the content at the old names.
19:11:18 <clarkb> Unfortuately this left behind just enough zk db content that the daily key backups are complaining about the old names not having content
19:11:26 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/814504 fix for zuul key exports post rename
19:11:57 <clarkb> That change should fix this for future renames. Cleanup for the existing errors will likely require our intervention. Possibly by rerunning the delete-keys commands with my change landed.
19:12:10 <clarkb> Note that the backups are otherwise successful, the scary error messages are largely noise
19:12:50 <clarkb> Next up we accidentally updated all gitea projects with their current projects.yaml information. We had wanted to only update the subset of projects that are being renamed as this process was very expensive in the past.
19:13:16 <clarkb> Our accidental update to all projects showed that current gitea with the rest api isn't that slow at updating. And we should consider doing full updatse anytime projects.yaml changes
19:13:27 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/814443 Update all gitea projects by default
19:13:42 <clarkb> I think it took something like 3-5 minutes to run the gitea-git-repos role against all giteas.
19:13:49 <clarkb> (we didn't directly time it since it was unexpected)
19:14:08 <ianw> ++ i feel like i had a change in that area ...
19:14:11 <clarkb> if infra-root can weigh in on that and whether or not there are other concerns beyond just time costs please bring them up
19:14:16 <frickler> this followup job is also failing https://review.opendev.org/c/opendev/system-config/+/808480 , some unrelated cert issue according to apache log
19:14:26 <fungi> separately, this raised the point that ansible doesn't like to rely on yaml's inferred data types, instead explicitly recasting them to str unless told to recast them to some other specific type. this maks it very hard to implement your own trinary field
19:15:11 <fungi> i wanted a rolevar which could be either a bool or a list, but i think that's basically not possible
19:15:28 <ianw> oh, the one i'm thinking of is https://review.opendev.org/c/opendev/system-config/+/782887 "gitea: switch to token auth for project creation"
19:15:34 <clarkb> you'd have to figure out ansible's serialization system and implement it in your module
19:15:54 <clarkb> ianw: I think gitea reverted the behavior that made normal auth really slow
19:15:57 <fungi> pretty much, re-parse post ansible's meddling
19:16:05 <clarkb> enough other people complained about it and its memory cost iirc
19:16:20 <clarkb> frickler: looks like refstack reported a 500 error when adding results. Might need to look in refstack logs?
19:16:44 <frickler> https://ef5d43af22af7b1c1050-17fc8f83c20e6521d7d8a3ccd8bca531.ssl.cf2.rackcdn.com/808480/1/check/system-config-run-refstack/a4825bb/refstack01.openstack.org/apache2/refstack-ssl-error.log
19:16:49 <clarkb> frickler: https://zuul.opendev.org/t/openstack/build/a4825bbef768406d940fcdd459cb92c6/log/refstack01.openstack.org/containers/docker-refstack-api.log I think that might be genuine error?
19:17:24 <clarkb> frickler: ya I think the apache error there is a side effect of how we test with faked up LE certs but shouldn't be a breaking issue
19:17:35 <frickler> ah, o.k.
19:18:03 <fungi> that apache warning is expected i think? did you confirm it's not present on earlier successful builds?
19:18:18 <clarkb> The last thing I noticed during the rename process is that manage-projects.yaml runs the gitea-git-repos update then the gerrit jeepyb manage projects. It updates project-config to master only after running gitea-git-repos. For extra safety it should do the proejct-config update before any other actions.
19:18:37 <frickler> no, I didn't check. then someone with refstack knowledge needs to debug
19:19:11 <clarkb> One thing that makes this complicated is that the sync-project-config role updates project-config then copies its contents to /opt/project-config on the remote server like review. But we don't want it to do that for gitea. I think we want a simpler update process to run early. Maybe using a flag off of sync-project-config. I'll look at this in the next day or two hopefully
19:19:19 <clarkb> frickler: ++ seems refstack specific
19:19:23 <fungi> jsonschema.exceptions.SchemaError: [{'type': 'object', 'properties': {'name': {'type': 'string'}, 'uuid': {'type': 'string', 'format': 'uuid_hex'}}}] is not of type 'object', 'boolean'
19:20:54 <clarkb> Anything else to go over on the rename process/results/etc
19:21:18 <frickler> there was a rename overlap missed
19:21:57 <fungi> rename overlap?
19:21:59 <frickler> and the fix then also needed a pyyaml6 fix https://review.opendev.org/c/openstack/project-config/+/814401
19:23:03 <fungi> thanks for fixing up the missed acl switch
19:23:06 <clarkb> ya when I rebased the ansible role change I missed a fixup
19:23:43 <clarkb> its difficult to review those since we typically rely on zuul tocheck things for us. Maybe figure out how to run things locally
19:25:02 <frickler> also still a lot of config errors
19:25:15 <clarkb> ya, but those are largely on the tenant to correct
19:25:35 <clarkb> Maybe we should send a reminder email to openstack/venus/refstack/et al to update their job configs
19:26:43 <clarkb> Maybe we wait and see another week and if progress isn't made send email to openstack-discuss asking that the related parties update their configs
19:27:01 <clarkb> I've just written a note in my notes file to do that
19:27:03 <frickler> +1
19:27:39 <fungi> in theory any changes they push should get errors reported to them by zuul, which ought to serve as a reminder
19:28:30 <clarkb> #topic Improving Zuul Restarts
19:28:48 <clarkb> Last week frickler discovered zuul in a sad state due to a zookeeper connectivity issue that apepars to have dropped all the zk watches
19:29:12 <clarkb> To correct that the zuul scheduler was restarted, but frickler noticed we haven't kept our documentation around doing zuul restarts up to date
19:29:21 <clarkb> Led to questions around what to restart and how.
19:29:39 <clarkb> Also when to reenqueue in the start up process and general debugging process
19:30:20 <clarkb> To answer what to restart I think we're currently needing to restart the scheduler and executors together. The mergers don't matter as much but there is a restart-zuul.yaml playbook in system-config that will do everything (which is safe to do if doing the scheduler and executors)
19:30:31 <frickler> yes, it would be nice to have that written up in a way that makes it easy to follow in an emergency
19:30:34 <clarkb> If you run that playbook it will do everything for you but capture and restore queues.
19:31:27 <clarkb> I think the general process we want to put in the documentation is: 1) save queues 2) run restart-zuul.yaml playbook 3) wait for all tenants to load in zuul (can be checked at https://opendev.org/ and wait for all tenants to show up there) 4) run re-enqueue script (possibly after editing it to remove problematic entries?)
19:32:14 <clarkb> I can work on writing this up in the next day or two
19:32:31 <ianw> (it's good to timestamp that re-enqueue dump ... says the person who has previously restored an old dump after restarting :)
19:32:34 <frickler> regarding queues I wondered whether we should drop periodic jobs to save some time/load and they'll run again soon enough
19:32:40 <fungi> granted, the need to also restart the executors whenever restarting the scheduler is a very recent discovery related to in progress scale-out-scheduler development efforts, which should no longer be necessary once it gets farther along
19:33:17 <clarkb> fungi: yes though at other times we've similarly had to restart broader sets of stuff to accomodate zuul chagnes. I think the safeest thing is always to do a full restart especially if you are already debugging an issue and want to avoid new ones :)
19:33:29 <fungi> that's fair
19:33:39 <fungi> in which case restarting the mergers too is warranted
19:33:51 <clarkb> yup and the fingergw and web and the playbook does all of that
19:33:51 <fungi> in case there are changes in the mergers which also need to be picked up
19:34:43 <clarkb> fungi: and ya not having periodic jobs that just fail is probably a good idea too
19:35:23 <clarkb> As far as debugging goes this is a bit harder to describe a specific process. Generally when I debug zuul I try to find a specific change/merge/ref/tag event that I can focus on as it helps you to narrow the amount of logs that are relevant
19:36:07 <clarkb> what that means is if a queue entry isn't doing what I expect it to I grep teh zuul logs for its identifier (change number or sha1 etc) then from that you find logs for that event whcih have an event id in the logs. Then you can grep the logs for that event id and look for ERROR or traceback etc
19:36:10 <fungi> i guess we should go about disabling tobiko's periodic jobs for them?
19:36:30 <fungi> they seem to run al day and end up with retry_limit results
19:36:33 <fungi> er, all day
19:36:34 <clarkb> fungi: ya or figure out why zuul is always so unhappy with them
19:36:43 <clarkb> when we reenqueue them they go straight to error
19:37:03 <clarkb> then eventualy when they actually run they break too. I suspect something is really unhappy with it and we did do a rename with topiko to tobiko something
19:37:25 <fungi> Error: Project opendev.org/x/devstack-plugin-tobiko does not have the default branch master
19:37:43 <clarkb> hrm but it does
19:37:56 <clarkb> I wonder if we need to get zuul to reclone it
19:37:58 <fungi> yeah, those are the insta-error messages
19:38:09 <fungi> could be broken repo caches i guess
19:38:34 <clarkb> Anyway I'll take this todo to update the process docs for restarting and try to give hints about my debugging process above even though it may not be applicable in all cases
19:38:38 <fungi> but when they do actually run, they're broken on something else i dug into a while back which turned out to be an indication that they're probably just abandoned
19:38:48 <clarkb> it should give others an idea of where to start when digging into zuuls very verbose logging
19:40:13 <clarkb> #topic Open Discussion
19:40:51 <clarkb> https://review.opendev.org/c/opendev/system-config/+/813675 is the last sort of trailing gerrit upgrade related change. It scares me because it is difficult to know for sure the gerrit group wasn't used anywhere but I've checked it a few times now.
19:41:11 <clarkb> I'm thinking that maybe I can land that on thursday when I hsould have most of the day to pay attention to it and restart gerrit and make sure all is still happy
19:42:16 <ianw> i've been focusing on getting rid of debian-stable, to free up space, afs is looking pretty full
19:42:37 <clarkb> debian-stable == stretch?
19:42:58 <fungi> the horribly misnamed "debian-stable" nodeset
19:43:00 <ianw> yeah, sorry, stretch from the mirrors, and the "debian-stable" nodeset which is stretch
19:43:01 <clarkb> aha
19:43:24 <fungi> which is technically oldstable
19:43:31 <ianw> but then there's a lot of interest in 9-stream, so that's what i'm trying to make room for
19:43:31 <clarkb> oldoldstable
19:43:41 <fungi> oldoldstable in fact, buster is oldstable now and bullseye is stable
19:43:45 <ianw> a test sync on that had it coming in at ~200gb
19:43:48 <ianw> (9-stream)
19:44:13 <clarkb> For 9 stream we (opendev) will build our own images on ext4 as we always do I suppose, but any progress with the image based update builds?
19:44:26 <clarkb> Also this reminds me that fedora boots are still unreliable in most clodus iirc
19:44:50 <ianw> clarkb: yep, if you could maybe take a look over
19:44:56 <ianw> #link https://review.opendev.org/q/topic:%22func-testing-bullseye%22+(status:open%20OR%20status:merged)
19:45:21 <ianw> that a) pulls out a lot duplication and old fluff from the functional testing
19:45:46 <ianw> and b) moves our functional testing to bullseye, which has a 5.10 kernel which can read the XFS "bigtime" volumes
19:46:09 <ianw> the other part of this is updating nodepool-builder to bullsye too; i have that
19:46:33 <ianw> #link https://review.opendev.org/c/zuul/nodepool/+/806312
19:46:56 <ianw> this needs a dib release though to pick up fixes for running on bullseye though
19:47:06 <ianw> i was thinking get that testing stack merged, then release at this point
19:47:21 <clarkb> ok I can try to review the func testing work today
19:47:46 <ianw> it's really a lot of small single steps, but i wanted to call out each test we're dropping/moving etc. explicitly
19:49:10 <fungi> sounds great
19:49:13 <ianw> this should also enable 9-stream minimal builds
19:49:40 <ianw> every time i think they're dead for good, somehow we find a way...
19:50:28 <clarkb> One thing the fedora boot struggles makes me wonder is if we can try and phase out fedora in favor of stream. We essentially do the same thing with ubuntu already since the quicker cadence is hard to keep up with and it seems like you end up fixing issues in a half baked distro constantly
19:51:36 <ianw> yeah, i agree stream might be the right level of updatey-ness for us
19:52:14 <ianw> fedora boot issues are about 3-4 down on my todo list ATM :)  will get there ...
19:52:32 <fungi> we continue to struggle with gentoo and opensuse-tumbleweed similarly
19:53:41 <fungi> probably it's just that we expect to have to rework image builds for a new version of the distro, but for rolling distros instead of getting new broken versions at predetermined intervals we get randomly broken image builds whenever something changes in them
19:54:35 <clarkb> that is part of it, but also at least with ubuntu and seems like for stream there is a lot more care in updating the main releases than the every 6 month releases
19:54:46 <clarkb> because finding bugs is part of the reason they do the every 6 month release but not when they did the big release
19:55:12 <fungi> yeah, agreed, they do seem to be treated as a bit more... "disposable?"
19:55:59 <fungi> anyway, it's the same reason i haven't pushed for debian testing or unstable image labels
19:56:18 <fungi> tracking down spontaneous image build breakage eats a lot of our time
19:57:22 <clarkb> Yup. Also we are just about at time. Last call for anything else Otherwise I'll end it at the the change of the hour
19:57:28 <ianw> oh that was one concern of mine with gentoo in that dib stack
19:57:50 <ianw> it seems like it will frequently be in a state of taking 1:30hours to timeout
19:58:06 <fungi> i still haven't worked out how to fix whatever's preventing us from turning gentoo testing back on for zuul-jobs changes either
19:58:09 <ianw> which is a little annoying to hold up all the other jobs
19:58:50 <fungi> though for a while it seemed to be staleness because the gentoo images hadn't been updated in something like 6-12 months
20:00:04 <clarkb> Alright we are at time. Thank you everyone.
20:00:07 <clarkb> #endmeeting