19:01:10 #startmeeting infra 19:01:10 Meeting started Tue Oct 19 19:01:10 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 The meeting name has been set to 'infra' 19:01:17 #link http://lists.opendev.org/pipermail/service-discuss/2021-October/000292.html Our Agenda 19:01:22 #topic Announcements 19:01:37 As mentioned it is the PTG this week. There are a couple of session that might interset those listening in 19:01:42 OpenDev session Wednesday October 20, 2021 at 14:00 - 16:00 UTC in https://meetpad.opendev.org/oct2021-ptg-opendev 19:02:08 This is intended to be an opendev office hours. If this time is not good for you you shouldn't feel obligated to stay up late or get up early. I'll be there and I should have it covered 19:02:35 If you've got opendev related questions concerns etc feel free to add them to the etherpad on that meetpad and join us tomorrow 19:02:42 Zuul session Thursday October 21, 2021 at 14:00 UTC in https://meetpad.opendev.org/zuul-2021-10-21 19:02:58 This was scheduled recently and sounds liek a birds of the feather session with a focus on the k8s operator 19:03:26 I'd like to go to this one but I think Imay end up getting pulled into the openstack tc session around this time as well since they will be talking CI requirement related stuff (python versions and distros) 19:03:54 Also I've got dentistry wednesday and school stuff thursday so will have a few blocks of time where I'm not super available outside of ptg hours 19:04:36 also worth noting the zuul operator bof does not appear on the official ptg schedule 19:04:52 owing to there not being any available slots for the preferred time 19:05:40 #topic Actions from last meeting 19:05:45 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-12-19.01.txt minutes from last meeting 19:05:50 We didn't record any actions that I could see 19:05:55 #topic Specs 19:06:24 The prometheus spec has been landed if anyone is interested in lookg at the implementation for that. I'm hoping that once I get past some of the zuul and gerrit stuff I've been doing I'll wind up with time for that myself 19:06:26 we'll see 19:06:31 #link https://review.opendev.org/810990 Mailman 3 spec 19:06:51 I think I've been the only reviewer on the mailman3 spec so far. Would be great to get some other reviews input on that too 19:07:00 further input welcome, but i don't think there are any unaddressed comments at this point 19:07:34 Would be good to have this up for approval next week if others think they can get reviews in between now and then 19:07:42 I guess we can make that decision in a week at our next meeting 19:07:59 yeah, hopefully i'll be nearer to being able to focus on it by then 19:08:35 #topic Topics 19:08:41 #topic Improving OpenDev's CD Throughput 19:09:13 ianw: this sort of stalled on the issue in zuul preventing this from merging. Those issues in zuul should be corrected now and if you recheck you should get an error message? Then we can fix the changes and hopefully land them? 19:09:25 ahh, yep, will get back to it! 19:09:30 thanks 19:09:40 #topic Gerrit Account Cleanups 19:10:00 same situation as the last numebr of weeks. Too many more urgent distractions. I may pull this off the agenda then add it back when I have time to make progress again 19:10:08 #topic Gerrit Project Renames 19:10:28 Wanted to do a followup on the project renames that fungi and I did last Friday. Overall things went well but we have noticed a few things 19:10:58 The first thing is that part of the rename process has us copy zuul key in zk for the renamed projects to their new names then delete the content at the old names. 19:11:18 Unfortuately this left behind just enough zk db content that the daily key backups are complaining about the old names not having content 19:11:26 #link https://review.opendev.org/c/zuul/zuul/+/814504 fix for zuul key exports post rename 19:11:57 That change should fix this for future renames. Cleanup for the existing errors will likely require our intervention. Possibly by rerunning the delete-keys commands with my change landed. 19:12:10 Note that the backups are otherwise successful, the scary error messages are largely noise 19:12:50 Next up we accidentally updated all gitea projects with their current projects.yaml information. We had wanted to only update the subset of projects that are being renamed as this process was very expensive in the past. 19:13:16 Our accidental update to all projects showed that current gitea with the rest api isn't that slow at updating. And we should consider doing full updatse anytime projects.yaml changes 19:13:27 #link https://review.opendev.org/c/opendev/system-config/+/814443 Update all gitea projects by default 19:13:42 I think it took something like 3-5 minutes to run the gitea-git-repos role against all giteas. 19:13:49 (we didn't directly time it since it was unexpected) 19:14:08 ++ i feel like i had a change in that area ... 19:14:11 if infra-root can weigh in on that and whether or not there are other concerns beyond just time costs please bring them up 19:14:16 this followup job is also failing https://review.opendev.org/c/opendev/system-config/+/808480 , some unrelated cert issue according to apache log 19:14:26 separately, this raised the point that ansible doesn't like to rely on yaml's inferred data types, instead explicitly recasting them to str unless told to recast them to some other specific type. this maks it very hard to implement your own trinary field 19:15:11 i wanted a rolevar which could be either a bool or a list, but i think that's basically not possible 19:15:28 oh, the one i'm thinking of is https://review.opendev.org/c/opendev/system-config/+/782887 "gitea: switch to token auth for project creation" 19:15:34 you'd have to figure out ansible's serialization system and implement it in your module 19:15:54 ianw: I think gitea reverted the behavior that made normal auth really slow 19:15:57 pretty much, re-parse post ansible's meddling 19:16:05 enough other people complained about it and its memory cost iirc 19:16:20 frickler: looks like refstack reported a 500 error when adding results. Might need to look in refstack logs? 19:16:44 https://ef5d43af22af7b1c1050-17fc8f83c20e6521d7d8a3ccd8bca531.ssl.cf2.rackcdn.com/808480/1/check/system-config-run-refstack/a4825bb/refstack01.openstack.org/apache2/refstack-ssl-error.log 19:16:49 frickler: https://zuul.opendev.org/t/openstack/build/a4825bbef768406d940fcdd459cb92c6/log/refstack01.openstack.org/containers/docker-refstack-api.log I think that might be genuine error? 19:17:24 frickler: ya I think the apache error there is a side effect of how we test with faked up LE certs but shouldn't be a breaking issue 19:17:35 ah, o.k. 19:18:03 that apache warning is expected i think? did you confirm it's not present on earlier successful builds? 19:18:18 The last thing I noticed during the rename process is that manage-projects.yaml runs the gitea-git-repos update then the gerrit jeepyb manage projects. It updates project-config to master only after running gitea-git-repos. For extra safety it should do the proejct-config update before any other actions. 19:18:37 no, I didn't check. then someone with refstack knowledge needs to debug 19:19:11 One thing that makes this complicated is that the sync-project-config role updates project-config then copies its contents to /opt/project-config on the remote server like review. But we don't want it to do that for gitea. I think we want a simpler update process to run early. Maybe using a flag off of sync-project-config. I'll look at this in the next day or two hopefully 19:19:19 frickler: ++ seems refstack specific 19:19:23 jsonschema.exceptions.SchemaError: [{'type': 'object', 'properties': {'name': {'type': 'string'}, 'uuid': {'type': 'string', 'format': 'uuid_hex'}}}] is not of type 'object', 'boolean' 19:20:54 Anything else to go over on the rename process/results/etc 19:21:18 there was a rename overlap missed 19:21:57 rename overlap? 19:21:59 and the fix then also needed a pyyaml6 fix https://review.opendev.org/c/openstack/project-config/+/814401 19:23:03 thanks for fixing up the missed acl switch 19:23:06 ya when I rebased the ansible role change I missed a fixup 19:23:43 its difficult to review those since we typically rely on zuul tocheck things for us. Maybe figure out how to run things locally 19:25:02 also still a lot of config errors 19:25:15 ya, but those are largely on the tenant to correct 19:25:35 Maybe we should send a reminder email to openstack/venus/refstack/et al to update their job configs 19:26:43 Maybe we wait and see another week and if progress isn't made send email to openstack-discuss asking that the related parties update their configs 19:27:01 I've just written a note in my notes file to do that 19:27:03 +1 19:27:39 in theory any changes they push should get errors reported to them by zuul, which ought to serve as a reminder 19:28:30 #topic Improving Zuul Restarts 19:28:48 Last week frickler discovered zuul in a sad state due to a zookeeper connectivity issue that apepars to have dropped all the zk watches 19:29:12 To correct that the zuul scheduler was restarted, but frickler noticed we haven't kept our documentation around doing zuul restarts up to date 19:29:21 Led to questions around what to restart and how. 19:29:39 Also when to reenqueue in the start up process and general debugging process 19:30:20 To answer what to restart I think we're currently needing to restart the scheduler and executors together. The mergers don't matter as much but there is a restart-zuul.yaml playbook in system-config that will do everything (which is safe to do if doing the scheduler and executors) 19:30:31 yes, it would be nice to have that written up in a way that makes it easy to follow in an emergency 19:30:34 If you run that playbook it will do everything for you but capture and restore queues. 19:31:27 I think the general process we want to put in the documentation is: 1) save queues 2) run restart-zuul.yaml playbook 3) wait for all tenants to load in zuul (can be checked at https://opendev.org/ and wait for all tenants to show up there) 4) run re-enqueue script (possibly after editing it to remove problematic entries?) 19:32:14 I can work on writing this up in the next day or two 19:32:31 (it's good to timestamp that re-enqueue dump ... says the person who has previously restored an old dump after restarting :) 19:32:34 regarding queues I wondered whether we should drop periodic jobs to save some time/load and they'll run again soon enough 19:32:40 granted, the need to also restart the executors whenever restarting the scheduler is a very recent discovery related to in progress scale-out-scheduler development efforts, which should no longer be necessary once it gets farther along 19:33:17 fungi: yes though at other times we've similarly had to restart broader sets of stuff to accomodate zuul chagnes. I think the safeest thing is always to do a full restart especially if you are already debugging an issue and want to avoid new ones :) 19:33:29 that's fair 19:33:39 in which case restarting the mergers too is warranted 19:33:51 yup and the fingergw and web and the playbook does all of that 19:33:51 in case there are changes in the mergers which also need to be picked up 19:34:43 fungi: and ya not having periodic jobs that just fail is probably a good idea too 19:35:23 As far as debugging goes this is a bit harder to describe a specific process. Generally when I debug zuul I try to find a specific change/merge/ref/tag event that I can focus on as it helps you to narrow the amount of logs that are relevant 19:36:07 what that means is if a queue entry isn't doing what I expect it to I grep teh zuul logs for its identifier (change number or sha1 etc) then from that you find logs for that event whcih have an event id in the logs. Then you can grep the logs for that event id and look for ERROR or traceback etc 19:36:10 i guess we should go about disabling tobiko's periodic jobs for them? 19:36:30 they seem to run al day and end up with retry_limit results 19:36:33 er, all day 19:36:34 fungi: ya or figure out why zuul is always so unhappy with them 19:36:43 when we reenqueue them they go straight to error 19:37:03 then eventualy when they actually run they break too. I suspect something is really unhappy with it and we did do a rename with topiko to tobiko something 19:37:25 Error: Project opendev.org/x/devstack-plugin-tobiko does not have the default branch master 19:37:43 hrm but it does 19:37:56 I wonder if we need to get zuul to reclone it 19:37:58 yeah, those are the insta-error messages 19:38:09 could be broken repo caches i guess 19:38:34 Anyway I'll take this todo to update the process docs for restarting and try to give hints about my debugging process above even though it may not be applicable in all cases 19:38:38 but when they do actually run, they're broken on something else i dug into a while back which turned out to be an indication that they're probably just abandoned 19:38:48 it should give others an idea of where to start when digging into zuuls very verbose logging 19:40:13 #topic Open Discussion 19:40:51 https://review.opendev.org/c/opendev/system-config/+/813675 is the last sort of trailing gerrit upgrade related change. It scares me because it is difficult to know for sure the gerrit group wasn't used anywhere but I've checked it a few times now. 19:41:11 I'm thinking that maybe I can land that on thursday when I hsould have most of the day to pay attention to it and restart gerrit and make sure all is still happy 19:42:16 i've been focusing on getting rid of debian-stable, to free up space, afs is looking pretty full 19:42:37 debian-stable == stretch? 19:42:58 the horribly misnamed "debian-stable" nodeset 19:43:00 yeah, sorry, stretch from the mirrors, and the "debian-stable" nodeset which is stretch 19:43:01 aha 19:43:24 which is technically oldstable 19:43:31 but then there's a lot of interest in 9-stream, so that's what i'm trying to make room for 19:43:31 oldoldstable 19:43:41 oldoldstable in fact, buster is oldstable now and bullseye is stable 19:43:45 a test sync on that had it coming in at ~200gb 19:43:48 (9-stream) 19:44:13 For 9 stream we (opendev) will build our own images on ext4 as we always do I suppose, but any progress with the image based update builds? 19:44:26 Also this reminds me that fedora boots are still unreliable in most clodus iirc 19:44:50 clarkb: yep, if you could maybe take a look over 19:44:56 #link https://review.opendev.org/q/topic:%22func-testing-bullseye%22+(status:open%20OR%20status:merged) 19:45:21 that a) pulls out a lot duplication and old fluff from the functional testing 19:45:46 and b) moves our functional testing to bullseye, which has a 5.10 kernel which can read the XFS "bigtime" volumes 19:46:09 the other part of this is updating nodepool-builder to bullsye too; i have that 19:46:33 #link https://review.opendev.org/c/zuul/nodepool/+/806312 19:46:56 this needs a dib release though to pick up fixes for running on bullseye though 19:47:06 i was thinking get that testing stack merged, then release at this point 19:47:21 ok I can try to review the func testing work today 19:47:46 it's really a lot of small single steps, but i wanted to call out each test we're dropping/moving etc. explicitly 19:49:10 sounds great 19:49:13 this should also enable 9-stream minimal builds 19:49:40 every time i think they're dead for good, somehow we find a way... 19:50:28 One thing the fedora boot struggles makes me wonder is if we can try and phase out fedora in favor of stream. We essentially do the same thing with ubuntu already since the quicker cadence is hard to keep up with and it seems like you end up fixing issues in a half baked distro constantly 19:51:36 yeah, i agree stream might be the right level of updatey-ness for us 19:52:14 fedora boot issues are about 3-4 down on my todo list ATM :) will get there ... 19:52:32 we continue to struggle with gentoo and opensuse-tumbleweed similarly 19:53:41 probably it's just that we expect to have to rework image builds for a new version of the distro, but for rolling distros instead of getting new broken versions at predetermined intervals we get randomly broken image builds whenever something changes in them 19:54:35 that is part of it, but also at least with ubuntu and seems like for stream there is a lot more care in updating the main releases than the every 6 month releases 19:54:46 because finding bugs is part of the reason they do the every 6 month release but not when they did the big release 19:55:12 yeah, agreed, they do seem to be treated as a bit more... "disposable?" 19:55:59 anyway, it's the same reason i haven't pushed for debian testing or unstable image labels 19:56:18 tracking down spontaneous image build breakage eats a lot of our time 19:57:22 Yup. Also we are just about at time. Last call for anything else Otherwise I'll end it at the the change of the hour 19:57:28 oh that was one concern of mine with gentoo in that dib stack 19:57:50 it seems like it will frequently be in a state of taking 1:30hours to timeout 19:58:06 i still haven't worked out how to fix whatever's preventing us from turning gentoo testing back on for zuul-jobs changes either 19:58:09 which is a little annoying to hold up all the other jobs 19:58:50 though for a while it seemed to be staleness because the gentoo images hadn't been updated in something like 6-12 months 20:00:04 Alright we are at time. Thank you everyone. 20:00:07 #endmeeting