Tuesday, 2021-10-12

clarkbAnyone else here for our meeting?19:00
clarkbWe will get started momentarily19:00
ianwo/19:00
fungiahoy19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Oct 12 19:01:10 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-October/000288.html Our Agenda19:01
clarkb#topic Announcements19:01
clarkbI forgot to mention this in the agenda but next week is the PTG19:01
clarkbI requested a short amount of time for ourselves largely as office hours that other projects can jump into to discuss stuff with us19:01
clarkbI plan to be there and should get an etherpad together today. If the times work out for you feel to join, otherwise I think we'll have it covered19:02
clarkbBut also keep in mind that is happening next week and we shoudl avoid changes to meetpad/etherpad if possible19:02
clarkb#topic Actions from last meeting19:03
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-05-19.01.txt minutes from last meeting19:03
clarkbWe had actions last week but they were all related to specs. Let's just jump into specs discussion then :)19:03
clarkb#topic Specs19:03
clarkbFirst up I did manage to update the prometheus spec based on feedback on how to run it. Ended up settling on using the built binary to avoid docker weirdness and old versions in distros19:04
clarkb#link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement19:04
clarkbcorvus brought up a good concern which is that we should ensure that it can run on old old distros and I confirmed it seems to work on xenial19:04
clarkbfungi: I was going to work with you to check it on trusty when you have time19:04
fungiahh, yep19:05
clarkbI didn't want to touch the remaining trusty node without you being around19:05
fungii should have time tomorrow19:05
clarkbCool I'll ping you tomorrow then.19:05
clarkbMy plan is to approve this spec if nothing comes up by end of day Thursday for me19:05
clarkbif you havent' reviewed the spec and would like to now is the time tod o that19:05
clarkband we can note any trusty problems before landing it too19:05
clarkbNext up is the mailman 3 spec19:05
clarkb#link https://review.opendev.org/810990 Mailman 3 spec19:05
clarkbI've reviewed this and the plan seems straightforward. Essentially spin up a new machine running mm3. Then migrate existing mm2 vhosts into it as users are ready starting with opendev19:06
clarkbIf other infra-root can review this spec that would be much appreciated19:06
clarkbfungi: ^ anything else to add on mailman 3?19:07
funginah, the migration tools are fairly honed from what i understand, so other than new interfaces and probably some new message formatting in places, users shouldn't really be impacted19:07
clarkbthank you for putting that together. I'm excited to be able to use the new frontend19:08
clarkb#topic Topics19:08
clarkb#topic Improving OpenDev CD throughput19:08
fungii guess the actual steps for the cut-over could stand to be drilled down into a bit, deciding how we want to go about making sure deliveries to the list get queued up while we're copying things over19:08
fungibut we can work that out as we get closer19:08
clarkb++19:09
clarkbianw: you mentioned trying to pick this up again. I think the next step is largely in making that first chagne in the stack mergeable (by runnign jobs for it somehow?)19:09
clarkbnote that corvus pointed out the change in zuul you thought would fix it likely won't as the playbooks aren't changing for those jobs19:10
ianwyeah i added a update to a readme19:10
clarkbah ok, I should go back and rereview then19:11
ianw(late last night ... and i just realised that file doesn't trigger anything either :)19:11
ianwi'll try something else!  but yep, i did respond to all comments19:11
clarkb#link https://review.opendev.org/c/opendev/system-config/+/807672 is the first change in the sequence19:11
clarkbcool and I'll look at rereviewing things this afternoon19:11
clarkb#topic Gerrit account cleanups19:13
clarkbThis is something that has gone on the back burner with zuul updates, gerrit upgrades, gitea upgrades, openstack releases etc. I haven't forgotten about it and will try and pick it up next week if the PTG is quiet for me (I expect it to be but you never know with an event)19:13
clarkbReally just noting that I still intend on getting to this but it is an easy punt because it doesn't typcially immediately affect stuff19:14
clarkb#topic Gerrit Project Renames19:14
clarkbWe announced last week that we would rename gerrit projects Friday October 15 at 18:00 UTC19:14
clarkbAll of our project renaming testing continues to function we should largely be mechanically ready for this19:14
clarkbThe one thing that has been noted is that we need to update project metadata in gitea after renames to update descriptions and urls and storyboard links19:15
clarkbI think the easiest way to do that would be to run the gitea project management with the full update flag set after we rename. Either using a subset projects.yaml or just doing it for everything (which could take hours)19:15
clarkbfungi: ^ you were thinking about this too, did you have a sense for how you wanted to approach it?19:16
fungiit's still not clear to me why we can't update specific projects19:16
fungithough i suppose we do need to perform a full run at least once to catch up19:16
clarkbfungi: the fundamental issue is that the input to renaming and the input to setting project metadata are different. The rename playbook takes taht simple yaml file with old and new names. The project metadata takes projects.yaml19:17
clarkbThis is why I think it is simpler to do it as two distinct steps.19:17
fungioh, it doesn't filter projects.yaml by specific entries?19:17
clarkbno projects.yaml is not referenced at all in the rename process19:18
clarkbThen to make things more complicated projects.yaml doesn't get updated until after the reanme is done and we merge the associated changes19:18
fungioh, i see, we would normally update projects.yaml after renaming19:18
clarkband since it takes hours to do the force update we don't do those19:18
fungiso would need to run the metadata update after that19:18
clarkber we don't automatically do those19:18
clarkbyup19:19
fungibut we could tell it to only update the projects which had been renamed, once things are up and the projects.yaml update merges, right?19:19
fungirather than telling it to update every project listed in the file19:19
clarkbfungi: the current code does not support that. We could hack it in by running the update against an edited projects.yaml. But even that might be complicated since I think the playbook syncs project config directly19:20
clarkbthis is the bit I was hoping someone would have time to look at19:20
clarkbI think it is ok to do the metadata update as a separate step post rename, but ya we should device a method of making it less expensive19:20
fungiahh, okay, so we need a way to tell the metadata update script to filter its actions to specific project names, i guess19:20
clarkbya that19:20
fungiseems like that shouldn't be too hard, we could probably have it use the rename file as the filter19:21
fungii'll see if i can figure out what we need to be able to do that bit19:21
clarkbessentially our rename process becomes: run rename playbook, restart things and land project-config updates, ensure zuul is happy, manually run the gitea repo management playbook hopefully not in the most expensive configuration possible19:21
clarkband is that very last bit that we need someone to look at19:22
clarkbthanks!19:22
fungii guess we could even avoid doing a full sync this time by feeding it the historical rename files too19:22
clarkbfungi: yup, we could go through and pull out all the names that need updating as a subset of the whole19:23
clarkbsince we have those records19:23
clarkbI'll work on an etherpad process doc as well as reviewing the project-config proposal and writing up the record files for this rename tomorrow19:23
clarkbAnything else to cover on this topic?19:24
fungii don't think so19:25
clarkb#topic Gerrit 3.3 upgrade19:25
clarkbThis went exceptionally well. I'm still looking around wondering what happened and if it is too good to be true :)19:25
clarkbthe 2.13 -> 3.2 upgrade has scarred me19:25
clarkbI've been trying to add hashtag:gerrit-3.3 to changes related to the upgrade and the cleanup afterwards. Feel free to add this hashtag to your changes too19:26
clarkbAt this point the major change remaining is the 3.2 image cleanup change.19:26
fungii'm trying to run down what looks like a new comment-related crash in gertty, probably related to the upgrade to 3.319:27
clarkbDo we have opinions on when we'll feel comfortable dropping the 3.2 images?19:27
ianwdropping the jobs won't purge the images from dockerhub though will it?19:28
clarkbianw: it will not, however the images in docker hub will eventually get aged out and deleted on that end19:28
clarkb(I forget what the timing on that is with dockerhub's new policy)19:28
clarkbhttps://review.opendev.org/c/opendev/system-config/+/813074 is the change. I'm feeling more and more confident in 3.3 since we hanven't had any issues yet that make me think revert19:29
fungiiirc, the image ageout has to do with when it last got a download19:29
clarkbmaybe we go ahead and land it and we can always restore it again later if necessary or just use the existing 3.2 tag until that ages out in docker hub19:29
ianwyep; we can always rebuild too, and have local copies.  so i think 813074 is probably gtg19:29
clarkbfungi: ya and our test jobs for 3.2 download that tag19:29
clarkbianw: ok I've removed the WIP19:30
fungii'm fine dropping them at any time, yeah. it's increasingly unlikely we'd try to roll back at this point19:30
fungiwe're nearing the 48-hour mark19:30
clarkbcool please review the change then :)19:30
fungiyeah, i was, but... gertty crashing on one of those changes distracted me ;)19:30
ianwthe attention set seems to be working19:31
clarkbI've also got https://review.opendev.org/c/opendev/system-config/+/813534/ up which is semi related in that the chanegs to update post upgrade hit a bug where we set the heap limit to 96gb in testing whcih doesn't work if the jvm tries to allocate memory on an 8gb instance19:31
clarkbianw: ^ I made a new ps on that fixing an issue I cuaght when writing the followups we talked about yesterday. The followups are there too. I think the whole stack deserves careful review to ensure we don't accidentally remove anything.19:31
clarkboh it just occured to me that the gerrit -> review group rename needs to be checked against the private group vars. Let me WIP that until that is done19:32
clarkbianw: oh neat it is telling me to review the CD Improvement change since you pushed a new ps19:33
clarkbI really think the attention set has the potential for being very powerful, just need to sort out how to make it work for us19:33
ianwyeah i've been careful to unclick people when voting, which isn't something that needs attention19:33
clarkbThe last thing I had on this topic is pointing out that we are already testing the 3.3 -> 3.4 upgrade in CI now :) We should start thinking about scheduling that upgrade next. Probably really look at that post PTG so in 2 weeks?19:34
clarkbianw: using the modify button at the bottom of the comment window thing?19:34
ianwi did add a note on that @19:34
ianw#link https://bugs.chromium.org/p/gerrit/issues/detail?id=1515419:34
ianwstill every bug seems to go into the polygerrit category there :/19:35
ianwi guess maybe this is polygerrit; i don't know who owns it19:35
clarkbianw: their bug tracker is broken and the nromal issue type can't be submitted because no one is assigned to receive notifications for them or some such19:36
clarkbso it defaults to polygerrit and you have to hope it gets in front of the right people. But I agree this could be a polygerrit issue19:36
ianwclarkb: yep; i think that's going to be the major issue -- if everyone adds your attention when they +1/+2 your change, your attention list becomes less useful19:36
ianwalso if anyone pops up with dashboard issues see19:37
ianw#link https://groups.google.com/g/repo-discuss/c/565rD1Sjiag19:37
clarkbmight be worth an email to opendev-discuss calling out the modify action and the dashboard stuff19:37
ianwbasically; /#/dashboard/... doesn't work, /dashboard/... does.  unclear if this is a bug or feature19:37
ianwsure i can draft something 19:38
clarkbthanks19:38
fungiservice-discuss?19:38
clarkbfungi: yup sorry. Every other -discuss is name-discuss19:39
fungicool, just making sure you didn't mean some other-discuss19:39
fungi(like openstack-discuss)19:39
clarkbThanks again to everyone who helped make this upgrade happen. I think we're in a really good place as far as gerrit goes. We can upgrade with minimal impact and in some cases downgrade. Much of the things we do with gerrit like project creation, project renaming, etc are tested. We even have upgrade testing19:40
clarkbOh and the new server etc19:40
clarkbWe've come a long way since we were on 2.13 a yaer ago19:40
clarkb#topic Open Discussion19:42
clarkbThat was it for the agenda.19:42
clarkbAnything else?19:42
ianw#link https://review.opendev.org/c/opendev/system-config/+/81262219:42
ianwfungi: ^ maybe you could double check i didn't fat finger anything; that's from the lock issues i think we discussed last week19:43
clarkboh for some reason I thought that had landed and noticed we still had the error that should fix19:43
clarkbbut I guess it hasn't landed and ++ to reviewing it and getting it in to fix those conflicts19:43
fungichecking19:43
ianwborg verify lock issues i mean19:43
fungiand yeah, i too saw the cron message and thought we had already fixed it, so good to know!19:44
clarkbalso gitea01 started failing backups again which makes me wonder about networking invexxhost again :)19:44
ianwthe recent -devel job failure had me thinking about bridge upgrades too19:44
ianwat one time it seemed to be under a lot of pressure for it's small size, but i don't recall any issues recently19:45
clarkbOne tricky thing I remembered about bridge updates is we have ssh rules around bridge connectivity iirc. We may need to spin up a new bridge, then update all the things to talk to it, then swap over?19:45
clarkbianw: the effort to parallelize the CD stuff could have us wanting a bigger server again19:45
clarkbianw: might be a good idea to get that work done first, monitor resource needs and size appropriately for a new server?19:45
ianwi thought it might be a good time to start thinking about using zuul-secrets in more places19:46
clarkbianw: to avoid needing the bastion?19:47
corvuswe could easily >4x the load on bridge before it's a problem.19:47
clarkbya the main issue that seems to affect bridge performance is having leaked ssh connections pile up ansible rules which causes system load to grow19:49
clarkbwhen that isn't happening it doesn't ened a ton of resources19:49
ianwclarkb: yep; it would be a different way of working but I think moves us even more towards a "gitops" model19:50
clarkbya I think my biggest struggle there is still around manually running stuff. It seems super useful to be able to do that when things pop up and zuul doesn't have a great answer to that (yet?)19:51
corvusi think the cd story gets a lot better if we can remove the ansible module restrictions for untrusted jobs on the executor (but that's a v5+ idea)19:52
fungiand the "leaked" (really indefinitely hung) ansible ssh processes seem to crop up when we have pathological server failures where ssh authentication begins but the login never completes19:52
fungirebooting whatever server is stuck generally clears it up19:52
clarkbif we want to hold off a bit on updating bridge with the idea that zuul improves before we need to upgrade I'd be ok with that. But we should probably write done a concrete set of things we can go to zuul about making better for that. The modules thing is another good one19:53
fricklerbefore we end, I also want to mention the issue with suds-jurko, which is mostly masked in CI because we have an old wheel built for it19:54
ianwclarkb: yep, i agree, but i imagine we could have some sort of "escape-hatch" where we do something like replicate the key and have a way to manually run ansible that provides the decrypted secrets19:54
clarkbbut also "zuul please run this job now" bypass as sysadmins would also be nice alternative to the escape hatch bridge gives us19:54
clarkbianw: ya that could work too19:54
fungiworth noting, we'll be unable to upgrade to the upcoming ansible release until we're on focal, or else we need to use a nondefault python3 with it on bridge19:54
clarkbfrickler: oh ya thats a good call out19:54
fricklerso maybe we want to expire wheels after some time or have some other way to check whether they still can be built19:54
ianwfrickler: ahh, i have a spec for that i think :)19:55
corvusi'm unaware of the suds-jurko issue, is there a summary?19:55
clarkbwe built a wheel for suds-jurko some time ago with older setuptools and have that in our wheel mirror. But suds-jurko doesn't build with current setuptools. This means running openstacky things outside of our CI system is problematic as that package doesn't install19:55
clarkbcorvus: ^19:55
ianw#link https://review.opendev.org/#/c/703916/19:55
fricklerthe wheel was built with setuptools < 58, with 58 it fails19:55
corvusthx19:56
fungia sort of toctou problem, i guess19:56
clarkband ya doing a fresh wipe of the wheel mirrors periodically would be a good way to expose that stuff in CI19:56
ianwthat doesn't exactly cover this scenario, but related.  basically that we are an append-only copy that grows indefinitely19:56
frickleralso https://bugs.launchpad.net/cinder/+bug/194634019:56
fricklerwe first noticed it with fedora, because we don't seem to have a wheel for py3919:56
clarkbwe probably don't need daily rebuilds but weekly or monthly might be a good approach. And solve the indefinite growth problem too19:56
clarkbseparately suds-jurko has been unmaintained for like 7 years...19:57
clarkband should be replaced too19:58
fungione of half a dozen (at least) dead dependencies which setuptools 58 turned up in various projects i know about19:58
frickleryes, in that specific case, there's suds-community as a replacement19:58
fricklerhttps://review.opendev.org/c/openstack/requirements/+/813302 lists what fails in u-c19:58
fricklerwith a job that doesn't use the pre-built wheels19:59
clarkbWe are at time. Feel free to continue conversation in #opendev or on the mailing list.20:00
clarkbThank you everyone for listening and participating. We'll probably be around next week since I don't expect a ton of direct PTG involvment. But if that changes I'll try to send and email about it20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Oct 12 20:00:49 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-12-19.01.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-12-19.01.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2021/infra.2021-10-12-19.01.log.html20:00
fungithanks clarkb!20:01

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!