19:01:09 <clarkb> #startmeeting infra
19:01:09 <opendevmeet> Meeting started Tue Nov  9 19:01:09 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:09 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:09 <opendevmeet> The meeting name has been set to 'infra'
19:01:17 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-November/000295.html Our Agenda
19:01:45 <clarkb> #topic Announcements
19:02:05 <clarkb> I was hoping I could link to gerrit user summit stuff but I can't find any details on that yet. They must be running into planning issues. I can sympathize with that
19:02:52 <ianw> o/
19:03:01 <clarkb> #topic Actions from last meeting
19:03:08 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-11-02-19.01.txt minutes from last meeting
19:03:19 <clarkb> We didn't record any new actions and the prior actions are all done or in progress. \o/
19:03:25 <clarkb> #topic Specs
19:03:56 <clarkb> Just a quick note here that I approved fungi's mailman3 spec after a quick respin to address input from frickler. The updates were minor and didn't seem like anything that needed to be completely rereviewed
19:05:13 <clarkb> #topic Topics
19:05:20 <clarkb> #topic Improving OpenDev's CD throughput
19:05:31 <clarkb> this is still on my todo list. All the zuul and container fun has been distracting me :/
19:05:46 <clarkb> ianw: have you had a chance to look at why the child changes are failing CI? (I'm mostly just curious)
19:06:06 <ianw> no sorry, i've managed to be distracted on other things
19:06:14 <clarkb> I think we've all been in that boat recently
19:06:22 <clarkb> #topic Gerrit Account Cleanups
19:06:52 <clarkb> I was mostly going to skip this over except fungi found a story where someone had trouble with @ in their username. fungi has asked them to clarify if they were trying to use an email address as a username or if their username actually has an @ in it
19:07:07 <clarkb> Noting that here for the possibility there is another sort of cleanup we'll need to do in normalizing usernames
19:07:17 <clarkb> Not really actioanable at this point but an interesting possibility
19:08:00 <clarkb> #topic Zuul Multi Scheduler Setup
19:08:41 <corvus> there are 2 schedulers
19:08:45 <clarkb> Zuul has made great progress on supporting multiple schedulers (removing the last remaining spof for a zuul install). Our OpenDev zuul is running two schedulers. One on zuul01.o.o and the other on zuul02.o.o. Zuul02 is the "primary"
19:08:55 <clarkb> What makes zuul02 the primary for us is that all web traffic hits it first
19:09:08 <corvus> and gearman
19:09:13 <corvus> and actually there's no web on zuul01
19:09:35 <clarkb> If things go really sideways I think we can stop zuul01's scheduler and restart the scheduler on zuul02 only
19:09:43 <corvus> (cause we don't have a load balancer for it)
19:09:56 <corvus> ++
19:10:00 <fungi> and, if necessary, clear out zk
19:10:10 <corvus> and if things go really badly, clearing the zk state would be a good idea
19:10:16 <clarkb> It is worth noting that we have run into problems but we've been trying to work through them as they show up. corvus has been a great help with that.
19:10:41 <clarkb> So far we've had issues with retried jobs not behing handled properly. Nodepool requests getting stuck in a perpetually waiting state, and config errors not serializing properly
19:10:41 <fungi> i'm thrilled that we haven't needed to downgrade again
19:10:49 <corvus> i think we're at the point where the problems that have been cropping up have been minimal enough we can roll forward
19:11:33 <clarkb> There are a few more new issues showing up today that deserve followup after this meeting. Specifically johnsom's designate change error and elodilles lack of a zuul.tag var on release jobs. There is also a job runtime estimate problem that corvus has a fix up for
19:11:43 <clarkb> corvus: ^ fyi details for both of those other things are in #opendev
19:13:13 <clarkb> Please do report any weird behavior. So far a lot of weird behavior I have noticed ahs been tracked back to the multi scheduler setup and reporting those things has been very helpful because now they are fixed :)
19:13:42 <clarkb> And ya the recovery at this point can probably be achived by simple restarting onto one scheduler using zuul02 since the code seems quite stable in a single scheduler setup
19:14:36 <clarkb> Anything else to add to the zuul topic? or questions about the setup?
19:15:49 <clarkb> #topic User management on our systems
19:16:08 <clarkb> Last week I started pulling on a thread and noticed there were some improvements we could make to how we manage our users and uids
19:16:14 <clarkb> thank you to those who helped me make sense of it all
19:16:47 <clarkb> A number of changes have come out of that. Some potentially impactful if we aren't careful. I'd like to ask infra-root take a look over these changes so that we can start landing them when we are confident they are safe and are able to watch them go in
19:16:55 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/816869/ Be explicit about uid/gid ranges
19:17:24 <clarkb> This first change adjusts our configured ranges in adduser.conf and logins.def which different tools refer to when creating usrrs
19:17:38 <clarkb> The rough layout is: 0-999 system, 1000-1999 unallocated, 2000-2999 for infra-root users, 3000-9999 host level users, 10k - 64k container users that need uids on the host as well for bind mounts.
19:17:40 <fungi> #link https://review.opendev.org/816869 Lower UID/GID range max to make way for containers
19:17:43 <fungi> also that one
19:17:48 <clarkb> Thats the same one :)
19:18:04 <fungi> er, yep sorry. i guess it had a different title
19:18:20 <clarkb> the idea there is we've put a number of container services on high uids like 10001 but then when we create say the letsencrypt group it gets created as gid 10002
19:19:05 <clarkb> fungi and I are thinking it would be better to not assign specific values to stuff that actually belongs to the system like our users and letsencrypt group and so on so we cap those at 9999 then we can eb explicit about container uids/gids and ensure those are non overlapping in the >10k space
19:19:35 <fungi> well, our users are already statically assigned uids and gids
19:19:40 <clarkb> yup
19:19:44 <fungi> "our" personal accounts i mean
19:19:59 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/816771 Clean up unused bootstrapping users
19:20:11 <clarkb> Is antoher user related change. This time to cleanup users that I think we don't need.
19:20:36 <clarkb> This is mostly a belts and suspenders spring cleaning move and one that scares me slightly since we might've accidentally used one of those users for something functional but as far as I can tell this isn't the case
19:20:49 <fungi> well, it's also a security concern
19:20:55 <clarkb> right
19:21:05 <clarkb> I think it is important, but one we should take care with and review carefully
19:21:08 <fungi> as on some systems those accounts come with provider-supplied authorized keys
19:21:27 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/816769/ Give gerritbot and matrix-gerritbot a shared user
19:21:51 <clarkb> This is a followon to 816869, and I'd like to have us work our way through giving most of our other containers similar treatment
19:22:04 <clarkb> lodgeit, refstack, hound, some other irc bots, etc
19:22:21 <clarkb> But I figure start slow make sure we've got things laid out the way we want first before we do a bunch of work that needs updating again later
19:22:31 <clarkb> The gerritbots seemed like a good example case
19:23:03 <clarkb> Then when that has been worked through we should also look into updating the uid for our mariadb containers to something other than 999
19:23:20 <clarkb> The mariadb uid was what sparked this whole thing off in the first place and will likely be the last thing to be addressed :)
19:23:48 <clarkb> I think for mariadb we can probably run it as the user for the services it supports in most cases. For review run it as the gerrit user, for etherpad run it as the etherpad user and so on
19:24:02 <clarkb> Since we don't run shared mariadbs between services we can do that safely
19:24:53 <clarkb> If y'all can review with a very critical eye I would appreciate it. I'm happy to do additiaonl testing (we already did some manual testing before settling on 816869)
19:25:29 <fungi> yeah, it *seems* to work as hoped
19:25:44 <fungi> part of the problem is that adduser and useradd rely on entirely different configs
19:26:05 <fungi> and package maintscripts and ansible roles use who knows which one
19:26:26 <clarkb> I think I foudn some evidence that package scripts do use both. Or rather one package uses one and another packages uses the other
19:26:36 <fungi> right
19:26:43 <clarkb> the evidence for this is that one will do gidmax-1 and the other will do gidmin+1
19:26:47 <fungi> so at least having the two of them in sync should help
19:26:49 <clarkb> and we see evidence of both on our systems
19:28:02 <clarkb> Any other questions or concerns to bring up on this topic?
19:28:27 <ianw> thanks for digging into it and explaining, i'll take a look at the changes too
19:30:11 <clarkb> #topic Open Discussion
19:30:40 <clarkb> I wasn't sure which other topics we would want to discuss so decided to trim the agenda down and let this portion of the meeting cover anything else
19:31:01 <clarkb> Gerrit3.4 upgrade stuff and fedora 35 work seem to be progressing, but not sure there is anything to share yet
19:31:13 <clarkb> I think there is a dib change I should review for containerfile stuff that I haven't been able to get to
19:31:35 <clarkb> #link https://review.opendev.org/c/openstack/diskimage-builder/+/817139 dib handle containerfile errors better
19:31:35 <ianw> yeah, that was working but last night centos-9 mirrors were broken
19:31:41 <ianw> speaking of
19:31:44 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/817136
19:32:13 <fungi> i could use some help on fixing bitrot in the storyboard-webclient builds, if anyone has tips for how to update a yarn.lock
19:32:14 <ianw> adds centos 9-stream mirrors, but i don't think we have space.  my plan is to continue to remove debian-stretch
19:32:15 <fungi> #link https://review.opendev.org/814053 [opendev/storyboard-webclient] Bindep cleanup
19:32:57 <clarkb> ianw: just left a comment on the centos-9-stream change
19:33:29 <clarkb> fungi: to update a yarn.lock you remove the lock and then reinstall iirc. That produces a new lock file and if testing succeeds with that you can merge it
19:33:53 <fungi> reinstall what?
19:34:00 <clarkb> reinstall the javascript stuff using yarn
19:34:08 <ianw> i've always just done "yarn upgrade" i think
19:34:17 <clarkb> https://classic.yarnpkg.com/en/docs/cli/install/ I think
19:34:19 <fungi> oh, i can give that a try, thanks
19:34:28 <fungi> oh, also i've had this up for a while, to hopefully make our system-config jobs a little more robust...
19:34:28 <clarkb> ianw: ah that might be the better method then. I guess upgrade ignores the lock and writes a new one?
19:34:30 <fungi> #link https://review.opendev.org/813880 [opendev/system-config] Retry acme.sh cloning
19:34:47 <ianw> but either way, the *real* problem is going to be every javascript library that has maintained the same name but rewritten itself completely (see prior discussions in #opendev yesterday :)
19:35:52 <fungi> yeah, in this case the reason i need to do it is because one of the js packages needs updating to support python3
19:36:08 <fungi> but i assume that will involve updates to a lot of other dependencies
19:36:23 <clarkb> ya you might have to fiddle with the requirements file equivalent to find something that produces a working set
19:36:32 <clarkb> when I've done this for zuul before it is a fun exercise
19:36:42 <fungi> and this for making the rejects in our iptables rules slightly more expressive...
19:36:44 <fungi> #link https://review.opendev.org/810013 [opendev/system-config] Switch IPv4 rejects from host-prohibit to admin
19:38:58 <clarkb> fungi: I've +2'd that one so you can approve it when you are able to watch it. Like the user changes has potential for wide spread pain if somehow it goes wrong (though again I don't expect any issues)
19:39:25 <fungi> yep
19:39:38 <fungi> i did some spot testing and confirmed it works as expected, at least
19:40:03 <ianw> oh, i approved it in between, but yeah, i'll be around
19:40:08 <clarkb> cool
19:40:17 <clarkb> I'll give this meeting a few more minutes to bring anything else up
19:40:30 <clarkb> But then I need to review some zuul fixes and eat lunch :)
19:40:47 <fungi> i'll be around too, of course
19:40:59 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/816766
19:41:09 <ianw> is a minor one to expose the db in gerrit testing
19:41:24 <ianw> i don't think we were noticing gerrit wasn't actually talking to the db correctly
19:41:26 <fungi> oh, right, i meant to look at that one, thanks
19:41:33 <clarkb> ++ thats a good update to our testing for gerrit
19:42:06 <clarkb> ianw: you might consider toggling the state too since we had the issue with the non unique keys thing in the past that was only hit on the change of an already created row
19:42:47 <clarkb> I approved it as is as this is claerly better than what we had before
19:43:06 <ianw> that's a good idea, can do that, just the same request with a DELETE
19:43:43 <ianw> I guess a loop of PUT DELETE PUT might work
19:44:24 <ianw> i wonder if a template to method: works
19:45:33 <fungi> seems like it should? it's just a string, right?
19:48:06 <clarkb> Sounds like that is it. Thanks everyone!
19:48:09 <clarkb> #endmeeting