#opendev-meeting log

19:01:02 <clarkb> #startmeeting infra
19:01:03 <opendevmeet> Meeting started Tue Nov 30 19:01:02 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:03 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:03 <opendevmeet> The meeting name has been set to 'infra'
19:01:08 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-November/000303.html Our Agenda
19:01:14 <clarkb> We have an agenda.
19:01:19 <clarkb> #topic Announcements
19:01:37 <clarkb> Gerrit User Summit is happening Thursday and Friday this week from 8am-11am pacific time virtually
19:02:00 <clarkb> If you are interested in joining registration is free. I think they will have recordings too if you prefer to catch up out of band
19:02:12 <fungi> also there was a new git-review release last week
19:02:18 <clarkb> I intend on joining as there is a talk on gerrit updates that will be useful to us to hear I think
19:02:55 <clarkb> yup please update your git-review installation to help ensure it is working properly. I've updated as my git version updated locally forcing me to update
19:03:01 <clarkb> I haven't had any issues with new git review yet
19:03:12 <fungi> git-review 2.2.0
19:03:37 <fungi> i sort of rushed it through because an increasing number of people were upgrading to newer git which it was broken with
19:03:54 <clarkb> the delta to the previous release was small too so probably the right move
19:04:04 <fungi> but yeah, follow up on the service-discuss ml or in #opendev if you run into anything unexpected with it
19:05:02 <clarkb> #topic Actions from last meeting
19:05:11 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-11-16-19.01.txt minutes from last meeting
19:05:16 <clarkb> I don't see any recorded actions
19:05:27 <clarkb> We'll dive right into the fun stuff then
19:05:30 <clarkb> #topic Topics
19:05:40 <clarkb> #topic Improving CD Throughput
19:06:59 <clarkb> sorry small network hiccup
19:07:12 <clarkb> A number of changes have landed to make this better while keeping our serialized one job after another setup
19:07:39 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/807808 Update system-config once per buildset.
19:07:45 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/818297/ Reduce actions needed to be taken in base-jobs.
19:08:07 <ianw> yep, those are the last two
19:08:09 <clarkb> These are the last two updates to keep status quo but prepare for parallel ops
19:08:31 <clarkb> Once those go in we can start thinking about adding/updating semaphores to allow jobs to run in parallel. Very exciting. Thank you ianw for pushing this along
19:08:48 <ianw> yep i'll get to that change soon and we can discuss
19:09:33 <clarkb> #topic Zuul multi scheduler setup
19:09:51 <clarkb> Just a note that a number of bug fixes have landed to zuul since we last restarted
19:10:15 <clarkb> I expect that we'll be doing a restart at some point soon to check everything is happy before zuul cuts a new release
19:10:39 <clarkb> I'm not sure if that will require a full restart and clearing of the zk state, corvus would know. Basically it is possible that this won't be a graceful restart
19:10:44 <fungi> after our next restart, it would probably be helpful to comb the scheduler/web logs for any new exceptions getting raised
19:10:45 <clarkb> s/graceful/no downtime/
19:10:53 <corvus> yes we do need a full clear/restart
19:11:05 <clarkb> corvus: thank you for confirming
19:11:20 <fungi> i saw you indicated similar in matrix as well
19:11:33 <fungi> (for future 4.11 anyway)
19:11:36 <clarkb> and ya generally be on the lookout for odd behaviors, our input has been really helpful to the development process here and we should keep providing that feedback
19:11:37 <corvus> i'd like to do that soon, but maybe after a few more changes land
19:12:32 <corvus> we should probably talk about multi web
19:12:46 <corvus> it is, amusingly, now our spof :)
19:13:05 <clarkb> corvus: are we thinking run a zuul-web on zuul01 as well then dns round robin?
19:13:11 <corvus> (amusing since it hasn't ever actually been a spof except that opendev only ever needed to run 1)
19:13:24 <corvus> that's an option, or a LB
19:13:29 <clarkb> if we add an haproxy that might work better for outages and balancing but it would still be a spof for us
19:13:36 <corvus> we might want to think about the LB so we can have more frequent restarts without outages
19:14:09 <clarkb> I guess the idea is haproxy will need to restart less often than zuul-web and in many cases haproxy is able to keep connections open until they complete
19:14:12 <fungi> dns round-robin is only useful for (coarse) load distribution, not failover
19:14:41 <frickler> do we have octavia available? that in vexxhost?
19:14:41 <corvus> i figure if it's good enough for gitea it's good enough for zuul; we know that we'll want to restart zuul-web frequently, and there's a pretty long window when a zuul-web is not fully initialized, so a lb setup could make a big difference.
19:15:23 <clarkb> frickler: I think it is available in vexxhost, but we don't host these services in vexxhost currently so that would add a large (~40ms?) rtt between the lb frontend and backend
19:15:46 <clarkb> corvus: good point re gitea
19:15:49 <fungi> on the other hand, if we need to take the lb down for an extended period, which is far less often, we can change dns to point directly to a single zuul-web while we work on the lb
19:16:07 <ianw> it's a bit old now, but https://review.opendev.org/c/opendev/system-config/+/677903 does the work to make haproxy a bit more generic for situations such as this
19:16:16 <fungi> or just build a new lb and switch dns to it, then tear down the old one
19:16:38 <ianw> (haproxy roles, not haproxy itself)
19:17:09 <clarkb> ianw: oh ya we'll want something like that if we go the haproxy route and don't aaS it
19:17:14 <corvus> ianw: is that for making a second haproxy server, or for using an existing one for more services?
19:17:35 <corvus> (i think it's option #1 from the commit msg)
19:17:40 <clarkb> corvus: I read the commit message as #1 as well
19:17:44 <ianw> corvus: iirc that was when we were considering a second haproxy server
19:17:58 <fungi> yeah, make it easier for us to reuse the system configuration, not the individual load balancer instances
19:18:53 <corvus> that approach seems good to me (but i don't feel strongly; if there's an aas we'd like to use that should be fine too)
19:18:54 <fungi> so that we don't end up with multiple almost identical copies of the same files in system-config for different load balancers
19:19:11 <clarkb> corvus: I think I have a slgiht prefernce of using our existing tooling for consistency
19:19:57 <clarkb> and separately if someone wants to investigate octavia we can do that and switch wholesale later (I'd be most concerned about using it across geographically distributed systems with disparate front and backe ends)
19:20:35 <fungi> though for that we'd probably be better off with some form of dns-based global load balancing
19:20:48 <fungi> granted it can be a bit hard on the nameservers
19:21:23 <fungi> (availability checks driving additions and removals to a dedicated dns record/zone)
19:21:47 <fungi> requires very short ttls, which some caching resolvers don't play nicely with
19:22:38 <corvus> ok, i +2d ianw's change; seems like we can base a zuul-lb role on that
19:22:58 <clarkb> sounds good, anything else zuul related to go over?
19:23:15 <corvus> i'll put that on my list, but it's #2 on my opendev task list, so if someone wants to grab it first feel free :)
19:24:02 <corvus> (and that's all from me)
19:24:05 <clarkb> #topic User management on our systems
19:24:21 <clarkb> The update to irc gerritbot here went really well. The update to matrix-gerritbot did not.
19:24:45 <clarkb> It turns out that matrix-gerritbot needs a cache dir in $HOME/.cache to store its dhall intermediate artifacts
19:25:31 <clarkb> and that didn't play nicely with the idea of running the container as a different user as it couldn't write to $HOME/.cache. I had thought I had bind mounted everything it needed and that it was all read only but that wasn't the case. TO make things a bit worse the dhall error log messages couldn't be written because the image lacked a utf8 locale and the error messages had
19:25:33 <clarkb> utf8 characters
19:25:54 <clarkb> tristanC has updated the matrix-gerritbot image to address these things so we can try again this week. I need to catch back up on that.
19:26:22 <clarkb> One thing I wanted to ask about is whether or not we'd like to build our own matrix-gerritbot images using docker instead of nix so that we can have a bit more fully featured image as well as understand the process
19:26:33 <clarkb> I found the nix stuff to be quite obtuse myself and basically punted on it as a result
19:27:51 <clarkb> (the image is really interesting it sets a bash prompt but no bash is installed, there is no /tmp (I tried to override $HOME to /tmp to fix the issue and that didn't work), etc)
19:28:55 <clarkb> I don't need an answer to that in this meeting but wanted to call it out. Let me know if you think that is a good or terrible idea once you have had a chance to ponder it
19:28:59 <fungi> i agree, it's nice to have images which can be minimally troubleshot at least
19:29:01 <ianw> it wouldn't quite fit our usual python-builder base images, though, either?
19:29:19 <clarkb> ianw: correct, it would be doing very similar things but with haskell and cabal instead of python and pip
19:29:41 <clarkb> ianw: we'd do a build in a throwaway image/layer and then copy the resulting binary into a more minimal haskell image
19:29:47 <clarkb> s/haskell/ghc/ I guess
19:30:15 <clarkb> https://hub.docker.com/_/haskell is the image we'd probably use
19:30:35 <clarkb> I don't think we would need to maintain the base images, we could just FROM that image a couple of times and copy the resulting binary over
19:31:28 <clarkb> We can move on. I wanted to call this out and get people thinking about it so that we can make a decision later. It isn't urgent to decide now as it isn't an operation issue at the moment
19:31:39 <clarkb> #topic UbuntuOne two factor auth
19:31:47 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-November/000298.html Using 2fa with ubuntu one
19:31:58 <fungi> at the beginning of last week i started that ml thread
19:32:15 <fungi> i wanted to bring it up again today since i know a lot of people were afk last week
19:32:31 <fungi> so far there have been no objections to proceeding, and two new volunteers to test
19:32:46 <clarkb> I have no objections, if users are comfortable with the warning in the group description I think we should enroll those who are interested
19:32:50 <fungi> even though we haven't really made a call for volunteers yet
19:32:58 <ianw> (i think i approved one already, sorry, after not reading the email)
19:33:14 <fungi> no harm done ;)
19:33:26 <clarkb> ya was hrw, I think hrw was aware of the concerns after working at canonical previously
19:33:31 <clarkb> an excellent volunteer :)
19:33:36 <fungi> i just didn't want to go approving more volunteers or asking for volunteers until we seemed to have some consensus that we're ready
19:33:55 <clarkb> I think so, its been about a year, I have yet to have a problem in that time
19:34:29 <fungi> i'll give people until this time tomorrow to follow up on the ml as well before i more generally declare that we're seeking volunteers to help try it out
19:34:57 <clarkb> sounds like a plan, thanks
19:35:26 <frickler> I guess I can't be admin for that group without being member?
19:35:33 <fungi> frickler: correct
19:35:35 <clarkb> frickler: I think that is correct due to how lp works
19:36:10 <fungi> i'm also happy to add more admins for the group
19:36:21 <frickler> o.k., not a blocker I'd think, but I'm not going to join at least for now
19:36:33 <clarkb> One thing we might need to clarify with canonical/lp/ubuntu is what happens if someone is removed from the group
19:36:38 <clarkb> and until then don't remove anyone?
19:37:00 <fungi> i'll make sure to mention that in the follow-up
19:37:19 <fungi> maybe hrw knows, even
19:37:35 <ianw> it does seem like from what it says it's a one-way ticket, i was treating it as such
19:37:42 <ianw> but good to confirm
19:37:53 <clarkb> ianw: yup, that is why I asked because if we add more admins they need to be aware of that and not remove people potentially
19:38:16 <clarkb> it may also be the case that the enrollment happens on the backend once and then never changes regardless of group membership
19:38:41 <clarkb> We have a couple more topics so lets continue on
19:38:44 <clarkb> #topic Adding a lists.openinfra.dev mailman site
19:38:49 <clarkb> #link https://review.opendev.org/818826 add lists.openinfra.dev
19:39:07 <clarkb> fungi: I guess you've decided it is safe to add the new site based on current resource usage on lists.o.o?
19:39:32 <clarkb> One thing I'll note is that I don't think we've added a new site since we converted to ansible. Just be on the lookout for anything odd due to that. We do test site creation in the test jobs though
19:39:35 <fungi> yeah, i've been monitoring the memory usage there and it's actually under less pressure after the ubuntu/python/mailman upgrade
19:40:17 <clarkb> you'll also need to update DNS over in the dns as a service but that is out of bad and safe to land this before that happens
19:40:17 <fungi> for some summmary background, as part of the renaming of the openstack foundation to the open infrastructure foundation, there's a desire to move the foundation-specific mailing lists off the openstack.org domain
19:40:57 <fungi> i'm planning to duplicate the list configs and subscribers, but leave the old archives in place
19:41:32 <clarkb> fungi: is there any concern for impact on the mm3 upgrade from this? I guess it is just another site to migrate but we'll be doing a bunch of those either way
19:41:35 <fungi> and forward from the old list addresses to the new ones of course
19:42:14 <fungi> yeah, one of the reasons i wanted to knock this out was to reduce the amount of list configuration churn we need to deal with shortly after a move to mm3 when we're still not completely familiar with it
19:42:43 <clarkb> makes sense. I think you've got the reviws you need so approve when ready I guess :)
19:42:46 <fungi> so the more changes we can make before we migrate, the more breathing room we'll have after to finish coming up to speed
19:42:56 <clarkb> Anything else on this topic?
19:43:17 <fungi> nope, thanks. i mainly wanted to make sure everyone was aware this was going on so there were few surprises
19:43:30 <clarkb> thank you for the heads up
19:43:32 <clarkb> #topic Proxying and caching Ansible Galaxy in our providers
19:43:52 <clarkb> #link https://review.opendev.org/818787 proxy caching ansible galaxy
19:44:08 <clarkb> This came up in the context of tripleo jobs needing to use ansible collections and having less reliable downloads
19:44:15 <fungi> right
19:44:21 <clarkb> I think we set them up with zuul github projects they can require on their jobs
19:44:38 <fungi> yes we added some oc the collections they're using, i think
19:44:41 <clarkb> Is the proxy cache something we think we should moev those ansible users to? or should we continue adding github projects?
19:44:54 <clarkb> or do we need some combo of both?
19:45:11 <fungi> that's my main question
19:45:23 <fungi> one is good for integration testing, the other good for deployment testing
19:45:49 <fungi> if you're writing software which pulls things from galaxy, you may want to exercise that part of it
19:45:52 <clarkb> corvus: from a zuul perspective I know we've struggled with the github api throttling during zuul restarts. Is that something you think we should try to optimize by reducing the number of github projects in our zuul config?
19:46:28 <clarkb> fungi: I think you still point galaxy at a local file dir url. And I'm not sure you gain much testing galaxies ability to parse file:/// vs https:///
19:46:50 <corvus> clarkb: i don't know if that's necessary at this point; i think it's worth forgetting what we knew and starting a fresh analysis (if we think it's worthwhile or is/could-be a problem)
19:46:56 <corvus> much has changed
19:46:57 <clarkb> corvus: got it
19:47:46 <clarkb> At the end of the day adding the proxy cache is pretty low effort on our end. But the zuul required projects should be far more reliable for jobs. And since we are already doing that I sort of lean that direction
19:48:02 <clarkb> But considering the low effort to run the caching proxy I'm good with doing both and letting users decide which tradeoff is best for them
19:48:28 <fungi> yeah, the latter means we need to review every new addition, even if the project doesn't actually need to consume that dependency from arbitrary git states
19:49:10 <fungi> with the caching proxy, if they add a collection or role from galaxy they get the benefit of the proxy right away
19:49:13 <clarkb> good point. I'll add this to my review list for after lunch and we can roll forweard with both while we sort out github connections in zuul
19:49:42 <clarkb> Anything else on this subject?
19:49:58 <fungi> but i agree that if the role or collection is heavily used then having it in the tenant config is going to be superior for stability
19:50:13 <fungi> i didn't have anything else on that one
19:50:19 <clarkb> #topic Open Discussion
19:50:26 <clarkb> We've got 10 minutes for any other items to discuss.
19:50:33 <fungi> you had account cleanups on the agenda too
19:50:47 <clarkb> ya but there isn't anything to say about them. I've been out and no time to discuss them
19:51:01 <fungi> for anyone reviewing storyboard, i have a couple of webclient fixes up
19:51:05 <clarkb> Its a bit aspirational at this point :/ I need to block of a solid day or three and just dive into it
19:51:10 <fungi> #link https://review.opendev.org/814053 Bindep cleanup and JavaScript updates
19:51:22 <fungi> that solves bitrot in the tests
19:51:27 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/819733 upgrade Gerrit to 3.3.8
19:51:27 <fungi> and makes it deployable again
19:51:36 <clarkb> Gerrit made new versions and ^ updates our image so that we can upgrade
19:51:45 <clarkb> Might want to do that during a zuul restart?
19:52:06 <fungi> yeah, since we need to clear zk anyway that probably makes sense
19:52:13 <fungi> #link https://review.opendev.org/814041 Update default contact in error message template
19:52:28 <fungi> that fixes the sb error message to point users to oftc now instead of freenode
19:52:42 <fungi> can't merge until the tests work again (the previous change i mentioned)
19:53:08 <ianw> oh i still have the 3.4 checklist to work through.  hopefully can discuss next week
19:53:34 <clarkb> ianw: 819733 does update the 3.4 imgae to 3.4.2 as well. We may want to refresh the test system on that once the above change lands
19:53:51 <clarkb> The big updates in these new versions are to reindexing so something that might actually impact the upgrade
19:54:02 <clarkb> they added a bunch of performance improvements sounds like
19:54:33 <ianw> iceweasel ... there's a name i haven't heard in a while
19:54:47 <fungi> especially since it essentially no longer exists
19:54:50 <ianw> clarkb: ++
19:56:25 <clarkb> Last call, then we can all go eat $meal
19:56:50 <ianw> kids these days wouldn't even remember the trademark wars of ... 2007-ish?
19:57:09 <fungi> i had to trademark uphill both ways in the snow
19:57:14 <clarkb> ianw: every browser is Chrome now too
19:57:59 <clarkb> #endmeeting