19:01:02 #startmeeting infra 19:01:03 Meeting started Tue Nov 30 19:01:02 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:03 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:03 The meeting name has been set to 'infra' 19:01:08 #link http://lists.opendev.org/pipermail/service-discuss/2021-November/000303.html Our Agenda 19:01:14 We have an agenda. 19:01:19 #topic Announcements 19:01:37 Gerrit User Summit is happening Thursday and Friday this week from 8am-11am pacific time virtually 19:02:00 If you are interested in joining registration is free. I think they will have recordings too if you prefer to catch up out of band 19:02:12 also there was a new git-review release last week 19:02:18 I intend on joining as there is a talk on gerrit updates that will be useful to us to hear I think 19:02:55 yup please update your git-review installation to help ensure it is working properly. I've updated as my git version updated locally forcing me to update 19:03:01 I haven't had any issues with new git review yet 19:03:12 git-review 2.2.0 19:03:37 i sort of rushed it through because an increasing number of people were upgrading to newer git which it was broken with 19:03:54 the delta to the previous release was small too so probably the right move 19:04:04 but yeah, follow up on the service-discuss ml or in #opendev if you run into anything unexpected with it 19:05:02 #topic Actions from last meeting 19:05:11 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-11-16-19.01.txt minutes from last meeting 19:05:16 I don't see any recorded actions 19:05:27 We'll dive right into the fun stuff then 19:05:30 #topic Topics 19:05:40 #topic Improving CD Throughput 19:06:59 sorry small network hiccup 19:07:12 A number of changes have landed to make this better while keeping our serialized one job after another setup 19:07:39 #link https://review.opendev.org/c/opendev/system-config/+/807808 Update system-config once per buildset. 19:07:45 #link https://review.opendev.org/c/opendev/base-jobs/+/818297/ Reduce actions needed to be taken in base-jobs. 19:08:07 yep, those are the last two 19:08:09 These are the last two updates to keep status quo but prepare for parallel ops 19:08:31 Once those go in we can start thinking about adding/updating semaphores to allow jobs to run in parallel. Very exciting. Thank you ianw for pushing this along 19:08:48 yep i'll get to that change soon and we can discuss 19:09:33 #topic Zuul multi scheduler setup 19:09:51 Just a note that a number of bug fixes have landed to zuul since we last restarted 19:10:15 I expect that we'll be doing a restart at some point soon to check everything is happy before zuul cuts a new release 19:10:39 I'm not sure if that will require a full restart and clearing of the zk state, corvus would know. Basically it is possible that this won't be a graceful restart 19:10:44 after our next restart, it would probably be helpful to comb the scheduler/web logs for any new exceptions getting raised 19:10:45 s/graceful/no downtime/ 19:10:53 yes we do need a full clear/restart 19:11:05 corvus: thank you for confirming 19:11:20 i saw you indicated similar in matrix as well 19:11:33 (for future 4.11 anyway) 19:11:36 and ya generally be on the lookout for odd behaviors, our input has been really helpful to the development process here and we should keep providing that feedback 19:11:37 i'd like to do that soon, but maybe after a few more changes land 19:12:32 we should probably talk about multi web 19:12:46 it is, amusingly, now our spof :) 19:13:05 corvus: are we thinking run a zuul-web on zuul01 as well then dns round robin? 19:13:11 (amusing since it hasn't ever actually been a spof except that opendev only ever needed to run 1) 19:13:24 that's an option, or a LB 19:13:29 if we add an haproxy that might work better for outages and balancing but it would still be a spof for us 19:13:36 we might want to think about the LB so we can have more frequent restarts without outages 19:14:09 I guess the idea is haproxy will need to restart less often than zuul-web and in many cases haproxy is able to keep connections open until they complete 19:14:12 dns round-robin is only useful for (coarse) load distribution, not failover 19:14:41 do we have octavia available? that in vexxhost? 19:14:41 i figure if it's good enough for gitea it's good enough for zuul; we know that we'll want to restart zuul-web frequently, and there's a pretty long window when a zuul-web is not fully initialized, so a lb setup could make a big difference. 19:15:23 frickler: I think it is available in vexxhost, but we don't host these services in vexxhost currently so that would add a large (~40ms?) rtt between the lb frontend and backend 19:15:46 corvus: good point re gitea 19:15:49 on the other hand, if we need to take the lb down for an extended period, which is far less often, we can change dns to point directly to a single zuul-web while we work on the lb 19:16:07 it's a bit old now, but https://review.opendev.org/c/opendev/system-config/+/677903 does the work to make haproxy a bit more generic for situations such as this 19:16:16 or just build a new lb and switch dns to it, then tear down the old one 19:16:38 (haproxy roles, not haproxy itself) 19:17:09 ianw: oh ya we'll want something like that if we go the haproxy route and don't aaS it 19:17:14 ianw: is that for making a second haproxy server, or for using an existing one for more services? 19:17:35 (i think it's option #1 from the commit msg) 19:17:40 corvus: I read the commit message as #1 as well 19:17:44 corvus: iirc that was when we were considering a second haproxy server 19:17:58 yeah, make it easier for us to reuse the system configuration, not the individual load balancer instances 19:18:53 that approach seems good to me (but i don't feel strongly; if there's an aas we'd like to use that should be fine too) 19:18:54 so that we don't end up with multiple almost identical copies of the same files in system-config for different load balancers 19:19:11 corvus: I think I have a slgiht prefernce of using our existing tooling for consistency 19:19:57 and separately if someone wants to investigate octavia we can do that and switch wholesale later (I'd be most concerned about using it across geographically distributed systems with disparate front and backe ends) 19:20:35 though for that we'd probably be better off with some form of dns-based global load balancing 19:20:48 granted it can be a bit hard on the nameservers 19:21:23 (availability checks driving additions and removals to a dedicated dns record/zone) 19:21:47 requires very short ttls, which some caching resolvers don't play nicely with 19:22:38 ok, i +2d ianw's change; seems like we can base a zuul-lb role on that 19:22:58 sounds good, anything else zuul related to go over? 19:23:15 i'll put that on my list, but it's #2 on my opendev task list, so if someone wants to grab it first feel free :) 19:24:02 (and that's all from me) 19:24:05 #topic User management on our systems 19:24:21 The update to irc gerritbot here went really well. The update to matrix-gerritbot did not. 19:24:45 It turns out that matrix-gerritbot needs a cache dir in $HOME/.cache to store its dhall intermediate artifacts 19:25:31 and that didn't play nicely with the idea of running the container as a different user as it couldn't write to $HOME/.cache. I had thought I had bind mounted everything it needed and that it was all read only but that wasn't the case. TO make things a bit worse the dhall error log messages couldn't be written because the image lacked a utf8 locale and the error messages had 19:25:33 utf8 characters 19:25:54 tristanC has updated the matrix-gerritbot image to address these things so we can try again this week. I need to catch back up on that. 19:26:22 One thing I wanted to ask about is whether or not we'd like to build our own matrix-gerritbot images using docker instead of nix so that we can have a bit more fully featured image as well as understand the process 19:26:33 I found the nix stuff to be quite obtuse myself and basically punted on it as a result 19:27:51 (the image is really interesting it sets a bash prompt but no bash is installed, there is no /tmp (I tried to override $HOME to /tmp to fix the issue and that didn't work), etc) 19:28:55 I don't need an answer to that in this meeting but wanted to call it out. Let me know if you think that is a good or terrible idea once you have had a chance to ponder it 19:28:59 i agree, it's nice to have images which can be minimally troubleshot at least 19:29:01 it wouldn't quite fit our usual python-builder base images, though, either? 19:29:19 ianw: correct, it would be doing very similar things but with haskell and cabal instead of python and pip 19:29:41 ianw: we'd do a build in a throwaway image/layer and then copy the resulting binary into a more minimal haskell image 19:29:47 s/haskell/ghc/ I guess 19:30:15 https://hub.docker.com/_/haskell is the image we'd probably use 19:30:35 I don't think we would need to maintain the base images, we could just FROM that image a couple of times and copy the resulting binary over 19:31:28 We can move on. I wanted to call this out and get people thinking about it so that we can make a decision later. It isn't urgent to decide now as it isn't an operation issue at the moment 19:31:39 #topic UbuntuOne two factor auth 19:31:47 #link http://lists.opendev.org/pipermail/service-discuss/2021-November/000298.html Using 2fa with ubuntu one 19:31:58 at the beginning of last week i started that ml thread 19:32:15 i wanted to bring it up again today since i know a lot of people were afk last week 19:32:31 so far there have been no objections to proceeding, and two new volunteers to test 19:32:46 I have no objections, if users are comfortable with the warning in the group description I think we should enroll those who are interested 19:32:50 even though we haven't really made a call for volunteers yet 19:32:58 (i think i approved one already, sorry, after not reading the email) 19:33:14 no harm done ;) 19:33:26 ya was hrw, I think hrw was aware of the concerns after working at canonical previously 19:33:31 an excellent volunteer :) 19:33:36 i just didn't want to go approving more volunteers or asking for volunteers until we seemed to have some consensus that we're ready 19:33:55 I think so, its been about a year, I have yet to have a problem in that time 19:34:29 i'll give people until this time tomorrow to follow up on the ml as well before i more generally declare that we're seeking volunteers to help try it out 19:34:57 sounds like a plan, thanks 19:35:26 I guess I can't be admin for that group without being member? 19:35:33 frickler: correct 19:35:35 frickler: I think that is correct due to how lp works 19:36:10 i'm also happy to add more admins for the group 19:36:21 o.k., not a blocker I'd think, but I'm not going to join at least for now 19:36:33 One thing we might need to clarify with canonical/lp/ubuntu is what happens if someone is removed from the group 19:36:38 and until then don't remove anyone? 19:37:00 i'll make sure to mention that in the follow-up 19:37:19 maybe hrw knows, even 19:37:35 it does seem like from what it says it's a one-way ticket, i was treating it as such 19:37:42 but good to confirm 19:37:53 ianw: yup, that is why I asked because if we add more admins they need to be aware of that and not remove people potentially 19:38:16 it may also be the case that the enrollment happens on the backend once and then never changes regardless of group membership 19:38:41 We have a couple more topics so lets continue on 19:38:44 #topic Adding a lists.openinfra.dev mailman site 19:38:49 #link https://review.opendev.org/818826 add lists.openinfra.dev 19:39:07 fungi: I guess you've decided it is safe to add the new site based on current resource usage on lists.o.o? 19:39:32 One thing I'll note is that I don't think we've added a new site since we converted to ansible. Just be on the lookout for anything odd due to that. We do test site creation in the test jobs though 19:39:35 yeah, i've been monitoring the memory usage there and it's actually under less pressure after the ubuntu/python/mailman upgrade 19:40:17 you'll also need to update DNS over in the dns as a service but that is out of bad and safe to land this before that happens 19:40:17 for some summmary background, as part of the renaming of the openstack foundation to the open infrastructure foundation, there's a desire to move the foundation-specific mailing lists off the openstack.org domain 19:40:57 i'm planning to duplicate the list configs and subscribers, but leave the old archives in place 19:41:32 fungi: is there any concern for impact on the mm3 upgrade from this? I guess it is just another site to migrate but we'll be doing a bunch of those either way 19:41:35 and forward from the old list addresses to the new ones of course 19:42:14 yeah, one of the reasons i wanted to knock this out was to reduce the amount of list configuration churn we need to deal with shortly after a move to mm3 when we're still not completely familiar with it 19:42:43 makes sense. I think you've got the reviws you need so approve when ready I guess :) 19:42:46 so the more changes we can make before we migrate, the more breathing room we'll have after to finish coming up to speed 19:42:56 Anything else on this topic? 19:43:17 nope, thanks. i mainly wanted to make sure everyone was aware this was going on so there were few surprises 19:43:30 thank you for the heads up 19:43:32 #topic Proxying and caching Ansible Galaxy in our providers 19:43:52 #link https://review.opendev.org/818787 proxy caching ansible galaxy 19:44:08 This came up in the context of tripleo jobs needing to use ansible collections and having less reliable downloads 19:44:15 right 19:44:21 I think we set them up with zuul github projects they can require on their jobs 19:44:38 yes we added some oc the collections they're using, i think 19:44:41 Is the proxy cache something we think we should moev those ansible users to? or should we continue adding github projects? 19:44:54 or do we need some combo of both? 19:45:11 that's my main question 19:45:23 one is good for integration testing, the other good for deployment testing 19:45:49 if you're writing software which pulls things from galaxy, you may want to exercise that part of it 19:45:52 corvus: from a zuul perspective I know we've struggled with the github api throttling during zuul restarts. Is that something you think we should try to optimize by reducing the number of github projects in our zuul config? 19:46:28 fungi: I think you still point galaxy at a local file dir url. And I'm not sure you gain much testing galaxies ability to parse file:/// vs https:/// 19:46:50 clarkb: i don't know if that's necessary at this point; i think it's worth forgetting what we knew and starting a fresh analysis (if we think it's worthwhile or is/could-be a problem) 19:46:56 much has changed 19:46:57 corvus: got it 19:47:46 At the end of the day adding the proxy cache is pretty low effort on our end. But the zuul required projects should be far more reliable for jobs. And since we are already doing that I sort of lean that direction 19:48:02 But considering the low effort to run the caching proxy I'm good with doing both and letting users decide which tradeoff is best for them 19:48:28 yeah, the latter means we need to review every new addition, even if the project doesn't actually need to consume that dependency from arbitrary git states 19:49:10 with the caching proxy, if they add a collection or role from galaxy they get the benefit of the proxy right away 19:49:13 good point. I'll add this to my review list for after lunch and we can roll forweard with both while we sort out github connections in zuul 19:49:42 Anything else on this subject? 19:49:58 but i agree that if the role or collection is heavily used then having it in the tenant config is going to be superior for stability 19:50:13 i didn't have anything else on that one 19:50:19 #topic Open Discussion 19:50:26 We've got 10 minutes for any other items to discuss. 19:50:33 you had account cleanups on the agenda too 19:50:47 ya but there isn't anything to say about them. I've been out and no time to discuss them 19:51:01 for anyone reviewing storyboard, i have a couple of webclient fixes up 19:51:05 Its a bit aspirational at this point :/ I need to block of a solid day or three and just dive into it 19:51:10 #link https://review.opendev.org/814053 Bindep cleanup and JavaScript updates 19:51:22 that solves bitrot in the tests 19:51:27 #link https://review.opendev.org/c/opendev/system-config/+/819733 upgrade Gerrit to 3.3.8 19:51:27 and makes it deployable again 19:51:36 Gerrit made new versions and ^ updates our image so that we can upgrade 19:51:45 Might want to do that during a zuul restart? 19:52:06 yeah, since we need to clear zk anyway that probably makes sense 19:52:13 #link https://review.opendev.org/814041 Update default contact in error message template 19:52:28 that fixes the sb error message to point users to oftc now instead of freenode 19:52:42 can't merge until the tests work again (the previous change i mentioned) 19:53:08 oh i still have the 3.4 checklist to work through. hopefully can discuss next week 19:53:34 ianw: 819733 does update the 3.4 imgae to 3.4.2 as well. We may want to refresh the test system on that once the above change lands 19:53:51 The big updates in these new versions are to reindexing so something that might actually impact the upgrade 19:54:02 they added a bunch of performance improvements sounds like 19:54:33 iceweasel ... there's a name i haven't heard in a while 19:54:47 especially since it essentially no longer exists 19:54:50 clarkb: ++ 19:56:25 Last call, then we can all go eat $meal 19:56:50 kids these days wouldn't even remember the trademark wars of ... 2007-ish? 19:57:09 i had to trademark uphill both ways in the snow 19:57:14 ianw: every browser is Chrome now too 19:57:59 #endmeeting