19:01:20 <clarkb> #startmeeting infra
19:01:21 <openstack> Meeting started Tue Jan  5 19:01:20 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:24 <openstack> The meeting name has been set to 'infra'
19:01:37 <clarkb> hello everyone, welcome to the first meeting of 2021
19:01:56 <clarkb> Others indicated they would be delayed in joining so I'll give it a few minutes before we dive into the agenda I sent out
19:02:06 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-January/000160.html Our Agenda
19:05:32 <clarkb> #topic Announcements
19:05:42 <clarkb> I didn't have any announcements. Were there others to share?
19:05:59 * corvus joins late
19:06:44 <fungi> i've nothing to share
19:06:48 <clarkb> #topic Actions from last meeting
19:06:54 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-12-08-19.01.txt minutes from last meeting
19:07:13 <clarkb> It hasbeen a while since our last meeting. I don't see any actions registered tehre. I think we can just roll forward into 2021
19:07:22 <clarkb> #topic Priority Efforts
19:07:27 <clarkb> #topic Update Config Management
19:07:48 <clarkb> Over the holidays it appears that rax was doing a number of host migrations. A non zero number of these failed leaving servers unreachable
19:08:36 <clarkb> other than services like ethercalc, wiki, and elasticsearch going down as a result one of the fallouts to this is our ansible playbooks try to connect to the servers and never time out piling up a number of stale ansible-playbook processes and their children on bridge
19:08:49 <clarkb> then subsequent runs timeout because the server is slow due to load
19:09:06 <clarkb> We do set an ansible ssh connection timeout but it doesn't seem to be sufficient in these cases
19:09:18 <clarkb> fungi: ^ I think you had a theory for why that may be but I can't remember it right now?
19:09:18 <fungi> because ssh doesn't time out connecting
19:09:31 <fungi> ssh authenticates and hangs
19:09:50 <clarkb> I see, its the next step that isn't being useful
19:10:05 <clarkb> I wonder if we can make that better in ansible or if ansible already has tooling to try and detect that.
19:10:15 <fungi> basically the servers are in a pathological condition which i think ansible's timeout mechanism doesn't take into consideration but happens rather regularly for us
19:10:18 <clarkb> like maybe we can set a task timeout to some value like 2 hours
19:11:06 <clarkb> anyway we don't need to solve it here. I just wanted to call that out since we hit this problem multiple times on bridge over the holidays ( and on our return)
19:11:19 <corvus> unsure if this is on/off topic, but i made some changes to the root email alias, and it doesn't seem to have taken effect on many servers; is our periodic ansible run failing due to these issues?
19:11:19 <fungi> it's either hanging the connection indefinitely during or immediately following authentication, i'm not sure which
19:11:42 <clarkb> corvus: base was failing, but should be running as of yeaterday evening my local time
19:11:50 <clarkb> correction: base was timing out
19:12:06 <corvus> ok, so i'll see if my inbox is full again tomorrow :)
19:12:08 <fungi> yeah, so servers later in the sequence would have been repeatedly skipped
19:13:01 <clarkb> and if you notice servers are unresponsive reboots seem to correct their issues
19:13:19 <clarkb> any other config management items to bring up? that was all I had
19:14:12 <clarkb> #topic OpenDev
19:14:46 <clarkb> On the Gerrit tuning topic we enabled the git v2 protocl then updated our zuul images to enable it client side and that was the last gerrit tuning we did
19:15:11 <clarkb> it seems to be working from a functionality perspective (zuul and git review are happy etc) but probably too early to say if it has helped with the system load issues
19:15:51 <corvus> yeah, we also scheduled holidays ;)
19:16:00 <corvus> if the tuning doesn't work out, let's fall back on scheduling more holidays
19:16:19 <fungi> yeah, i'll be more convinced next week or the week after when everyone's turning it up to 11 again
19:16:20 <clarkb> Other tuning ideas are the strong refs for jgit caches (potentially needs more memory and is scary for that reason), setting up service user and regular user thread counts to better balance CI and humans, and on the upstream mailing list there has been a ton of recent discussion from other users about tuning caches
19:17:16 <clarkb> corvus: do you know where ianw has gotten with the zuul results plugin work? I think you were helping to get that into an upstream plugin?
19:18:25 <clarkb> I expect we will be able to incorporate taht into our images soon, but I've not yet acught up on the status of this work
19:18:28 <fungi> i'll readily admit i ended up not finding time to work on the jeepyb fixes for update_bug/update_bp as other problems kept preempting my time
19:18:37 <corvus> um... i haven't checked recently but last i remember is it exists in an upstream repo
19:18:47 <clarkb> corvus: cool so progress :)
19:19:07 <clarkb> the other thing ianw had brought up was using the built in WIP status for changes. In testing that we have found that Zuul doesn't understand WIP status changse as unmergable
19:19:16 <corvus> #link https://gerrit.googlesource.com/plugins/zuul-results-summary/
19:19:23 <clarkb> we mentioned this last time we had a meeting but we should discourage users from using that until Zuul does understand that status
19:19:46 <corvus> i can add that feature
19:20:00 <clarkb> the preexisting WIP vote on the workflow should be used until zuul has been updated
19:20:14 <clarkb> corvus: tahnks
19:20:23 <corvus> #action corvus add wip support to zuul
19:20:49 <clarkb> The last Gerrit related topic I wanted to bring up was the 3.3 upgrade. guillaumec says that 3.3.1 incorporates the fix for zuul
19:21:13 <corvus> this was the comments thing (that would break 'recheck' i think)
19:21:39 <clarkb> I think that means we can start looking at 3.3.1 upgrades if people have time. The upgrade does involve some changes like Non-Interactive Users group being renamed to Service Users and I am sure there are other things to consider so if we do that lets read release notes and test it (review-test can still be used for this I think)
19:21:44 <clarkb> corvus: yup
19:21:47 <corvus> i haven't checked on what the final status of that is (ie, do we need to enable an option or is it transparantly backwards compat)
19:22:13 <clarkb> oh good point we should also dobule check this fix doesn't need settings to be effective
19:22:47 <corvus> i think people were leaning towards not requiring that, but it was a suggestion, so we should verify
19:22:53 <clarkb> I don't know that I'll have time to drive a gerrit upgrade at the beginning of the year. I've got all the typically beginning of the year things distracting me. But I can help anyone else who may have time (if they don't also have beginning of the year items)
19:23:27 <clarkb> ianw was also working on improving our testing of gerrit in CI
19:23:49 <clarkb> it might be worth getting those improvements landed then relying on it to help verify the next upgrade. I don't think we're in a rush so that may be a good idea
19:24:51 <clarkb> The other opendev related upgrade is Gitea 1.13
19:25:01 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/769226
19:25:18 <clarkb> this upgrade seems to be a bigger leap than previous gitea upgrades. They have added new features like project management kanban boards
19:25:56 <clarkb> our testing is decent for api checking but maybe we should hold the run job for that change now and put a repo or three in it and confirm it is happy from a ui perspective?
19:25:56 <corvus> oO
19:26:30 <clarkb> this version also adds elasticsearch support for indexing. It isn't the default and I think we should upgrade to it first without worrying about elasticsearch just to sort out the other changes. Then as a followon we can work to sort out elasticsearch
19:26:55 <fungi> our manage-projects test loads repos into gitea, can we depends-on or something to just take advantage of that and hold it?
19:27:16 <clarkb> fungi: the gitea test creats all of the projects, but without git content
19:27:24 <clarkb> fungi: all you need to do is push the content in after holding it
19:27:35 <fungi> ahh
19:27:36 <clarkb> we could potentially modify the job to push in content for some small repos too)
19:27:51 <clarkb> that may be a good idea
19:27:53 <fungi> or push some ourselves after setting up necessary credentials, yeah
19:29:13 <clarkb> ya why don't we do that. I'll WIP the change and suggest we hold it and check the ui since the upgrade is a bit more involved than ones we have done perviously
19:30:04 <clarkb> Any other opendev topics to discuss or should we move on?
19:30:29 <fungi> annual report?
19:30:39 <clarkb> thats next though I guess technically it fits under here
19:30:40 <fungi> or did you have a separate topic for that?
19:30:46 <fungi> ahh, no worries
19:30:52 <clarkb> ya I had it in general topics but it is the opendev project update. Lets talk about it here
19:30:54 * fungi should read meeting agendas
19:31:05 <clarkb> We have been asked to put together a project update for opendev in the foundation's annual report
19:31:15 <clarkb> #link https://etherpad.opendev.org/p/opendev-2020-annual-report
19:31:38 <clarkb> I have written a draft. But I'm happy to scrap that if others want to write one. Also happy for edits and suggestions
19:31:56 <clarkb> I believe we have a week from tomorrow to get it together so this isn't a huge rush but is also a near future item to figure out
19:34:00 <fungi> i'm also putting some polish on our engagement metrics generator: https://review.opendev.org/729293
19:34:03 <clarkb> I've been planning to do periodic rereads and edits myself too. Basically want to reread it with it being a bit more fresh than correct things as necessary
19:34:46 <clarkb> #topic General topics
19:34:54 <clarkb> #topic Bup and Borg Backups
19:35:16 <clarkb> I think we may be about ready to drop this entry from our agenda. I'll double check with ianw when holidays end.
19:35:30 <clarkb> tldr aiui is we're using borg now, bup should be diasbled at least on some servers
19:35:54 <clarkb> we'll keep the old bup backups around on the old volumes liek we've done with previous bup rotations
19:36:25 <clarkb> if you haven't yet had a chance to interact with borg and try out recovery methods that may be a good exercise. Should only take about half an hour I would expect
19:37:29 <clarkb> #topic InMotion Hosted Cloud
19:37:55 <clarkb> The other thing I've been working on this week is getting an account with inmotion bootstrapped so that we can spin up an openstack cloud there for nodepool resources when they are ready
19:38:28 <clarkb> I have created an account and the details for that as well as our contacts are in the usual location. There is no actualy cloud yet though. AIUI we are waiting on them to tell us they are ready to try bootstrapping the actual resources
19:39:45 <fungi> this is the experiment where we're sort of on the hook as openstack cloud admins, right?
19:39:52 <fungi> infracloud mk2?
19:40:03 <clarkb> yes, but I think we've decided taht we are comfortable with a redeploy strategy using their provided management tools
19:40:14 <clarkb> in theory that means the actual overhead to us is low
19:40:28 <fungi> okay, so basically hands-off and if it breaks we push a button and rebuild it all
19:40:36 <clarkb> exactly
19:40:43 <corvus> so if it breaks or we need to upgrade, ^ that?
19:40:48 <clarkb> yup
19:41:13 <corvus> that happens occasionally with our current providers too
19:41:46 <clarkb> they have also expressed interest in zuul and nodepool so maybe we can get them involved there too
19:41:55 <fungi> openstack as a service. it'll be interesting
19:42:51 <clarkb> #topic Open Discussion
19:43:14 <clarkb> That was about all I had. There are some old agenda items that I should probably clean up after thinking about them for half a second
19:43:47 <clarkb> I've got meetings mon-wed next week that will have me distracted in the mornings (and maybe afternoons? I don't know if that has bee nsorted out yet)
19:43:57 <clarkb> I should be around for our meeting next week though
19:44:16 <fungi> yeah, same here (same meetings)
19:44:43 <fungi> but they're half-day if memory serves, so shouldn't be entirely distracting
19:46:05 <clarkb> Anything else? or should we call it here?
19:46:59 * fungi has nothing
19:47:22 <clarkb> sounds like that may be it then. Thanks everyone and we'll see you here next week
19:47:34 <fungi> thanks clarkb!
19:47:38 <clarkb> feel free to bring up discussions in #opendev or on the mailing list and we can pick things up there if they were missed here
19:47:39 <corvus> thanks!
19:47:41 <clarkb> #endmeeting