19:01:14 <clarkb> #startmeeting infra
19:01:15 <openstack> Meeting started Tue Oct 13 19:01:14 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:18 <openstack> The meeting name has been set to 'infra'
19:01:22 <diablo_rojo> o/
19:01:25 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-October/000105.html Our Agenda
19:01:44 <fungi> ohai
19:01:54 <clarkb> I'm actually going to flip the order of this agenda around so that we can talk about gerrit last so that we can just talk about it until we are done or run out of time
19:02:03 <clarkb> #topic Announcements
19:02:13 <clarkb> The OpenStack release happens tomorrow
19:02:27 <clarkb> we should be slushy on things that impact that (I think we've been managing that so far so not super concerned)
19:02:53 <clarkb> Then next week we have the summit and the week after that the PTG
19:03:08 <clarkb> hope to see you all virtually there :)
19:03:26 <clarkb> #topic Actions from last meeting
19:03:34 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-10-06-19.01.txt minutes from last meeting
19:03:38 <clarkb> No actions recordred
19:03:43 <clarkb> (and I can't type)
19:03:59 <clarkb> #topic General topics
19:04:05 <clarkb> #topic PTG Planning
19:04:18 <clarkb> #link https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning happens here
19:04:52 <clarkb> if you haven't looked over this etherpad yet it would be a good idea to do a quick check to ensure we aren't missing any important items
19:05:16 <clarkb> Other than that ensure you've registered
19:05:20 <clarkb> #link https://www.openstack.org/ptg/
19:05:26 <clarkb> And we'll see you on meetpad in a couple weeks
19:05:38 <clarkb> #topic Bup and Borg Backups
19:05:56 <clarkb> ethercalc is are borg test unit
19:06:10 * fungi tries to parse
19:06:16 <clarkb> ianw: that landed yesterday, anything unexpected or exciting to mention on that?
19:06:25 <clarkb> s/are/our/
19:06:38 <ianw> well, yes, about ansible on bridge of course :)
19:06:51 <clarkb> oh ya the jinja thing. That would be good to recount here
19:06:59 <fungi> did manually upgrading jinja2 work?
19:07:13 <fungi> i sort of passed out around that time
19:07:17 <ianw> in a yak shaving adventure, i realised that bridge has jinja2 2.10, because ansible doesn't specify any lower bound
19:07:49 <ianw> and of course, i managed to find a place where it is incompatible with ~2.11, which is what gets installed in the gate testing
19:08:12 <ianw> i haven't tried with the manual update of it yet, will today
19:08:30 <ianw> in the mean time, i wrote up https://review.opendev.org/757670 to install ansible in a venv on bridge, so we can --update it
19:09:28 <clarkb> assuming that works we should have our first borg'd server ya?
19:09:31 <ianw> i'll clean that up today, but it seems to work
19:09:49 <ianw> clarkb: hopefully :)  anyway, progress is being made
19:10:26 <clarkb> #topic Splitting puppet else into specific infra-prod jobs
19:10:35 <clarkb> I don't think anyone has started this yet but thought I'd quickly double check
19:11:22 <fungi> i have not, no
19:11:45 <clarkb> #topic Priority Efforts
19:11:53 <clarkb> #topic Update Configuration Management
19:12:31 <clarkb> I think I saw ianw working on the reprepro ansiblification. There was also an update to add in some files that were missing in gerrit ansible
19:12:43 <clarkb> anything to add re ^ or any other config management updates?
19:12:55 <ianw> yeah i'm starting on that, so we can get rid of more puppet there
19:12:59 <fungi> oh, was there an update to that change? i'm happy to take a look
19:13:31 <ianw> fungi: still very much a wip.  i'm taking a less template-centric approach
19:13:43 <clarkb> this is an area we should be careful with the openstack release happening tomorrow but things like reprepro should be low impact if they break (due to how we vos release)
19:13:47 <fungi> probably wise. that was far too many templates
19:14:29 <ianw> yeah, there were about 3 different forms of templates, which made it more confusing than just looking at the files
19:14:44 <ianw> (it didn't start that way, of course :)
19:15:19 <fungi> shall i just abandon my topic:ansible-reprepro changes then? i guess you're working in a new change
19:15:51 <fungi> i'm entirely in favor of something with fewer templates
19:16:07 <fungi> that's what was so daunting about trying to convert the puppet to begin with
19:16:23 <fungi> i got as far as template conversion and stalled out
19:17:26 <ianw> fungi: you can leave it for now, i used some of it as reference :)
19:17:39 <fungi> by all means, happy it helped
19:19:01 <clarkb> #topic OpenDev
19:19:18 <clarkb> That takes us to the topic I was hoping to make room for (and we did yay)
19:19:26 <clarkb> specifically upgrading our gerrit server
19:19:54 <clarkb> fungi and I have worked through a gerrit 2.13 to 3.2 upgrade on review-test using a snapshot of production from october 1
19:20:17 <clarkb> That upgrade is looking to be about 2 days long (with gerrit offline for it)
19:20:38 <clarkb> The first step is to upgrade from 2.13 to 2.16 as we need 2.16 to do the notedb conversion
19:20:39 <fungi> which wouldn't be too terrible over a weekend
19:20:50 <fungi> two days over a weekend i mean
19:21:05 <clarkb> once we've upgraded to 2.16 I think we shoud checkpoint there so we don't have to fall back all the way to 2.13 if something goes wrong
19:21:16 <clarkb> then we run the notedb migration which will take about 8 hours
19:21:28 <clarkb> then the next day we can do the 3.0 through 3.2 upgrades
19:21:49 <corvus> by about 2 days, what do you mean?  like 8am one day to 5pm the next?  with idle time for when processes finish and no one is watching?
19:21:53 <corvus> or 48 hours straight?
19:22:03 <clarkb> 8am to 5pm the next
19:22:10 <corvus> cool
19:22:19 <fungi> with likely some idle time interspersed
19:23:14 <clarkb> roughly that process would look like: shut everything down and put up notices, backup reviewdb and git repos, do upgrade to 2.16, check it is happy, backup reviewdb and git repos (this is ~5pm day one), do notedb migration, at 8am next day do 3.0 to 3.2 upgrades which should finish around midday. Spend rest of day turning things back on and merging changes to catch up with our new state
19:23:15 <fungi> like the notedb conversion, but also to a lesser extent offline reindexing, database schema migrations, git gc passes...
19:24:07 <clarkb> and ya lots of idle time waiting for things to finish
19:24:16 <clarkb> https://etherpad.opendev.org/p/gerrit-2.16-upgrade has timing data
19:24:52 <clarkb> From the testing side of things basic functionality seems to work
19:24:54 <fungi> which is about as accurately measured as we can manage. same server flavor, volume type, snapshot of production data, et cetera
19:25:17 <fungi> obviously though, clouds, no way to be sure about the timing for any of it
19:25:21 <clarkb> I can login, do git review -s, git review a change, review a change, search for changes, and otherwise interact with the web ui
19:25:27 <clarkb> fungi tested that ICLA signing works
19:25:35 <fungi> yup
19:25:49 <clarkb> I'm currently testing replication to a gitea99 from a held system-config-run-gitea job
19:25:59 <clarkb> that has been running for 25 hours now and is still not done replicating
19:26:05 <clarkb> but it is working
19:26:13 <fungi> for followup tasks (or could even be done beforehand), there are likely some zuul jobs to be written to replace some of our jeepyb gerrit hooks
19:26:25 <clarkb> my takeaway from that is we should be prepared to possibly stop replicating refs/changes again but we can make that decision if it becomes a problem?
19:26:52 <clarkb> yes, the next thing I want to test is project creation with manage-projects, then renaming a project, and finally use the delete-project plugin to test deleting a project
19:27:03 <corvus> why would we need to re-replicate refs/changes?
19:27:24 <clarkb> corvus: we'll be replicating all of the notedb content which is in refs/changes/XY/ABCXY/meta now
19:27:53 <corvus> hrm.  but in your test, gitea99 doesn't have the bulk of the refs/changes content while prod gitea does
19:28:00 <clarkb> corvus: correct
19:28:19 <fungi> this isn't for timing the replication, but measuring the resultant system
19:28:25 <clarkb> it will be ~15GB of data to replicate I think based on df output before and after the notedb migration
19:29:08 <fungi> also the replication to gitea can happen outside the window, it's merely additive
19:29:26 <fungi> but we want to know if the added data is going to cause our gitea servers to topple over
19:29:37 <corvus> right, though it could cause replication to lag which could affect users
19:29:46 <fungi> this is true, yes
19:30:15 <clarkb> ya I think we should go in with the intention of replicating refs/changes and keep in mind we can disable it if we notice problems (so far the only problemi s the speed at which a fresh server can be replicated to)
19:30:19 <fungi> we can re-test by importing a gitea production db backup and re-replicating the difference
19:30:35 <fungi> if we want to have an idea of how long it's going to actually take to catch up
19:31:05 <fungi> (for purposes of messaging to our users about replication lag)
19:31:14 <clarkb> on the jeepyb side of things the lp bug and spec integration as well as the welcome message hook all talk to reviewdb which will be stale after the notedb migration (and eventually we'll drop that db and it will break completely)
19:31:19 <corvus> we can also disable replicating refs/changes/XY/ABCXY/meta right?
19:31:27 <clarkb> corvus: I can't figure out how to do it
19:31:36 <corvus> is it a regex?  negative lookahead?
19:31:42 <clarkb> no its just globs I think
19:31:47 <clarkb> it uses gits normal ref syntax
19:31:48 <corvus> oh. nm then.
19:31:52 <fungi> might be worth asking luca if we can't figure it out
19:32:04 <fungi> though yeah, sounds... like a glob
19:32:14 <corvus> i agree that we can probably assume it will be okay and disable if not and regroup
19:32:47 <clarkb> for jeepyb I'm wondering if people think we should proactively remove those gerrit hooks
19:33:02 <clarkb> or if anyone is interseted in looking at them to see if they need to use the db or if they can just hit the rest api maybe
19:33:12 <clarkb> (I think the rest api would be the best way to interact with notedb)
19:34:04 <fungi> those hooks seem like somethnig which could be replaced with zuul jobs in advance with little or no modification needed between 2.13 and 3.2
19:35:01 <ianw> can you point out what those hooks are, for those of us who might not know? :)
19:35:07 <clarkb> one sec
19:35:28 <ianw> getting something zuulified might be somewhere i can practically help :)
19:36:00 <clarkb> https://opendev.org/opendev/jeepyb/src/branch/master/jeepyb/cmd/update_bug.py
19:36:28 <clarkb> https://opendev.org/opendev/jeepyb/src/branch/master/jeepyb/cmd/update_blueprint.py
19:36:41 <clarkb> https://opendev.org/opendev/jeepyb/src/branch/master/jeepyb/cmd/welcome_message.py
19:37:05 <clarkb> then in system config wehave simple shim bash scripts that execute those jeepyb tools via gerrit hooks
19:37:56 <fungi> ianw: yeah, that's why i mentioned, they can probably be worked on in parallel and then that's one less thing we have to worry about afterward
19:38:21 <clarkb> related to this is storyboard integration which is currently done via the its-storyboard plugin. That plugin hasnt had much development in years
19:38:46 <clarkb> its possible that it just works since storyboard and gerrit plugins are quite stable but I havent set it up to test it
19:38:49 <fungi> also zuul and the storyboard api are both far more extensible than the its framework
19:39:08 <ianw> looks like the db parts there are for setting the uploader of a change to the owner of the bug in launchpad
19:39:11 <fungi> so there's a lot of opportunity for improvement as a zuul job anyway in my opinion
19:40:14 <clarkb> ianw maybe you can look at the jeepyb side and give a recommendation and based on that we decide if weneed to test storyboard integration?
19:41:18 <ianw> ok, i can take a look and see if i can find the apis to replicate what's there
19:42:08 <ianw> i can make an therpad
19:42:42 <clarkb> thank you
19:43:05 <clarkb> other things to note, draft changes will be converted to WIP changes
19:43:22 <clarkb> the UI will change as gerrit 3.2 is polygerrit only
19:43:58 <clarkb> the zuul commentlink stuff will be removed and we can figure out how to make them fancy again post upgrade
19:44:27 <fungi> also the custom summary table js overlay yeah?
19:44:44 <clarkb> yes that doesnt work either so in my change stack it is removed
19:44:49 <corvus> i'm still out of ideas for that other than a polygerrit plugin
19:45:23 <clarkb> https://review.opendev.org/#/c/757162/ is the end of my WIP stack if you want to take alook
19:46:45 <clarkb> mechanically I'm not quite sure yet how we land those post upgrade
19:47:03 <clarkb> assuming zuul is already running we won't want them to actually execute in sequence
19:47:17 <clarkb> we could squash them all together or perhaps remove the infra-prod job for gerrit
19:47:38 <clarkb> or land them pre zuul being started (one change in the sequence is a zuul config update to get it authing properly to gerrit)
19:47:57 <clarkb> I think we can do a proper review of the upgrade process from start to finish once I've written up a better change doc
19:48:00 <clarkb> and check on that there
19:48:17 <clarkb> Are there other gerrit features or functionality that people think will be critical to test pre upgrade?
19:49:16 <ianw> gertty?
19:49:47 <clarkb> ++ would existing gertty users like to point it at review-test or should I get a local install running again?
19:50:00 <clarkb> also possible that corvus already uses gertty with upstream gerrit and its fine?
19:50:02 <fungi> plenty of folks are also using gertty with newer gerrits, but can't hurt
19:50:17 <clarkb> ok add that to my local list
19:50:33 <corvus> i have used it with upstream gerrit and it works
19:50:37 <clarkb> excellent
19:50:42 <corvus> it doesn't fully support all the new features, but it functions
19:50:49 <corvus> (eg tag support is half-implemented)
19:50:55 <corvus> i mean hashtag
19:51:17 <clarkb> given all that do we think we should start working to schedule a downtime under the assumption we'll work through the remainder of our test items in the interim?
19:51:20 <corvus> gerrit devs are good api stewards :)
19:51:58 <clarkb> I think we should start a downtime window ~PDT friday morning and end it ~PDT sunday evening
19:52:04 <clarkb> where PDT may be PST due to time change
19:52:33 <clarkb> that gives us a large buffer over our less busy period of time
19:52:46 <clarkb> with the goal of being done saturday
19:53:05 <fungi> it will be pst by then, yes
19:53:16 <fungi> time change is coming up for the usa next week i think?
19:53:24 <clarkb> fungi: ya just before the PTG
19:54:13 <clarkb> with the summit and PTG coming up and then tue US election likely to be distracting I think the earliest we could do it is 13th of november (friday the 13th!)
19:54:28 <clarkb> but maybe 20th is better as it gives us a buffer and is closer to a large us holiday (so likely to be quiet?)
19:54:30 <fungi> lucky 13
19:54:41 <clarkb> ianw: corvus: any opinions?
19:55:27 <corvus> checking
19:55:36 <fungi> i'm open all of the above
19:55:40 <fungi> there's yet more pressure to update as we've just today learned that fedora 33's default openssl policy causes ssh to no longer accept our gerrit host key without additional overrides
19:55:57 <clarkb> I think my personal vote would be November 20, 21, 22
19:56:08 <clarkb> and keep working on testing things if something major comes up we can push it back
19:56:23 <clarkb> but so far testing has been mostly happy (as long as we accept things won't be perfect just functional)
19:56:43 <ianw> afaik i can be around then, not that i've really done anything on the details of the upgrade
19:57:07 <clarkb> ya the only reason I've said PST timezone reference is I've been doing a lot of the work so figure I should be around to drive things
19:57:12 <clarkb> but I'd love as many eyeballs as possible :)
19:57:22 <fungi> additional hands are helpful for sorting out unanticipated issues after the upgrade too
19:57:35 <corvus> either of those sounds good.  i suspect due to covid, people may be generous in taking time off before or after thanksgiving this year.
19:57:54 <corvus> (eg, banked use-it-or-lose-it vacation days in usa)
19:58:00 <clarkb> ya
19:58:24 <fungi> or they may be like me and have an excuse to be anti-social and not worry about family obligations ;)
19:58:26 <clarkb> why don't we pencil in the 20th if nothing jeopardizing that comes up in the next week we can announce it with a full month of warning?
19:58:44 <corvus> that sounds good
19:58:46 <fungi> sounds great
19:59:10 <corvus> fungi: oh yeah, i'm not suggesting people would use vacation to meet with other people socially.  just use it period.
19:59:11 <clarkb> woo and we haven't quite run out of time yet either :)
19:59:27 <clarkb> Is there anything else related to the gerrit upgrade that we want to talk about before I end the meeting?
19:59:41 <clarkb> also thank you for all the help and willingness to do a weekend outage
19:59:50 <corvus> clarkb: thanks to you for driving it.  and fungi too.
20:00:11 <clarkb> and we're at time. Thanks again
20:00:15 <clarkb> #endmeeting