19:01:14 #startmeeting infra 19:01:15 Meeting started Tue Oct 13 19:01:14 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:18 The meeting name has been set to 'infra' 19:01:22 o/ 19:01:25 #link http://lists.opendev.org/pipermail/service-discuss/2020-October/000105.html Our Agenda 19:01:44 ohai 19:01:54 I'm actually going to flip the order of this agenda around so that we can talk about gerrit last so that we can just talk about it until we are done or run out of time 19:02:03 #topic Announcements 19:02:13 The OpenStack release happens tomorrow 19:02:27 we should be slushy on things that impact that (I think we've been managing that so far so not super concerned) 19:02:53 Then next week we have the summit and the week after that the PTG 19:03:08 hope to see you all virtually there :) 19:03:26 #topic Actions from last meeting 19:03:34 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-10-06-19.01.txt minutes from last meeting 19:03:38 No actions recordred 19:03:43 (and I can't type) 19:03:59 #topic General topics 19:04:05 #topic PTG Planning 19:04:18 #link https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning happens here 19:04:52 if you haven't looked over this etherpad yet it would be a good idea to do a quick check to ensure we aren't missing any important items 19:05:16 Other than that ensure you've registered 19:05:20 #link https://www.openstack.org/ptg/ 19:05:26 And we'll see you on meetpad in a couple weeks 19:05:38 #topic Bup and Borg Backups 19:05:56 ethercalc is are borg test unit 19:06:10 * fungi tries to parse 19:06:16 ianw: that landed yesterday, anything unexpected or exciting to mention on that? 19:06:25 s/are/our/ 19:06:38 well, yes, about ansible on bridge of course :) 19:06:51 oh ya the jinja thing. That would be good to recount here 19:06:59 did manually upgrading jinja2 work? 19:07:13 i sort of passed out around that time 19:07:17 in a yak shaving adventure, i realised that bridge has jinja2 2.10, because ansible doesn't specify any lower bound 19:07:49 and of course, i managed to find a place where it is incompatible with ~2.11, which is what gets installed in the gate testing 19:08:12 i haven't tried with the manual update of it yet, will today 19:08:30 in the mean time, i wrote up https://review.opendev.org/757670 to install ansible in a venv on bridge, so we can --update it 19:09:28 assuming that works we should have our first borg'd server ya? 19:09:31 i'll clean that up today, but it seems to work 19:09:49 clarkb: hopefully :) anyway, progress is being made 19:10:26 #topic Splitting puppet else into specific infra-prod jobs 19:10:35 I don't think anyone has started this yet but thought I'd quickly double check 19:11:22 i have not, no 19:11:45 #topic Priority Efforts 19:11:53 #topic Update Configuration Management 19:12:31 I think I saw ianw working on the reprepro ansiblification. There was also an update to add in some files that were missing in gerrit ansible 19:12:43 anything to add re ^ or any other config management updates? 19:12:55 yeah i'm starting on that, so we can get rid of more puppet there 19:12:59 oh, was there an update to that change? i'm happy to take a look 19:13:31 fungi: still very much a wip. i'm taking a less template-centric approach 19:13:43 this is an area we should be careful with the openstack release happening tomorrow but things like reprepro should be low impact if they break (due to how we vos release) 19:13:47 probably wise. that was far too many templates 19:14:29 yeah, there were about 3 different forms of templates, which made it more confusing than just looking at the files 19:14:44 (it didn't start that way, of course :) 19:15:19 shall i just abandon my topic:ansible-reprepro changes then? i guess you're working in a new change 19:15:51 i'm entirely in favor of something with fewer templates 19:16:07 that's what was so daunting about trying to convert the puppet to begin with 19:16:23 i got as far as template conversion and stalled out 19:17:26 fungi: you can leave it for now, i used some of it as reference :) 19:17:39 by all means, happy it helped 19:19:01 #topic OpenDev 19:19:18 That takes us to the topic I was hoping to make room for (and we did yay) 19:19:26 specifically upgrading our gerrit server 19:19:54 fungi and I have worked through a gerrit 2.13 to 3.2 upgrade on review-test using a snapshot of production from october 1 19:20:17 That upgrade is looking to be about 2 days long (with gerrit offline for it) 19:20:38 The first step is to upgrade from 2.13 to 2.16 as we need 2.16 to do the notedb conversion 19:20:39 which wouldn't be too terrible over a weekend 19:20:50 two days over a weekend i mean 19:21:05 once we've upgraded to 2.16 I think we shoud checkpoint there so we don't have to fall back all the way to 2.13 if something goes wrong 19:21:16 then we run the notedb migration which will take about 8 hours 19:21:28 then the next day we can do the 3.0 through 3.2 upgrades 19:21:49 by about 2 days, what do you mean? like 8am one day to 5pm the next? with idle time for when processes finish and no one is watching? 19:21:53 or 48 hours straight? 19:22:03 8am to 5pm the next 19:22:10 cool 19:22:19 with likely some idle time interspersed 19:23:14 roughly that process would look like: shut everything down and put up notices, backup reviewdb and git repos, do upgrade to 2.16, check it is happy, backup reviewdb and git repos (this is ~5pm day one), do notedb migration, at 8am next day do 3.0 to 3.2 upgrades which should finish around midday. Spend rest of day turning things back on and merging changes to catch up with our new state 19:23:15 like the notedb conversion, but also to a lesser extent offline reindexing, database schema migrations, git gc passes... 19:24:07 and ya lots of idle time waiting for things to finish 19:24:16 https://etherpad.opendev.org/p/gerrit-2.16-upgrade has timing data 19:24:52 From the testing side of things basic functionality seems to work 19:24:54 which is about as accurately measured as we can manage. same server flavor, volume type, snapshot of production data, et cetera 19:25:17 obviously though, clouds, no way to be sure about the timing for any of it 19:25:21 I can login, do git review -s, git review a change, review a change, search for changes, and otherwise interact with the web ui 19:25:27 fungi tested that ICLA signing works 19:25:35 yup 19:25:49 I'm currently testing replication to a gitea99 from a held system-config-run-gitea job 19:25:59 that has been running for 25 hours now and is still not done replicating 19:26:05 but it is working 19:26:13 for followup tasks (or could even be done beforehand), there are likely some zuul jobs to be written to replace some of our jeepyb gerrit hooks 19:26:25 my takeaway from that is we should be prepared to possibly stop replicating refs/changes again but we can make that decision if it becomes a problem? 19:26:52 yes, the next thing I want to test is project creation with manage-projects, then renaming a project, and finally use the delete-project plugin to test deleting a project 19:27:03 why would we need to re-replicate refs/changes? 19:27:24 corvus: we'll be replicating all of the notedb content which is in refs/changes/XY/ABCXY/meta now 19:27:53 hrm. but in your test, gitea99 doesn't have the bulk of the refs/changes content while prod gitea does 19:28:00 corvus: correct 19:28:19 this isn't for timing the replication, but measuring the resultant system 19:28:25 it will be ~15GB of data to replicate I think based on df output before and after the notedb migration 19:29:08 also the replication to gitea can happen outside the window, it's merely additive 19:29:26 but we want to know if the added data is going to cause our gitea servers to topple over 19:29:37 right, though it could cause replication to lag which could affect users 19:29:46 this is true, yes 19:30:15 ya I think we should go in with the intention of replicating refs/changes and keep in mind we can disable it if we notice problems (so far the only problemi s the speed at which a fresh server can be replicated to) 19:30:19 we can re-test by importing a gitea production db backup and re-replicating the difference 19:30:35 if we want to have an idea of how long it's going to actually take to catch up 19:31:05 (for purposes of messaging to our users about replication lag) 19:31:14 on the jeepyb side of things the lp bug and spec integration as well as the welcome message hook all talk to reviewdb which will be stale after the notedb migration (and eventually we'll drop that db and it will break completely) 19:31:19 we can also disable replicating refs/changes/XY/ABCXY/meta right? 19:31:27 corvus: I can't figure out how to do it 19:31:36 is it a regex? negative lookahead? 19:31:42 no its just globs I think 19:31:47 it uses gits normal ref syntax 19:31:48 oh. nm then. 19:31:52 might be worth asking luca if we can't figure it out 19:32:04 though yeah, sounds... like a glob 19:32:14 i agree that we can probably assume it will be okay and disable if not and regroup 19:32:47 for jeepyb I'm wondering if people think we should proactively remove those gerrit hooks 19:33:02 or if anyone is interseted in looking at them to see if they need to use the db or if they can just hit the rest api maybe 19:33:12 (I think the rest api would be the best way to interact with notedb) 19:34:04 those hooks seem like somethnig which could be replaced with zuul jobs in advance with little or no modification needed between 2.13 and 3.2 19:35:01 can you point out what those hooks are, for those of us who might not know? :) 19:35:07 one sec 19:35:28 getting something zuulified might be somewhere i can practically help :) 19:36:00 https://opendev.org/opendev/jeepyb/src/branch/master/jeepyb/cmd/update_bug.py 19:36:28 https://opendev.org/opendev/jeepyb/src/branch/master/jeepyb/cmd/update_blueprint.py 19:36:41 https://opendev.org/opendev/jeepyb/src/branch/master/jeepyb/cmd/welcome_message.py 19:37:05 then in system config wehave simple shim bash scripts that execute those jeepyb tools via gerrit hooks 19:37:56 ianw: yeah, that's why i mentioned, they can probably be worked on in parallel and then that's one less thing we have to worry about afterward 19:38:21 related to this is storyboard integration which is currently done via the its-storyboard plugin. That plugin hasnt had much development in years 19:38:46 its possible that it just works since storyboard and gerrit plugins are quite stable but I havent set it up to test it 19:38:49 also zuul and the storyboard api are both far more extensible than the its framework 19:39:08 looks like the db parts there are for setting the uploader of a change to the owner of the bug in launchpad 19:39:11 so there's a lot of opportunity for improvement as a zuul job anyway in my opinion 19:40:14 ianw maybe you can look at the jeepyb side and give a recommendation and based on that we decide if weneed to test storyboard integration? 19:41:18 ok, i can take a look and see if i can find the apis to replicate what's there 19:42:08 i can make an therpad 19:42:42 thank you 19:43:05 other things to note, draft changes will be converted to WIP changes 19:43:22 the UI will change as gerrit 3.2 is polygerrit only 19:43:58 the zuul commentlink stuff will be removed and we can figure out how to make them fancy again post upgrade 19:44:27 also the custom summary table js overlay yeah? 19:44:44 yes that doesnt work either so in my change stack it is removed 19:44:49 i'm still out of ideas for that other than a polygerrit plugin 19:45:23 https://review.opendev.org/#/c/757162/ is the end of my WIP stack if you want to take alook 19:46:45 mechanically I'm not quite sure yet how we land those post upgrade 19:47:03 assuming zuul is already running we won't want them to actually execute in sequence 19:47:17 we could squash them all together or perhaps remove the infra-prod job for gerrit 19:47:38 or land them pre zuul being started (one change in the sequence is a zuul config update to get it authing properly to gerrit) 19:47:57 I think we can do a proper review of the upgrade process from start to finish once I've written up a better change doc 19:48:00 and check on that there 19:48:17 Are there other gerrit features or functionality that people think will be critical to test pre upgrade? 19:49:16 gertty? 19:49:47 ++ would existing gertty users like to point it at review-test or should I get a local install running again? 19:50:00 also possible that corvus already uses gertty with upstream gerrit and its fine? 19:50:02 plenty of folks are also using gertty with newer gerrits, but can't hurt 19:50:17 ok add that to my local list 19:50:33 i have used it with upstream gerrit and it works 19:50:37 excellent 19:50:42 it doesn't fully support all the new features, but it functions 19:50:49 (eg tag support is half-implemented) 19:50:55 i mean hashtag 19:51:17 given all that do we think we should start working to schedule a downtime under the assumption we'll work through the remainder of our test items in the interim? 19:51:20 gerrit devs are good api stewards :) 19:51:58 I think we should start a downtime window ~PDT friday morning and end it ~PDT sunday evening 19:52:04 where PDT may be PST due to time change 19:52:33 that gives us a large buffer over our less busy period of time 19:52:46 with the goal of being done saturday 19:53:05 it will be pst by then, yes 19:53:16 time change is coming up for the usa next week i think? 19:53:24 fungi: ya just before the PTG 19:54:13 with the summit and PTG coming up and then tue US election likely to be distracting I think the earliest we could do it is 13th of november (friday the 13th!) 19:54:28 but maybe 20th is better as it gives us a buffer and is closer to a large us holiday (so likely to be quiet?) 19:54:30 lucky 13 19:54:41 ianw: corvus: any opinions? 19:55:27 checking 19:55:36 i'm open all of the above 19:55:40 there's yet more pressure to update as we've just today learned that fedora 33's default openssl policy causes ssh to no longer accept our gerrit host key without additional overrides 19:55:57 I think my personal vote would be November 20, 21, 22 19:56:08 and keep working on testing things if something major comes up we can push it back 19:56:23 but so far testing has been mostly happy (as long as we accept things won't be perfect just functional) 19:56:43 afaik i can be around then, not that i've really done anything on the details of the upgrade 19:57:07 ya the only reason I've said PST timezone reference is I've been doing a lot of the work so figure I should be around to drive things 19:57:12 but I'd love as many eyeballs as possible :) 19:57:22 additional hands are helpful for sorting out unanticipated issues after the upgrade too 19:57:35 either of those sounds good. i suspect due to covid, people may be generous in taking time off before or after thanksgiving this year. 19:57:54 (eg, banked use-it-or-lose-it vacation days in usa) 19:58:00 ya 19:58:24 or they may be like me and have an excuse to be anti-social and not worry about family obligations ;) 19:58:26 why don't we pencil in the 20th if nothing jeopardizing that comes up in the next week we can announce it with a full month of warning? 19:58:44 that sounds good 19:58:46 sounds great 19:59:10 fungi: oh yeah, i'm not suggesting people would use vacation to meet with other people socially. just use it period. 19:59:11 woo and we haven't quite run out of time yet either :) 19:59:27 Is there anything else related to the gerrit upgrade that we want to talk about before I end the meeting? 19:59:41 also thank you for all the help and willingness to do a weekend outage 19:59:50 clarkb: thanks to you for driving it. and fungi too. 20:00:11 and we're at time. Thanks again 20:00:15 #endmeeting