19:01:39 #startmeeting infra 19:01:41 Meeting started Tue Nov 10 19:01:39 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:42 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:44 The meeting name has been set to 'infra' 19:01:55 o/ 19:02:00 ohai 19:02:25 #link http://lists.opendev.org/pipermail/service-discuss/2020-November/000134.html Our Agenda 19:02:34 o/ 19:03:18 #topic Announcements 19:03:27 o/ 19:03:27 Wallaby cycle signing key has been activated https://review.opendev.org/760364 19:03:32 Please sign if you haven't yet https://docs.opendev.org/opendev/system-config/latest/signing.html 19:03:35 o/ 19:03:36 I should find time to do that 19:04:26 as long as we have at least a few folks attesting to it, that should be fine. the previous key has also published a signature for it anyway 19:05:03 #topic Actions from last meeting 19:05:09 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-03-19.01.txt minutes from last meeting 19:05:14 There were no recorded actions 19:05:23 #topic Priority Efforts 19:05:28 #topic Update Config Management 19:05:53 I believe we have an update on mirror-update.opendev.org from ianw and fungi? The reprepro stuff has been converted to ansible and the old puppeted server is no more? 19:06:10 that sounds right to me 19:06:25 yes, all done now, i've removed the old server so it's all opendev.org, all the time :) 19:06:37 excellent, thank you for working on that. 19:06:47 Has the change to do vos release via ssh landed? 19:07:05 yes, i haven't double checked all the runs yet this morning, but the ones i saw last night looked good 19:07:36 758695 merged and was deployed by 05:12:16 19:07:43 cool. Are there any other puppet conversions to call out? 19:07:56 so in theory any mirror pulses starting after that time should have used it 19:09:10 umm you saw the thing about the afs puppet jobs 19:09:20 i think they have just been broken for ... a long time? 19:09:33 ianw: yup I've pushed up a few changes/patchsets to try and fix the testing on that change 19:09:43 and yes I expect that has always been broken 19:09:51 just more noticeable now due to the symlink thing 19:10:09 ianw: if my patches don't work then maybe we should ignore e208 for now in order to get the puppetry happy 19:10:09 ok, i think afs is my next challenge to get updated 19:10:09 grafana indicates ~current state (all <4hr old) for our package mirrors 19:11:11 #topic OpenDev 19:11:31 Preparations for a gerrit 3.2 upgrade are ramping up again 19:11:41 #link http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html Our announcement for the November 20 - 22 upgrade window 19:12:00 fungi and I have got review-test upgraded from a ~november 5 prod state 19:12:15 The server should be up and useable for testing and other interactions 19:12:29 yep, fully upgraded to 3.2 19:12:34 ianw: we are hoping that will help with your jeepyb testing 19:12:36 also usable for demonstrating the ui 19:13:04 ahh yes, i can play with the api and see if we can replicate the jeepyb things 19:13:11 and it sounds like the 3.3 release is coming right about the time we planned to upgrade to 3.2, so we should probably plan a separate 3.3 release soon after? 19:13:33 fungi: I think once we've settled then ya 3.3 should happen quickly after 19:13:37 I think we've basically decided that the surrogate gerrit idea is neat, but introduces a bit of complexity in knowing what needs to be synced back and forth to end up with a valid upgrade path that way. 19:13:40 i suppose we can keep review-test around to also test the 3.3 upgrade if we want 19:13:56 fungi and I did discover that giving the notedb conversion more threads sped up that process. Still not short but noticeably quicker 19:14:03 we gave it 50% more threads and it ran 40% quicker 19:14:13 I think we plan to double the default thread count when we do the production upgrade 19:14:27 we might be able to speed it up a bit more still too, though i don't expect much below 4 hours to complete the notedb migration step 19:14:51 There are a couple of things that have popped up that I wanted to bring up for wider discussion. 19:15:02 The first is that we have confirmed that gerrit does not like updating accounts if they don't have an email set 19:15:08 #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13654 19:15:08 if we budget ~4 hours on the upgrade plan, i guess we can see where that would leave us in possible timelines 19:15:42 oh, yeah, that email-less account behavior strikes me as a bug 19:15:53 I've filed that upstream bug due to that weird account management behavior. You can create an internal account just fine without and email address, but you cannot then update that account's ssh keys 19:16:08 you also can't use a duplicate email address across accounts 19:16:17 you're allowed to create accounts with no e-mail address, but adding an ssh key to one after the fact throws a weird backtrace into the logs and responds "unavailable" 19:16:22 so probably just a regression 19:16:28 this means that if we need to update our admin accounts we may need to set a unique email address on them :/ 19:16:34 we can set "infra-root+foobar" as email for for aour admin accounts 19:16:53 yeah, that seems like a reasonable workaround 19:16:57 ah cool 19:17:04 ++ 19:17:04 fungi: ^ we should probably test that gerrit treats those as unique? 19:17:09 or i guess probably our own email addresses depending on hosting provider 19:17:22 since rackspace's e-mail system we're using does support + addresses as automatic aliases 19:17:22 gmail supports it iirc 19:17:25 and hopefully newer gerrit will just fix the problem 19:17:36 (and my exim/cyrus does) 19:17:48 i'm happy to test that gerrit sees those addresses as unique, but i can pretty well guarantee it will 19:18:01 fungi: thanks 19:18:14 that seems like a quick and easy fix so we probably don't need to get into it much further 19:18:29 The other thing I wanted to bring up was the set of chagnes that I've prepped in relation to the upgrade 19:18:39 #link https://review.opendev.org/#/q/status:open+topic:gerrit-upgrade-prep Changes for before and after the upgrade 19:18:49 A number of those should be safe to land today and they do not have a WIP 19:18:52 the main reason i would want to avoid having to put e-mail addresses on those accounts is it's just one more thing which can tab complete to them in the gerrit ui and confuse users 19:19:19 Another chunk reflect state after the upgrade and are WIP because we shouldn't land them yet 19:19:43 It would be great if we could get reviews on the lot of them to sanity check things as well as land as much as we can today 19:19:49 (or $day before the upgrade_ 19:20:19 One specific concern I've got is there are ~4 system-config chagnes that sort of all need to land together because they reflect post upgrade system state, but zuul will run them in sequence 19:20:38 so I'm wondering how should we manipulate zuul during/after the upgrade to safely run those updates against the updated state 19:20:40 fungi: good point, we should probably avoid "corvus+admin"; infra-root+corvus is better due to tab-complete 19:21:17 https://review.opendev.org/#/c/757155/ https://review.opendev.org/#/c/757625/ https://review.opendev.org/#/c/757156/ https://review.opendev.org/#/c/757176/ are the 4 changes I've identified in this situation 19:21:18 clarkb: bracket with disable/enable jobs change? 19:22:04 corvus: ya so I think our options are: disable then enable the jobs entirely, force merge them all before zuul starts, squash them and set it up so that a single job running is fine 19:22:41 one concern with disabling the jobs then enabling them is I worry I won't manage to sufficiently disable the job since we trigger them in a number of places. But that concern may just be mitigated with sufficient grepping 19:23:08 i agree and force-merge or squashing means less time spinning wheels 19:23:10 just before the meeting I discovered thatjeepyb wasn't running the gerrit 3.1 and 3.2 image builds as an example of where we've missed things like that previously 19:24:04 i'm good with squashing, those changes aren't massive 19:24:23 and they're all for the same repo 19:24:26 the changes in system-config that trail the ones I've listed above should all be safe to land as after the fact cleanups 19:24:56 Another concern I had was I expect gitea replication to take a day and a half or so based on testing, I don't think we rely on gitea state for our zuul jobs that run ansible, but if we do anywhere can you call that out? 19:25:09 because that is another syncing of the world step that may impact our automated deployments 19:25:44 but ya if people can review those changes and think about them from a perspective of how do we land them safely post upgrade that would be great. I'm open to feedback and ideas 19:26:14 I'm hoping to write up a concrete upgrade plan doc soon (starting tomorrow likely) and we can start to fill in those details 19:26:42 at this point I think my biggest concern with the upgrade revolves around how do we turn zuul back on safely :) 19:26:58 the gitea replication lag will probably confuse folks cloning or pulling changes (or using gertty) 19:27:17 but it's happened before, so i think if we include that in the announcement folks can deal 19:27:56 this is also why even if we can get stuff done on saturday we need to say the maintenance is through sunday 19:28:10 fungi: yup and we have done that 19:28:11 (or early monday as we've been communicating so far) 19:28:27 another thought that occured to me when writing https://review.opendev.org/#/c/762191/1 earlier today is that it feels like we're effectively abandoning review-dev 19:28:56 Should we try to upgrade review-dev or decide it doesn't work well for us anymore and we need something like review-test going forward? 19:29:06 I'm hopeful that zuul jobs can fit in there too 19:29:17 i had assumed, perhaps incorrectly, that we wouldn't really need review-dev going forward 19:29:39 fungi: fwiw I don't think that is incorrect, mostly just me realizing today "Oh ya we still have review dev and these cahgnes will make it sad" 19:29:47 I think that is ok if one of the todo items here is retire review-dev 19:29:57 we can put it in the emergency file in the interim 19:30:18 review-test with prod like data has been way more valuable imo 19:30:23 our proliferation of -dev servers predates our increased efficiency at standing up test servers on demand, or even as part of ci 19:30:48 and at some point they become more of a maintenance burden than a benefit 19:32:14 clarkb: ++ 19:32:34 ok /me adds put review-dev in stasis to the list 19:33:04 The last thing on my talk about gerrit list is that storyboard is still an unknown 19:33:14 its-storyboard may or may not work is the more specific way of saying that 19:33:28 fungi: how terrible would it be to set up credentials for review-test against storyboard-dev now and test that integration? 19:33:41 we're building it into the images, adding credentials for it would be fairly trivial 19:34:01 i can give that a go later this week and test it 19:34:07 that would be great, thank you 19:34:29 anyone else have questions or concerns to bring up around the upgrade? 19:35:11 i think where it's likely yo fall apart is around commentlinks mapping to the its actions 19:35:19 er, likely to fall apart 19:37:39 (talking about its-storyboard plugin integration that is) 19:38:29 #topic General topics 19:38:36 #topic PTG Followups 19:38:58 Just a note that I haven't forgotten these, but the time pressure for the gerrit upgrade has me focusing on that (the downside to having all the things happen in a short period of time) 19:39:27 I'm hoping tomorrow will be a "writing" day and I'll get an upgrade plan doc written as well as some of these ptg things and not look at failing jobs or code for a bit 19:39:41 #topic Meetpad not useable from some locations 19:40:01 I brought this up with Horace and he was willing to help us test it, then I completely spaced on it because last week had a very distracting event going on. 19:40:25 I'll try pinging horace this evening (my time) to see if there is a good time to test again 19:40:38 then hopeflly we can narrow this down to corporate firewalls or the great firewall etc 19:41:21 #topic Bup and Borg Backups 19:41:30 Wanted to bring this up since there have been recent updates 19:41:48 In particular I think we declared bup bankruptcy on etherpad since /root/.bup was using significant disk 19:42:06 and out of that ianw has landed chagnes to start running borg on all the hosts we back up 19:42:19 ianw: were you happy with the results of those changes? 19:42:40 i was last night on etherpad 19:42:58 i haven't yet gone through all the other hosts but will today 19:43:26 sounds good 19:43:29 note per our discussion bup is now off on etherpad, because it was filling up the disk 19:43:52 I think the biggest change from what we were doing with bup is that borg requires a bit more opt in to what is backed up rather than backing up all of / with exclusions 19:44:04 (we could set borg to backup / then do exclusions too I suppose) 19:44:16 want to call that out as I tripped over it a few times when reasoning about exclusion list updates and the like 19:45:04 another thing is that the vexxhost backup server has 1tb attached, the rax one 3tb 19:45:08 i think if we set a good policy about where we expect important data/state to reside on our systems and then back up those paths, it's fine 19:45:32 ianw: also have we set the borg settings to do append only backups? 19:45:47 we had called that out as a desireable feature and now I can't recall if we're setting that or not 19:46:15 yes, we run the remote side with --append-only 19:47:05 great, thank you for working on this. Hopeflly we end up freeing a lot of local disk that was consumed by /root/.bup as well as handle the python2 less world 19:48:30 I had a couple other topics (openstackid.org and splitting puppet else up) but I don't think anything has happend on those subjects 19:48:36 #topic Open Discussion 19:49:10 tomorrow is a holiday in many parts of the wordl which is why I'm hoping I can get away with writing documents :) 19:49:17 if you've got the day off enjoy 19:49:29 ianw: there was some discussion in #zuul this morning related to your pypa zuul work; did you see that? is a tl;dr worthwhile? 19:49:59 corvus: sure, pypa have shown interest in zuul and i've been working to get a proof-of-concept up 19:50:21 oh sorry, i meant do you want me to summarize the #zuul chat? :) 19:50:21 the pull request doing some tox testing is @ https://github.com/pypa/pip/pull/9107 19:50:42 oh, haha, sure 19:51:52 it was suggested that if we pull more stuff out of the run playbook and put it into pre (eg, ensure-tox etc) it would make the console tab more accessible to folks. i think that's relevant in your pr since that job is being defined there. i think avass was going to leave a comment. 19:52:25 building on that, we thought we might look into having zuul default to the console tab rather than the summary tab. (this item is less immediately relevant) 19:53:17 oh right, yeah i pushed a change to do that in that pr 19:53:52 oh inmotionhosting has reached out to me about possibly providing cloud resources to opendev. I'ev got an introductory call with them tomorrow to start that conversation 19:53:59 the overall theme is if we focus on simplifying the run playbook and present the console tab to users, we can immediately present important information to users, increase the signal/noise ratio, and the output may start to seem a little more familiar to folks using other ci tools. 19:54:22 ianw: cool, then you're probably ahead of me on this, i had to duck out right after that convo. :) 19:54:35 corvus: this is true, as with travis or github, i forget, you get basically your yaml file shown to you in a "console" format 19:54:47 like you click to open up each step and see the logs 19:55:02 ianw: yeah, and we do too, it's just our yaml file is way bigger :) 19:55:50 (and the console tab hides pre/post playbooks by default, so putting "boring" stuff in those is a win for ux [assuming it's appropriate to put them there]) 19:56:32 clarkb: neatoh 19:56:40 i'm pretty aware that just using zuul to run tox as 3rd party CI for github isn't a big goal for us ... but i do feel like there's some opportunity to bring pip a little further along here 19:57:03 the tasks like "Run tox testing" which are just role inclusion statements could also be considered noise, i suppose 19:57:19 fungi: yeah, that might be worth a ui re-think 19:57:37 (maybe we can ignore those?) 19:58:13 clarkb: are they a private cloud provider? 19:58:20 or maybe "expandable" task results could be more prominent in the ui somehow 19:58:29 corvus: yup, the brief intro I got was that they coudl run an openstack private cloud that we would use 19:58:30 besides just the leading > marker 19:59:32 We are just about to our hour time limit. Thank you everyone! 19:59:40 thanks clarkb! 19:59:45 clarkb: thx! 20:00:00 We'll see you here next week. Probably with another focus on gerrit as that'll be a few days before the planned upgrade 20:00:05 probably do a sanity check go no go then too 20:00:10 #endmeeting