#opendev-meeting log

19:01:39 <clarkb> #startmeeting infra
19:01:41 <openstack> Meeting started Tue Nov 10 19:01:39 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:44 <openstack> The meeting name has been set to 'infra'
19:01:55 <corvus> o/
19:02:00 <fungi> ohai
19:02:25 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-November/000134.html Our Agenda
19:02:34 <ianw> o/
19:03:18 <clarkb> #topic Announcements
19:03:27 <diablo_rojo__> o/
19:03:27 <clarkb> Wallaby cycle signing key has been activated https://review.opendev.org/760364
19:03:32 <clarkb> Please sign if you haven't yet https://docs.opendev.org/opendev/system-config/latest/signing.html
19:03:35 <diablo_rojo__> o/
19:03:36 <clarkb> I should find time to do that
19:04:26 <fungi> as long as we have at least a few folks attesting to it, that should be fine. the previous key has also published a signature for it anyway
19:05:03 <clarkb> #topic Actions from last meeting
19:05:09 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-03-19.01.txt minutes from last meeting
19:05:14 <clarkb> There were no recorded actions
19:05:23 <clarkb> #topic Priority Efforts
19:05:28 <clarkb> #topic Update Config Management
19:05:53 <clarkb> I believe we have an update on mirror-update.opendev.org from ianw and fungi? The reprepro stuff has been converted to ansible and the old puppeted server is no more?
19:06:10 <fungi> that sounds right to me
19:06:25 <ianw> yes, all done now, i've removed the old server so it's all opendev.org, all the time :)
19:06:37 <clarkb> excellent, thank you for working on that.
19:06:47 <clarkb> Has the change to do vos release via ssh landed?
19:07:05 <ianw> yes, i haven't double checked all the runs yet this morning, but the ones i saw last night looked good
19:07:36 <fungi> 758695 merged and was deployed by 05:12:16
19:07:43 <clarkb> cool. Are there any other puppet conversions to call out?
19:07:56 <fungi> so in theory any mirror pulses starting after that time should have used it
19:09:10 <ianw> umm you saw the thing about the afs puppet jobs
19:09:20 <ianw> i think they have just been broken for ... a long time?
19:09:33 <clarkb> ianw: yup I've pushed up a few changes/patchsets to try and fix the testing on that change
19:09:43 <clarkb> and yes I expect that has always been broken
19:09:51 <clarkb> just more noticeable now due to the symlink thing
19:10:09 <clarkb> ianw: if my patches don't work then maybe we should ignore e208 for now in order to get the puppetry happy
19:10:09 <ianw> ok, i think afs is my next challenge to get updated
19:10:09 <fungi> grafana indicates ~current state (all <4hr old) for our package mirrors
19:11:11 <clarkb> #topic OpenDev
19:11:31 <clarkb> Preparations for a gerrit 3.2 upgrade are ramping up again
19:11:41 <clarkb> #link http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html Our announcement for the November 20 - 22 upgrade window
19:12:00 <clarkb> fungi and I have got review-test upgraded from a ~november 5 prod state
19:12:15 <clarkb> The server should be up and useable for testing and other interactions
19:12:29 <fungi> yep, fully upgraded to 3.2
19:12:34 <clarkb> ianw: we are hoping that will help with your jeepyb testing
19:12:36 <fungi> also usable for demonstrating the ui
19:13:04 <ianw> ahh yes, i can play with the api and see if we can replicate the jeepyb things
19:13:11 <fungi> and it sounds like the 3.3 release is coming right about the time we planned to upgrade to 3.2, so we should probably plan a separate 3.3 release soon after?
19:13:33 <clarkb> fungi: I think once we've settled then ya 3.3 should happen quickly after
19:13:37 <clarkb> I think we've basically decided that the surrogate gerrit idea is neat, but introduces a bit of complexity in knowing what needs to be synced back and forth to end up with a valid upgrade path that way.
19:13:40 <fungi> i suppose we can keep review-test around to also test the 3.3 upgrade if we want
19:13:56 <clarkb> fungi and I did discover that giving the notedb conversion more threads sped up that process. Still not short but noticeably quicker
19:14:03 <clarkb> we gave it 50% more threads and it ran 40% quicker
19:14:13 <clarkb> I think we plan to double the default thread count when we do the production upgrade
19:14:27 <fungi> we might be able to speed it up a bit more still too, though i don't expect much below 4 hours to complete the notedb migration step
19:14:51 <clarkb> There are a couple of things that have popped up that I wanted to bring up for wider discussion.
19:15:02 <clarkb> The first is that we have confirmed that gerrit does not like updating accounts if they don't have an email set
19:15:08 <clarkb> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13654
19:15:08 <fungi> if we budget ~4 hours on the upgrade plan, i guess we can see where that would leave us in possible timelines
19:15:42 <fungi> oh, yeah, that email-less account behavior strikes me as a bug
19:15:53 <clarkb> I've filed that upstream bug due to that weird account management behavior. You can create an internal account just fine without and email address, but you cannot then update that account's ssh keys
19:16:08 <clarkb> you also can't use a duplicate email address across accounts
19:16:17 <fungi> you're allowed to create accounts with no e-mail address, but adding an ssh key to one after the fact throws a weird backtrace into the logs and responds "unavailable"
19:16:22 <fungi> so probably just a regression
19:16:28 <clarkb> this means that if we need to update our admin accounts we may need to set a unique email address on them :/
19:16:34 <corvus> we can set "infra-root+foobar" as email for for aour admin accounts
19:16:53 <fungi> yeah, that seems like a reasonable workaround
19:16:57 <clarkb> ah cool
19:17:04 <ianw> ++
19:17:04 <clarkb> fungi: ^ we should probably test that gerrit treats those as unique?
19:17:09 <corvus> or i guess probably our own email addresses depending on hosting provider
19:17:22 <fungi> since rackspace's e-mail system we're using does support + addresses as automatic aliases
19:17:22 <corvus> gmail supports it iirc
19:17:25 <clarkb> and hopefully newer gerrit will just fix the problem
19:17:36 <corvus> (and my exim/cyrus does)
19:17:48 <fungi> i'm happy to test that gerrit sees those addresses as unique, but i can pretty well guarantee it will
19:18:01 <clarkb> fungi: thanks
19:18:14 <clarkb> that seems like a quick and easy fix so we probably don't need to get into it much further
19:18:29 <clarkb> The other thing I wanted to bring up was the set of chagnes that I've prepped in relation to the upgrade
19:18:39 <clarkb> #link https://review.opendev.org/#/q/status:open+topic:gerrit-upgrade-prep Changes for before and after the upgrade
19:18:49 <clarkb> A number of those should be safe to land today and they do not have a WIP
19:18:52 <fungi> the main reason i would want to avoid having to put e-mail addresses on those accounts is it's just one more thing which can tab complete to them in the gerrit ui and confuse users
19:19:19 <clarkb> Another chunk reflect state after the upgrade and are WIP because we shouldn't land them yet
19:19:43 <clarkb> It would be great if we could get reviews on the lot of them to sanity check things as well as land as much as we can today
19:19:49 <clarkb> (or $day before the upgrade_
19:20:19 <clarkb> One specific concern I've got is there are ~4 system-config chagnes that sort of all need to land together because they reflect post upgrade system state, but zuul will run them in sequence
19:20:38 <clarkb> so I'm wondering how should we manipulate zuul during/after the upgrade to safely run those updates against the updated state
19:20:40 <corvus> fungi: good point, we should probably avoid "corvus+admin"; infra-root+corvus is better due to tab-complete
19:21:17 <clarkb> https://review.opendev.org/#/c/757155/ https://review.opendev.org/#/c/757625/ https://review.opendev.org/#/c/757156/ https://review.opendev.org/#/c/757176/ are the 4 changes I've identified in this situation
19:21:18 <corvus> clarkb: bracket with disable/enable jobs change?
19:22:04 <clarkb> corvus: ya so I think our options are: disable then enable the jobs entirely, force merge them all before zuul starts, squash them and set it up so that a single job running is fine
19:22:41 <clarkb> one concern with disabling the jobs then enabling them is I worry I won't manage to sufficiently disable the job since we trigger them in a number of places. But that concern may just be mitigated with sufficient grepping
19:23:08 <corvus> i agree and force-merge or squashing means less time spinning wheels
19:23:10 <clarkb> just before the meeting I discovered thatjeepyb wasn't running the gerrit 3.1 and 3.2 image builds as an example of where we've missed things like that previously
19:24:04 <fungi> i'm good with squashing, those changes aren't massive
19:24:23 <fungi> and they're all for the same repo
19:24:26 <clarkb> the changes in system-config that trail the ones I've listed above should all be safe to land as after the fact cleanups
19:24:56 <clarkb> Another concern I had was I expect gitea replication to take a day and a half or so based on testing, I don't think we rely on gitea state for our zuul jobs that run ansible, but if we do anywhere can you call that out?
19:25:09 <clarkb> because that is another syncing of the world step that may impact our automated deployments
19:25:44 <clarkb> but ya if people can review those changes and think about them from a perspective of how do we land them safely post upgrade that would be great. I'm open to feedback and ideas
19:26:14 <clarkb> I'm hoping to write up a concrete upgrade plan doc soon (starting tomorrow likely) and we can start to fill in those details
19:26:42 <clarkb> at this point I think my biggest concern with the upgrade revolves around how do we turn zuul back on safely :)
19:26:58 <corvus> the gitea replication lag will probably confuse folks cloning or pulling changes (or using gertty)
19:27:17 <corvus> but it's happened before, so i think if we include that in the announcement folks can deal
19:27:56 <fungi> this is also why even if we can get stuff done on saturday we need to say the maintenance is through sunday
19:28:10 <clarkb> fungi: yup and we have done that
19:28:11 <fungi> (or early monday as we've been communicating so far)
19:28:27 <clarkb> another thought that occured to me when writing https://review.opendev.org/#/c/762191/1 earlier today is that it feels like we're effectively abandoning review-dev
19:28:56 <clarkb> Should we try to upgrade review-dev or decide it doesn't work well for us anymore and we need something like review-test going forward?
19:29:06 <clarkb> I'm hopeful that zuul jobs can fit in there too
19:29:17 <fungi> i had assumed, perhaps incorrectly, that we wouldn't really need review-dev going forward
19:29:39 <clarkb> fungi: fwiw I don't think that is incorrect, mostly just me realizing today "Oh ya we still have review dev and these cahgnes will make it sad"
19:29:47 <clarkb> I think that is ok if one of the todo items here is retire review-dev
19:29:57 <clarkb> we can put it in the emergency file in the interim
19:30:18 <clarkb> review-test with prod like data has been way more valuable imo
19:30:23 <fungi> our proliferation of -dev servers predates our increased efficiency at standing up test servers on demand, or even as part of ci
19:30:48 <fungi> and at some point they become more of a maintenance burden than a benefit
19:32:14 <corvus> clarkb: ++
19:32:34 <clarkb> ok /me adds put review-dev in stasis to the list
19:33:04 <clarkb> The last thing on my talk about gerrit list is that storyboard is still an unknown
19:33:14 <clarkb> its-storyboard may or may not work is the more specific way of saying that
19:33:28 <clarkb> fungi: how terrible would it be to set up credentials for review-test against storyboard-dev now and test that integration?
19:33:41 <fungi> we're building it into the images, adding credentials for it would be fairly trivial
19:34:01 <fungi> i can give that a go later this week and test it
19:34:07 <clarkb> that would be great, thank you
19:34:29 <clarkb> anyone else have questions or concerns to bring up around the upgrade?
19:35:11 <fungi> i think where it's likely yo fall apart is around commentlinks mapping to the its actions
19:35:19 <fungi> er, likely to fall apart
19:37:39 <fungi> (talking about its-storyboard plugin integration that is)
19:38:29 <clarkb> #topic General topics
19:38:36 <clarkb> #topic PTG Followups
19:38:58 <clarkb> Just a note that I haven't forgotten these, but the time pressure for the gerrit upgrade has me focusing on that (the downside to having all the things happen in a short period of time)
19:39:27 <clarkb> I'm hoping tomorrow will be a "writing" day and I'll get an upgrade plan doc written as well as some of these ptg things and not look at failing jobs or code for a bit
19:39:41 <clarkb> #topic Meetpad not useable from some locations
19:40:01 <clarkb> I brought this up with Horace and he was willing to help us test it, then I completely spaced on it because last week had a very distracting event going on.
19:40:25 <clarkb> I'll try pinging horace this evening (my time) to see if there is a good time to test again
19:40:38 <clarkb> then hopeflly we can narrow this down to corporate firewalls or the great firewall etc
19:41:21 <clarkb> #topic Bup and Borg Backups
19:41:30 <clarkb> Wanted to bring this up since there have been recent updates
19:41:48 <clarkb> In particular I think we declared bup bankruptcy on etherpad since /root/.bup was using significant disk
19:42:06 <clarkb> and out of that ianw has landed chagnes to start running borg on all the hosts we back up
19:42:19 <clarkb> ianw: were you happy with the results of those changes?
19:42:40 <ianw> i was last night on etherpad
19:42:58 <ianw> i haven't yet gone through all the other hosts but will today
19:43:26 <clarkb> sounds good
19:43:29 <ianw> note per our discussion bup is now off on etherpad, because it was filling up the disk
19:43:52 <clarkb> I think the biggest change from what we were doing with bup is that borg requires a bit more opt in to what is backed up rather than backing up all of / with exclusions
19:44:04 <clarkb> (we could set borg to backup / then do exclusions too I suppose)
19:44:16 <clarkb> want to call that out as I tripped over it a few times when reasoning about exclusion list updates and the like
19:45:04 <ianw> another thing is that the vexxhost backup server has 1tb attached, the rax one 3tb
19:45:08 <fungi> i think if we set a good policy about where we expect important data/state to reside on our systems and then back up those paths, it's fine
19:45:32 <clarkb> ianw: also have we set the borg settings to do append only backups?
19:45:47 <clarkb> we had called that out as a desireable feature and now I can't recall if we're setting that or not
19:46:15 <ianw> yes, we run the remote side with --append-only
19:47:05 <clarkb> great, thank you for working on this. Hopeflly we end up freeing a lot of local disk that was consumed by /root/.bup as well as handle the python2 less world
19:48:30 <clarkb> I had a couple other topics (openstackid.org and splitting puppet else up) but I don't think anything has happend on those subjects
19:48:36 <clarkb> #topic Open Discussion
19:49:10 <clarkb> tomorrow is a holiday in many parts of the wordl which is why I'm hoping I can get away with writing documents :)
19:49:17 <clarkb> if you've got the day off enjoy
19:49:29 <corvus> ianw: there was some discussion in #zuul this morning related to your pypa zuul work; did you see that?  is a tl;dr worthwhile?
19:49:59 <ianw> corvus: sure, pypa have shown interest in zuul and i've been working to get a proof-of-concept up
19:50:21 <corvus> oh sorry, i meant do you want me to summarize the #zuul chat? :)
19:50:21 <ianw> the pull request doing some tox testing is @ https://github.com/pypa/pip/pull/9107
19:50:42 <ianw> oh, haha, sure
19:51:52 <corvus> it was suggested that if we pull more stuff out of the run playbook and put it into pre (eg, ensure-tox etc) it would make the console tab more accessible to folks.  i think that's relevant in your pr since that job is being defined there.  i think avass was going to leave a comment.
19:52:25 <corvus> building on that, we thought we might look into having zuul default to the console tab rather than the summary tab.  (this item is less immediately relevant)
19:53:17 <ianw> oh right, yeah i pushed a change to do that in that pr
19:53:52 <clarkb> oh inmotionhosting has reached out to me about possibly providing cloud resources to opendev. I'ev got an introductory call with them tomorrow to start that conversation
19:53:59 <corvus> the overall theme is if we focus on simplifying the run playbook and present the console tab to users, we can immediately present important information to users, increase the signal/noise ratio, and the output may start to seem a little more familiar to folks using other ci tools.
19:54:22 <corvus> ianw: cool, then you're probably ahead of me on this, i had to duck out right after that convo.  :)
19:54:35 <ianw> corvus: this is true, as with travis or github, i forget, you get basically your yaml file shown to you in a "console" format
19:54:47 <ianw> like you click to open up each step and see the logs
19:55:02 <corvus> ianw: yeah, and we do too, it's just our yaml file is way bigger :)
19:55:50 <corvus> (and the console tab hides pre/post playbooks by default, so putting "boring" stuff in those is a win for ux [assuming it's appropriate to put them there])
19:56:32 <corvus> clarkb: neatoh
19:56:40 <ianw> i'm pretty aware that just using zuul to run tox as 3rd party CI for github isn't a big goal for us ... but i do feel like there's some opportunity to bring pip a little further along here
19:57:03 <fungi> the tasks like "Run tox testing" which are just role inclusion statements could also be considered noise, i suppose
19:57:19 <corvus> fungi: yeah, that might be worth a ui re-think
19:57:37 <corvus> (maybe we can ignore those?)
19:58:13 <corvus> clarkb: are they a private cloud provider?
19:58:20 <fungi> or maybe "expandable" task results could be more prominent in the ui somehow
19:58:29 <clarkb> corvus: yup, the brief intro I got was that they coudl run an openstack private cloud that we would use
19:58:30 <fungi> besides just the leading > marker
19:59:32 <clarkb> We are just about to our hour time limit. Thank you everyone!
19:59:40 <fungi> thanks clarkb!
19:59:45 <corvus> clarkb: thx!
20:00:00 <clarkb> We'll see you here next week. Probably with another focus on gerrit as that'll be a few days before the planned upgrade
20:00:05 <clarkb> probably do a sanity check go no go then too
20:00:10 <clarkb> #endmeeting