#opendev-meeting log

19:01:05 <clarkb> #startmeeting infra
19:01:06 <openstack> Meeting started Tue Jul 14 19:01:05 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:10 <openstack> The meeting name has been set to 'infra'
19:01:13 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-July/000056.html Our Agenda
19:01:19 <clarkb> #topic Announcements
19:01:27 <clarkb> OpenDev virtual event #2 happening July 20-22
19:01:50 <clarkb> calling this out as they are using etherpad, but the previous event didn't have any problems with etherpad. I plan to be around and support the service if necessary though
19:01:51 <zbr> o/
19:02:12 <clarkb> also if you are interested in baremetal management that is the topic and you are welcome to join
19:02:57 <clarkb> #topic Actions from last meeting
19:03:05 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-07-19.00.txt minutes from last meeting
19:03:17 <clarkb> ianw: thank you for running last weeks meeting when I was out. I didn't see any actions recorded on the minutes.
19:03:26 <clarkb> ianw: is there anything else to add or should we move on to today's topics?
19:03:52 <ianw> nothing, i think move on
19:04:14 <clarkb> #topic Specs approval
19:04:21 <clarkb> #link https://review.opendev.org/#/c/731838/ Authentication broker service
19:04:37 <clarkb> I believe this got a new patchset and I ws going to review it then things got busy before I took a week off
19:04:54 <clarkb> fungi: ^ other than needing reviews anything else to add?
19:05:23 <fungi> nope, there's a minor sequence numbering problem in one of the lists in it, but no major revisions requested yet
19:05:48 <clarkb> great, and a friendly reminder to the rest of us to try and review that spec
19:05:51 <fungi> or counting error i guess
19:05:59 <clarkb> #topic Priority Efforts
19:06:07 <clarkb> #topic Update Config Management
19:06:22 <clarkb> ze01 is running on containers again. We've vendored the gear lib into the ansible role that uses it
19:06:39 <fungi> no new issues seen since?
19:06:40 <clarkb> other than a small hiccup with the vendoring I haven't seen any additional issues related to this
19:07:07 <clarkb> maybe give it another day or two then we should consider updating the remaining executors?
19:07:11 <fungi> any feel for how long we should pilot it before redoing the other 11?
19:07:22 <fungi> ahh, yeah, another day or two sounds fine to me
19:07:49 <clarkb> most of the issues we've hit have been in jobs that don't run frequently which is why giving it a few days to have those random jobs run on that executors seems like a good idea
19:07:56 <clarkb> but I don't think we need to wait for very long either
19:08:05 <ianw> umm, there is one
19:08:06 <ianw> https://review.opendev.org/#/c/740854/
19:08:21 <clarkb> ah that was related to the executor then?
19:08:30 <clarkb> (I saw the failures were happening but hadn't followed it that closely)
19:08:53 <ianw> yes, the executor writes out the job ssh key in the new openssh format, and it is more picky about whitespace
19:09:30 <clarkb> #link https://review.opendev.org/#/c/740854/ fixes an issue with containerized ze01. Should be landed and confirmed happy before converting more executors
19:09:35 <fungi> ahh, right, specifically because the version of openssh in the container is newer
19:09:37 <clarkb> I'll give that a review after the meeting if no one beats me to it
19:10:05 <fungi> fwiw the reasoning is sound and it's a very small patch, but disappointing default behavior from variable substitution
19:10:39 <fungi> i guess ansible or jinja assumes variables with trailing whitespace are a mistake unless you tell it otherwise
19:11:13 <clarkb> as far as converting the other 11 goes, I'm not entirely sure what the exact process is there. I think its something like stop zuul services, manually remove systemd units for zuul services, run ansible, start container but we'll want to double check that if mordred isn't able to update us
19:11:37 <mordred> ohai - not really here - but here fora a sec
19:11:50 <fungi> i'd also be cool waiting for mordred's return to move forward, in case he wants to be involved in the next steps
19:12:08 <mordred> yeah - I think the story when we're happy with ze01 is for each remaining ze to shut down the executor, run ansible to update to docker
19:12:26 <mordred> but I can totally drive that when I'm back online for real
19:12:38 <clarkb> cool that probably gives us a good burn in period for ze01 too
19:13:01 <mordred> yah
19:13:10 <ianw> yeah probably worth seeing if any other weird executor specific behaviour pops up
19:13:20 <fungi> sounds okay to me
19:13:24 <corvus> mordred: eta for your return?
19:14:19 <mordred> I'll be back online enough to work on this on Thursday
19:14:21 <fungi> "chapter 23: the return of mordred"
19:14:53 <mordred> I'll have electricians replacing mains power ... But I have a laptop and phone :)
19:14:54 <fungi> that also fits with the "couple of days" suggestion
19:15:00 <corvus> cool, no rush or anything, just thought if it was going to be > this week maybe i'd start to pick some stuff off, but "wait for mordred" sounds like it'll fit time-wise :)
19:15:13 <clarkb> ya I'm willing to help too, just let me know
19:15:17 <fungi> same
19:15:31 <mordred> Cool. I should be straight forward at this point
19:15:40 <corvus> meanwhile, i'll continue work on the (tangentially related) multi-arch container stuff
19:15:57 <clarkb> corvus: that was the next item on my list of notes related to config management updates
19:16:03 <corvus> cool i'll summarize
19:16:32 <corvus> despite all of our reworking, we're still seeing the "container ends up with wrong arch" problem for the nodepool builder containers
19:16:52 <corvus> we managed to autohold a set of nodes exhibiting the problem reliably
19:17:03 <corvus> (and by reliably, i mean, i can run the build over and over and get the same result)
19:17:20 <corvus> so i should be able to narrow down the problem with that
19:17:37 <corvus> at this point, it's unknown whether it's an artifact of buildx, zuul-registry, or something else
19:17:45 <clarkb> is there any concern that if we were to restart nodepool builders right now they may fail due to a mismatch in the published artifacts?
19:18:32 <corvus> clarkb: 1 sec
19:18:41 <corvus> mordred, ianw: do you happen to have a link to the change?
19:19:35 <ianw> https://review.opendev.org/#/c/726263/ the multi-arch python-builder you mean?
19:19:39 <corvus> yep
19:19:45 <ianw> then yes :)
19:19:48 <corvus> #link https://review.opendev.org/726263 failing multi arch change
19:19:55 <corvus> that's the change with the held nodes
19:20:23 <corvus> (it was +3, but failed in gate with the error; i've modified it slightly to fail the buildset registry in order to complete the autohold)
19:21:16 <corvus> clarkb: i *think* the failure is happening reliably in the build stage, so i think we're unlikely to have a problem restarting with anything published
19:21:37 <clarkb> gotcha, basically if we make it to publishing things have succeeded which implies the arches are mapped properly?
19:21:44 <corvus> we do have published images for both arches, and, tbh, i'm not sure what's actually on them.
19:21:46 * mordred is excited to learn what the issue is
19:21:57 <corvus> are we running both arches in containers at this point?
19:22:15 <mordred> no - arm is still running non-container
19:22:30 <mordred> multi-arch being finished here should let us run the arm builder in containers
19:22:35 <mordred> and stop having differences
19:22:43 <corvus> okay.  then my guess would be that there is a good chance the arm images published may not be arm images.  but i honestly don't know.
19:23:07 <corvus> we should certainly not proceed any further with arm until this is resolved
19:23:15 <clarkb> ++
19:23:22 <mordred> ++
19:23:25 <fungi> noted
19:23:25 <mordred> well
19:23:34 <mordred> we haven't built arm python-base images
19:23:40 <mordred> so any existing arm layers for nodepool-builder are defnitely bogus
19:23:55 <mordred> so defnitely should not proceed further :)
19:23:58 <clarkb> mordred: even if those layers possibly don't do anything arch specific?
19:24:06 <mordred> they do
19:24:06 <clarkb> like our python-base is just python and bash right?
19:24:08 <clarkb> ah ok
19:24:15 <mordred> they install dumb-init
19:24:18 <mordred> which is arch-specific
19:24:23 <fungi> python is arch-specific
19:24:35 <clarkb> fungi: ya but cpython is on the lower layer
19:24:43 <mordred> yah - but it comes from the base image
19:24:44 <mordred> from docker.io/library/python
19:24:44 <mordred> and is properly arched
19:24:53 <clarkb> .py files in a layer won't care
19:24:59 <clarkb> unless they link to c things
19:25:00 <mordred> but we install at least one arch-specific package in docker.io/opendevorg/python-base
19:25:04 <clarkb> or we install dumb init
19:25:14 <mordred> yah
19:25:15 <fungi> oh, okay, when you said "our python-base is just python and bash right" you meant python scripts, not compiled cpython
19:25:20 <ianw> clarkb: but anything that builds does, i think that was where we saw some issues with gcc at least
19:25:21 <clarkb> fungi: yup
19:25:38 <fungi> i misunderstood, sorry. thought you meant python the interpreter
19:25:45 <mordred> ianw: the gcc issue is actually a symptom of arch mismatch
19:26:19 <mordred> the builder image buiolds wheels so the base image dones't have to - but the builder and base layers were out of sync arch-wise - so we saw the base image install trying to gcc something (and failing)
19:26:59 <mordred> (yay for complex issues)
19:27:02 <ianw> i thought so, and the random nature of the return was why it passed check but failed in gate (irrc?)
19:27:15 <mordred> yup
19:27:27 <mordred> thank goodness corvus managed to hold a reproducible env
19:27:46 <clarkb> anything else on the topic of config management?
19:28:07 <ianw> i have grafana.opendev.org rolled out
19:28:36 <ianw> i'm still working on graphite02.opendev.org and migrating the settings and data correctly
19:29:36 <fungi> might be worth touching on the backup server split, though we can cover that under a later topic if preferred
19:30:03 <clarkb> yup its a separate topic
19:30:20 <fungi> cool, let's just lump it into that then
19:30:30 <clarkb> #topic OpenDev
19:30:40 <clarkb> lets talk opendev things really quickly then we can get to some of the things that popped up recently
19:30:49 <clarkb> #link https://review.opendev.org/#/c/740716/ Upgrade to v1.12.2
19:31:21 <clarkb> That change upgrades us to latest gitea. Notably its changelog asys it allows you to properly set the default branch on new projects to something other than manster
19:31:46 * fungi changes all his default branches to manster
19:31:48 <clarkb> this isn't smoeting we're using yet but figuring these things out was noted in https://etherpad.opendev.org/p/opendev-git-branches so upgraded sooner than later seems like a good idea
19:32:23 <fungi> yeah, i think it's a good idea to have that in place soon
19:32:26 <clarkb> my fix for the repo list pagination did merge upstream and some future version of gitea should include it. That said the extra check we've got seems like good belts and suspenders
19:32:29 <fungi> surprised nobody's asked for the option yet
19:32:34 <clarkb> that fix is not in v1.12.2
19:33:21 <clarkb> and finally I need to send an email to all of our advisory board volunteers and ask them to sub to service-discuss and service-annoucne if they haven't already, then see if I can get them to agree on a comms method (I've suggested service-discuss for simplicity)
19:34:19 <clarkb> #topic General Topics
19:34:40 <clarkb> #topic Dealing with Bup indexes and backup server volume migrations and our new backup server
19:34:57 <clarkb> this is the catch all for backup related items. Maybe we should start with what led us into discovering things?
19:35:16 <clarkb> My understanding of it is that zuul01's root disk filled up and this was tracked back to bups local to zuul01 "git" indexes
19:35:31 <clarkb> we rm'ed that entire dir in /root/ but then bup stopped working on zuul01
19:35:56 <clarkb> in investigating the fix for that we discovered our existing volume was nearing full capacity so we rotated out the oldest volume and made it latest on the old backup server
19:36:06 <fungi> probably the biggest things to discuss are that we've discovered it's safe to reinitialize ~root/.bup on backup clients, and that we're halfway through replacing the puppet-managed backups with ansible-managed backups but they use different servers (and redundancy would be swell)
19:36:08 <clarkb> after that ianw pointed out we have a newer backup server which is in use for some servers
19:36:19 <ianw> i had a think about how it ended up like that ...
19:36:34 <corvus> i kinda thought rm'ing the local index should not have caused a problem; it's not clear if it did or not; we didn't spend much time on that since it was time to roll the server side anyway
19:37:09 <clarkb> corvus: I think for review we may not have rotated its remote backup after rm'ing the local index because its remote server was the new server (not the old one). ianw and fungi can probablyh confirm that though
19:37:28 <clarkb> ianw: fungi ^ maybe lets sort that out first then talk about the new server?
19:37:49 <fungi> corvus: i did see a (possibly related) problem when i did it on review01... specifically that it ran away with disk (spooling something i think but i couldn't tell where) on the first backup attempt and filled the rootfs and crashed
19:38:12 <corvus> oh i missed the review01 issue
19:38:25 <corvus> it filled up root01's rootfs?
19:38:28 <corvus> er
19:38:29 <fungi> and yeah, when i removed and reinitialized ~root/.bup on review01 i didn't realize we were backing it up to a different (newer) backup server
19:38:31 <corvus> review01's rootfs
19:39:00 <corvus> fungi: what did you do to correct that?
19:39:07 <fungi> then i started the backup without clearing its remote copy on the new backup server, and rootfs space available quickly drained to 0%
19:39:49 <ianw> fungi: is that still the state of things?
19:39:54 <fungi> bup crashed with a puthon exception due to the enospc, but it immediately freed it all, leading me to suspect it was spooling in unlinked tempfiles
19:40:14 <fungi> which would also explain why i couldn't find them
19:40:31 <clarkb> ya it basically fixed itself afterwards but cacti showed a spike during roughly the time bup was running
19:40:34 <clarkb> then a subsequent run of bup was fine
19:40:43 <fungi> after that, i ran it again, and it used a bit of space on / temporarily but eventually succeeded
19:41:30 <fungi> so really not sure what to make of that
19:41:32 <clarkb> it may be worth doing a test recovery off the current review01 backups (and zuul01?) just to be sure the removal of /root/.bup isn't a problem there
19:41:56 <fungi> it did not exhibit whatever behavior led zuul01 to have two bup processes hung/running started on successive days
19:43:08 <ianw> clarkb: ++ i can take an action item to confirm i can get some data out of those backups if we like
19:43:18 <clarkb> ianw: that would be great, thanks
19:43:40 <clarkb> and I think otherwise we continue to monitor it and see if we have disk issues?
19:44:02 <clarkb> ianw: what are we thinking for the server swap itself?
19:44:12 <ianw> yeah, so some history
19:44:51 <ianw> i wrote the ansible roles to install backup users and cron jobs etc in ansible, and basically iirc the idea was that as we got rid of puppet everything would pull that in, everything would be on the new server and the old could be retired
19:45:00 <ianw> however, puppet clearly has a long tail ...
19:45:23 <ianw> which is how we've ended up in a confusing situation for a long time
19:45:39 <ianw> firstly
19:45:42 <fungi> but also we're already ansibling other stuff on all those servers, so the fact that some also get unrelated services configured by puppet should be irrelevant now as far as that goes
19:46:07 <ianw> fungi: yes, that was my next point :)  i don't think that was true, or as true, at the time of the original backup roles
19:46:11 <ianw> so, for now
19:46:16 <fungi> if we can manage user accounts across all of them with ansible then seems like we could manage backups across all of them with ansible too
19:46:27 <ianw> #link https://review.opendev.org/740824 add zuul to backup group
19:46:31 <fungi> yeah, a year ago maybe not
19:46:45 <ianw> we should do that ^ ... zuul dropped the bup:: puppet bits, but didn't pick up the ansible bits
19:46:49 <ianw> then
19:47:04 <ianw> #link https://review.opendev.org/740827 backup all hosts with ansible
19:47:14 <ianw> that *adds* the ansible backup roles to all extant backup hosts
19:47:37 <ianw> so, they will be backing up to the vexxhost server (new, ansible roles) and the rax one (old, puppet roles)
19:47:41 <clarkb> gotcha so we'll swap over puppeted hosts too that way its less confusing
19:47:57 <ianw> once that is rolled out, we should clean up the puppet side, drop the bup:: bits from them and remove the cron job
19:48:04 <clarkb> oh we'll keep the puppet too? would it be better to have the ansible side configure both theo ld and new server?
19:48:05 <fungi> and once we do that, build a second new backup server and add it to the role?
19:48:08 <clarkb> and simply remove the puppetry?
19:48:21 <ianw> *then* we should add a second backup server in RAX, add that to the ansible side, and we'll have dual backups
19:48:31 <fungi> yeah, all that sounds fine to me
19:48:33 <clarkb> gotcha
19:48:36 <ianw> yes ... sounds like we agree :)
19:48:42 * fungi makes thumbs-up sign
19:48:47 <clarkb> basically add the second back in with ansible rather than worry too much about continuing to use the puppeted side of things
19:48:49 <clarkb> wfm
19:49:09 <clarkb> as a time check we have ~12 minutes and a few more items so I'll keep things moving here
19:49:21 <clarkb> #topic Retiring openstack-infra ML July 15
19:49:36 <clarkb> fungi: I havne't seen any objections for this, are we still a go for that tomorrow?
19:50:56 <fungi> yeah, that's the plan
19:51:04 <fungi> #link https://review.opendev.org/739152 Forward openstack-infra ML to openstack-discuss
19:51:18 <fungi> i'll be approving that tomorrow, preliminary reviews appreciated
19:51:34 <fungi> i've also got a related issue
19:52:37 <fungi> in working on a mechanism for analyzing mailing list volume/activity for our engagement statistics i've remembered that we'd never gotten around to coming up with a means of providing links to the archives for retired mailing lists
19:53:13 <fungi> and mailman 2.x doesn't have a web api really
19:53:28 <fungi> or more specifically pipermail which does the archive presentation
19:53:45 <clarkb> the archives are still there if you know the urls though iirc. Maybe a basic index page we can link to somewhere?
19:54:07 <fungi> basically once these are deleted, mailman no longer knows about the lists but pipermail-generated archives for them continue to exist and be served if you know the urls
19:54:46 <fungi> at the moment there are 24 (25 tomorrow) retired mailing lists on domains we host, and they're all on the lists.openstack.org domain so far but eventually there will be others
19:55:27 <fungi> i don't know if we should just manually add links to retired list archives in the html templates for each site (there is a template editor in the webui though i've not really played with it)
19:55:51 <clarkb> each site == mailman list?
19:55:52 <fungi> or if we should run some cron/ansible/zuul automation to generate a static list of them and publish it somewhere discoverable
19:56:13 <fungi> sites are like lists.opendev.org, lists.zuul-ci.org, et cetera
19:56:18 <clarkb> ah
19:56:27 <clarkb> that seems reasonable to me because it is where people will go looking for it
19:56:32 <clarkb> but I'm not sure how automatable that is
19:56:59 <fungi> yeah, i'm mostly just brniging this up now to say i'm open to suggestions outside the meeting (so as not to take up any more of the hour)
19:57:07 <clarkb> ++
19:57:14 <clarkb> #topic China telecom blocks
19:57:24 <fungi> i'll keep this short
19:57:27 <clarkb> AIUI we removed the blcoks and did not need to switch to ianw's UA based filtering?
19:57:55 <fungi> we dropped the temporary firewall rules (see opendev-announce ml archives for date and time) once the background activity dropped to safe levels
19:58:34 <fungi> it could of course reoccur, or something like it, at any time. no guarantees it would be from the same networks/providers either
19:58:50 <clarkb> we've landed the apache filtration code now though right?
19:58:55 <fungi> so i do still think ianw's solution is a good one to keep in our back pocket
19:59:03 <clarkb> so our response in the future can be to switch to the apache port in haproxy configs?
19:59:24 <fungi> yes, the plumbing is in place we just have to turn it on and configure it
19:59:32 <ianw> yeah, i think it's probably good we have the proxy option up our sleeve if we need those layer 7 blocks
19:59:43 <ianw> touch wood, never need it
19:59:48 <clarkb> ++
19:59:52 <fungi> bt of course if we don't exercise it, then it's at risk of bitrot as well so we should be prepared to have to fix something with it
19:59:56 <clarkb> are there any changes needed to finish that up so its is ready if we need ti?
20:00:08 <clarkb> or are we in the state where its in our attic and good to go when necessary?
20:00:09 <ianw> fungi: it is enabled and tested during the gate testing runs
20:00:32 <clarkb> (we are at time now but have one last thing to bring up)
20:00:34 <ianw> gate testing runs for gitea
20:00:44 <fungi> yeah, hopefully that mitigates the bitrot risk then
20:01:14 <clarkb> #topic Project Renames
20:01:24 <clarkb> There are a couple of renames requested now.
20:01:56 <clarkb> I'm already feeling a bit swamped this week just catching up on things and making progress on items that I was pushing on
20:02:09 <clarkb> makes me think that July 24 may be a good option for rename outage
20:02:46 <clarkb> if I can get at least one other set of eyeballs for that I'll go ahead and announce it. We're at time so don't need to have that answer right now but let me know if you can help
20:02:55 <fungi> the opendev hardware automation conference finishes on the 22nd, so i can swing the 24th
20:02:57 <clarkb> (we've largely automated that whole process now which is cool)
20:03:01 <clarkb> fungi: thanks
20:03:13 <clarkb> Thanks everyone!
20:03:17 <clarkb> #endmeeting