19:01:05 #startmeeting infra 19:01:06 Meeting started Tue Jul 14 19:01:05 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:08 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 The meeting name has been set to 'infra' 19:01:13 #link http://lists.opendev.org/pipermail/service-discuss/2020-July/000056.html Our Agenda 19:01:19 #topic Announcements 19:01:27 OpenDev virtual event #2 happening July 20-22 19:01:50 calling this out as they are using etherpad, but the previous event didn't have any problems with etherpad. I plan to be around and support the service if necessary though 19:01:51 o/ 19:02:12 also if you are interested in baremetal management that is the topic and you are welcome to join 19:02:57 #topic Actions from last meeting 19:03:05 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-07-19.00.txt minutes from last meeting 19:03:17 ianw: thank you for running last weeks meeting when I was out. I didn't see any actions recorded on the minutes. 19:03:26 ianw: is there anything else to add or should we move on to today's topics? 19:03:52 nothing, i think move on 19:04:14 #topic Specs approval 19:04:21 #link https://review.opendev.org/#/c/731838/ Authentication broker service 19:04:37 I believe this got a new patchset and I ws going to review it then things got busy before I took a week off 19:04:54 fungi: ^ other than needing reviews anything else to add? 19:05:23 nope, there's a minor sequence numbering problem in one of the lists in it, but no major revisions requested yet 19:05:48 great, and a friendly reminder to the rest of us to try and review that spec 19:05:51 or counting error i guess 19:05:59 #topic Priority Efforts 19:06:07 #topic Update Config Management 19:06:22 ze01 is running on containers again. We've vendored the gear lib into the ansible role that uses it 19:06:39 no new issues seen since? 19:06:40 other than a small hiccup with the vendoring I haven't seen any additional issues related to this 19:07:07 maybe give it another day or two then we should consider updating the remaining executors? 19:07:11 any feel for how long we should pilot it before redoing the other 11? 19:07:22 ahh, yeah, another day or two sounds fine to me 19:07:49 most of the issues we've hit have been in jobs that don't run frequently which is why giving it a few days to have those random jobs run on that executors seems like a good idea 19:07:56 but I don't think we need to wait for very long either 19:08:05 umm, there is one 19:08:06 https://review.opendev.org/#/c/740854/ 19:08:21 ah that was related to the executor then? 19:08:30 (I saw the failures were happening but hadn't followed it that closely) 19:08:53 yes, the executor writes out the job ssh key in the new openssh format, and it is more picky about whitespace 19:09:30 #link https://review.opendev.org/#/c/740854/ fixes an issue with containerized ze01. Should be landed and confirmed happy before converting more executors 19:09:35 ahh, right, specifically because the version of openssh in the container is newer 19:09:37 I'll give that a review after the meeting if no one beats me to it 19:10:05 fwiw the reasoning is sound and it's a very small patch, but disappointing default behavior from variable substitution 19:10:39 i guess ansible or jinja assumes variables with trailing whitespace are a mistake unless you tell it otherwise 19:11:13 as far as converting the other 11 goes, I'm not entirely sure what the exact process is there. I think its something like stop zuul services, manually remove systemd units for zuul services, run ansible, start container but we'll want to double check that if mordred isn't able to update us 19:11:37 ohai - not really here - but here fora a sec 19:11:50 i'd also be cool waiting for mordred's return to move forward, in case he wants to be involved in the next steps 19:12:08 yeah - I think the story when we're happy with ze01 is for each remaining ze to shut down the executor, run ansible to update to docker 19:12:26 but I can totally drive that when I'm back online for real 19:12:38 cool that probably gives us a good burn in period for ze01 too 19:13:01 yah 19:13:10 yeah probably worth seeing if any other weird executor specific behaviour pops up 19:13:20 sounds okay to me 19:13:24 mordred: eta for your return? 19:14:19 I'll be back online enough to work on this on Thursday 19:14:21 "chapter 23: the return of mordred" 19:14:53 I'll have electricians replacing mains power ... But I have a laptop and phone :) 19:14:54 that also fits with the "couple of days" suggestion 19:15:00 cool, no rush or anything, just thought if it was going to be > this week maybe i'd start to pick some stuff off, but "wait for mordred" sounds like it'll fit time-wise :) 19:15:13 ya I'm willing to help too, just let me know 19:15:17 same 19:15:31 Cool. I should be straight forward at this point 19:15:40 meanwhile, i'll continue work on the (tangentially related) multi-arch container stuff 19:15:57 corvus: that was the next item on my list of notes related to config management updates 19:16:03 cool i'll summarize 19:16:32 despite all of our reworking, we're still seeing the "container ends up with wrong arch" problem for the nodepool builder containers 19:16:52 we managed to autohold a set of nodes exhibiting the problem reliably 19:17:03 (and by reliably, i mean, i can run the build over and over and get the same result) 19:17:20 so i should be able to narrow down the problem with that 19:17:37 at this point, it's unknown whether it's an artifact of buildx, zuul-registry, or something else 19:17:45 is there any concern that if we were to restart nodepool builders right now they may fail due to a mismatch in the published artifacts? 19:18:32 clarkb: 1 sec 19:18:41 mordred, ianw: do you happen to have a link to the change? 19:19:35 https://review.opendev.org/#/c/726263/ the multi-arch python-builder you mean? 19:19:39 yep 19:19:45 then yes :) 19:19:48 #link https://review.opendev.org/726263 failing multi arch change 19:19:55 that's the change with the held nodes 19:20:23 (it was +3, but failed in gate with the error; i've modified it slightly to fail the buildset registry in order to complete the autohold) 19:21:16 clarkb: i *think* the failure is happening reliably in the build stage, so i think we're unlikely to have a problem restarting with anything published 19:21:37 gotcha, basically if we make it to publishing things have succeeded which implies the arches are mapped properly? 19:21:44 we do have published images for both arches, and, tbh, i'm not sure what's actually on them. 19:21:46 * mordred is excited to learn what the issue is 19:21:57 are we running both arches in containers at this point? 19:22:15 no - arm is still running non-container 19:22:30 multi-arch being finished here should let us run the arm builder in containers 19:22:35 and stop having differences 19:22:43 okay. then my guess would be that there is a good chance the arm images published may not be arm images. but i honestly don't know. 19:23:07 we should certainly not proceed any further with arm until this is resolved 19:23:15 ++ 19:23:22 ++ 19:23:25 noted 19:23:25 well 19:23:34 we haven't built arm python-base images 19:23:40 so any existing arm layers for nodepool-builder are defnitely bogus 19:23:55 so defnitely should not proceed further :) 19:23:58 mordred: even if those layers possibly don't do anything arch specific? 19:24:06 they do 19:24:06 like our python-base is just python and bash right? 19:24:08 ah ok 19:24:15 they install dumb-init 19:24:18 which is arch-specific 19:24:23 python is arch-specific 19:24:35 fungi: ya but cpython is on the lower layer 19:24:43 yah - but it comes from the base image 19:24:44 from docker.io/library/python 19:24:44 and is properly arched 19:24:53 .py files in a layer won't care 19:24:59 unless they link to c things 19:25:00 but we install at least one arch-specific package in docker.io/opendevorg/python-base 19:25:04 or we install dumb init 19:25:14 yah 19:25:15 oh, okay, when you said "our python-base is just python and bash right" you meant python scripts, not compiled cpython 19:25:20 clarkb: but anything that builds does, i think that was where we saw some issues with gcc at least 19:25:21 fungi: yup 19:25:38 i misunderstood, sorry. thought you meant python the interpreter 19:25:45 ianw: the gcc issue is actually a symptom of arch mismatch 19:26:19 the builder image buiolds wheels so the base image dones't have to - but the builder and base layers were out of sync arch-wise - so we saw the base image install trying to gcc something (and failing) 19:26:59 (yay for complex issues) 19:27:02 i thought so, and the random nature of the return was why it passed check but failed in gate (irrc?) 19:27:15 yup 19:27:27 thank goodness corvus managed to hold a reproducible env 19:27:46 anything else on the topic of config management? 19:28:07 i have grafana.opendev.org rolled out 19:28:36 i'm still working on graphite02.opendev.org and migrating the settings and data correctly 19:29:36 might be worth touching on the backup server split, though we can cover that under a later topic if preferred 19:30:03 yup its a separate topic 19:30:20 cool, let's just lump it into that then 19:30:30 #topic OpenDev 19:30:40 lets talk opendev things really quickly then we can get to some of the things that popped up recently 19:30:49 #link https://review.opendev.org/#/c/740716/ Upgrade to v1.12.2 19:31:21 That change upgrades us to latest gitea. Notably its changelog asys it allows you to properly set the default branch on new projects to something other than manster 19:31:46 * fungi changes all his default branches to manster 19:31:48 this isn't smoeting we're using yet but figuring these things out was noted in https://etherpad.opendev.org/p/opendev-git-branches so upgraded sooner than later seems like a good idea 19:32:23 yeah, i think it's a good idea to have that in place soon 19:32:26 my fix for the repo list pagination did merge upstream and some future version of gitea should include it. That said the extra check we've got seems like good belts and suspenders 19:32:29 surprised nobody's asked for the option yet 19:32:34 that fix is not in v1.12.2 19:33:21 and finally I need to send an email to all of our advisory board volunteers and ask them to sub to service-discuss and service-annoucne if they haven't already, then see if I can get them to agree on a comms method (I've suggested service-discuss for simplicity) 19:34:19 #topic General Topics 19:34:40 #topic Dealing with Bup indexes and backup server volume migrations and our new backup server 19:34:57 this is the catch all for backup related items. Maybe we should start with what led us into discovering things? 19:35:16 My understanding of it is that zuul01's root disk filled up and this was tracked back to bups local to zuul01 "git" indexes 19:35:31 we rm'ed that entire dir in /root/ but then bup stopped working on zuul01 19:35:56 in investigating the fix for that we discovered our existing volume was nearing full capacity so we rotated out the oldest volume and made it latest on the old backup server 19:36:06 probably the biggest things to discuss are that we've discovered it's safe to reinitialize ~root/.bup on backup clients, and that we're halfway through replacing the puppet-managed backups with ansible-managed backups but they use different servers (and redundancy would be swell) 19:36:08 after that ianw pointed out we have a newer backup server which is in use for some servers 19:36:19 i had a think about how it ended up like that ... 19:36:34 i kinda thought rm'ing the local index should not have caused a problem; it's not clear if it did or not; we didn't spend much time on that since it was time to roll the server side anyway 19:37:09 corvus: I think for review we may not have rotated its remote backup after rm'ing the local index because its remote server was the new server (not the old one). ianw and fungi can probablyh confirm that though 19:37:28 ianw: fungi ^ maybe lets sort that out first then talk about the new server? 19:37:49 corvus: i did see a (possibly related) problem when i did it on review01... specifically that it ran away with disk (spooling something i think but i couldn't tell where) on the first backup attempt and filled the rootfs and crashed 19:38:12 oh i missed the review01 issue 19:38:25 it filled up root01's rootfs? 19:38:28 er 19:38:29 and yeah, when i removed and reinitialized ~root/.bup on review01 i didn't realize we were backing it up to a different (newer) backup server 19:38:31 review01's rootfs 19:39:00 fungi: what did you do to correct that? 19:39:07 then i started the backup without clearing its remote copy on the new backup server, and rootfs space available quickly drained to 0% 19:39:49 fungi: is that still the state of things? 19:39:54 bup crashed with a puthon exception due to the enospc, but it immediately freed it all, leading me to suspect it was spooling in unlinked tempfiles 19:40:14 which would also explain why i couldn't find them 19:40:31 ya it basically fixed itself afterwards but cacti showed a spike during roughly the time bup was running 19:40:34 then a subsequent run of bup was fine 19:40:43 after that, i ran it again, and it used a bit of space on / temporarily but eventually succeeded 19:41:30 so really not sure what to make of that 19:41:32 it may be worth doing a test recovery off the current review01 backups (and zuul01?) just to be sure the removal of /root/.bup isn't a problem there 19:41:56 it did not exhibit whatever behavior led zuul01 to have two bup processes hung/running started on successive days 19:43:08 clarkb: ++ i can take an action item to confirm i can get some data out of those backups if we like 19:43:18 ianw: that would be great, thanks 19:43:40 and I think otherwise we continue to monitor it and see if we have disk issues? 19:44:02 ianw: what are we thinking for the server swap itself? 19:44:12 yeah, so some history 19:44:51 i wrote the ansible roles to install backup users and cron jobs etc in ansible, and basically iirc the idea was that as we got rid of puppet everything would pull that in, everything would be on the new server and the old could be retired 19:45:00 however, puppet clearly has a long tail ... 19:45:23 which is how we've ended up in a confusing situation for a long time 19:45:39 firstly 19:45:42 but also we're already ansibling other stuff on all those servers, so the fact that some also get unrelated services configured by puppet should be irrelevant now as far as that goes 19:46:07 fungi: yes, that was my next point :) i don't think that was true, or as true, at the time of the original backup roles 19:46:11 so, for now 19:46:16 if we can manage user accounts across all of them with ansible then seems like we could manage backups across all of them with ansible too 19:46:27 #link https://review.opendev.org/740824 add zuul to backup group 19:46:31 yeah, a year ago maybe not 19:46:45 we should do that ^ ... zuul dropped the bup:: puppet bits, but didn't pick up the ansible bits 19:46:49 then 19:47:04 #link https://review.opendev.org/740827 backup all hosts with ansible 19:47:14 that *adds* the ansible backup roles to all extant backup hosts 19:47:37 so, they will be backing up to the vexxhost server (new, ansible roles) and the rax one (old, puppet roles) 19:47:41 gotcha so we'll swap over puppeted hosts too that way its less confusing 19:47:57 once that is rolled out, we should clean up the puppet side, drop the bup:: bits from them and remove the cron job 19:48:04 oh we'll keep the puppet too? would it be better to have the ansible side configure both theo ld and new server? 19:48:05 and once we do that, build a second new backup server and add it to the role? 19:48:08 and simply remove the puppetry? 19:48:21 *then* we should add a second backup server in RAX, add that to the ansible side, and we'll have dual backups 19:48:31 yeah, all that sounds fine to me 19:48:33 gotcha 19:48:36 yes ... sounds like we agree :) 19:48:42 * fungi makes thumbs-up sign 19:48:47 basically add the second back in with ansible rather than worry too much about continuing to use the puppeted side of things 19:48:49 wfm 19:49:09 as a time check we have ~12 minutes and a few more items so I'll keep things moving here 19:49:21 #topic Retiring openstack-infra ML July 15 19:49:36 fungi: I havne't seen any objections for this, are we still a go for that tomorrow? 19:50:56 yeah, that's the plan 19:51:04 #link https://review.opendev.org/739152 Forward openstack-infra ML to openstack-discuss 19:51:18 i'll be approving that tomorrow, preliminary reviews appreciated 19:51:34 i've also got a related issue 19:52:37 in working on a mechanism for analyzing mailing list volume/activity for our engagement statistics i've remembered that we'd never gotten around to coming up with a means of providing links to the archives for retired mailing lists 19:53:13 and mailman 2.x doesn't have a web api really 19:53:28 or more specifically pipermail which does the archive presentation 19:53:45 the archives are still there if you know the urls though iirc. Maybe a basic index page we can link to somewhere? 19:54:07 basically once these are deleted, mailman no longer knows about the lists but pipermail-generated archives for them continue to exist and be served if you know the urls 19:54:46 at the moment there are 24 (25 tomorrow) retired mailing lists on domains we host, and they're all on the lists.openstack.org domain so far but eventually there will be others 19:55:27 i don't know if we should just manually add links to retired list archives in the html templates for each site (there is a template editor in the webui though i've not really played with it) 19:55:51 each site == mailman list? 19:55:52 or if we should run some cron/ansible/zuul automation to generate a static list of them and publish it somewhere discoverable 19:56:13 sites are like lists.opendev.org, lists.zuul-ci.org, et cetera 19:56:18 ah 19:56:27 that seems reasonable to me because it is where people will go looking for it 19:56:32 but I'm not sure how automatable that is 19:56:59 yeah, i'm mostly just brniging this up now to say i'm open to suggestions outside the meeting (so as not to take up any more of the hour) 19:57:07 ++ 19:57:14 #topic China telecom blocks 19:57:24 i'll keep this short 19:57:27 AIUI we removed the blcoks and did not need to switch to ianw's UA based filtering? 19:57:55 we dropped the temporary firewall rules (see opendev-announce ml archives for date and time) once the background activity dropped to safe levels 19:58:34 it could of course reoccur, or something like it, at any time. no guarantees it would be from the same networks/providers either 19:58:50 we've landed the apache filtration code now though right? 19:58:55 so i do still think ianw's solution is a good one to keep in our back pocket 19:59:03 so our response in the future can be to switch to the apache port in haproxy configs? 19:59:24 yes, the plumbing is in place we just have to turn it on and configure it 19:59:32 yeah, i think it's probably good we have the proxy option up our sleeve if we need those layer 7 blocks 19:59:43 touch wood, never need it 19:59:48 ++ 19:59:52 bt of course if we don't exercise it, then it's at risk of bitrot as well so we should be prepared to have to fix something with it 19:59:56 are there any changes needed to finish that up so its is ready if we need ti? 20:00:08 or are we in the state where its in our attic and good to go when necessary? 20:00:09 fungi: it is enabled and tested during the gate testing runs 20:00:32 (we are at time now but have one last thing to bring up) 20:00:34 gate testing runs for gitea 20:00:44 yeah, hopefully that mitigates the bitrot risk then 20:01:14 #topic Project Renames 20:01:24 There are a couple of renames requested now. 20:01:56 I'm already feeling a bit swamped this week just catching up on things and making progress on items that I was pushing on 20:02:09 makes me think that July 24 may be a good option for rename outage 20:02:46 if I can get at least one other set of eyeballs for that I'll go ahead and announce it. We're at time so don't need to have that answer right now but let me know if you can help 20:02:55 the opendev hardware automation conference finishes on the 22nd, so i can swing the 24th 20:02:57 (we've largely automated that whole process now which is cool) 20:03:01 fungi: thanks 20:03:13 Thanks everyone! 20:03:17 #endmeeting