#openstack-meeting log

19:01:11 <clarkb> #startmeeting infra
19:01:13 <openstack> Meeting started Tue Sep 10 19:01:11 2019 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:16 <openstack> The meeting name has been set to 'infra'
19:01:22 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2019-September/006478.html Our Agenda
19:01:32 <clarkb> #topic Announcements
19:02:11 <clarkb> no one objected to me holding the PTL baton for another cycle and I'm apparently PTL again :)
19:02:42 <fungi> condolences
19:02:44 <clarkb> as mentioned in my candidacy statement I'd like for this to be my last cycle. And will be more than happy to work with people to make such a transition easier
19:03:18 <clarkb> The another announcement is there is an openstack foundation board meeting immediately after our meeting
19:03:27 <corvus> related -- we should probably talk about opendev governance at some point
19:03:32 <clarkb> corvus: ++
19:03:47 <corvus> (don't want to derail announcements, but i expect that's intertwined with the openstack-infra ptl issue)
19:03:50 <clarkb> board meeting details here https://wiki.openstack.org/wiki/Governance/Foundation/10September2019BoardMeeting if you want to listen in. I'll do my best to avoid this meeting going long too
19:03:57 <fungi> yes, i have a feeling if much of what we do became the jurisdiction of opendev, what's left and openstack-specific could maybe become a sig
19:04:02 <clarkb> corvus: ya there is definitely interconnectedness
19:04:24 <clarkb> fwiw I'd happily let someone else be opendev PTL (or whatever we end up calling it) too :)
19:04:40 <clarkb> corvus: I'll scribble a note to get that ball rolling
19:05:45 <clarkb> #topic Actions from last meeting
19:05:53 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-09-03-19.01.txt minutes from last meeting
19:06:15 <clarkb> I don't see any actions there. Also not sure about you but the last week has been one of fighting fires for me
19:06:52 <clarkb> not much got done off of my normal TODO list. On the upside I think our log storage in swift should be much happier now and hopefully our jobs are slightly easier to debug when "networking" breaks
19:07:06 <clarkb> #topic Priority Efforts
19:07:16 <clarkb> #topic OpenDev
19:07:33 <clarkb> Maybe now is a good time to talk about what we think the next step on moving opendev governance forward is?
19:07:46 <clarkb> I assume it will involve a discussion with the openstack TC of some sort
19:07:57 <clarkb> fungi: ^ as our TC rep an thoughts on that?
19:08:38 <fungi> well...
19:08:48 <fungi> i'm technically no longer on the tc ;)
19:08:54 <fungi> but still happy to provide guidance there
19:08:55 <clarkb> oh right that also changed last week :)
19:09:12 <fungi> and yes, the sooner we bring it up with them, the better
19:09:56 <fungi> (for those who missed the openstack tc election, i stepped down to help make sure the election process went smoothly, since there were some significant tooling complexities due to scheduling constraints)
19:10:04 <clarkb> fungi: is there a formal agenda we have to add ourselves to or is there still informal office horus we can show up to and discuss at?
19:10:24 <corvus> you know me: the closer it resembles a self-organized collective, the better as far as i'm concerned :)
19:10:36 <fungi> i recommend a thread on the openstack-discuss ml, with [tc][infra] in the subject
19:10:57 <clarkb> fungi: ok I can draft an email for that on etherpad and we can decide when it is ready to be sent
19:11:01 <fungi> sounds great!
19:11:42 <fungi> but yeah, the tc formally meets infrequently to satisfy requirements in the osf bylaws, and treats those meetings mostly as a reporting outlet
19:11:50 <corvus> i think the main thing is it won't be *under* the tc (or any foundation project), but we want involvement from them; so figuring out what form that involvement should take is a big question i think.
19:11:59 <clarkb> corvus: ++
19:12:01 <fungi> exactly
19:12:24 <clarkb> and I'm not sure we need to have that answer before kicking off a thread to discuss it, though some ideas will likely help guide discussion in a productive manner
19:12:47 <corvus> yeah, for something like that "how would you like to be involved" is a fine open-ended question to ask i think
19:13:00 <fungi> i wholeheartedly concur
19:13:05 <corvus> maybe worth sending the same email to all the osf projects?
19:13:25 <fungi> or at least a similar e-mail, sure
19:13:40 <fungi> the degree to which we're involved in and assisting each of them and with what tends to differ
19:14:03 <fungi> though i expect what we want to offer all of them is roughly the same
19:14:18 <corvus> and our interest in their participation is probably roughly equal :)
19:14:24 <clarkb> ++
19:14:30 <fungi> absolutely
19:14:52 <fungi> or at least proportional
19:15:17 <corvus> non-zero
19:15:30 <clarkb> I think that gives us a really good starting point to being a draft and sort out details together there
19:15:37 <clarkb> s/being/begin/
19:15:44 <corvus> i'm a draft
19:16:00 <clarkb> on nitrous?
19:16:05 <fungi> sounds like we all are
19:16:13 <corvus> clarkb: yes, i'm also a drift
19:16:49 <clarkb> Alright any other opendev related business or should we move on?
19:17:09 <corvus> ++
19:17:52 <clarkb> #topic Update Config Management
19:17:53 <fungi> none here
19:18:22 <clarkb> One thing I managed to get off of my todo list was merging and monitoring and applying mordreds config updates for Gerrit
19:18:34 <clarkb> Our gerrits should no longer do full replication on restart
19:18:48 * fungi celebrates
19:18:57 <corvus> yay!  and the ssh timeout too, right?
19:18:59 <clarkb> and they have hour long ssh timeouts (I even tested this on review-dev with a stream-events connection. I don't think review.o.o is ever idle long enough to hit that so will depend on sad clients)
19:19:02 <clarkb> corvus: yup
19:19:04 <fungi> granted, the upcoming rename restart will still need a reindex
19:19:21 <corvus> fungi: yeah, we can target just the renamed projects though for that i think?
19:19:37 <fungi> oh, wait, replication
19:19:41 <fungi> right
19:19:47 <clarkb> ya we can do targetted replication
19:19:51 <clarkb> and reindexing is online
19:19:57 <fungi> i think the rename process doesn't need rereplication
19:20:02 <clarkb> shouldn't
19:20:14 <fungi> gerrit needs reindexing, but that only happens automatically on restarts for upgrades
19:20:21 <fungi> i was confusing the two. sorry!
19:20:37 <clarkb> Any other config management related business?
19:20:56 <fungi> nada
19:21:06 <clarkb> #topic Storyboard
19:21:14 <clarkb> fungi: diablo_rojo_phon any news to share?
19:21:35 <fungi> i've been entirely disconnected for the past week, so while i don't think so i really don't know
19:21:59 <clarkb> silly hurricanes interrupting work
19:22:39 <fungi> they should do a better job of scheduling those, yes
19:23:01 <fungi> diablo_rojo_phon: may also not be on hand at the moment
19:23:08 <fungi> SotK: ?
19:24:22 <fungi> guessing we can move on. i'll hopefully have things to talk about next week as i get back into gear
19:24:36 <clarkb> k
19:24:39 <clarkb> #topic General Topics
19:24:49 <clarkb> #link https://etherpad.openstack.org/p/201808-infra-server-upgrades-and-cleanup
19:24:55 <clarkb> First up is server cleanups
19:25:09 <clarkb> I don't expect much movement happened on the wiki what with hurricanes and other fires
19:25:36 <clarkb> for static.o.o I think we now have a relatively complete list of tasks to be able to delete that server
19:25:39 <clarkb> #link https://etherpad.openstack.org/p/static-services Sign up for tasks
19:25:48 <fungi> i've been pretty useless for the last week, yes, apologies
19:25:57 <clarkb> thank you ajaeger for jumping on that already. If the rest of us can grab one or two of those we should be able to make good progress on it
19:26:08 <clarkb> ianw: ^ did you have anything else to add re static.o.o and your auditing?
19:26:43 <ianw> no; we didn't really come to a conclusion on the redirects ... but we can defer that until the harder things like afs publishing are done
19:27:09 <clarkb> ok
19:27:29 <clarkb> That is a good transition into the next topic: AFS stability
19:27:59 <clarkb> we discovered that afs02.dfw.o.o was very unhappy last week. It looked like fallout from a live migration based on console content (I seem to recall similar messages from hosts that we were told had been migrated)
19:28:04 <clarkb> corvus rebooted it
19:28:24 <clarkb> unfortunately that left some volumes locked. ianw has since fixed all of those but the opensuse mirror volume (is that correct?)
19:28:37 <clarkb> I have also turned mirror-update.openstack.org back on but left mirror-update.opendev.org alone
19:28:39 <corvus> oh, huh, i thought i checked that after rebooting
19:28:52 <corvus> what steps should we have done after the reboot?
19:29:08 <clarkb> ianw: ^ can fill us in on his investigation and next steps
19:29:21 <ianw> yes the opensuse mirror had a release error; something we've seen before and made a thread about on openafs lists (see updates in #openstack-infra)
19:29:49 <ianw> i've tried the salvage recovery steps they suggest and an re-running the opensuse release
19:30:15 <clarkb> corvus: I believe that subsequent vos releases ended up timing out their auth (and this may have happend after the reboot)
19:30:34 <ianw> corvus: i think maybe everything needed to recreate r/o and everything timed out
19:30:43 <ianw> or what clarkb said :)
19:31:21 <corvus> oh. huh.  wonder why it needed to redo r/o
19:31:41 <clarkb> so ya immediately after the reboot I think it was fine
19:32:26 <corvus> so the short version of the recovery process is pause release activity before rebooting a server; after it boots, clean up any broken releases; perform a release on every volume manually on the db servers; unpause releases"  ?
19:32:44 <clarkb> corvus: yes I think that would do it mroe gracefully
19:33:09 <corvus> is there a doc update with that, or should i write that up real quick?
19:33:09 <ianw> yeah, possibly -- here's where ubuntu went wrong : http://paste.openstack.org/show/774655/
19:33:32 <clarkb> corvus: I am not aware of a doc update for that yet. I think you can write it
19:33:41 <ianw> so it's not exactly like "recreating r/o"
19:34:03 <ianw> i haven't audited the logs of all the other ones to see what happened, but we could
19:36:37 <clarkb> sounds like that may be it for this topic? lets move on as I want to finish before 2000UTC
19:36:55 <clarkb> Next up is project renaming on Monday at 1400UTC that is 7am Pacific
19:37:01 <clarkb> #link https://etherpad.openstack.org/project-renames-2019-09-10 Planning document
19:37:09 <clarkb> I've just started editing ^ based on the one we did at the end of may
19:37:18 <clarkb> from that I need to write a chagne for the repos.yaml content
19:37:30 <clarkb> and need to get the project-config rename chagnes in order
19:37:48 <clarkb> One question I did have is whether or not the libification of gitea management affects our rename playbook at all
19:38:01 <clarkb> I don't think it does because rename playbook should call into that like the normal ansible runs do too
19:38:06 <clarkb> but I'll look at that as well
19:38:24 <corvus> #link https://review.opendev.org/681338 Add docs for recovering an OpenAFS fileserver
19:39:55 <clarkb> monty said he would be around monday and early pacific time would work for him. fungi I think you said that works too, does it still work post hurricane?
19:40:09 <clarkb> (and anyone else should feel free to help too :) )
19:40:14 <fungi> yeah, it's fine. i'll be here with bells on
19:40:56 <clarkb> great I'll see you all there then
19:41:10 <clarkb> Next up is volume of ara file in ara reports is making ceph sad
19:41:21 <clarkb> Wanted to bring this up to cover the changes we made really quickly
19:41:41 <clarkb> first corvus updated the zuul dashboard to largely cover our needs there via the job output json file
19:42:04 <clarkb> corvus: do we still need to get that fix for handlers merged? or do we just need an executor erstart for that?
19:42:58 <corvus> er 1 sec; i'm unprepared :)
19:43:34 <corvus> https://review.opendev.org/680726 is unmerged, but not very important
19:44:11 <clarkb> interseting that zuul didn't enqueue it when its parent merged
19:44:17 <corvus> the current output is misleading if ansible "handler" tasks are used, but they don't get used that often in the main body of jobs
19:44:19 <clarkb> in any case we stopped running ara for every job
19:44:28 <clarkb> nested ara runs are still there but the root report is no longer a thing
19:44:36 <clarkb> that should cut back on our total object count for log files in swift/ceph
19:44:46 <corvus> clarkb: it did; but zuul's gate pipeline doesn't honor 'recheck'
19:44:59 <clarkb> the other major chagne we made was to shard logs into containers based on the first three chars of the build uuid
19:45:13 <clarkb> we should get 4096 containers in each cloud region to shard logs into as a result
19:45:19 <clarkb> hopeflly that spreads things out well
19:45:20 <clarkb> corvus: ah
19:45:32 <clarkb> and as of this mornign all swifts are back in service
19:45:40 <corvus> i still consider the current state an emergency fix... and we should figure out what to do about nested aras
19:45:46 <clarkb> corvus: k
19:45:55 <clarkb> vexxhost is the last remaining cloud we've kept out (it is ceph)
19:46:13 <clarkb> I think as soon as mnaser is ready for us to try using vexxhost again we can go ahead and do that
19:47:25 <corvus> we know that we have reduced the file count by turning off the zuul-ara, but we don't know what proportion of the files that was compared to nested-ara -- meaning we don't know how much of an impact the nested-aras will continue to be
19:47:57 <corvus> i suspect it's still significant, and if so, it would be good to come up with another option there
19:47:57 <clarkb> correct. We have also removed hot containers like periodic/ by replacing them with the build uuid sharded containers
19:48:04 <corvus> there was an etherpad with ideas, right?
19:48:12 <corvus> dmsimard: ping
19:48:26 <clarkb> #link https://etherpad.openstack.org/p/Vz5IzxlWFz ARA file count reduction ideas
19:49:04 <corvus> i feel like #2 isn't worth the gain
19:49:19 <fungi> yeah, i got the impression some of those nested aras reported on orders of magnitude more ansible tasks than our job ara
19:49:49 <corvus> if that's the case, then turning off zuul-ara may not change as much as we'd hoped.
19:49:51 <clarkb> fungi: but we run an order of magnitude fewer jobs for them (I think)
19:49:59 <corvus> maybe that cancels out?
19:50:06 <fungi> the number of tasks involved in, e.g., deploying a complete openstack is waaaay more than to run a typical zuul job
19:50:23 <fungi> but yes, hard to know due to the difference in frequency
19:50:24 <clarkb> nested ara is just system-config jobs, osa, and tripleo
19:50:30 <clarkb> (at least those are the ones I know of)
19:50:37 <corvus> it's "just tripleo" i'm worried about :)
19:51:30 <clarkb> ya that could be significant (though their use of resources is more due to running long multinode jobs than many many jobs I think)
19:51:34 <corvus> looking at the etherpad, broadly, i see only two ways to improve this: 1) use a static server; 2) do more in javascript
19:51:50 <dmsimard> hi, I'm here
19:51:58 <fungi> those were basically the two possible roads i was aware of
19:52:36 <fungi> do more server-side (so not in swift since we can't provide our own swift middleware), or do more on the client side (so probably with js)
19:52:53 <corvus> if we need to do something urgently, #1 (static server) is probably the only thing that will do.   if our effort so far has given us some breathing room, we can look into #2 (javascript development)
19:53:19 <clarkb> corvus: I think the current swift clouds have no problem with the existing setup (at least I've not heard complaints)
19:53:28 <clarkb> corvus: given that I think we do have breathing room
19:53:44 <dmsimard> using the sqlite middleware is my preferred option, it was designed to solve this exact problem
19:53:54 <dmsimard> I do not have the javascript skills to take on the other option
19:53:59 <fungi> we'll presumably know more for sure as we approach our retention
19:54:09 <fungi> about whether it's a strain on them
19:54:26 <clarkb> fungi: ya and donnyd can likely give us detailed info
19:54:45 <corvus> i'm not opposed to using a static server; i'm opposed to being personally responsible for sysadmining one
19:55:03 <clarkb> https://grafana.fortnebula.com/d/9MMqh8HWk/openstack-utilization?orgId=2&refresh=30s publishes swift info from FN
19:55:29 <corvus> i'm over-extended (as i think the team is), so i'm trying to simplify both the system and our responsibilities, so i'd still like to avoid that if possible
19:55:47 <donnyd> yea, and there is little in the way of load on those swift servers
19:55:47 <clarkb> corvus: ++ also we've found that there are problems with sysadmining such a system even if we have all the time to do it
19:55:59 <donnyd> I can also grab any other stats needed and publish them
19:56:15 <clarkb> particularly around device limits and volume attachments (one alternative we considered was ceph but if it has problems with the object store version of this will it have problems with the block/fs version?)
19:56:30 <donnyd> i am 2x on objects, so what you see there you can /2
19:57:00 <corvus> dmsimard: we talked about maybe zuul and ara sharing some react widgets... maybe we can find a way to combine zuul's "load data from json" approach with ara?
19:57:13 <donnyd> I am about to move swift to nvme so it really won't matter on my end at all
19:57:35 <clarkb> (as a timecheck we have just under 3 minutes left in the meeting but this was the agenda so we can talk about ara until then)
19:57:55 <clarkb> more than happy to have further discussion in #openstack-infra on this topic or any other afterwards too
19:58:01 <corvus> dmsimard: i don't know what that looks like, or how to do it, but i feel like there should be something we can do to start converging
19:58:32 <dmsimard> corvus: yes, I believe it was tristanC who came up with that. I'll be at the office with him tomorrow, will bring it up.
19:59:31 <corvus> dmsimard: cool -- why don't the 2 of you bat around some ideas, and then maybe the 3 of us can have a conference call or something afterwords?
20:00:01 <dmsimard> sounds like a plan
20:00:08 <clarkb> and we are at time. Thank you everyone!
20:00:10 <clarkb> #endmeeting