19:01:11 #startmeeting infra 19:01:13 Meeting started Tue Sep 10 19:01:11 2019 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:16 The meeting name has been set to 'infra' 19:01:22 #link http://lists.openstack.org/pipermail/openstack-infra/2019-September/006478.html Our Agenda 19:01:32 #topic Announcements 19:02:11 no one objected to me holding the PTL baton for another cycle and I'm apparently PTL again :) 19:02:42 condolences 19:02:44 as mentioned in my candidacy statement I'd like for this to be my last cycle. And will be more than happy to work with people to make such a transition easier 19:03:18 The another announcement is there is an openstack foundation board meeting immediately after our meeting 19:03:27 related -- we should probably talk about opendev governance at some point 19:03:32 corvus: ++ 19:03:47 (don't want to derail announcements, but i expect that's intertwined with the openstack-infra ptl issue) 19:03:50 board meeting details here https://wiki.openstack.org/wiki/Governance/Foundation/10September2019BoardMeeting if you want to listen in. I'll do my best to avoid this meeting going long too 19:03:57 yes, i have a feeling if much of what we do became the jurisdiction of opendev, what's left and openstack-specific could maybe become a sig 19:04:02 corvus: ya there is definitely interconnectedness 19:04:24 fwiw I'd happily let someone else be opendev PTL (or whatever we end up calling it) too :) 19:04:40 corvus: I'll scribble a note to get that ball rolling 19:05:45 #topic Actions from last meeting 19:05:53 #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-09-03-19.01.txt minutes from last meeting 19:06:15 I don't see any actions there. Also not sure about you but the last week has been one of fighting fires for me 19:06:52 not much got done off of my normal TODO list. On the upside I think our log storage in swift should be much happier now and hopefully our jobs are slightly easier to debug when "networking" breaks 19:07:06 #topic Priority Efforts 19:07:16 #topic OpenDev 19:07:33 Maybe now is a good time to talk about what we think the next step on moving opendev governance forward is? 19:07:46 I assume it will involve a discussion with the openstack TC of some sort 19:07:57 fungi: ^ as our TC rep an thoughts on that? 19:08:38 well... 19:08:48 i'm technically no longer on the tc ;) 19:08:54 but still happy to provide guidance there 19:08:55 oh right that also changed last week :) 19:09:12 and yes, the sooner we bring it up with them, the better 19:09:56 (for those who missed the openstack tc election, i stepped down to help make sure the election process went smoothly, since there were some significant tooling complexities due to scheduling constraints) 19:10:04 fungi: is there a formal agenda we have to add ourselves to or is there still informal office horus we can show up to and discuss at? 19:10:24 you know me: the closer it resembles a self-organized collective, the better as far as i'm concerned :) 19:10:36 i recommend a thread on the openstack-discuss ml, with [tc][infra] in the subject 19:10:57 fungi: ok I can draft an email for that on etherpad and we can decide when it is ready to be sent 19:11:01 sounds great! 19:11:42 but yeah, the tc formally meets infrequently to satisfy requirements in the osf bylaws, and treats those meetings mostly as a reporting outlet 19:11:50 i think the main thing is it won't be *under* the tc (or any foundation project), but we want involvement from them; so figuring out what form that involvement should take is a big question i think. 19:11:59 corvus: ++ 19:12:01 exactly 19:12:24 and I'm not sure we need to have that answer before kicking off a thread to discuss it, though some ideas will likely help guide discussion in a productive manner 19:12:47 yeah, for something like that "how would you like to be involved" is a fine open-ended question to ask i think 19:13:00 i wholeheartedly concur 19:13:05 maybe worth sending the same email to all the osf projects? 19:13:25 or at least a similar e-mail, sure 19:13:40 the degree to which we're involved in and assisting each of them and with what tends to differ 19:14:03 though i expect what we want to offer all of them is roughly the same 19:14:18 and our interest in their participation is probably roughly equal :) 19:14:24 ++ 19:14:30 absolutely 19:14:52 or at least proportional 19:15:17 non-zero 19:15:30 I think that gives us a really good starting point to being a draft and sort out details together there 19:15:37 s/being/begin/ 19:15:44 i'm a draft 19:16:00 on nitrous? 19:16:05 sounds like we all are 19:16:13 clarkb: yes, i'm also a drift 19:16:49 Alright any other opendev related business or should we move on? 19:17:09 ++ 19:17:52 #topic Update Config Management 19:17:53 none here 19:18:22 One thing I managed to get off of my todo list was merging and monitoring and applying mordreds config updates for Gerrit 19:18:34 Our gerrits should no longer do full replication on restart 19:18:48 * fungi celebrates 19:18:57 yay! and the ssh timeout too, right? 19:18:59 and they have hour long ssh timeouts (I even tested this on review-dev with a stream-events connection. I don't think review.o.o is ever idle long enough to hit that so will depend on sad clients) 19:19:02 corvus: yup 19:19:04 granted, the upcoming rename restart will still need a reindex 19:19:21 fungi: yeah, we can target just the renamed projects though for that i think? 19:19:37 oh, wait, replication 19:19:41 right 19:19:47 ya we can do targetted replication 19:19:51 and reindexing is online 19:19:57 i think the rename process doesn't need rereplication 19:20:02 shouldn't 19:20:14 gerrit needs reindexing, but that only happens automatically on restarts for upgrades 19:20:21 i was confusing the two. sorry! 19:20:37 Any other config management related business? 19:20:56 nada 19:21:06 #topic Storyboard 19:21:14 fungi: diablo_rojo_phon any news to share? 19:21:35 i've been entirely disconnected for the past week, so while i don't think so i really don't know 19:21:59 silly hurricanes interrupting work 19:22:39 they should do a better job of scheduling those, yes 19:23:01 diablo_rojo_phon: may also not be on hand at the moment 19:23:08 SotK: ? 19:24:22 guessing we can move on. i'll hopefully have things to talk about next week as i get back into gear 19:24:36 k 19:24:39 #topic General Topics 19:24:49 #link https://etherpad.openstack.org/p/201808-infra-server-upgrades-and-cleanup 19:24:55 First up is server cleanups 19:25:09 I don't expect much movement happened on the wiki what with hurricanes and other fires 19:25:36 for static.o.o I think we now have a relatively complete list of tasks to be able to delete that server 19:25:39 #link https://etherpad.openstack.org/p/static-services Sign up for tasks 19:25:48 i've been pretty useless for the last week, yes, apologies 19:25:57 thank you ajaeger for jumping on that already. If the rest of us can grab one or two of those we should be able to make good progress on it 19:26:08 ianw: ^ did you have anything else to add re static.o.o and your auditing? 19:26:43 no; we didn't really come to a conclusion on the redirects ... but we can defer that until the harder things like afs publishing are done 19:27:09 ok 19:27:29 That is a good transition into the next topic: AFS stability 19:27:59 we discovered that afs02.dfw.o.o was very unhappy last week. It looked like fallout from a live migration based on console content (I seem to recall similar messages from hosts that we were told had been migrated) 19:28:04 corvus rebooted it 19:28:24 unfortunately that left some volumes locked. ianw has since fixed all of those but the opensuse mirror volume (is that correct?) 19:28:37 I have also turned mirror-update.openstack.org back on but left mirror-update.opendev.org alone 19:28:39 oh, huh, i thought i checked that after rebooting 19:28:52 what steps should we have done after the reboot? 19:29:08 ianw: ^ can fill us in on his investigation and next steps 19:29:21 yes the opensuse mirror had a release error; something we've seen before and made a thread about on openafs lists (see updates in #openstack-infra) 19:29:49 i've tried the salvage recovery steps they suggest and an re-running the opensuse release 19:30:15 corvus: I believe that subsequent vos releases ended up timing out their auth (and this may have happend after the reboot) 19:30:34 corvus: i think maybe everything needed to recreate r/o and everything timed out 19:30:43 or what clarkb said :) 19:31:21 oh. huh. wonder why it needed to redo r/o 19:31:41 so ya immediately after the reboot I think it was fine 19:32:26 so the short version of the recovery process is pause release activity before rebooting a server; after it boots, clean up any broken releases; perform a release on every volume manually on the db servers; unpause releases" ? 19:32:44 corvus: yes I think that would do it mroe gracefully 19:33:09 is there a doc update with that, or should i write that up real quick? 19:33:09 yeah, possibly -- here's where ubuntu went wrong : http://paste.openstack.org/show/774655/ 19:33:32 corvus: I am not aware of a doc update for that yet. I think you can write it 19:33:41 so it's not exactly like "recreating r/o" 19:34:03 i haven't audited the logs of all the other ones to see what happened, but we could 19:36:37 sounds like that may be it for this topic? lets move on as I want to finish before 2000UTC 19:36:55 Next up is project renaming on Monday at 1400UTC that is 7am Pacific 19:37:01 #link https://etherpad.openstack.org/project-renames-2019-09-10 Planning document 19:37:09 I've just started editing ^ based on the one we did at the end of may 19:37:18 from that I need to write a chagne for the repos.yaml content 19:37:30 and need to get the project-config rename chagnes in order 19:37:48 One question I did have is whether or not the libification of gitea management affects our rename playbook at all 19:38:01 I don't think it does because rename playbook should call into that like the normal ansible runs do too 19:38:06 but I'll look at that as well 19:38:24 #link https://review.opendev.org/681338 Add docs for recovering an OpenAFS fileserver 19:39:55 monty said he would be around monday and early pacific time would work for him. fungi I think you said that works too, does it still work post hurricane? 19:40:09 (and anyone else should feel free to help too :) ) 19:40:14 yeah, it's fine. i'll be here with bells on 19:40:56 great I'll see you all there then 19:41:10 Next up is volume of ara file in ara reports is making ceph sad 19:41:21 Wanted to bring this up to cover the changes we made really quickly 19:41:41 first corvus updated the zuul dashboard to largely cover our needs there via the job output json file 19:42:04 corvus: do we still need to get that fix for handlers merged? or do we just need an executor erstart for that? 19:42:58 er 1 sec; i'm unprepared :) 19:43:34 https://review.opendev.org/680726 is unmerged, but not very important 19:44:11 interseting that zuul didn't enqueue it when its parent merged 19:44:17 the current output is misleading if ansible "handler" tasks are used, but they don't get used that often in the main body of jobs 19:44:19 in any case we stopped running ara for every job 19:44:28 nested ara runs are still there but the root report is no longer a thing 19:44:36 that should cut back on our total object count for log files in swift/ceph 19:44:46 clarkb: it did; but zuul's gate pipeline doesn't honor 'recheck' 19:44:59 the other major chagne we made was to shard logs into containers based on the first three chars of the build uuid 19:45:13 we should get 4096 containers in each cloud region to shard logs into as a result 19:45:19 hopeflly that spreads things out well 19:45:20 corvus: ah 19:45:32 and as of this mornign all swifts are back in service 19:45:40 i still consider the current state an emergency fix... and we should figure out what to do about nested aras 19:45:46 corvus: k 19:45:55 vexxhost is the last remaining cloud we've kept out (it is ceph) 19:46:13 I think as soon as mnaser is ready for us to try using vexxhost again we can go ahead and do that 19:47:25 we know that we have reduced the file count by turning off the zuul-ara, but we don't know what proportion of the files that was compared to nested-ara -- meaning we don't know how much of an impact the nested-aras will continue to be 19:47:57 i suspect it's still significant, and if so, it would be good to come up with another option there 19:47:57 correct. We have also removed hot containers like periodic/ by replacing them with the build uuid sharded containers 19:48:04 there was an etherpad with ideas, right? 19:48:12 dmsimard: ping 19:48:26 #link https://etherpad.openstack.org/p/Vz5IzxlWFz ARA file count reduction ideas 19:49:04 i feel like #2 isn't worth the gain 19:49:19 yeah, i got the impression some of those nested aras reported on orders of magnitude more ansible tasks than our job ara 19:49:49 if that's the case, then turning off zuul-ara may not change as much as we'd hoped. 19:49:51 fungi: but we run an order of magnitude fewer jobs for them (I think) 19:49:59 maybe that cancels out? 19:50:06 the number of tasks involved in, e.g., deploying a complete openstack is waaaay more than to run a typical zuul job 19:50:23 but yes, hard to know due to the difference in frequency 19:50:24 nested ara is just system-config jobs, osa, and tripleo 19:50:30 (at least those are the ones I know of) 19:50:37 it's "just tripleo" i'm worried about :) 19:51:30 ya that could be significant (though their use of resources is more due to running long multinode jobs than many many jobs I think) 19:51:34 looking at the etherpad, broadly, i see only two ways to improve this: 1) use a static server; 2) do more in javascript 19:51:50 hi, I'm here 19:51:58 those were basically the two possible roads i was aware of 19:52:36 do more server-side (so not in swift since we can't provide our own swift middleware), or do more on the client side (so probably with js) 19:52:53 if we need to do something urgently, #1 (static server) is probably the only thing that will do. if our effort so far has given us some breathing room, we can look into #2 (javascript development) 19:53:19 corvus: I think the current swift clouds have no problem with the existing setup (at least I've not heard complaints) 19:53:28 corvus: given that I think we do have breathing room 19:53:44 using the sqlite middleware is my preferred option, it was designed to solve this exact problem 19:53:54 I do not have the javascript skills to take on the other option 19:53:59 we'll presumably know more for sure as we approach our retention 19:54:09 about whether it's a strain on them 19:54:26 fungi: ya and donnyd can likely give us detailed info 19:54:45 i'm not opposed to using a static server; i'm opposed to being personally responsible for sysadmining one 19:55:03 https://grafana.fortnebula.com/d/9MMqh8HWk/openstack-utilization?orgId=2&refresh=30s publishes swift info from FN 19:55:29 i'm over-extended (as i think the team is), so i'm trying to simplify both the system and our responsibilities, so i'd still like to avoid that if possible 19:55:47 yea, and there is little in the way of load on those swift servers 19:55:47 corvus: ++ also we've found that there are problems with sysadmining such a system even if we have all the time to do it 19:55:59 I can also grab any other stats needed and publish them 19:56:15 particularly around device limits and volume attachments (one alternative we considered was ceph but if it has problems with the object store version of this will it have problems with the block/fs version?) 19:56:30 i am 2x on objects, so what you see there you can /2 19:57:00 dmsimard: we talked about maybe zuul and ara sharing some react widgets... maybe we can find a way to combine zuul's "load data from json" approach with ara? 19:57:13 I am about to move swift to nvme so it really won't matter on my end at all 19:57:35 (as a timecheck we have just under 3 minutes left in the meeting but this was the agenda so we can talk about ara until then) 19:57:55 more than happy to have further discussion in #openstack-infra on this topic or any other afterwards too 19:58:01 dmsimard: i don't know what that looks like, or how to do it, but i feel like there should be something we can do to start converging 19:58:32 corvus: yes, I believe it was tristanC who came up with that. I'll be at the office with him tomorrow, will bring it up. 19:59:31 dmsimard: cool -- why don't the 2 of you bat around some ideas, and then maybe the 3 of us can have a conference call or something afterwords? 20:00:01 sounds like a plan 20:00:08 and we are at time. Thank you everyone! 20:00:10 #endmeeting