19:01:40 #startmeeting infra 19:01:40 Meeting started Tue Nov 21 19:01:40 2017 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:41 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:43 The meeting name has been set to 'infra' 19:01:51 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:54 o/ 19:02:27 There was a bit of stuff left over from last meeting that I took the liberty of clearing off the agenda. Feel free to reraise items as we go through if I shoudn't have removed something 19:02:38 #topic Announcements 19:02:54 o/ 19:03:03 This week is a major holiday in the US so expect those of us living there to be AFK starting thursday 19:03:36 I will be picking up a turkey this afternoon so the fun starts early too 19:03:52 (/me lurk) 19:04:05 #topic Actions from last meeting 19:04:12 #link http://eavesdrop.openstack.org/meetings/infra/2017/infra.2017-10-31-19.01.txt Minutes from last meeting 19:04:22 er thats the wrong link 19:04:25 #undo 19:04:25 Removing item from minutes: #link http://eavesdrop.openstack.org/meetings/infra/2017/infra.2017-10-31-19.01.txt 19:04:53 #link http://eavesdrop.openstack.org/meetings/infra/2017/infra.2017-11-14-19.00.txt 19:05:22 The only action there is for ianw to confirm backups are working properly. I saw report that this had been confirmed yseterday so \o/ 19:05:45 fungi: you've also had an action to document secrets backup policy. Has that change been pushed/merged? 19:05:50 strangely, it looks like my pending action item got omitted 19:05:52 but yeah 19:05:55 #link https://review.openstack.org/520181 Add a section on secrets to the migration guide 19:06:12 needs reviewers 19:06:28 oh cool thanks! 19:06:33 thanks 19:06:37 there was nowhere good to add the note about it, so i added a section to give us somewhere 19:08:06 that looks great; when it merges, we should send a note to the dev list pointing people at it, for the benefit of those who have already read the doc 19:08:39 dunno if that was too much or not enough detail about the feature, but felt it needed some context at least 19:08:39 There aren't any specs that need review/approval that I've seen 19:08:39 so going to skip specs approval 19:08:39 #topic Priority Efforts 19:08:39 did we lose the bot? 19:08:56 there it goes maybe I'm lagging 19:09:01 clarkb: you seem laggy to me 19:09:10 fun! 19:09:16 * frickler is seeing lagging, too 19:09:21 #topic Zuul v3 19:09:35 ahh, yeah, i had an entry for this 19:09:43 Fungi asks if it is time to remove jenkins from the Gerrit CI group. ++ from me 19:09:58 we've gotten a few people showing up in irc asking us to delete stale -1 verify votes from "jenkins" 19:10:06 let's do it 19:10:07 because people aren't reviewing their changes 19:10:26 seems sane 19:10:36 and we did at one point before the rollout indicate that we would at some point remove jenkins from the group granting it verify voting permission 19:11:13 but didn't want to do it immediately because we were relying on zuul v3 trusting the verify votes left by v2 so people wouldn't need to recheck everything 19:11:16 right so the concern is that removing the account from the group that is allowed to -1/-2 verify will effectively remove those votes from the UI 19:11:38 yup 19:11:52 figured now was a good time to revisit and see if we think we're just down to a long-tail of relatively inactive changes where doing a recheck isn't too onerous 19:11:52 well - the votes would only go away if we deleted the user, right? 19:12:00 rather than deleting the user from the group 19:12:02 mordred: right, but they won't be visible 19:12:15 mordred: gerrit "hides" the votes on active changes if you lose permission for them 19:12:23 wow. what a GREAT idea 19:12:30 this of course will hide all the verified votes ever 19:12:34 so you can still see them in the db, but won't in the webui or api 19:13:04 jeblair: it only seems to do it for open changes. merged and abandoned were unaffected in the past by it 19:13:06 ya so anything with a jenkins +1 will need to get zuul +1'd before it can gate 19:13:12 fungi: oh huh 19:13:18 that's less crazy 19:13:26 I think its been long enough that we can make that jump now 19:13:32 i agree, i'm in favor 19:13:51 right. if i add myself to project bootstrappers, +2 verify and submit a change, then remove myself from that group, the verify +2 lingers on the merged change 19:13:53 yah. I think our stale check would be in effect by now anyway 19:13:54 let's send an email to the mailing list about this 19:14:14 (as a FYI - we've just done this) 19:14:15 i'll write up an announcement 19:14:17 mordred: do we still have that? 19:14:20 stale check 19:14:26 jeblair: oh - do we not? 19:14:32 I think we got rid of it 19:14:35 we do not do stale result checking 19:14:38 mordred: ages ago... 19:14:42 ah. well, silly me 19:14:54 yeah seems to be gone 19:14:58 right, it's been a couple of years i think? my sense of time is pretty terrible any more though 19:14:59 in any case, I think it's long enough that requiring a new +1 from zuul shouldn't be a super large burden 19:16:18 if i can slip in a late addition to the topic... 19:16:27 pabelanger proposed removing infra-check: https://review.openstack.org/521880 19:16:33 I thnik we have plenty of time 19:16:33 i think it's time for that as well 19:16:45 jeblair: ++ 19:16:50 wfm 19:17:02 ++ 19:17:10 Yay 19:17:24 i did not approve in case others were still reviewing 19:17:38 and zuul-env from DIB: https://review.openstack.org/514483/ 19:18:38 aiui the shim should handle the removal of that transparently, but we should still be prepared to keep an eye out and roll back images if something about that fails 19:18:39 for ^ I think just need to approve it when we are ready to field questions from anyone that may have used the zuul env in weird ways (lets hope there are zero cases of that :) ) 19:18:46 jeblair: yup that was my undersatnding as well 19:19:12 cool 19:19:37 clarkb: wait - you're saying that people may have used a thing in a ways we didn't intend or forsee? I find that hard to imagine ... 19:19:56 :) 19:20:08 i suppose we could codesearch for /usr/zuul-env or whatever the path is 19:20:16 mordred: I expect you didn't find any such cases while working on releasenotes and sphinx ;) 19:20:37 do we want to have a more explicit list of transition cleanup changes somewhere that we can work through? 19:20:49 codesearch says manila is doing that 19:20:51 I know we had an etherpad for the VM instances maybe tack something onto there 19:21:05 and oslo.messaging 19:21:06 * AJaeger approves the infra-check removal in a minute or two 19:21:14 and tooz... 19:21:18 there's a ton 19:21:34 another thing was moving the jenkins user stuff into its own element that third party CIs could use that we wouldn't 19:21:35 ahh, because they copied legacy playbooks in-repo 19:21:44 pabelanger: ^ do you know if the jenkins stuff has changes yet (or maybe is done?) 19:21:53 AJaeger: I'm certainly not STILL finding them 19:21:59 clarkb: yah, I have a change up, but need to rework it still into own element 19:22:02 I can finish that today 19:22:37 so anyway, according to codesearch there's going to be a ton of cross-project work involved in removing use of zuul-env from copies of legacy jobs 19:22:56 likely also copied to stable branches codesearch isn't indexing 19:23:01 fungi: probably worth an email to the dev list with a link to the change where we want to delete it 19:23:10 yeah 19:23:15 similar to the run: foo -> run: foo.yaml email 19:23:42 https://review.openstack.org/521937/ could use a final +3 too, removes static wheel-builder slaves 19:23:58 once merged, i can delete servers and push on other static slave nodes 19:24:17 wait - removing zuul-env from the images won't break people using /usr/zuul-env/bin/zuul-cloner - that's where we put zuul-cloner in the legacy base job 19:24:31 mordred: right 19:24:37 its only if they use any other content of the venv 19:24:41 so like the manila jobs thatuse it shouldn't be broken, they're in legacy jobs 19:24:44 yah. 19:24:53 it will break once we remove zuul-cloner from base job, but that is another topic for another day 19:25:02 into base-legacy 19:25:29 it's already in base-legacy, no? 19:25:41 yes, but we haven't removed it from base yet 19:25:49 it's in both? 19:25:52 this: /usr.zuul-env(!?.bin.zuul-cloner)/ should be a regex for finding zuul-env uses other than zuul-cloner yeah? 19:25:59 jeblair: yes 19:26:06 on purpose? 19:26:40 see https://review.openstack.org/513506/ for history 19:26:58 mordred: that looks right 19:27:10 I think once we remove zuul-env in DIB, we can circle back to 513506 19:27:45 clarkb: ok. according to that there are no uses of zuul-env that are not zuul-cloner 19:27:58 okay. 19:28:11 i agree the way to proceed is to remove it from images, then 513506 19:28:16 oh, got it, so if they use zuul-cloner that's fine since we'll still put something executable at /usr/zuul-env/bin/zuul-cloner for the foreseeable future? 19:28:23 if 506 breaks people, it will be easy to revert 19:28:40 fungi: yes 19:28:48 fungi: there just won't be a virtualenv there around it 19:29:13 and to double check zuul-cloner shim doesn't rely on the virtualenv python or any libs there right? 19:29:13 I agree with jeblair - I think 506 is good to land- there really shouldn't be jobs using zuul-cloner and not parented on legacy-base and if there are we need to find them 19:29:16 it runs from system python? 19:29:19 clarkb: that's right 19:29:23 oh, i'm assuming we put the shim in base because just in case someone ended up (even accidentally as dmsimard says) using the cloner, we wanted them using the shim and not v2 19:29:24 that's less scary then. i thought we were talking about _also_ removing the zuul-cloner shim 19:29:38 clarkb: it'snot installed in the venv - it's just copied there 19:29:44 to summarize: 19:29:55 1) remove zuul-env from images now. if that doesn't blow up: 19:30:14 mordred: ++ 19:30:15 2) remove zuul-cloner shim from base immediately afterwords. 19:30:35 3) if that blows up, fix those jobs, or temporarily revert 506 to add zuul-cloner shim back. repeat as necessary. 19:30:46 jeblair: ++ 19:30:53 4) in the distant future, remove legacy-base (which will continue to install the shim as long as it exists) 19:30:55 [eol] 19:31:04 yeah, that seems sane 19:31:07 I have verified that fetch-zuul-cloner will create the directory if it's not there 19:31:07 sounds like a plan 19:31:13 so it should still work on images without the venv 19:31:34 lemme info this 19:31:42 #info plan for removing zuul-cloner shim: 19:31:48 #info 1) remove zuul-env from images now. if that doesn't blow up: 19:31:52 #info 2) remove zuul-cloner shim from base immediately afterwords. 19:31:56 #info 3) if that blows up, fix those jobs, or temporarily revert 506 to add zuul-cloner shim back. repeat as necessary. 19:32:00 #info 4) in the distant future, remove legacy-base (which will continue to install the shim as long as it exists) 19:32:01 k 19:32:18 step 1 is https://review.openstack.org/#/c/514483 19:32:29 wfm 19:32:33 step 2 is https://review.openstack.org/#/c/514483 19:32:50 jeblair: plan is fine. Who wants to +A 514483? 19:32:54 #info step 1 is https://review.openstack.org/#/c/514483 19:33:11 #info step 2 is https://review.openstack.org/#/c/513506/ 19:33:16 I can watch step1 now 19:34:06 ok, anything else related to zuulv3? 19:34:18 nak 19:34:27 it is awesome 19:34:30 :) 19:34:37 #topic General Topics 19:35:01 This is where I claered out a whole bunch of stuff from the agenda that appeared stale so speak up if I did so and shouldn't have 19:35:08 pabelanger: I +A'd 514483 19:35:19 mordred: ack 19:35:26 Worth mentioning again that ianw reported that the new backup server is functioning and has the old backups servers volumes attached to it 19:35:38 thank you ianw for getting that sorted out 19:36:04 thanks a ton ian! that's been on our backlog longer than i want to think about 19:36:32 we should make a note to remove and delete the old volumes after an appropriate amount of time. Maybe in the new year? 19:37:07 that gives us just over a month or so of keeping old backups around 19:37:18 yeah, that seems long enough to me 19:37:36 agree 19:38:02 we've also never figured out how to rotate backups so we don't eat disk space indefinitely. i wonder if that's a good model (switch volumes, then evenrually remove old volumes) 19:38:37 fungi: so rather than adding up to one 3TB filesystem just swap out an old 1TB fs for a new 1TB fs? 19:39:04 wfm 19:39:23 (should check actual usage before committing to a specific size but I like the idea of rotating rather than appending) 19:39:26 if memory serves, bup doesn't have a way to age out data, so we do incur a bunch of overhead re-priming the new volume under that model 19:39:43 fungi: correct 19:40:01 as we have to transfer a full copy of everything rather than just differential/incremental changes 19:40:53 i think it comes down to a question of how much we're backing up, and how much retention we want 19:41:04 ianw may also have thoughts having just done it 19:41:11 and how much disk we can allocate, i guess 19:41:39 though this might also be a good thing to try moving to vexxhost? 19:41:58 ya ceph may make this substantially easier 19:42:00 (not sure) 19:42:28 if nothing else, it's been suggested that getting available block devices of substantial size is much easier for us there than in rax 19:43:16 like, could get a 25tb block device rather than having to stitch together a slew of 1tb volumes with lvm or raid0 19:43:19 we may also consider a different tool like borg, which has support for append only and not append only backups. Not sure if you can switch between them in a way that makes sense given the reasons we have append only backups in the first place 19:44:07 but thats likely significantly more worjk 19:44:12 yah 19:45:25 The other general item I wanted to bring up quickly was we are mostly keeping up with the logstash job queue at this point. its been steady around 130-150k jobs for a couple days now and the worker processes aren't crashing \o/ 19:45:55 nice 19:46:11 I'd like to not add any significant load to that system (new files to index) until after the holiday as I'd like to see if it catches up and drives to zero with the expected drop in job activity during the holiday 19:46:19 if it does that then I think we can slowly add things back in and see how we do 19:46:41 clarkb: https://review.openstack.org/520171 adds one file 19:46:51 clarkb: do you want to WIP? ^ 19:46:57 AJaeger: ya I can WIP it 19:48:04 sounds like a fine plan 19:48:39 #topic Open Discussion 19:49:21 I had 2 things, first, what do people think about a virtual sprint before or after jan 1 for control plane upgrades (xenial) 19:49:59 * clarkb pulls up a calendar 19:50:07 2nd, could have been in zuulv3 topic, I'd like to upgrade / migration nodepool-builders to feature/zuulv3 branch to build up python3 and new nodepool syntax, we've been trying to add new images and it is confusing to contributors 19:50:25 pabelanger: ++ to 2nd thing for sure 19:51:14 i should be free of visiting family around that week, if all goes well 19:51:16 for #1, https://releases.openstack.org/queens/schedule.html, r-11 or r-10 or r-4 look clear on release side of things 19:51:59 r-10 and r-8 look good to me 19:52:26 yah, I could do either myself 19:52:27 r-10 and r-11 look good to me 19:52:43 I think a matrix of availability etherpad/ethercalc thing may work best since everyone is going to have different holiday plans/travel/etc 19:52:50 I cannot do R-8 19:52:52 but ya I like the idea of focusing on that when things get quiet around here 19:53:14 yeah, r-10 and r-8 are currently clear on my calendar too 19:53:20 okay, I'll compose and send that out ML today. See when people are free and what they want to work on 19:53:24 we should totally do it on jan 1 when we're all still sloshed 19:53:27 sounds good 19:53:29 jeblair: hahahaha 19:53:35 "who needs this server? NOT US" 19:53:39 I'd paid to see jeblair sloshed 19:53:41 jeblair: i can always make a point of getting sloshed regardless of the week 19:54:37 pabelanger: for the second thing, does it make sense to just merge v3 into master on nodepool at this point? 19:55:14 clarkb: maybe jeblair or mordred can answer that 19:55:18 then the install will update automagically to the new version? we just have ot update the config in sync right? 19:55:29 tbf, we could probably merge v3 to master on zuul too 19:55:43 clarkb: we'd need to update puppet for python3 support, but yat 19:55:59 I'm thinking that may be the best approach as it solves the underlying issue of having the two branches 19:56:11 its more work but gets us into a better state I think then we just go back to dev on master 19:56:17 yah, if we want to have that discussion, sure 19:56:37 * mordred is in favor of merging back to master 19:56:40 on both 19:56:45 it might be best to go ahead and make a plan to tell third-parties to freeze anything they need, then start updating the puppet modules and merge the branches in 19:57:11 we've made recentish tags on both that 3rd parties can use, yeah? 19:57:26 yeah, i think we need a plan for puppet modules though 19:57:44 we have v3 flags on the puppet modules today 19:57:48 we could invert it? 19:57:52 maybe someone could work through all the steps for that 19:57:59 so that v3 is default and if you are deploying v2 then set the flag? 19:58:25 clarkb: yeah, v2 + v2 release tag could be inputs to the puppet module to get a v2 system 19:58:28 and defaults could all be v3 19:58:57 totally missed out on the entire meeting, I had a topic but I guess we're out of time ? :/ 19:59:14 you have half a minute ;) 19:59:31 also I'm sure we'll mostly be around in -infra after the meeting 19:59:35 Don't want to overlap, I'll bring it up in #openstack-infra yeah 19:59:49 though i will be stuffing my face with a fried chicken sandwich 19:59:58 now i want one 20:00:02 thanks everyone! 20:00:05 #endmeeting