19:01:22 #startmeeting infra 19:01:23 Meeting started Tue Jun 11 19:01:22 2019 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:26 The meeting name has been set to 'infra' 19:01:45 #link http://lists.openstack.org/pipermail/openstack-infra/2019-June/006399.html 19:02:01 That is the agenda I sent out yesterday. We'll have a couple extra items compared to that that showed up late 19:02:11 #topic Announcements 19:02:34 I'm going to be out next week so we will need a volunteer to run the meeting next week 19:02:52 Also we started enforcing SPF failures on our mailman listserver 19:03:18 I think fungi has decided that qq.com does have valid servers in their policy so we should only be dropping the invalid sources now 19:03:28 yup 19:03:39 so far it's really knocked the spam volume waaaaay down 19:04:45 yay, here is hoping we can stop blackholing the list admin addrs 19:05:11 #topic Actions from last meeting 19:05:20 #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-06-04-19.01.txt minutes from last meeting 19:05:32 corvus was going to do something with the storyboard slow log data 19:06:01 corvus: ^ did that end up happening? 19:06:02 i ended up sticking the files in afs for the storyboard team to work with 19:06:24 awesome 19:06:29 i ran a simple report and mordred gave suggestions on using a tool to do more investigation if warranted 19:06:50 It sounded like SotK knew where to look in storyboard given what the slow log was telling us 19:06:53 i guess we should shut off the slow query log now though 19:06:58 before it fills the disk 19:07:26 or at least truncate the existing one and gather new data now that a few changes landed over the weekend to smiplify a few queries 19:07:40 oh cool, given that, yeah 19:08:11 sounds like a plan, thanks 19:08:53 #topic Priority Efforts 19:09:07 #topic Update Config Management 19:09:35 ianw has some ansiblification work we'll talk about a bit later. Other than that any progress on other portions of this effort? 19:11:13 sounds like no? Given that lets move on to the next topic 19:11:26 #topic OpenDev 19:12:22 This is where I've spent a good chunk of my time recently in a round about way. I've wanted to rebuild gitea06 which we wanted to build on our control plane images built by nodepool. This led me to digging into why nodepool wasn't reliably building images an dthen why it wasn't reliably uploading images 19:12:56 and now it is! 19:13:00 I ended up clearing out /opt/dib_tmp on the nodepool builders to free up disk space so thatimages could build again. In that process I rebooted the builders to clear out stale mounts as well, and that restarted nodepool which updated openstacksdk 19:13:28 new openstack sdk couldn't upload to rax which was fixed by downgrading sdk and then mordred figured it out and fixed it properly in sdk so next release should be good 19:13:43 oh also discovered we had leaked a bunch of images in the process which I've worked to clean out of the clouds 19:13:56 and now ya I think nodepool is a lot happier and we should be ready to boot a gitea06 :) 19:14:42 after which time we get to find out what's involved in transplanting a gitea database? 19:14:50 fungi: yup 19:15:00 * fungi hopes it's just mysqldump and source 19:15:04 fungi: also possibly the git content as well so that we don't have to rely on slow replicatio nfor that 19:15:07 clarkb: the patches landed, so I submitted a release request 19:15:27 mordred: cool we should probably plan to restart nodepool services once the new release is installed on the nodepool nodes 19:15:44 ++ 19:16:03 https://review.opendev.org/#/c/664585/ for anyone who is interested 19:16:16 hopefully there aren't more regressions hiding in the recent sdk release we haven't found yet 19:16:57 nodepool was updated to exclude the broken releases as well 19:17:10 as i understand it, this one was hard to find since it's particularly challenging to deploy a "rackspace" in testing. 19:17:29 ya 19:17:47 and it ended up being a really obtuse thing too 19:17:53 yah 19:17:57 (type was wrong in a json document and srever only responded with a 400) 19:19:04 Any other opendev related updates? 19:21:09 #topic Storyboard 19:21:24 We talked about db slow log data collection and recent improvements to that already 19:21:32 Are there any other updates to bring up re storyboard? 19:21:55 new feature merged over the weekend to allow auto-assigning of groups to security stories 19:22:28 er, not assigning 19:22:38 so projects can have groups mapped to them, and if those groups are flagged as security type then they get automatically subscribed to any story marked "security" 19:23:05 there is also a distinction now between security stories and just normal private stories 19:23:14 That means people will be able to see them but won't get email notifications yet right? 19:23:27 s/subscribed/added to the ACL for/ for clarity 19:23:35 er, that right 19:23:50 if the security checkbox is checked on creation, the story automatically starts out private and can't be made public until after creation 19:24:03 which is a nice extra bit of safety 19:24:58 and we deploy from master too right? So these db fixes and new features should show up as they merge? 19:25:10 yep. we have yet to exercise those changed features on the storyboard.openstack.org deployment but they should be available now 19:25:33 excellent 19:26:08 Anything else? 19:26:43 some of the changes related to that series cleaned up and refactored the queries used for private story matching 19:26:59 so hopefully at least some performance improvement there 19:27:34 there was also at least one unrelated change which merged that should have improved one of the db queries, i think, but now i forget which it was 19:28:26 Ok lets move on 19:28:29 oh, and i approved the project-config change today where karbor is moving to sb 19:28:29 #topic General Topics 19:28:31 that's all i had 19:28:58 fungi: mordred: any updates or things we can help with wiki and status upgrades? 19:29:05 #link https://etherpad.openstack.org/p/201808-infra-server-upgrades-and-cleanup 19:30:23 looks like puppet may not be updating on wiki-dev, i need to dig into why 19:31:02 no mention of puppet or ansible in its syslog 19:31:36 oh, right 19:31:46 i keep looking at the original server not the replacement 19:32:14 clarkb: nope - I'm just failing at working hours at the moment - nothing about it will be difficult best I can tell right now 19:32:49 i think i'm now into trying to reconcile some missing upstream git repos for various mw extensions 19:32:58 (something something selling a house is more work than moving out of an apartment something something) 19:33:28 mordred: I recommend you don't skip out on closing day (the people we bought our house from tried to) 19:33:52 alright next up is the https opendev mirror situation 19:33:53 hah. yeah - nope, I'm ready to sign those papers and get that cash money 19:34:03 #link https://review.opendev.org/#/c/663854/ and parents to deploy more opendev mirrors 19:34:06 (they walk in to the room with the entire purchase price in a box of $20s right?) 19:34:34 it can take a while if you have to go to a bunch of atms 19:34:42 we are still observing the afs failures (cache related?) against the dfw opendev mirror 19:34:57 I think ianw intended on deploying a small number of extras to see if we can reproduce the behavior on them 19:35:02 the change above and its parents should get us there 19:35:13 yeah, that's annoying; the hosts for that are up and ready 19:35:24 This is mostly just a call for reviews and notification that the afs stuff is still happening 19:35:40 ianw: oh cool, are they in dns and everything? 19:36:08 i approved all of the stack except the one that puts them into production 19:36:09 clarkb: umm, have to check the reviews but it's all out there 19:36:23 k 19:36:37 Next up is github replication 19:36:39 i guess i'll have to get onto building openafs from upstream and we can roll that out to a host and see if it's more stable ... 19:36:50 ianw: ++ 19:36:59 #link http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005007.html 19:37:03 mostly i wanted to find out if our attempt to get out of the business of dealing with github was working? it seems that many folks are interested in replicating to github, and they seem to be asking for our help in doing it. are people happy with this status quo? 19:37:49 I think the biggest problem we've run into so far is people that want a repo org transfer but don't want us to have temporary admin on the remote side 19:38:06 I found a workaround for that which works 19:38:14 When we do get our user added as admin the process seems to go pretty smoothly 19:38:14 that maybe we should update our docs to do? 19:38:28 well, and that people who want repo org transfers need our help to do it (unless we add volunteers to the gh admins group who want to take that over) 19:38:42 fungi: which I think we intend on doing with openstack/ ? 19:38:48 yeah 19:38:55 which is - a) transfer repo to openstackadmin b) transfer repo to $otherhuman c) stop caring 19:39:02 $otherhuman can then transfer to their org 19:39:09 clarkb: i think that's the only one which matters in our case anyway 19:39:21 mordred: and all of the redirects still work ya? 19:39:25 yes 19:39:44 cool. Given that maybe we should enlist some openstack volunteers, and write that process down then get out of hte business entirely :) 19:39:50 ++ 19:39:50 likely we need to ask the openstack tc to drum up some volunteers to take over management of the openstack org on gh now 19:40:08 in absolute terms, our interaction with github has increased dramatically, but it sounds like folks think this is a fairly time-limited thing still, and we should ramp down to zero soonish (like, this year)? 19:40:25 or what clarkb just said :) 19:40:26 corvus: ya and for example with the repo renames we just did we didn't have to touch github at all which was great 19:40:33 i would like to ramp down to 0 much sooner than the end of this year 19:40:39 this month would be nice 19:41:15 okay, that mostly involves working with the tc to set that up. yeah? 19:41:16 and the opendevzuul app has moved to opendevorg - so there's no conflict there between that and a tc-delegated team managing the openstack gh org 19:41:16 Considering jroll's preexisting involvment maybe we should hand over volunteer enlistment to jroll? 19:41:21 also we ought to figure out a timeline for getting gh out of our gerrit replication config 19:41:36 we should probably make an opendevadmin account and make the opendevorg org owned by opendevadmin 19:41:42 so that there is no remaining overlap 19:41:44 corvus: ya I think jroll is our contact on the tc to do that 19:41:49 mordred: ++ 19:42:00 okay, how about i take on the 'work with jroll' task 19:42:06 corvus: wfm 19:42:24 #action corvus work with jroll to get openstack github org self managed 19:42:42 i'll suggest that he enlist some volunteers to take over operation of the openstackadmin account, continuing the usage policy ianw & co established 19:42:54 sounds great to me 19:42:54 I can create an opendevadmin user - but I am not set up to do the 2fa setup any time in the next 2 weeks - so it might be an easier task for someone else - or else we can wait until I'm less snowed under and I'll volunteer to do the whole thing 19:43:26 whoever sets up that account remember to make it an owner of the org(s) 19:43:58 mordred: my guess is 2 weeks is probably a fine timeline for that 19:43:59 k. i'll make sure to mention we need to retain access until the opendevadmin thing is set up :) 19:44:23 k. I'll take the task then 19:44:37 #action mordred set up github opendevadmin account for the opendevorg org and zuul app 19:44:51 that said - the people taking over don't have shared shell access to a vm with a shared secrets file that they can use to managed shared access to an admin account 19:45:17 mordred: could encrypt it with all of their gpg keys or something like that 19:45:29 yeah, the discussion may involve more work than just a simple handoff 19:45:37 yeah - just saying - the openstackadmin design as it is now works for the infra team ... but yeah ^^ that 19:45:47 This is our 15 minute time check. I think we need to keep moving to get through the agenda 19:45:53 i'm good 19:45:54 and for the record - I'm fine with however the new owners want to do things 19:45:59 Next up we have fungi's topic for requesting spamhaus pbl exceptions when we boot new servers that send email 19:46:23 pretty simple reminder 19:46:51 rackspace has blanketed basically all of their ipv4 address space with spamhaus policy blocklist entrues 19:47:04 we could put a print() in launch.py 19:47:07 Maybe we should add a note for doing that to launch node's ya that 19:47:22 corvus: I think that reminder would help me remember :) 19:47:27 so when we boot a machine there which will be sending e-mail to a variety of people, need to query the pbl for its v4 address 19:47:31 https://www.spamhaus.org/query/ip/104.239.149.165 19:47:46 I can write that change really quick after lunch if others think it will be useful 19:47:54 ++ 19:47:58 and then there is a dance you have to perform involving getting a one-time code sent to a responsible e-mail address 19:48:07 and feed that back into the site 19:48:29 i usually use infra-root@o.o for that 19:48:43 k I can add that info to the print output 19:49:17 probably any address would work, but if it winds up getting readded for some reason i don't know if switching e-mail addresses may cause them to decide to not grant a new exception 19:49:46 well clouds do reuse IPs so we may run into that just in our normal operation of replacing servers 19:49:54 but we can figure that out if we get into that situation 19:50:12 (spam reporting back to spamhaus for messages from an address in the exception list can cause the exception to get removed) 19:50:43 Next up I wanted to make a quick reminderthat I'm going to replace certs that expire soon this week. I'll probably start the renewal process on thursday 19:50:57 and I'll be getting 2 year certs for those 19:51:01 i ran into it in the wake of the ask.o.o server replacement, folks were unable to create new accounts because their e-mail providers were discarding/rejecting the account confirmation e-mails from it 19:51:15 which will hopefully give us plenty of time to switch over to LE over time 19:51:22 fungi: ah 19:51:56 And last item we have is ianw's backup server and client config with ansible 19:51:58 #link https://review.opendev.org/#/c/662657/ 19:52:22 i just wanted to float a few things past the peanut gallery 19:52:33 Looks like ianw plans to boot a new backup server running bionic and has questions if 1) we would like to keep not running config mgmt there once deplyoed and 2) if we want 2 servers 19:52:51 i plan to bring up a new server once the reviews are in on that ... rax ord still preferred home? 19:53:17 we might consider vexxhost if mnaser is comfortable with that to give us geographic and provider redundancy 19:53:53 but I think we do want at least geographic redundancy so ord seems like a good spot 19:54:12 yeah, the ansible is intended to automate key and user management, so adding a server should just be an entry in the inventory 19:54:30 i don't think we want to deal with stuff at rest in two places ... but we could 19:54:34 the main reason not to run config management regularly is extra protection in case of an error or compromise. granted, i come from a *very* conservative sysadmin background when it comes to such things. :) 19:55:24 i second that preference though 19:56:10 being conservative with our backups seems reasonable 19:56:14 also keeping backups in a different provider from where the bulk of our control plane currently resides seems like a prudent choice in case that provider has a sudden and unexpected change of heart 19:56:43 agreed. if we have to pick one, i'd pick vexxhost. then rax-ord as a second site. 19:56:46 ok, we can keep the disabled policy 19:56:47 hoping that would never happen, but you have to have contingency plans 19:57:07 our original backup hosts were rax-ord and hpcloud 19:57:16 ok, i can reach out to mnaser and figure out where we might do it 19:57:17 so, erm, that plan has already served its purpose once. 19:57:32 except the hpcloud backup server kept getting deleted 19:57:41 i mean before the last time it got deleted of course 19:57:48 yeah. none of this is theoretical. :( 19:58:02 We are just about out of time 19:58:07 #topic Open Discussion 19:58:21 Anything else really quickly that we want to call out before we have lunch/breakfast/dinner 19:58:47 clarkb: wow, someone's hungry! :) 19:58:52 #link https://review.opendev.org/664675 Replace the old Stein cycle signing key with Train 19:58:54 Very 19:59:15 would be nice to approve that today so i don't have to rewrite the date in the change i'm going to push to release docs 19:59:15 fungi: at some point we should figure out what that process means in the opendev world 19:59:21 i concur 19:59:29 but this one was slightly past due 19:59:33 yeah, now is not the time for that :) 19:59:52 sounds good for next week's agenda 19:59:55 i'll add it 20:00:05 and we are at time 20:00:08 Thank you everyone! 20:00:11 #endmeeting