19:01:10 #startmeeting infra 19:01:11 Meeting started Tue Oct 23 19:01:10 2018 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:15 The meeting name has been set to 'infra' 19:01:20 o. 19:01:21 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:23 o/ 19:01:30 #topic Announcements 19:01:50 Just another friendly reminder that the summit and forum fast appraoch and you may wish to double check the schedule 19:01:54 #link https://www.openstack.org/summit/berlin-2018/summit-schedule/#track=262 Berlin Forum schedule up for feedback 19:02:00 so fast 19:02:20 I think it is fairly solid at this point so mostly just a reminder to go look at the schedule if you haven't arleady. Unsure if changes can be made easily now 19:02:38 fungi: There is rumor that food awaits me after the meeting. Good motivation :) 19:03:01 oh, i was simply commenting on the fast approach of the summit and forum 19:03:06 oh that :) 19:03:07 but yes, fast food too 19:03:13 indeed it is like 3 weeks away 19:03:49 I'll be visiting the dentist tomorrow midday, but don't expect any drilling or drugs to happen so should be around. Otherwise not seen any announcements 19:04:03 #topic Actions from last meeting 19:04:11 #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-10-16-19.01.txt minutes from last meeting 19:04:32 No explicit actions called out but we did some work on things we talked about (which we'll get to later in the agenda) 19:05:30 #topic Specs approval 19:05:57 Any specs we should be aware of? I think we are mostly head down on implementing work that is already captured by specs or general maintenance 19:06:09 I guess the storyboard spec is still outstanding. And I keep meaning to read it 19:06:23 you and a bunch of us 19:06:27 well, at least me anyway 19:06:52 #link https://review.openstack.org/#/c/607377/ Storyboard attachments spec. Not ready, but input would be helpful 19:07:13 frickler has review comments. diablo_rojo not sure if you've seen those 19:08:10 #topic Priority Efforts 19:08:16 #topic Storyboard 19:08:23 That makes a good transition to storyboard topic 19:08:51 clarkb, I saw them, was waiting for some more before I did an update 19:09:04 just poked SotK for comments 19:09:06 diablo_rojo: ok, I'll try to take a look at it today after the nodepool builder work 19:09:14 clarkb, that would be suuuper helpful 19:09:49 fungi, too if you have some free time in between the house stuff. 19:09:52 mordred: not sure if you are here, but I think you started looking at using pbrx with storyboard to work on server upgrades? 19:09:54 yep! 19:10:15 * clarkb will try to be a stand in for mordred 19:10:40 mordred volunteered to work on the storyboard trusty server upgrades and seems to have taken the appraoch of using storaybord(-dev) as an early container deployment system 19:10:59 Seems like a good approach to me 19:11:01 The upside to this is as a python + js application it is actually fairly similar to zuul which means pbrx should work for it 19:11:05 it's a fairly self-contained service so might be a good test case 19:11:30 I want to say the early work has been in udpating bindep.txt type files to have pbrx produce working container images for storyboard 19:11:52 if you are interested in this aspect of the config mgmt updates this is probably a good place to follow along and/or participate 19:11:57 the api server is straightforward and the webclient is just a wad of files that need to be splatted somewhere a webserver can find 19:12:03 ianw: ^ you may have thoughts in particular since you've looked at similar with graphite 19:12:40 though there's also the need to have rabbitmq running, and also the worker process 19:12:48 so i guess not completely straightforward 19:13:06 fungi: it should be a good real world application though as it isn't trivial 19:13:17 while still tying into the toolin we've already built 19:13:19 clarkb: yeah, i need to get back to the actual "get docker on control plane server bit", some prelim reviews out there that didn't quite work 19:14:13 diablo_rojo: fungi any other storyboard topics to bring up before we move on to the config mgmt udpates? 19:14:28 nah 19:14:28 I don't think so. 19:14:34 thanks! 19:14:41 Thank you :) 19:14:43 #topic Config Mgmt Update 19:15:14 As mentioned in the context of storyboard the work of updating our config management processes continues to happen 19:16:05 i guess there are a couple base classes of work going on there? 1. contanierizing things, and 2. automatic deployment/replacement of containers? 19:16:07 #link https://review.openstack.org/#/c/604925/ add zuul user to bridge.o.o for CD activities could use a second review 19:16:39 also https://review.openstack.org/609556 19:16:45 fungi: ya part of the spec is the idea that we'll build images regularly to pick up updates and to avoid being stuck on insecure software unexpectedly 19:16:50 i guess i should have used the topic on that 19:16:59 fungi: to make that useful we also need to update the running deployments with the newi mages 19:17:19 #link https://review.openstack.org/#/c/609556/ Install ansible 2.7.0 release on bridge.openstack.org 19:18:03 mostly I think the work here needs a few reviews. The topic:puppet-4 stuff is largely blocked on reviews as well 19:18:24 if we can find time to take a look at topic:puppet-4 and topic:update-cfg-mgmt we should be able to make some pretty big progress over the short term 19:18:42 FWIW I sent patches to see what it would look like to enable ara on bridge.o.o, would love feedback https://review.openstack.org/#/q/topic:ara-on-bridge 19:19:11 #link https://review.openstack.org/#/q/topic:ara-on-bridge changes to run ara on the bridge server to visualize ansible runs there 19:19:37 should we change the topic to update-cfg-mgmt? 19:19:49 i think ara is in scope for that 19:19:52 corvus: dmsimard ++ I think this is an aid for that spec and could use the topic 19:19:58 sure 19:20:35 updated 19:21:44 Other than "please review as you can" any other items for this topic? 19:22:06 re ara -- 19:22:19 do we need a mysql db for that or are we going to stick with sqlite for now? 19:22:35 sqlite works well in the context of executors, each job is running on it's own database 19:22:41 the problem with sqlite is write concurrency 19:22:41 sounded like there was a concurrent write issue with sqlite? 19:22:46 yeah, that 19:23:03 so we should have an infra-root set up a mysql db before we land those ara changes? 19:23:13 * mordred waves 19:23:28 do we want trove or just orchestrate a mysql service onto the bridge server? 19:23:29 or add that to the changes and run one locally? 19:23:45 corvus: the legwork is done in hostvars on bridge.o.o, pabelanger helped 19:24:12 dmsimard: meaning there is already a mysql db running somewhere? 19:24:24 yes, a trove instance 19:25:40 dmsimard: did he provide the connection info so we can update that change to use it? 19:25:41 We can definitely start with sqlite, I'm just not sure how things might behave with concurrency 19:25:55 let's not start with sqlite, let's use the trove which apparently exists :) 19:26:05 and would be helpful if we remember to #status log changes like that that happen outside of config mgmt 19:26:16 clarkb: my default 19:26:19 fault* 19:26:44 corvus: ++ 19:26:52 corvus: the connection information is secret though, so it needs to live on bridge.o.o ? 19:27:20 I guess we could add a node to the job, install mysql on it and use that for the testinfra tests 19:27:23 dmsimard: only the password i think; paul should have added that to private hiera and given you the key he put it under so the change can reference thta key 19:27:46 i think sqlite is ok for testinfra 19:27:55 if it's a trove instance, we usually keep the hostname secret too 19:28:05 ok 2 things then :) 19:28:14 since there is no connection security there 19:28:16 corvus: I'm the one who committed the changes, they're in hostvars for bridge.o.o as well as ara.o.o 19:28:30 is there some potential for keys/passwords to make it into this? do we care? 19:28:43 any machine for any tenant in that same region can open sockets to the trove mysql service for it, after all 19:28:56 pabelanger helped me through the process of doing that, he didn't do it for me, sorry for the confusion 19:28:59 ianw: that was going to be my next question. Do we need to use no_log type attributes on tasks to avoid data exposure? 19:29:00 dmsimard: oh i thought you said paul did it. ok then :) 19:29:21 maybe to start we should run it locally only and we have to ssh port forward into bridge to see the web server? 19:29:27 to get a feel for what data is exposed 19:29:33 clarkb: wfm 19:29:34 clarkb: yes 19:29:58 ianw: and yes, data can be exposed 19:30:02 sounds fine 19:30:12 (not publishing the ara reports i mean) 19:30:25 ok lets start with localhost only webserver to see what we are exposing, remediate it and learn from there then maybe make it public 19:30:47 and go ahead and have bridge report to mysql 19:30:47 (goal should be making it open in some capacity though) 19:30:52 corvus: yup 19:31:07 oh, i wasn't really thinking publishing the website, i was more back at "data at rest not on bridge.o.o" tbh 19:31:17 yeah, eventually it would be nice to get back what we lost with puppetboard being abandoned 19:31:20 clarkb: the original spec was to replace puppetboard 19:31:24 dmsimard: thanks for this, it's going to help a lot :) 19:32:03 ianw: ah the actual db contents themselves 19:32:20 that's also a great point 19:32:38 are the trove instances on the public internet ? 19:32:43 dmsimard: no 19:32:47 possible we're compromising the security of bridge.o.o by putting the database somewhere not locally on that server 19:32:52 they are on the rax private network 19:32:56 dmsimard: they might as well be though 19:33:10 which is "private" 19:33:12 they're on the shared "private" network which all rackspace tenants can access 19:33:49 if this is a concern, putting mysql on bridge.o.o is another option which would also happen to reduce latency and improve performance (i.e, no roundtrip to put stuff in a remote database far away) 19:33:49 so you need access to a rackspace account/vm and to know (or guess) the database address and credentials 19:34:02 or a pre-authentication zero-day vulnerability in mysql 19:34:14 but then we might as well set up the web application there as well (instead of a dedicated ara.o.o server) 19:34:18 yeah, also possibly the communications aren't encrypted? 19:34:19 bridge is not sized correctly to even run ansible; there's no way we can add mysql there without a rebuild. 19:35:15 ianw: no possible about it. the database connections aren't encrypted, so someone with the ability to subvert that network could also sniff them maybe? i can't remember whether they're sent in the clear or as a challenge/response 19:35:47 another option would be to run a mysql off host ourselves 19:35:53 then add encryption and such 19:35:57 but also possible to inject into an established socket with some effort i guess 19:36:03 then we don't need to rebuild bridge.o.o for this 19:36:34 Need to drop momentarily to pick up kids 19:36:38 meh, we need to rebuild bridge anyway, so if we want to stick with on-host, i wouldn't count that as a strike against it 19:36:39 also possible trove has grown the feature to allow us to use tls/ssl for connections 19:36:44 corvus: fair neough 19:36:52 maybe we use trove initially with the understanding that when (not if) we resize bridge (because it's already not appropriately-sized for ansible anyway) we add enough capacity to also put mysql on it? 19:36:58 or also what corvus said 19:37:03 why don't we pick this back up in #openstack-infra when dmsimard can rejoin? and we can continue with the meeting agenda? 19:37:06 it may also be that we consider it low risk enough that we are leaking in the first place, it doesn't matter 19:37:13 or what fungi said :) ++ 19:37:34 I do think being careful around this is worthwhile while we work to understand what we are exposing 19:37:36 yeah, we're not *supposed* to be putting private stuff in this db, the question is what happens if we accidentally do 19:37:54 we're all just repeating one another at this point anyway, so yeah we can pick it back up in #-infra later ;) 19:38:13 maybe we should pick it back up in #-infra later 19:38:19 heh 19:38:21 #topic General topics 19:38:33 I'll take that as a cue 19:39:12 OpenDev is a thing and we now have some concrete early task items planned out. Step 0 is getting dns servers running for opendev.org (thank your corvus for getting this going) 19:39:35 before we can start using opendev.org everywhere we need to communicate what we mean by opendev and what the goals we have are 19:40:06 #link https://etherpad.openstack.org/p/infra-to-opendev-messaging is the document we started working with yesterday between the infra team and the foundation staff to try and capture some of that 19:40:43 I will be working with the foundation staff to write up the short message with a Q&A and use that to send a community wide email type thing. There is also plan to talk about this at the summit at least at the board meeting 19:41:10 My hunch is once we get through the summit we'll be able to start concretely using opendev and maybe etherpad becomes etherpad.opendev.org the week after summit 19:41:35 Still a lot of details to sort out, I expect we'll be reviewing documents in the near future to make sure they make sense from our perspective 19:41:58 and for anyone wondering the current working one liner is: "Community hosted tools and infrastructure for developing and maintaining open source software." 19:42:17 questions, concerns about ^? 19:43:38 i like the specificity of "free/libre open source software" but as a first go it's not bad 19:44:09 "community hosted and operated" might also be a good expansion 19:44:16 Feel free to reach out to me directly as well if this venue doesn't work for you 19:44:24 I'm reachable via email and irc 19:44:32 something to make it more obvious we're running these things as a community, not just hosting them for communities 19:44:34 fungi: you should add those thoughts to the etherpad :) 19:44:36 yup 19:44:46 ++ 19:45:38 Next item is the trusty upgrade sprint. Last week ended up being a wierd week for this as various individuals had less availabiltiy then initially anticipated 19:46:08 That said I did manage to upgrade logstash.o.o, etherpad-dev.o.o, and etherpad.o.o and am working with shrews this week to delete nodepool.o.o and use our newer zookeeper cluster instead 19:46:17 Thank you to all of you that helped me with random reviews to make that possible 19:46:45 I'd like to keep pushing on this as a persistent thing because trusty EOL is about 6 months away 19:46:59 if we are able to do a handful of servers a week we should be done relatively quickly 19:47:26 #link https://etherpad.openstack.org/p/201808-infra-server-upgrades-and-cleanup list of trusty servers that need to be upgraded. Please put your name next to those you can help with 19:48:17 That brings up the last thing I wanted to talk about which is the zookeeper cluster move 19:48:42 I got the zookeeper cluster up and running as a cluster yesterday. Shrews and I will be moving the nodepool builders to the cluster after this meeting 19:48:58 Those will then run for a day and a half or so to populate images on the new db 19:49:27 Then on thursday I'd like us to update the zookeeper config on the nodepool launchers and zuul scheduler to use the new cluster 19:49:47 We will likely implement this as a full cluster shutdown (may as well get everything running latest code, and maybe zuul can make a release based on that) 19:50:24 hrm apparently thursday is the stein-1 milestone day though :/ 19:50:26 sounds great, thanks for working on that 19:50:37 we might have to wait for the milestone stuff to happen first? 19:50:43 I'll talk to release team after htis meeting 19:50:53 milestone 1 is usually a non-event, especially now that openstack is no longer tagging milestones 19:51:03 oh in that case ya should be fine. But I will double check 19:51:29 though they may force releases for cycle-with-intermediary projects which haven't released prior to the milestone week 19:51:30 Once all that is done we'll want to go back through and cleanup any images and nodepool nodes that are not automatically cleaned up 19:52:10 I'm volunteering myself to drive and do a lot of that because I do think this is an important switch and early cycle is the time to do it 19:52:20 you are now all warned and if you want to follow along or help let me know :) 19:52:25 #topic Open Discussion 19:52:34 And now ~7 minutes for anything else 19:52:45 we need to get them cleaned up since the image numbers will reset (and thus avoid name collisions), but we should have plenty of leeway on doing that 19:53:51 thats a good point, but ya ~1 year of nodepool with zk should give us a big runway for that 19:54:11 smallest sequence number i saw was 300 some, but still plenty 19:54:24 we will definitely. almost certainly. probably. get it all cleaned up in a year. 19:54:44 i like your optimism! 19:56:44 if anyone is wondering zk has weird way of determining which ip address to listen to 19:57:03 you configure a list of all the cluster members and for the entry that belongs to the current host it binds to the address in that entry 19:57:23 so if you use hostnames that resolve to localhost because /etc/hosts then you only listen to localhost and can't talk to your cluster 19:58:15 looks/sounds like we are basically done here. Thank you everyone. Find us on the openstack-infra mailing list or in #openstack-infra for further discussion 19:58:23 I'll go ahead and end the meeting now 19:58:25 #endmeeting