19:03:18 <fungi> #startmeeting infra
19:03:19 <openstack> Meeting started Tue May 30 19:03:18 2017 UTC and is due to finish in 60 minutes.  The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:03:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:03:23 <openstack> The meeting name has been set to 'infra'
19:03:29 <fungi> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting
19:03:32 <fungi> #topic Announcements
19:03:35 <fungi> #info OpenStack general mailing list archives from Launchpad (July 2010 to July 2013) have been imported into the current general archive on lists.openstack.org.
19:03:40 <fungi> #link http://lists.openstack.org/pipermail/openstack/ OpenStack general mailing list archives
19:03:44 <fungi> as always, feel free to hit me up with announcements you want included in future meetings
19:03:58 <fungi> #topic Actions from last meeting
19:04:04 <fungi> #link http://eavesdrop.openstack.org/meetings/infra/2017/infra.2017-05-23-19.02.html Minutes from last meeting
19:04:10 <fungi> bkero draft an upgrade plan for lists.o.o to xenial
19:04:14 <fungi> i saw you talking about this a lot last week in #openstack-infra
19:04:19 <fungi> have you firmed up any thoughts about it yet?
19:04:32 <fungi> got a new etherpad you can #link?
19:04:52 <bkero> Yes, I took the snapshot and did the update similar to the precise -> trusty update. The update went about as expected with the new version serving the content.
19:05:09 <bkero> I haven't had it send any test emails yet, I'll need to crete some mailman accounts/admins to be able to do that.
19:05:25 <fungi> i saw one hiccup where the data was somewhere other than where teh package expected it? did that get sorted out?
19:06:00 <bkero> The lockfile directory didn't exist was the problem with that. I haven't nailed down the reason yet, although the service was running.
19:06:41 <bkero> We could manually create it root:mailman with 2775, although I'd prefer to have that handled by the package. Maybe --reinstall will create it for us.
19:06:43 <fungi> could have to do with how we disabled services before taking the snapshot maybe?
19:06:58 * fungi is grasping at straws
19:07:06 <bkero> The dir is created by the package, although maybe with an upgrade something is different
19:07:24 <clarkb> but it existed on the lists.o.o server
19:07:36 <bkero> It likely existed on the snapshot before do-release-upgrade-ing as well
19:07:49 <fungi> so something cleaned it up... is it in a directory which doesn't persist between boots (e.g., /var/run)?
19:07:59 <bkero> jeblair had the theory that systemd manages /var/run and clobbered it for us
19:08:25 <jeblair> yeah, it doesn't persist across reboots.  normally the init script makes it when starting mm.
19:08:36 <fungi> yeah, i wouldn't be surprised (not actually bashing systemd here)
19:08:52 <jeblair> so only a problem if you upgrade a system which had not run the program since boot.
19:09:08 <fungi> right, that is teh sort of direction i was going as well
19:09:15 <pabelanger> seems to make sense
19:09:44 <fungi> okay, i guess let's do what we can to test and maybe next week get a topic on the agenda to discuss a time to roll forward with the production upgrade window assuming it seems viable?
19:10:02 <bkero> Sounds good to me
19:10:22 <jeblair> ++
19:10:34 <fungi> we probably don't need quite so long of an outage as we took for precise->trusty since we're not doing the filesystem conversion as part of the maintenance again
19:10:53 <bkero> For a trail, things done on the snapshot for the update are listed on the trusty etherpad: https://etherpad.openstack.org/p/lists.o.o-trusty-upgrade
19:11:04 <bkero> Under "Things done on snapshot:". Should likely be moved to a new etherpad.
19:11:27 <fungi> #link https://etherpad.openstack.org/p/lists.o.o-trusty-upgrade has notes about the xenial upgrade for now
19:11:41 <clarkb> might be nice to have an etherpad for xenial upgrade without the unneeded noise from the trusty upgrade
19:11:47 <clarkb> just to make it clear when reviewing/executing
19:12:24 <fungi> agreed. also if we can get some rough runtime estimates for things like the system package upgrading step, that will help inform our maintenance duration for the announcement
19:12:53 <bkero> Good point. For that I might need another snapshot to time the tasks.
19:13:31 <fungi> just give one of us a heads up in #openstack-infra when you're ready for another instance booted from the original snapshot
19:13:41 <fungi> shouldn't need a new snapshot, just a new instance
19:14:01 <fungi> anyway... thanks bkero, jeblair, clarkb and everyone else who has helped with this so far. sounds like excellent progress
19:14:18 <fungi> #topic Specs approval
19:14:25 <fungi> #info APPROVED: "Nodepool Drivers" spec
19:14:28 <fungi> #link http://specs.openstack.org/openstack-infra/infra-specs/specs/nodepool-drivers.html "Nodepool Drivers" spec
19:14:47 <pabelanger> great
19:14:53 <jeblair> yay!  tristanC started hacking on this too
19:15:03 <fungi> this reminds me, we should consider adding a help-wanted section to the specs page where we keep specs that have no assignee (yet)
19:15:27 <fungi> though in this case it's getting picked up fairly quickly
19:15:32 <jeblair> fungi: good idea
19:15:59 <fungi> #action fungi propose a help-wanted section to our specs index for unassigned specs
19:16:21 <fungi> #topic Priority Efforts
19:16:45 <fungi> nothing called out specifically here, though the spec approved above is closely related to the "Zuul v3" priority
19:16:59 <fungi> #topic Use ansible for zuulv3.openstack.org (pabelanger)
19:17:07 <fungi> #link https://review.openstack.org/468113 WIP: Use ansible for zuulv3.openstack.org
19:17:13 <pabelanger> hello
19:17:25 <fungi> a nice segue from the zuul v3 priority too ;)
19:17:42 <pabelanger> I wanted to get the pulse of people and see if running puppet to bootstrap a server, then have ansible take over as cfgmgmt on the server
19:17:58 <pabelanger> I know we cannot hard cut over today, but wanted to see if dual stack make sense
19:17:59 <fungi> that's definitely new territory
19:18:36 <fungi> in the past we've considered puppet good at configuration management (and mediocre at orchestration/automation) while ansible was held up as sort of the reverse of that
19:18:42 <jeblair> pabelanger: by bootstrap, do you mean users/mta/dns/iptables/etc?
19:18:56 <pabelanger> jeblair: yes, template::server.pp today is still puppet
19:18:58 <clarkb> my only real concern with this is that there seems to be the thoguht that using ansible will fix the config management issues with the zuulv3 deployment. I think the real issues are independent of puppet/ansible (lack of reporting, slow turn around in application loop, etc) and personally feel effort would be better spent on addressing those problems. But I don't have much time to address either so
19:19:00 <clarkb> won't get in the way
19:19:14 <fungi> i will admit to being pretty uninformed on the current state of ansible as an idempotent/declarative configuration management solution
19:19:22 <jeblair> pabelanger: (if so, i'd assume you'd want puppot to continue doing that -- so i think it's more both puppet and ansible both cfg-mangaging, just different areas of the system)
19:19:53 <jeblair> clarkb: i don't think this is intended to "fix the config management issues with the zuulv3 deployment" ?
19:20:06 <clarkb> jeblair: that was the impression I got from the discussion in #zuul the other day
19:20:11 <jeblair> huh, not what i got
19:20:19 <clarkb> it was basically "the puppet doesn't work and no one is updating it, lets just delete it"
19:20:22 <jeblair> i thought this was pabelanger likes writing ansible more than puppet.
19:20:31 <jeblair> again, not what i got at all
19:20:46 <pabelanger> jeblair: Well, that phase, user/mta/dns/iptables is a larger change, since it affects more then 1 server. But I think we could do ansible for the long term.  But currently, logic is the only shared code between our servers.  I think having that as puppet then ability to run ansible after might be a go step to migrating
19:20:46 <fungi> i think i missed the discussion. what the current wip change could certainly benefit from is rationale in the commit message
19:20:53 <jeblair> zuulv3 puppet will be no more complex than zuulv2 puppet which works just fine
19:21:01 <fungi> it does more to describe the what without really addressing the why
19:21:26 <pabelanger> right, I am writing way more ansible the puppet today. That doesn't mean we can upgrade puppet-zuul to support zuulv3.
19:21:27 <jeblair> the only complication is if we want the same thing to support v2 and v3.  but that has nothing to do with the language.  you could make that as "easy" by just not supporting both in puppet
19:22:02 <pabelanger> Personally, I am hoping to support a 3rd party CI system in all ansible. To provide another option to puppet-openstackci
19:22:16 <clarkb> jeblair: the argument I heard was puppet was in the way and ansible would fix the problems. I just think the problems aren't really due to puppet (like lack of reporting, difficulty of local deployment, and slow turnaround in production)
19:22:40 * mordred shows up - sorry, last thing ran over
19:22:49 <jeblair> clarkb: okay, that's not what pabelanger just said
19:22:50 <clarkb> I'm fine with using either tool. I just think we are better served fixing those problems before changing tooling
19:22:56 <jeblair> clarkb: what problems?
19:23:09 <jeblair> clarkb: what does "lack of reporting" mean?
19:23:13 <clarkb> jeblair: lack of reporting, slow turnaround in production, and difficulting of local deployment
19:23:23 <clarkb> jeblair: people making changes to puppet do not know if/when/how their things get applied
19:23:25 <jeblair> yes you have said that 3 times, and i still don't know what those mean
19:23:28 <clarkb> which will also be true should we use ansible
19:23:50 <fungi> lack of reporting from the puppet apply invocation?
19:24:03 <clarkb> fungi: yes and generally where the ansible loop is "eg do I just need to wait longer"
19:24:14 <jeblair> ara may help with that
19:24:34 <pabelanger> I'd like to get ARA working, but I think we need to do some dev to support our scale
19:24:37 <jeblair> is anyone working on setting up ara for "puppetmaster" ?
19:24:44 <pabelanger> I have looked at it
19:25:06 <jeblair> that would, at the very least, get us back to "last run timestamp" level of reporting
19:25:20 <pabelanger> Agree, I can first set that up if it is a requirement
19:25:27 <fungi> and success/fail output
19:25:34 <jeblair> ya
19:25:44 <jeblair> clarkb: slow turnaround in production?
19:25:56 <clarkb> jeblair: it sometimes takes an hour for a config change to get applied
19:26:02 <clarkb> jeblair: even though we in theory run every 15 minutes
19:26:06 <fungi> the time between run_all.sh loops?
19:26:09 <clarkb> (and sometimes longer if ansible has gone out to lunch)
19:26:09 <fungi> aha
19:26:10 <mordred> jeblair: fwiw, I thnk cloudnull uses ara for managing openstack clouds - so it might not be too bad for our scale
19:26:11 <clarkb> fungi: ya
19:26:37 <clarkb> jeblair: basically slow turnaround makes it really hard for people to be around and address config changes more iteratively
19:27:38 <jeblair> clarkb: gotcha.  we might be able to do some refactoring now, but we can also consider having zuulv3 drive more specific tasks once it's ready.
19:27:55 <mordred> yah- I think some of that time issue stems from the fact that we're treating it as one single loop - my understanding of how people recommend running things in more ansible-ish ways are to have playbooks broken out a bit more, so that applying an update to system A doesn't depend on applying an update to system b
19:27:57 <mordred> jeblair: ++
19:28:01 <mordred> (specific tasks)
19:28:06 <fungi> and maybe we can find some way to be more selective about when we do targeted puppet apply for servers impacted by a specific change vs full runs
19:28:11 <pabelanger> agree, I think we could debug the execution more too. I know I haven't tried to see why things are slow on puppetmaster
19:28:14 <clarkb> http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2017-05-25.log.html#t2017-05-25T16:52:03 is why I have/had the impression I did re using ansible
19:29:02 <clarkb> and I just think if we are going to say config management isn't working for us those ~3 reasons are why far more so than ansible vs puppet
19:29:06 <jeblair> i interpreted that more as mordred saying "if no one wants to write puppet then maybe we should switch"
19:29:10 <mordred> right
19:29:19 <mordred> I agree with clarkb that changing systems will not magically fix anything
19:29:46 <jeblair> clarkb: regardless of the different perceptions both of us had to that, i think "all of the above" are relevant, and they are all related
19:29:51 <mordred> but - if we're having issues with people fixing the current system because they just don't want to work on puppet any more, then I think that is relevant and a thing we should consider
19:29:55 <mordred> ++
19:31:18 <clarkb> we already use both tools so happy to use ansible more. I just didn't want us going down that rabbit hole with the idea it will fix the real issues with config management we have
19:31:29 <mordred> clarkb: ++
19:31:37 <mordred> totally agree
19:31:37 <jeblair> i think there are some things we can do with the hybrid system we have now to address the issues clarkb raised.  i think puppet continues to be workable in the long run.  i also think that if we did want to switch from puppet to ansible, it may further address clarkb's issues and also be more fun for some of us.  :)
19:31:53 <mordred> jeblair: ++
19:31:58 * mordred just ++ people today
19:32:02 <jeblair> mordred: ++
19:32:10 <fungi> i also think that we have a decent enough separation right now between ansible for orchestration and puppet for configuration management and are (mostly) using each to its strengths. if we want to switch that around, as we've said in the past, we need a coordinated plan
19:32:35 <jeblair> fungi: yeah, this would be a step over a line we have drawn.  it would be good to know what the steps following that are.  :)
19:32:43 <bkero> Having been at several projects who tried to switch configuration management systems over, I can tell you that there is no endgame there.
19:32:58 <bkero> Feature parity is expected, but it never comes
19:32:58 <mmedvede> side note from a third-party CI maintainer, switch from puppet to ansible would be painful for us. It is more or less starting from scratch. This at least should be considered if decision is made to go full in ansible
19:33:11 <mordred> mmedvede: ++
19:34:14 <fungi> pabelanger: also, to reiterate, while discussion in the meeting is useful the commit message for 468113 really needs to summarize the reasons it's being proposed
19:34:27 <mordred> in fact, I think that's a really good reason why, even if we do make ansible playbooks to install/manage a zuulv3 - we also need puppet modules for the same for puppet-openstackci
19:35:02 <pabelanger> fungi: right, it was mostly a POC to have people look at code. I am thinking some document outlineing how it might work is the next step?
19:35:39 <mordred> since we have a bunch of people depending on that
19:35:55 <fungi> pabelanger: sure, beefing up the commit message would be a start, but based on the discussion in here so far i'm pretty sure you'd need to start a spec
19:36:17 <pabelanger> okay, I can start iterating on that
19:36:21 <jeblair> i am open to moving openstack-infra to ansible, and at this point, i'd even be okay doing it piecemeal, but i'd like us to agree that's the point we want to get to so that we know the ultimate destination before we start.
19:36:54 <fungi> because the nuances of the decision will make for a very wordy commit message and burying an important decision there when it's broader-reaching isn't great for discoverability
19:37:09 <jeblair> ++spec
19:37:43 <fungi> i won't #action that as i have no idea what direction you want to take it
19:37:46 <mordred> spec++
19:38:08 <clarkb> spec+=1
19:38:41 <pabelanger> perfect, that is all I had on the topic. Thank you
19:38:44 <fungi> it should also take some of the open specs we have (including the ansible puppet apply spec which is still in our priority efforts list) into account too
19:38:57 <jeblair> meanwhile, i feel that we may need to deploy zuulv3 for openstack-infra before we resolve this issue
19:39:21 <jeblair> (unless we think we'll have a decision by next meeting)
19:39:35 <fungi> i also expect a number of downstream consumers who are more or less happily using our puppet module now would like to get the benefits of zuul v3 without having to switch up their configuration management significantly
19:40:19 <clarkb> can always make the config management turtles go deeper. ansible runs puppet which runs ansible >_> (I'm not serious)
19:40:20 <mordred> fungi: ++
19:40:22 <fungi> so for those reasons, fixing what we have now enough to be able to at least initially support zuulv3 seems pretty necessary regardless
19:40:29 <jeblair> fungi: agreed
19:40:51 <mordred> fungi: I definitely think we need to consider puppet-openstackci a first-class deliverable regardless of how we decide to run infra
19:41:01 <jeblair> fungi: if there's a longer term switch, we should give enough time for folks who depend on puppet-zuul to step up to maintain it before we abandon it.
19:41:02 <fungi> i'm inclined to agree
19:41:16 <fungi> thanks for the interesting topic pabelanger!
19:41:33 <pabelanger> thanks for chatting about it
19:41:33 <jeblair> mordred: sure, but realistically, if we move infra to ansible, it will need another maintainer (there may still be overlap) in the long run.
19:42:14 <mordred> jeblair: agree
19:42:52 <fungi> there's nothing else on the agenda, so we can go to open discussion for teh last few minutes and still talk about this if people want
19:43:03 <fungi> #topic Open discussion
19:43:16 <fungi> pabelanger: you got a couple of replacement nodepool builders up on xenial, right? how'd that go?
19:43:27 <pabelanger> fungi: yes, we have 100% migrated
19:43:35 <fungi> looks like the service is running on them and stopped on the old builders now... time to delete the old ones yet?
19:43:44 <pabelanger> yup, I can delete them today
19:43:46 <clarkb> and zypper works on xenial (we have suse images booted in clouds now)
19:43:54 <pabelanger> yes, that too
19:44:02 <fungi> oh, right, we have opensuse nodes! (and jobs!)
19:44:14 <pabelanger> it should be much easier to bring on new operating systems now thanks to cmurphy work
19:44:39 <fungi> cmurphy: awesome work!
19:44:39 <pabelanger> mordred: clarkb: did we want to push on citycloud for ipv6 networking?
19:44:49 <pabelanger> I haven't done anything since friday
19:45:06 <fungi> did they ever get back to us about the weird extra addresses in that one region?
19:45:13 <clarkb> pabelanger: maybe? perhaps we first attempt another region to see if it is more reliable there and possibly use stateless dhcp rather than just slaac?
19:45:22 <fungi> or has the problem maybe magically vanished?
19:45:30 <clarkb> fungi: I haven't heard from them on it beyond that they were looking into it. We worked around it in shade on our end
19:45:33 <pabelanger> fungi: I thought we pushed a fix in shade too
19:45:42 <pabelanger> clarkb: yes, I can look at la1 tomorrow
19:45:46 <fungi> oh, right there was that. did it also solve the reasons for the boot failures?
19:45:59 <pabelanger> they appear to have decreased
19:46:06 <pabelanger> but I see a few still
19:46:23 <fungi> maybe those are for different reasons too
19:46:39 <clarkb> the only servers in sto2 with multiple private IPs currently are the pabelanger-test server and the multinode set of servers I held for debugging
19:47:15 <clarkb> I will go ahead and delete the multinode servers, pabelanger's we gave to citycloud as an example so maybe keep that one around a little longer?
19:47:25 <fungi> maybe they found/fixed that issue and just didn't reply to us in that case
19:47:38 <clarkb> ya which is what happened with the flavor issue in la1
19:47:39 <fungi> but yeah, let's give them a little longer
19:48:54 <clarkb> pabelanger: but re ipv6 maybe lets try the other options available to us first (stateless ipv6 and other citycloud region) and then take what we learn to citycloud
19:49:02 <clarkb> pabelanger: so far they seem receptive to the feedback so thats good
19:49:16 <pabelanger> clarkb: wfm
19:49:33 <clarkb> er s/ipv6/dhcpv6/
19:49:42 <fungi> yep, quite pleased with citycloud's generous assistance and donation
19:50:02 <pabelanger> ++
19:50:09 <fungi> seems like it went really smoothly, all things considered
19:50:41 <jeblair> yay!
19:50:53 <fungi> and quite a lot of capacity
19:51:17 <clarkb> I'm looking for feedback on https://review.openstack.org/468705 which is another d-g journald related change to help address SUSE runs
19:51:24 <fungi> did we hear any further news about the kolla job failures they (at least originally) thought were only manifesting in citycloud?
19:51:32 <clarkb> it puts a little more effort on the end user grabbing the logs but not significantly more
19:51:43 <clarkb> probably need sdague though who appears out today
19:51:54 <clarkb> fungi: Ihaven't heard more on that no
19:52:00 <fungi> sdague has the right idea ;)
19:52:06 <pabelanger> didn't somebody say there was a web interface for systemd logs?
19:52:17 <clarkb> fungi: however found a possibly related problem in neutron functional tests where they don't use our ipv4 safe range
19:52:27 <clarkb> pabelanger: there is but it would require work to make useful for our current steup
19:52:42 <bkero> pabelanger: systemd-journal-gateway
19:52:44 <clarkb> fungi: so wondering if maybe kolla is also not using the safe range and the IPs are conflicting
19:53:06 <fungi> clarkb: which web interface was that? a search i just did turned up one called cockpit
19:53:08 <jeblair> clarkb: i think sdague is out this week
19:53:21 <clarkb> fungi: the one bkero named, its part of systemd
19:53:34 <clarkb> jeblair: ah so maybe we need to just decide without him :)
19:53:38 <fungi> oh, cool! and i guess cockpit is a more general management frontend, not just a log viewer
19:54:48 <fungi> looks like maybe it's systemd-journal-gatewayd (d on the end)
19:55:42 <fungi> "serves journal events over the network. Clients must connect using HTTP"
19:56:28 <fungi> oh, right, discussion in the channel covered some missing features we've come to rely on in os-log-analyze
19:56:41 <fungi> especially hyperlinking to individual loglines
19:56:41 <clarkb> cockpit does look shiny though I expect it will have similar issues
19:56:50 <clarkb> basically everything is built assuming one giant journal
19:57:07 <bkero> Cockpit is a...big thing.
19:57:46 <fungi> more likely we'd just teach osla to run journalctl commands under the covers or something
19:57:58 <fungi> probably less work than trying to integrate something like those
19:58:27 <fungi> since we have lots of other logs we still need osla for
19:58:35 <fungi> so it's not like it's going anywhere
20:00:04 <fungi> and we're at time. find us in #openstack-infra for subsequent discussion! (need to make way for the no-tc-meeting now)
20:00:12 <fungi> #endmeeting