19:03:18 #startmeeting infra 19:03:19 Meeting started Tue May 30 19:03:18 2017 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:03:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:03:23 The meeting name has been set to 'infra' 19:03:29 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:03:32 #topic Announcements 19:03:35 #info OpenStack general mailing list archives from Launchpad (July 2010 to July 2013) have been imported into the current general archive on lists.openstack.org. 19:03:40 #link http://lists.openstack.org/pipermail/openstack/ OpenStack general mailing list archives 19:03:44 as always, feel free to hit me up with announcements you want included in future meetings 19:03:58 #topic Actions from last meeting 19:04:04 #link http://eavesdrop.openstack.org/meetings/infra/2017/infra.2017-05-23-19.02.html Minutes from last meeting 19:04:10 bkero draft an upgrade plan for lists.o.o to xenial 19:04:14 i saw you talking about this a lot last week in #openstack-infra 19:04:19 have you firmed up any thoughts about it yet? 19:04:32 got a new etherpad you can #link? 19:04:52 Yes, I took the snapshot and did the update similar to the precise -> trusty update. The update went about as expected with the new version serving the content. 19:05:09 I haven't had it send any test emails yet, I'll need to crete some mailman accounts/admins to be able to do that. 19:05:25 i saw one hiccup where the data was somewhere other than where teh package expected it? did that get sorted out? 19:06:00 The lockfile directory didn't exist was the problem with that. I haven't nailed down the reason yet, although the service was running. 19:06:41 We could manually create it root:mailman with 2775, although I'd prefer to have that handled by the package. Maybe --reinstall will create it for us. 19:06:43 could have to do with how we disabled services before taking the snapshot maybe? 19:06:58 * fungi is grasping at straws 19:07:06 The dir is created by the package, although maybe with an upgrade something is different 19:07:24 but it existed on the lists.o.o server 19:07:36 It likely existed on the snapshot before do-release-upgrade-ing as well 19:07:49 so something cleaned it up... is it in a directory which doesn't persist between boots (e.g., /var/run)? 19:07:59 jeblair had the theory that systemd manages /var/run and clobbered it for us 19:08:25 yeah, it doesn't persist across reboots. normally the init script makes it when starting mm. 19:08:36 yeah, i wouldn't be surprised (not actually bashing systemd here) 19:08:52 so only a problem if you upgrade a system which had not run the program since boot. 19:09:08 right, that is teh sort of direction i was going as well 19:09:15 seems to make sense 19:09:44 okay, i guess let's do what we can to test and maybe next week get a topic on the agenda to discuss a time to roll forward with the production upgrade window assuming it seems viable? 19:10:02 Sounds good to me 19:10:22 ++ 19:10:34 we probably don't need quite so long of an outage as we took for precise->trusty since we're not doing the filesystem conversion as part of the maintenance again 19:10:53 For a trail, things done on the snapshot for the update are listed on the trusty etherpad: https://etherpad.openstack.org/p/lists.o.o-trusty-upgrade 19:11:04 Under "Things done on snapshot:". Should likely be moved to a new etherpad. 19:11:27 #link https://etherpad.openstack.org/p/lists.o.o-trusty-upgrade has notes about the xenial upgrade for now 19:11:41 might be nice to have an etherpad for xenial upgrade without the unneeded noise from the trusty upgrade 19:11:47 just to make it clear when reviewing/executing 19:12:24 agreed. also if we can get some rough runtime estimates for things like the system package upgrading step, that will help inform our maintenance duration for the announcement 19:12:53 Good point. For that I might need another snapshot to time the tasks. 19:13:31 just give one of us a heads up in #openstack-infra when you're ready for another instance booted from the original snapshot 19:13:41 shouldn't need a new snapshot, just a new instance 19:14:01 anyway... thanks bkero, jeblair, clarkb and everyone else who has helped with this so far. sounds like excellent progress 19:14:18 #topic Specs approval 19:14:25 #info APPROVED: "Nodepool Drivers" spec 19:14:28 #link http://specs.openstack.org/openstack-infra/infra-specs/specs/nodepool-drivers.html "Nodepool Drivers" spec 19:14:47 great 19:14:53 yay! tristanC started hacking on this too 19:15:03 this reminds me, we should consider adding a help-wanted section to the specs page where we keep specs that have no assignee (yet) 19:15:27 though in this case it's getting picked up fairly quickly 19:15:32 fungi: good idea 19:15:59 #action fungi propose a help-wanted section to our specs index for unassigned specs 19:16:21 #topic Priority Efforts 19:16:45 nothing called out specifically here, though the spec approved above is closely related to the "Zuul v3" priority 19:16:59 #topic Use ansible for zuulv3.openstack.org (pabelanger) 19:17:07 #link https://review.openstack.org/468113 WIP: Use ansible for zuulv3.openstack.org 19:17:13 hello 19:17:25 a nice segue from the zuul v3 priority too ;) 19:17:42 I wanted to get the pulse of people and see if running puppet to bootstrap a server, then have ansible take over as cfgmgmt on the server 19:17:58 I know we cannot hard cut over today, but wanted to see if dual stack make sense 19:17:59 that's definitely new territory 19:18:36 in the past we've considered puppet good at configuration management (and mediocre at orchestration/automation) while ansible was held up as sort of the reverse of that 19:18:42 pabelanger: by bootstrap, do you mean users/mta/dns/iptables/etc? 19:18:56 jeblair: yes, template::server.pp today is still puppet 19:18:58 my only real concern with this is that there seems to be the thoguht that using ansible will fix the config management issues with the zuulv3 deployment. I think the real issues are independent of puppet/ansible (lack of reporting, slow turn around in application loop, etc) and personally feel effort would be better spent on addressing those problems. But I don't have much time to address either so 19:19:00 won't get in the way 19:19:14 i will admit to being pretty uninformed on the current state of ansible as an idempotent/declarative configuration management solution 19:19:22 pabelanger: (if so, i'd assume you'd want puppot to continue doing that -- so i think it's more both puppet and ansible both cfg-mangaging, just different areas of the system) 19:19:53 clarkb: i don't think this is intended to "fix the config management issues with the zuulv3 deployment" ? 19:20:06 jeblair: that was the impression I got from the discussion in #zuul the other day 19:20:11 huh, not what i got 19:20:19 it was basically "the puppet doesn't work and no one is updating it, lets just delete it" 19:20:22 i thought this was pabelanger likes writing ansible more than puppet. 19:20:31 again, not what i got at all 19:20:46 jeblair: Well, that phase, user/mta/dns/iptables is a larger change, since it affects more then 1 server. But I think we could do ansible for the long term. But currently, logic is the only shared code between our servers. I think having that as puppet then ability to run ansible after might be a go step to migrating 19:20:46 i think i missed the discussion. what the current wip change could certainly benefit from is rationale in the commit message 19:20:53 zuulv3 puppet will be no more complex than zuulv2 puppet which works just fine 19:21:01 it does more to describe the what without really addressing the why 19:21:26 right, I am writing way more ansible the puppet today. That doesn't mean we can upgrade puppet-zuul to support zuulv3. 19:21:27 the only complication is if we want the same thing to support v2 and v3. but that has nothing to do with the language. you could make that as "easy" by just not supporting both in puppet 19:22:02 Personally, I am hoping to support a 3rd party CI system in all ansible. To provide another option to puppet-openstackci 19:22:16 jeblair: the argument I heard was puppet was in the way and ansible would fix the problems. I just think the problems aren't really due to puppet (like lack of reporting, difficulty of local deployment, and slow turnaround in production) 19:22:40 * mordred shows up - sorry, last thing ran over 19:22:49 clarkb: okay, that's not what pabelanger just said 19:22:50 I'm fine with using either tool. I just think we are better served fixing those problems before changing tooling 19:22:56 clarkb: what problems? 19:23:09 clarkb: what does "lack of reporting" mean? 19:23:13 jeblair: lack of reporting, slow turnaround in production, and difficulting of local deployment 19:23:23 jeblair: people making changes to puppet do not know if/when/how their things get applied 19:23:25 yes you have said that 3 times, and i still don't know what those mean 19:23:28 which will also be true should we use ansible 19:23:50 lack of reporting from the puppet apply invocation? 19:24:03 fungi: yes and generally where the ansible loop is "eg do I just need to wait longer" 19:24:14 ara may help with that 19:24:34 I'd like to get ARA working, but I think we need to do some dev to support our scale 19:24:37 is anyone working on setting up ara for "puppetmaster" ? 19:24:44 I have looked at it 19:25:06 that would, at the very least, get us back to "last run timestamp" level of reporting 19:25:20 Agree, I can first set that up if it is a requirement 19:25:27 and success/fail output 19:25:34 ya 19:25:44 clarkb: slow turnaround in production? 19:25:56 jeblair: it sometimes takes an hour for a config change to get applied 19:26:02 jeblair: even though we in theory run every 15 minutes 19:26:06 the time between run_all.sh loops? 19:26:09 (and sometimes longer if ansible has gone out to lunch) 19:26:09 aha 19:26:10 jeblair: fwiw, I thnk cloudnull uses ara for managing openstack clouds - so it might not be too bad for our scale 19:26:11 fungi: ya 19:26:37 jeblair: basically slow turnaround makes it really hard for people to be around and address config changes more iteratively 19:27:38 clarkb: gotcha. we might be able to do some refactoring now, but we can also consider having zuulv3 drive more specific tasks once it's ready. 19:27:55 yah- I think some of that time issue stems from the fact that we're treating it as one single loop - my understanding of how people recommend running things in more ansible-ish ways are to have playbooks broken out a bit more, so that applying an update to system A doesn't depend on applying an update to system b 19:27:57 jeblair: ++ 19:28:01 (specific tasks) 19:28:06 and maybe we can find some way to be more selective about when we do targeted puppet apply for servers impacted by a specific change vs full runs 19:28:11 agree, I think we could debug the execution more too. I know I haven't tried to see why things are slow on puppetmaster 19:28:14 http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2017-05-25.log.html#t2017-05-25T16:52:03 is why I have/had the impression I did re using ansible 19:29:02 and I just think if we are going to say config management isn't working for us those ~3 reasons are why far more so than ansible vs puppet 19:29:06 i interpreted that more as mordred saying "if no one wants to write puppet then maybe we should switch" 19:29:10 right 19:29:19 I agree with clarkb that changing systems will not magically fix anything 19:29:46 clarkb: regardless of the different perceptions both of us had to that, i think "all of the above" are relevant, and they are all related 19:29:51 but - if we're having issues with people fixing the current system because they just don't want to work on puppet any more, then I think that is relevant and a thing we should consider 19:29:55 ++ 19:31:18 we already use both tools so happy to use ansible more. I just didn't want us going down that rabbit hole with the idea it will fix the real issues with config management we have 19:31:29 clarkb: ++ 19:31:37 totally agree 19:31:37 i think there are some things we can do with the hybrid system we have now to address the issues clarkb raised. i think puppet continues to be workable in the long run. i also think that if we did want to switch from puppet to ansible, it may further address clarkb's issues and also be more fun for some of us. :) 19:31:53 jeblair: ++ 19:31:58 * mordred just ++ people today 19:32:02 mordred: ++ 19:32:10 i also think that we have a decent enough separation right now between ansible for orchestration and puppet for configuration management and are (mostly) using each to its strengths. if we want to switch that around, as we've said in the past, we need a coordinated plan 19:32:35 fungi: yeah, this would be a step over a line we have drawn. it would be good to know what the steps following that are. :) 19:32:43 Having been at several projects who tried to switch configuration management systems over, I can tell you that there is no endgame there. 19:32:58 Feature parity is expected, but it never comes 19:32:58 side note from a third-party CI maintainer, switch from puppet to ansible would be painful for us. It is more or less starting from scratch. This at least should be considered if decision is made to go full in ansible 19:33:11 mmedvede: ++ 19:34:14 pabelanger: also, to reiterate, while discussion in the meeting is useful the commit message for 468113 really needs to summarize the reasons it's being proposed 19:34:27 in fact, I think that's a really good reason why, even if we do make ansible playbooks to install/manage a zuulv3 - we also need puppet modules for the same for puppet-openstackci 19:35:02 fungi: right, it was mostly a POC to have people look at code. I am thinking some document outlineing how it might work is the next step? 19:35:39 since we have a bunch of people depending on that 19:35:55 pabelanger: sure, beefing up the commit message would be a start, but based on the discussion in here so far i'm pretty sure you'd need to start a spec 19:36:17 okay, I can start iterating on that 19:36:21 i am open to moving openstack-infra to ansible, and at this point, i'd even be okay doing it piecemeal, but i'd like us to agree that's the point we want to get to so that we know the ultimate destination before we start. 19:36:54 because the nuances of the decision will make for a very wordy commit message and burying an important decision there when it's broader-reaching isn't great for discoverability 19:37:09 ++spec 19:37:43 i won't #action that as i have no idea what direction you want to take it 19:37:46 spec++ 19:38:08 spec+=1 19:38:41 perfect, that is all I had on the topic. Thank you 19:38:44 it should also take some of the open specs we have (including the ansible puppet apply spec which is still in our priority efforts list) into account too 19:38:57 meanwhile, i feel that we may need to deploy zuulv3 for openstack-infra before we resolve this issue 19:39:21 (unless we think we'll have a decision by next meeting) 19:39:35 i also expect a number of downstream consumers who are more or less happily using our puppet module now would like to get the benefits of zuul v3 without having to switch up their configuration management significantly 19:40:19 can always make the config management turtles go deeper. ansible runs puppet which runs ansible >_> (I'm not serious) 19:40:20 fungi: ++ 19:40:22 so for those reasons, fixing what we have now enough to be able to at least initially support zuulv3 seems pretty necessary regardless 19:40:29 fungi: agreed 19:40:51 fungi: I definitely think we need to consider puppet-openstackci a first-class deliverable regardless of how we decide to run infra 19:41:01 fungi: if there's a longer term switch, we should give enough time for folks who depend on puppet-zuul to step up to maintain it before we abandon it. 19:41:02 i'm inclined to agree 19:41:16 thanks for the interesting topic pabelanger! 19:41:33 thanks for chatting about it 19:41:33 mordred: sure, but realistically, if we move infra to ansible, it will need another maintainer (there may still be overlap) in the long run. 19:42:14 jeblair: agree 19:42:52 there's nothing else on the agenda, so we can go to open discussion for teh last few minutes and still talk about this if people want 19:43:03 #topic Open discussion 19:43:16 pabelanger: you got a couple of replacement nodepool builders up on xenial, right? how'd that go? 19:43:27 fungi: yes, we have 100% migrated 19:43:35 looks like the service is running on them and stopped on the old builders now... time to delete the old ones yet? 19:43:44 yup, I can delete them today 19:43:46 and zypper works on xenial (we have suse images booted in clouds now) 19:43:54 yes, that too 19:44:02 oh, right, we have opensuse nodes! (and jobs!) 19:44:14 it should be much easier to bring on new operating systems now thanks to cmurphy work 19:44:39 cmurphy: awesome work! 19:44:39 mordred: clarkb: did we want to push on citycloud for ipv6 networking? 19:44:49 I haven't done anything since friday 19:45:06 did they ever get back to us about the weird extra addresses in that one region? 19:45:13 pabelanger: maybe? perhaps we first attempt another region to see if it is more reliable there and possibly use stateless dhcp rather than just slaac? 19:45:22 or has the problem maybe magically vanished? 19:45:30 fungi: I haven't heard from them on it beyond that they were looking into it. We worked around it in shade on our end 19:45:33 fungi: I thought we pushed a fix in shade too 19:45:42 clarkb: yes, I can look at la1 tomorrow 19:45:46 oh, right there was that. did it also solve the reasons for the boot failures? 19:45:59 they appear to have decreased 19:46:06 but I see a few still 19:46:23 maybe those are for different reasons too 19:46:39 the only servers in sto2 with multiple private IPs currently are the pabelanger-test server and the multinode set of servers I held for debugging 19:47:15 I will go ahead and delete the multinode servers, pabelanger's we gave to citycloud as an example so maybe keep that one around a little longer? 19:47:25 maybe they found/fixed that issue and just didn't reply to us in that case 19:47:38 ya which is what happened with the flavor issue in la1 19:47:39 but yeah, let's give them a little longer 19:48:54 pabelanger: but re ipv6 maybe lets try the other options available to us first (stateless ipv6 and other citycloud region) and then take what we learn to citycloud 19:49:02 pabelanger: so far they seem receptive to the feedback so thats good 19:49:16 clarkb: wfm 19:49:33 er s/ipv6/dhcpv6/ 19:49:42 yep, quite pleased with citycloud's generous assistance and donation 19:50:02 ++ 19:50:09 seems like it went really smoothly, all things considered 19:50:41 yay! 19:50:53 and quite a lot of capacity 19:51:17 I'm looking for feedback on https://review.openstack.org/468705 which is another d-g journald related change to help address SUSE runs 19:51:24 did we hear any further news about the kolla job failures they (at least originally) thought were only manifesting in citycloud? 19:51:32 it puts a little more effort on the end user grabbing the logs but not significantly more 19:51:43 probably need sdague though who appears out today 19:51:54 fungi: Ihaven't heard more on that no 19:52:00 sdague has the right idea ;) 19:52:06 didn't somebody say there was a web interface for systemd logs? 19:52:17 fungi: however found a possibly related problem in neutron functional tests where they don't use our ipv4 safe range 19:52:27 pabelanger: there is but it would require work to make useful for our current steup 19:52:42 pabelanger: systemd-journal-gateway 19:52:44 fungi: so wondering if maybe kolla is also not using the safe range and the IPs are conflicting 19:53:06 clarkb: which web interface was that? a search i just did turned up one called cockpit 19:53:08 clarkb: i think sdague is out this week 19:53:21 fungi: the one bkero named, its part of systemd 19:53:34 jeblair: ah so maybe we need to just decide without him :) 19:53:38 oh, cool! and i guess cockpit is a more general management frontend, not just a log viewer 19:54:48 looks like maybe it's systemd-journal-gatewayd (d on the end) 19:55:42 "serves journal events over the network. Clients must connect using HTTP" 19:56:28 oh, right, discussion in the channel covered some missing features we've come to rely on in os-log-analyze 19:56:41 especially hyperlinking to individual loglines 19:56:41 cockpit does look shiny though I expect it will have similar issues 19:56:50 basically everything is built assuming one giant journal 19:57:07 Cockpit is a...big thing. 19:57:46 more likely we'd just teach osla to run journalctl commands under the covers or something 19:57:58 probably less work than trying to integrate something like those 19:58:27 since we have lots of other logs we still need osla for 19:58:35 so it's not like it's going anywhere 20:00:04 and we're at time. find us in #openstack-infra for subsequent discussion! (need to make way for the no-tc-meeting now) 20:00:12 #endmeeting