#openstack-meeting log

19:01:45 <clarkb> #startmeeting infra
19:01:46 <openstack> Meeting started Tue Feb  6 19:01:45 2018 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:47 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:49 <openstack> The meeting name has been set to 'infra'
19:01:50 <ianw> yeah, if someone wants to write an app that dynamically moves alarms based on launch delays, that would be awesome
19:01:54 <AJaeger> o/
19:02:23 <clarkb> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting
19:02:46 <clarkb> word of warning this morning's fires have me not very well prepared for the meeting but we do have an agenda and we'll do our best
19:02:55 <clarkb> #topic Announcements
19:02:57 <hrw> o/
19:03:09 <clarkb> The summit CFP closes on the 8th (that is like 1.5 days from now)
19:03:28 <clarkb> #link https://etherpad.openstack.org/p/infra-rocky-ptg PTG Brainstorming
19:04:26 <clarkb> And finally it is PTL election season with nominations happening for another day or so
19:04:48 <clarkb> #topic Actions from last meeting
19:05:10 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-01-30-19.01.txt Notes from last meeting
19:05:36 <clarkb> #action clarkb / corvus / everyone / to take pass through old zuul and nodepool master branch changes to at least categorize changes
19:05:52 <clarkb> I don't think we've really had time to do ^ with the various fires we've been fighting so we'll keep that on the board
19:05:56 <corvus> i *think* we've worked through the backlog there
19:06:19 <clarkb> oh cool. Should I undo it then?
19:06:26 <corvus> tobiash and i (maybe others?) did a bunch last week, and, at least, i think we're done
19:06:31 <clarkb> #undo
19:06:32 <openstack> Removing item from minutes: #action clarkb / corvus / everyone / to take pass through old zuul and nodepool master branch changes to at least categorize changes
19:06:42 <clarkb> thank you everyone for somehow managing to get through that despite the fires
19:06:51 <clarkb> I'm still on the hook for cleaning out old specs
19:06:55 <clarkb> #action clarkb clear out old infra specs
19:07:26 <clarkb> #topic Specs approval
19:07:38 <clarkb> Any specs stuff we need to go over that I have missed?
19:08:20 <clarkb> Sounds like no
19:08:20 <pabelanger> o/
19:08:30 <clarkb> #topic Priority Efforts
19:08:36 <clarkb> #topic Zuul v3
19:08:48 <clarkb> corvus: you wanted to talk about the zuul executor OOMing
19:09:08 <corvus> the oom-killer has a habit of killing the log streaming daemon
19:09:18 <corvus> we thought this was because it was running out of memory
19:09:20 <corvus> understandably
19:09:33 <corvus> but on closer inspection, i don't think it actually is.  i think it only thinks it is.
19:09:44 <fungi> a false-positive?
19:10:14 <corvus> now that the governors are improved, we can see that even with 50% +/- 10% of the physical memory "available", we've still seen oom-killer invoked
19:10:41 <corvus> the problem appears to be that the memory is not 'free', just 'reclaimable'.  especially 'reclaimable slab' memory.
19:11:12 <corvus> there have been some kernel bugs related to this, and a lot of changes, especially around the early 4.4 kernels
19:11:50 <corvus> i'm inclined to think that we're suffering from our workload patterns being suboptimal for whatever kernel memory management is going on
19:12:07 <fungi> ahh, that's believable
19:12:19 <clarkb> one suggestion I had was to try the hwe kernels ubuntu publishes. I've had reasonably good luck locally with them due to hardware needs but they get you a 4.13 kernel on xenial
19:12:38 <fungi> worth considering
19:12:39 <clarkb> considering these servers are mostly disposable and replaceable we can always rebuild them with older kernels if necessary as well
19:13:00 <corvus> i think our options include: a) throw more executors at the problem and therefore use less ram.  b) try to tune kernel memory management parameters.  c) open a bug report and try kernel bugfixes.  d) try a different kernel via hwe or os upgrade.
19:13:07 <fungi> you can easily switch between default boot kernels without rebuilding, for that matter
19:13:20 <dmsimard> hwe kernel is worth a shot
19:13:29 <corvus> yeah, i'm inclined to start with 4.13
19:13:52 <fungi> that seems like the easiest way to rule out old fixed bugs (at the risk of discovering newer ones)
19:13:53 <dmsimard> we can even try it "online".. as in, install it, reboot without rebuilding the instance from a new image
19:14:12 <fungi> dmsimard: yeah, that's what i was assuming we'd do
19:14:13 <corvus> i'm not enthused about debugging 4.4 at this point in its lifecycle
19:14:40 <pabelanger> is tuning OOM killer something we want to try too? As to not kill an executor process before ansible-playbook?
19:14:42 <dmsimard> fungi: I forget if they have a PPA for them but I've installed them "manually" from http://kernel.ubuntu.com/~kernel-ppa/mainline/ before
19:14:48 <clarkb> we are also a few months away from new ubuntu lts release which should in theory be a simpler upgrade since we've already done the systemd jump
19:15:09 <clarkb> so we may not have to live with the hwe kernel for long
19:15:11 <fungi> pabelanger: i think the concern with tuning is that the bug would still be present, and we'd just kill other things
19:15:12 <dmsimard> fungi: ah that's a ppa right in a URL
19:15:31 <clarkb> fungi: ya that should probably be a last resort especially if we actually do have memory available after all
19:15:34 <fungi> dmsimard: not a ppa, just a non-default kernel package (might be in the backports suite though)
19:15:55 <pabelanger> fungi: yah, thought is if we kill ansible-playbook, job would be aborted and retried? Where if we lose executor we have to restart everything. But agree, not best solution
19:15:59 <hrw> ubuntu hwe kernels works fine. my private server runs them since they started doing them
19:16:24 <clarkb> hrw: ya mine home server runs it too
19:16:37 <clarkb> the only downside I've seen was getting kpti patches took longer than the base kernel
19:16:43 <corvus> we have had a bad experince with them in the past.  they broke openafs the last time we had ne installed.
19:16:53 <corvus> the executors need openafs
19:16:56 <fungi> it's not so much whether they "work fine" but whether we turn up new and worse bugs than we're trying to rule out
19:17:00 <corvus> that's something to look out for
19:17:12 <fungi> all kernel versions have bugs. new versions usually have some new bugs too
19:18:12 <fungi> yes, the openafs lkm does complicate this a little too (not logistically, just surface area for previously unnoticed bugs/incompatibilities)
19:18:14 <clarkb> right if this was the gerrit server for example I'd be more cautious
19:18:22 <corvus> okay, so let's start with an in-place test of hwe kernel, then maybe there's some voodoo mm stuff we can try (like disabling cgroups memory management), then maybe we throw more machines at it until we get yet another newer kernel with the upgrade?
19:18:24 <clarkb> but we can fairly easily undo any kernel updates on these zuul executors
19:18:41 <pabelanger> wfm
19:18:45 <clarkb> corvus: sounds like a reasonable start
19:19:04 <ianw> yeah, seems like we are a/b testing basically, and it's easily reverted
19:19:04 <dmsimard> It seems we are using fairly old version of bubblewrap ? 0.1.8 which goes back to March 2017 -- there was 0.2.0 released in October 2017 but nothing after that. The reason I mention that is because there seems to be a history of memory leak: https://github.com/projectatomic/bubblewrap/issues/224
19:20:14 <fungi> and yeah, the newer kernel packages are just in the main suite for xenial, so no backports suite or ppa addition to sources.list needed: https://packages.ubuntu.com/xenial/linux-image-4.13.0-32-generic
19:20:33 <fungi> #link https://packages.ubuntu.com/xenial/linux-image-4.13.0-32-generic xenial package page for linux-image-4.13.0-32-generic
19:20:57 <corvus> i'd recommend ze02 for a test machine, it oomed yesterday
19:21:33 <corvus> anyone want to volunteer for installing hwe on ze02 and rebooting?  i'm deep into the zk problem
19:21:46 <ianw> i can take that on if we like, i have all day :)
19:21:53 <clarkb> ianw: thanks!
19:21:55 <corvus> ianw: awesome, thx!
19:22:19 <clarkb> corvus: re zk I put an agenda item re losing zk connection requires scheduler restart. Is that what you are debugging?
19:22:26 <corvus> clarkb: yep
19:22:39 <clarkb> I wasn't sure if this was a known issue with planned fixes that just need effort or if its debugging and fixing that hasn't been done yet or is in progress
19:23:06 <corvus> it's a logic bug in zuul that should be correctible.  i'm working on fixing it now
19:23:40 <clarkb> ok so keep an eye out for change(s) to fix that and try to review
19:24:13 <clarkb> Anything else we want to talk about before moving on?
19:24:53 <dmsimard> I'll just link this pad which summarizes what we had today: https://etherpad.openstack.org/p/HRUjBTyabM
19:25:06 <dmsimard> (not sure if I can #link as non-chair)
19:25:09 <clarkb> #link https://etherpad.openstack.org/p/HRUjBTyabM for info on zuul outage today
19:25:13 <clarkb> dmsimard: I think you can but ^ got it
19:25:28 <clarkb> #topic General Topics
19:26:00 <clarkb> ianw hrw arm64/aarch64 (is one more correct than the other?) updates?
19:26:26 <ianw> oh, that was from ages ago, just haven't taken it out
19:26:31 <hrw> clarkb: aarch64 is architecture name, arm64 is popular alias
19:26:33 <ianw> but i can
19:26:59 <ianw> #link http://lists.openstack.org/pipermail/openstack-infra/2018-February/005817.html
19:27:00 <clarkb> ah wasn't sure if you wanted to give a new update since I'm guessing things have moved over the last week?
19:27:07 <ianw> i sent that update
19:27:24 <fungi> yeah, arm64 is simply what debian called their port (and a number of other popular distros followed suit)
19:27:39 <fungi> much like ppc64
19:27:49 <ianw> basically there's some AFS patches out which seem to work ... i have a custom build and need to puppet that into the mirror server
19:28:02 <ianw> gpt support for dib is done i think
19:28:30 <ianw> EFI is going pretty well; i can produce working efi images for amd64 & arm64
19:28:49 <ianw> next thing to test is the infra elements + base build on arm64
19:28:58 <ianw> once we have all that sorted, nothing to stop us launching nodes!
19:29:57 <clarkb> neat!
19:30:30 <hrw> I spoke with Andreas on sunday to mark those patches as priority ones ;D
19:31:25 <pabelanger> cool
19:32:00 <ianw> yeah, i still have the block-device-* element stuff marked as WIP, but i'm getting more convinced it's the way to go, no major problems seem to be revealing themselves
19:32:40 <fungi> that's excellent news
19:34:11 <clarkb> #topic Project Renames
19:34:47 <clarkb> I think we've got the pieces in place to actually be able to perform these sanely now
19:35:13 <clarkb> mordred: fungi I think you were going to try and write up a process that could be used?
19:35:34 <clarkb> it would probably be good to schedule this and try to get it down though as it has been quite a while
19:35:44 <clarkb> and in the process we can try ianw's fix for the nova specs repo replication
19:35:53 <fungi> i can give it a shot... though i'm not entirely sure the extent to which a project (or at least its jobs?) need to be entirely unwound in zuul before renaming
19:36:28 <fungi> i suppose we can merge a change which tells zuul to ignore the in-repo configuration temporarily until it gets patched
19:37:05 <clarkb> fungi: ya could just remove it from the zuul project list temporarily
19:37:06 <fungi> though unsure whether that's sufficient
19:37:23 <clarkb> its going to have to be updted there regardless
19:37:36 <fungi> oh, true we could entirely remove it from zuul, bypass code review to merge the config changes in those repos, then readd with the new name
19:38:00 <clarkb> looking at a calendar release happens during ptg so probably the earliest we want to do renames is week after ptg
19:38:16 <fungi> i'd be cool with trying it then
19:38:43 <fungi> unless that's too soon after the main release due to the release-rtailing projects
19:38:52 <fungi> er, release-trailing
19:38:53 <corvus> i won't be around ptg+1 week
19:38:53 <mordred> clarkb: gah. sorry - was looking away from computer with heads in code
19:39:12 <fungi> worth checking with the release team on when would be a good time for a scheduler gerrit outage
19:39:20 <fungi> er, scheduled
19:39:29 <clarkb> ok lets ping them and ask
19:39:34 <clarkb> I can do that
19:40:09 <fungi> cool. it could also just get discussed wit them at the ptg if preferred
19:40:13 <clarkb> corvus: you are back the ptg +2 week?
19:40:40 * mordred will be around both weeks
19:41:10 <clarkb> fungi: do you think if we corner them at ptg we'll have a better shot at getting the day we want? :P
19:41:24 <corvus> clarkb: yep, should be
19:41:41 <fungi> i think the answer will be the same either way. they're a surprisingly consistent lot ;)
19:41:50 <clarkb> alright then lets sync up with them and go from there
19:42:05 <corvus> (i don't think i'm a required participant, just adding info)
19:42:15 <clarkb> #topic Open Discussion
19:43:05 <fungi> the more infra-root admins we have on hand during the rename maintenance in case of emergencies, the better. also deep understanding of zuul job configuration will be helpful if we have to think on our feet this first time
19:43:15 <clarkb> re zuul networking outage this morning/last night (whatever it is relative to your timezone) it would be good to not forget to sort out the ipv6 situation on the executors
19:43:32 <clarkb> I think that if we end up leaving them in that state for more than a day we should consider removing AAAA records in DNS
19:43:57 <pabelanger> I'd like to maybe discuss how we can start testing migration to zookeeper cluster, might even be better topic for PTG
19:43:59 <corvus> remove gerrit's quad a?
19:44:08 <ianw> clarkb: is that the current theory for why things dropped out of zk?
19:44:11 <fungi> yeah, as soon as the rest of the executors get service restarts to pick up newer zuul performance improvements, i'll be stopping ze09 and rebooting it with its zuul-executor service disabled so it can be used to open a ticket with rackspace
19:44:18 <clarkb> corvus: the zuul executors
19:44:40 <clarkb> corvus: beacuse things talk to them and may resolve AAAA too
19:44:51 <clarkb> (specifically won't the console streaming do something like that from the scheduler?)
19:44:55 <corvus> clarkb: ok, but the big issue we saw was the executors not connecting to gerrit... very little is inbound to executors... just log streaming i think
19:45:31 <fungi> i favor building replacement executors rather than running some with no ipv6 longer-term
19:45:45 <clarkb> I'm just thinking if the lack of ipv6 is going to be longish term then not having dns records for those IPs would probably be good
19:45:53 <fungi> though at this point we have no reason to believe we won't end up with new instances exhibiting the same issues
19:46:34 <fungi> leaving some ipv6-less means inconsistencies between executors which seems equally bad unless we're doing it explicitly to test something
19:46:35 <corvus> clarkb: yeah... i'm worried that if the issue is 'ipv6 is broken in all directions' then those records are actually the least of our worries
19:46:49 <dmsimard> can we open a ticket with rax to see if there is any known issues ?
19:46:56 <clarkb> corvus: ya I think we should work to get a ticket into rax too to make sure that they at least know about it
19:46:59 <dmsimard> It seems odd that only certain executors are affected
19:47:08 <fungi> dmsimard: to restate what i just said above "yeah, as soon as the rest of the executors get service restarts to pick up newer zuul performance improvements, i'll be stopping ze09 and rebooting it with its zuul-executor service disabled so it can be used to open a ticket with rackspace"
19:47:10 <clarkb> dmsimard: if I had to guess its nova cell related
19:47:21 <clarkb> knowing what little I know about rackspace's internals
19:47:23 <dmsimard> fungi: +1
19:49:17 <clarkb> corvus: maybe we wait for people to notice log streaming is broken?
19:49:29 <clarkb> (though I haven't confirmed that and may jsut work due to failover to v4 magically)
19:50:21 <corvus> yeah, it's broken enough these days i wouldn't sweat it.
19:51:39 <clarkb> As a heads up I'm doing a day trip to seattle on friday to talk about infra things in a bar
19:51:49 <clarkb> so I will probably be mostly afk doing that
19:52:02 <Shrews> clarkb: let's all do that
19:52:06 <ianw> oh, one thing that came up in our discussions over AFS was there might be a request for us to try out kafs
19:52:09 <fungi> it's what i do every day
19:52:36 <AJaeger> clarkb: nice ;)
19:52:43 <ianw> i suggested that we might be able to load-balance in a kafs based mirror server at some point as a testing environment
19:53:20 <clarkb> ianw: til about kafs. I think that would be worthwhile to test considering the afs incompatibility with ubuntu's hwe kernels on xenial
19:53:46 <corvus> does it support the sha256 based keys?
19:54:16 <fungi> i'm still struggling to find the official page for it
19:54:19 <ianw> i don't know details like that ATM :)
19:54:45 <ianw> #link https://wiki.openafs.org/devel/LinuxKAFSNotes/
19:54:48 <corvus> it's worth doing some manual local testing to make sure it works
19:54:51 <ianw> has some details
19:55:22 <fungi> thanks
19:55:23 <clarkb> looks like 4.15 included major improvements to kafs so we may need a very new kernel for it to be useable
19:55:26 <clarkb> but we can test that
19:55:30 <clarkb> also it supports ipv6 neat
19:56:07 <ianw> yeah, it's definitely a custom hand-held rollout for testing
19:56:40 <fungi> i'm wondering how kafs could effectively support ipv6 without a v6-supporting afs server
19:56:50 <fungi> since kafs just looks like a client cache manager
19:57:03 <clarkb> fungi: looks like maybe that is for auristor?
19:57:39 <fungi> also, i thought v4-isms in the protocol abounded and a newer afs-like protocol or protocol revision was needed to do it over v6
19:58:16 <fungi> so maybe auristor supports an afs-like protocol over v6, i suppose
19:58:56 <clarkb> alright we are just about at time. Thanks everyone! If you like rocket launches we are about 45 minutes away from the falcon heavy launch
19:59:07 <clarkb> I'm goign to go find lunch now
19:59:11 <clarkb> #endmeeting