19:01:45 #startmeeting infra 19:01:46 Meeting started Tue Feb 6 19:01:45 2018 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:47 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:49 The meeting name has been set to 'infra' 19:01:50 yeah, if someone wants to write an app that dynamically moves alarms based on launch delays, that would be awesome 19:01:54 o/ 19:02:23 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:02:46 word of warning this morning's fires have me not very well prepared for the meeting but we do have an agenda and we'll do our best 19:02:55 #topic Announcements 19:02:57 o/ 19:03:09 The summit CFP closes on the 8th (that is like 1.5 days from now) 19:03:28 #link https://etherpad.openstack.org/p/infra-rocky-ptg PTG Brainstorming 19:04:26 And finally it is PTL election season with nominations happening for another day or so 19:04:48 #topic Actions from last meeting 19:05:10 #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-01-30-19.01.txt Notes from last meeting 19:05:36 #action clarkb / corvus / everyone / to take pass through old zuul and nodepool master branch changes to at least categorize changes 19:05:52 I don't think we've really had time to do ^ with the various fires we've been fighting so we'll keep that on the board 19:05:56 i *think* we've worked through the backlog there 19:06:19 oh cool. Should I undo it then? 19:06:26 tobiash and i (maybe others?) did a bunch last week, and, at least, i think we're done 19:06:31 #undo 19:06:32 Removing item from minutes: #action clarkb / corvus / everyone / to take pass through old zuul and nodepool master branch changes to at least categorize changes 19:06:42 thank you everyone for somehow managing to get through that despite the fires 19:06:51 I'm still on the hook for cleaning out old specs 19:06:55 #action clarkb clear out old infra specs 19:07:26 #topic Specs approval 19:07:38 Any specs stuff we need to go over that I have missed? 19:08:20 Sounds like no 19:08:20 o/ 19:08:30 #topic Priority Efforts 19:08:36 #topic Zuul v3 19:08:48 corvus: you wanted to talk about the zuul executor OOMing 19:09:08 the oom-killer has a habit of killing the log streaming daemon 19:09:18 we thought this was because it was running out of memory 19:09:20 understandably 19:09:33 but on closer inspection, i don't think it actually is. i think it only thinks it is. 19:09:44 a false-positive? 19:10:14 now that the governors are improved, we can see that even with 50% +/- 10% of the physical memory "available", we've still seen oom-killer invoked 19:10:41 the problem appears to be that the memory is not 'free', just 'reclaimable'. especially 'reclaimable slab' memory. 19:11:12 there have been some kernel bugs related to this, and a lot of changes, especially around the early 4.4 kernels 19:11:50 i'm inclined to think that we're suffering from our workload patterns being suboptimal for whatever kernel memory management is going on 19:12:07 ahh, that's believable 19:12:19 one suggestion I had was to try the hwe kernels ubuntu publishes. I've had reasonably good luck locally with them due to hardware needs but they get you a 4.13 kernel on xenial 19:12:38 worth considering 19:12:39 considering these servers are mostly disposable and replaceable we can always rebuild them with older kernels if necessary as well 19:13:00 i think our options include: a) throw more executors at the problem and therefore use less ram. b) try to tune kernel memory management parameters. c) open a bug report and try kernel bugfixes. d) try a different kernel via hwe or os upgrade. 19:13:07 you can easily switch between default boot kernels without rebuilding, for that matter 19:13:20 hwe kernel is worth a shot 19:13:29 yeah, i'm inclined to start with 4.13 19:13:52 that seems like the easiest way to rule out old fixed bugs (at the risk of discovering newer ones) 19:13:53 we can even try it "online".. as in, install it, reboot without rebuilding the instance from a new image 19:14:12 dmsimard: yeah, that's what i was assuming we'd do 19:14:13 i'm not enthused about debugging 4.4 at this point in its lifecycle 19:14:40 is tuning OOM killer something we want to try too? As to not kill an executor process before ansible-playbook? 19:14:42 fungi: I forget if they have a PPA for them but I've installed them "manually" from http://kernel.ubuntu.com/~kernel-ppa/mainline/ before 19:14:48 we are also a few months away from new ubuntu lts release which should in theory be a simpler upgrade since we've already done the systemd jump 19:15:09 so we may not have to live with the hwe kernel for long 19:15:11 pabelanger: i think the concern with tuning is that the bug would still be present, and we'd just kill other things 19:15:12 fungi: ah that's a ppa right in a URL 19:15:31 fungi: ya that should probably be a last resort especially if we actually do have memory available after all 19:15:34 dmsimard: not a ppa, just a non-default kernel package (might be in the backports suite though) 19:15:55 fungi: yah, thought is if we kill ansible-playbook, job would be aborted and retried? Where if we lose executor we have to restart everything. But agree, not best solution 19:15:59 ubuntu hwe kernels works fine. my private server runs them since they started doing them 19:16:24 hrw: ya mine home server runs it too 19:16:37 the only downside I've seen was getting kpti patches took longer than the base kernel 19:16:43 we have had a bad experince with them in the past. they broke openafs the last time we had ne installed. 19:16:53 the executors need openafs 19:16:56 it's not so much whether they "work fine" but whether we turn up new and worse bugs than we're trying to rule out 19:17:00 that's something to look out for 19:17:12 all kernel versions have bugs. new versions usually have some new bugs too 19:18:12 yes, the openafs lkm does complicate this a little too (not logistically, just surface area for previously unnoticed bugs/incompatibilities) 19:18:14 right if this was the gerrit server for example I'd be more cautious 19:18:22 okay, so let's start with an in-place test of hwe kernel, then maybe there's some voodoo mm stuff we can try (like disabling cgroups memory management), then maybe we throw more machines at it until we get yet another newer kernel with the upgrade? 19:18:24 but we can fairly easily undo any kernel updates on these zuul executors 19:18:41 wfm 19:18:45 corvus: sounds like a reasonable start 19:19:04 yeah, seems like we are a/b testing basically, and it's easily reverted 19:19:04 It seems we are using fairly old version of bubblewrap ? 0.1.8 which goes back to March 2017 -- there was 0.2.0 released in October 2017 but nothing after that. The reason I mention that is because there seems to be a history of memory leak: https://github.com/projectatomic/bubblewrap/issues/224 19:20:14 and yeah, the newer kernel packages are just in the main suite for xenial, so no backports suite or ppa addition to sources.list needed: https://packages.ubuntu.com/xenial/linux-image-4.13.0-32-generic 19:20:33 #link https://packages.ubuntu.com/xenial/linux-image-4.13.0-32-generic xenial package page for linux-image-4.13.0-32-generic 19:20:57 i'd recommend ze02 for a test machine, it oomed yesterday 19:21:33 anyone want to volunteer for installing hwe on ze02 and rebooting? i'm deep into the zk problem 19:21:46 i can take that on if we like, i have all day :) 19:21:53 ianw: thanks! 19:21:55 ianw: awesome, thx! 19:22:19 corvus: re zk I put an agenda item re losing zk connection requires scheduler restart. Is that what you are debugging? 19:22:26 clarkb: yep 19:22:39 I wasn't sure if this was a known issue with planned fixes that just need effort or if its debugging and fixing that hasn't been done yet or is in progress 19:23:06 it's a logic bug in zuul that should be correctible. i'm working on fixing it now 19:23:40 ok so keep an eye out for change(s) to fix that and try to review 19:24:13 Anything else we want to talk about before moving on? 19:24:53 I'll just link this pad which summarizes what we had today: https://etherpad.openstack.org/p/HRUjBTyabM 19:25:06 (not sure if I can #link as non-chair) 19:25:09 #link https://etherpad.openstack.org/p/HRUjBTyabM for info on zuul outage today 19:25:13 dmsimard: I think you can but ^ got it 19:25:28 #topic General Topics 19:26:00 ianw hrw arm64/aarch64 (is one more correct than the other?) updates? 19:26:26 oh, that was from ages ago, just haven't taken it out 19:26:31 clarkb: aarch64 is architecture name, arm64 is popular alias 19:26:33 but i can 19:26:59 #link http://lists.openstack.org/pipermail/openstack-infra/2018-February/005817.html 19:27:00 ah wasn't sure if you wanted to give a new update since I'm guessing things have moved over the last week? 19:27:07 i sent that update 19:27:24 yeah, arm64 is simply what debian called their port (and a number of other popular distros followed suit) 19:27:39 much like ppc64 19:27:49 basically there's some AFS patches out which seem to work ... i have a custom build and need to puppet that into the mirror server 19:28:02 gpt support for dib is done i think 19:28:30 EFI is going pretty well; i can produce working efi images for amd64 & arm64 19:28:49 next thing to test is the infra elements + base build on arm64 19:28:58 once we have all that sorted, nothing to stop us launching nodes! 19:29:57 neat! 19:30:30 I spoke with Andreas on sunday to mark those patches as priority ones ;D 19:31:25 cool 19:32:00 yeah, i still have the block-device-* element stuff marked as WIP, but i'm getting more convinced it's the way to go, no major problems seem to be revealing themselves 19:32:40 that's excellent news 19:34:11 #topic Project Renames 19:34:47 I think we've got the pieces in place to actually be able to perform these sanely now 19:35:13 mordred: fungi I think you were going to try and write up a process that could be used? 19:35:34 it would probably be good to schedule this and try to get it down though as it has been quite a while 19:35:44 and in the process we can try ianw's fix for the nova specs repo replication 19:35:53 i can give it a shot... though i'm not entirely sure the extent to which a project (or at least its jobs?) need to be entirely unwound in zuul before renaming 19:36:28 i suppose we can merge a change which tells zuul to ignore the in-repo configuration temporarily until it gets patched 19:37:05 fungi: ya could just remove it from the zuul project list temporarily 19:37:06 though unsure whether that's sufficient 19:37:23 its going to have to be updted there regardless 19:37:36 oh, true we could entirely remove it from zuul, bypass code review to merge the config changes in those repos, then readd with the new name 19:38:00 looking at a calendar release happens during ptg so probably the earliest we want to do renames is week after ptg 19:38:16 i'd be cool with trying it then 19:38:43 unless that's too soon after the main release due to the release-rtailing projects 19:38:52 er, release-trailing 19:38:53 i won't be around ptg+1 week 19:38:53 clarkb: gah. sorry - was looking away from computer with heads in code 19:39:12 worth checking with the release team on when would be a good time for a scheduler gerrit outage 19:39:20 er, scheduled 19:39:29 ok lets ping them and ask 19:39:34 I can do that 19:40:09 cool. it could also just get discussed wit them at the ptg if preferred 19:40:13 corvus: you are back the ptg +2 week? 19:40:40 * mordred will be around both weeks 19:41:10 fungi: do you think if we corner them at ptg we'll have a better shot at getting the day we want? :P 19:41:24 clarkb: yep, should be 19:41:41 i think the answer will be the same either way. they're a surprisingly consistent lot ;) 19:41:50 alright then lets sync up with them and go from there 19:42:05 (i don't think i'm a required participant, just adding info) 19:42:15 #topic Open Discussion 19:43:05 the more infra-root admins we have on hand during the rename maintenance in case of emergencies, the better. also deep understanding of zuul job configuration will be helpful if we have to think on our feet this first time 19:43:15 re zuul networking outage this morning/last night (whatever it is relative to your timezone) it would be good to not forget to sort out the ipv6 situation on the executors 19:43:32 I think that if we end up leaving them in that state for more than a day we should consider removing AAAA records in DNS 19:43:57 I'd like to maybe discuss how we can start testing migration to zookeeper cluster, might even be better topic for PTG 19:43:59 remove gerrit's quad a? 19:44:08 clarkb: is that the current theory for why things dropped out of zk? 19:44:11 yeah, as soon as the rest of the executors get service restarts to pick up newer zuul performance improvements, i'll be stopping ze09 and rebooting it with its zuul-executor service disabled so it can be used to open a ticket with rackspace 19:44:18 corvus: the zuul executors 19:44:40 corvus: beacuse things talk to them and may resolve AAAA too 19:44:51 (specifically won't the console streaming do something like that from the scheduler?) 19:44:55 clarkb: ok, but the big issue we saw was the executors not connecting to gerrit... very little is inbound to executors... just log streaming i think 19:45:31 i favor building replacement executors rather than running some with no ipv6 longer-term 19:45:45 I'm just thinking if the lack of ipv6 is going to be longish term then not having dns records for those IPs would probably be good 19:45:53 though at this point we have no reason to believe we won't end up with new instances exhibiting the same issues 19:46:34 leaving some ipv6-less means inconsistencies between executors which seems equally bad unless we're doing it explicitly to test something 19:46:35 clarkb: yeah... i'm worried that if the issue is 'ipv6 is broken in all directions' then those records are actually the least of our worries 19:46:49 can we open a ticket with rax to see if there is any known issues ? 19:46:56 corvus: ya I think we should work to get a ticket into rax too to make sure that they at least know about it 19:46:59 It seems odd that only certain executors are affected 19:47:08 dmsimard: to restate what i just said above "yeah, as soon as the rest of the executors get service restarts to pick up newer zuul performance improvements, i'll be stopping ze09 and rebooting it with its zuul-executor service disabled so it can be used to open a ticket with rackspace" 19:47:10 dmsimard: if I had to guess its nova cell related 19:47:21 knowing what little I know about rackspace's internals 19:47:23 fungi: +1 19:49:17 corvus: maybe we wait for people to notice log streaming is broken? 19:49:29 (though I haven't confirmed that and may jsut work due to failover to v4 magically) 19:50:21 yeah, it's broken enough these days i wouldn't sweat it. 19:51:39 As a heads up I'm doing a day trip to seattle on friday to talk about infra things in a bar 19:51:49 so I will probably be mostly afk doing that 19:52:02 clarkb: let's all do that 19:52:06 oh, one thing that came up in our discussions over AFS was there might be a request for us to try out kafs 19:52:09 it's what i do every day 19:52:36 clarkb: nice ;) 19:52:43 i suggested that we might be able to load-balance in a kafs based mirror server at some point as a testing environment 19:53:20 ianw: til about kafs. I think that would be worthwhile to test considering the afs incompatibility with ubuntu's hwe kernels on xenial 19:53:46 does it support the sha256 based keys? 19:54:16 i'm still struggling to find the official page for it 19:54:19 i don't know details like that ATM :) 19:54:45 #link https://wiki.openafs.org/devel/LinuxKAFSNotes/ 19:54:48 it's worth doing some manual local testing to make sure it works 19:54:51 has some details 19:55:22 thanks 19:55:23 looks like 4.15 included major improvements to kafs so we may need a very new kernel for it to be useable 19:55:26 but we can test that 19:55:30 also it supports ipv6 neat 19:56:07 yeah, it's definitely a custom hand-held rollout for testing 19:56:40 i'm wondering how kafs could effectively support ipv6 without a v6-supporting afs server 19:56:50 since kafs just looks like a client cache manager 19:57:03 fungi: looks like maybe that is for auristor? 19:57:39 also, i thought v4-isms in the protocol abounded and a newer afs-like protocol or protocol revision was needed to do it over v6 19:58:16 so maybe auristor supports an afs-like protocol over v6, i suppose 19:58:56 alright we are just about at time. Thanks everyone! If you like rocket launches we are about 45 minutes away from the falcon heavy launch 19:59:07 I'm goign to go find lunch now 19:59:11 #endmeeting