19:01:06 <clarkb> #startmeeting infra
19:01:07 <openstack> Meeting started Tue Feb 13 19:01:06 2018 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:11 <openstack> The meeting name has been set to 'infra'
19:01:18 <clarkb> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting
19:01:29 <clarkb> #topic Announcements
19:01:35 <clarkb> #link https://etherpad.openstack.org/p/infra-rocky-ptg PTG Brainstorming
19:01:52 <clarkb> If we end up having time towards the end of the meeting we will swing back around on ^ to make sure we aren't missing anything critical
19:02:02 <clarkb> but please do skim it over when you have a chance
19:02:23 <clarkb> also PTL Election season is now, about one day left to vote in elections. I think kolla, QA, and one other project have elections this time around
19:02:37 <fungi> mistral
19:02:52 <fungi> voter turn-out is around 40% across all 3 teams
19:02:54 <clarkb> so go vote if you are eligible and haven't yet
19:03:25 <clarkb> #topic Actions from last meeting
19:03:32 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-02-06-19.01.txt minutes from last meeting
19:03:54 <clarkb> I have yet to clean out old specs because either there is firefighting or I've come down with the latest plague it seems like
19:04:00 <clarkb> #action clarkb actually clean up specs
19:04:32 <clarkb> #topic Specs Approval
19:04:50 <clarkb> (sorry for moving fast but I would like to get through normal agenda then hopefully have time to talk about ptg things as it is coming up fast)
19:05:03 <clarkb> I am not aware of any specs that need review other than general cleanup
19:05:12 <clarkb> have I missed any that people want to call attention to?
19:05:36 <fungi> the topic of polling platforms has come up recently in a few conversations
19:05:53 <fungi> i wonder if we should ressurrect the proposed spec for that as a help-wanted thing
19:06:19 <fungi> might be a potential future internship opportunity much in the same way codesearch.o.o was
19:06:44 <clarkb> in the email I wrote last month I suggested that maybe we didn't need that anymore but since then yes more people have asked about polls
19:06:53 <clarkb> probably a good indication there is still a use for it
19:07:27 <clarkb> when going over the old specs I can look at resurrecting it
19:07:39 <fungi> cool, just a thought i had
19:08:16 <clarkb> #topic Priority Efforts
19:08:27 <clarkb> #topic Zuul v3
19:08:50 <clarkb> corvus: ianw Looks like you want to talk about executors OOMing and the status of ze02 with the new kernel
19:09:21 <corvus> good topic!
19:09:22 <ianw> looking at http://grafana01.openstack.org/dashboard/db/zuul-status you'd have to think ze02 was the lowest memory usage?
19:09:43 <corvus> it's the highest
19:09:50 <fungi> "used"
19:09:54 <corvus> i'm going to invert that metric :)
19:10:20 <fungi> yeah, available ram is sort of backwards from what people expect to see graphed
19:10:37 <ianw> ahhh, well ... the only difference there is the hwe kernel
19:10:46 <ianw> so is it working or not? :)
19:10:49 <corvus> anyway, the biggest thing i would take from that is apparently memory management is different
19:11:04 <corvus> that's probably a good thing since i believe our oom was caused by poor memory management in the kernel
19:11:37 <fungi> it doesn't seem to be running more builds than the others though
19:11:44 <corvus> (ie, my hypothesis is that it wasn't recovering slab memory correctly)
19:12:23 <corvus> it's run for several days without ooming, which is a good sign (other servers have oom'd since, but not all)
19:12:29 <dmsimard> There slabtop we can use to monitor that (I haven't used that in ages)
19:12:42 <fungi> but yeah, i wonder if it's simply more accurately reporting actual available memory while the others simply seem like they have more available than they do
19:13:05 <corvus> if we wanted to increase our confidence it was improving things, we could try to unwind some of the executor load improvements we made recently in an attempt to intentionally stress the executors more.
19:13:23 <pabelanger> o/
19:13:39 <corvus> or we could assume it's an improvement and roll it out to the others.  then we'll have full coverage and obviously any oom will disprove it.
19:13:56 <fungi> fwiw, no oom indicated in dmesg since booting a week ago
19:14:03 <clarkb> I'm somewhat partial to just rolling it out considering how easy it is to revert (just build new executors or switch kernels and reboot)
19:14:05 <dmsimard> If the HWE kernel hasn't made thing _worse_, I would roll it out to the others
19:14:13 <dmsimard> and what clarkb said
19:14:27 <clarkb> I think increasing the load to shake things out more will likely result in more pain for our users
19:14:29 <pabelanger> we have had 2 zuul-executor OOM, taking down streaming too
19:14:30 <corvus> ianw: what's the cost for the afs package?
19:14:37 <clarkb> especially since hwe kernel hasn't catastrophically failed
19:14:37 <pabelanger> ze04 / ze09, IIRC
19:14:38 <ianw> ok; the only trick is i've got some custom afs .debs that need installing
19:14:59 <pabelanger> ianw: could we add them to a PPA?
19:15:00 <ianw> we can either puppet that in, or just do it manually
19:15:03 <fungi> we could set those up in our ppa?
19:15:11 <pabelanger> +1
19:15:18 <fungi> we already have bwrap in there
19:15:32 <fungi> er, set up a ppa for that in our ppa space i suppose
19:16:02 <corvus> i'm assuming we'll revisit this in 2-3months as well.
19:16:09 <ianw> i can just stick them on tarballs like https://tarballs.openstack.org/package-afs-aarch64/
19:17:16 <corvus> i sort of like the ppa aspect for .debs -- it seems more accessible and repeatable if someone else has to make an update
19:17:39 <pabelanger> agree
19:18:12 <pabelanger> I think there is also arm64 PPA support too?
19:18:18 <dmsimard> ianw: so afs is kernel-specific ? I mean, would that package work against a non-HWE kernel ? I mostly wonder about how this translates for people potentially uses our puppet modules/PPA
19:18:47 <clarkb> dmsimard: yes the drawback to being an out of kernel driver
19:18:48 <pabelanger> You may also request builds for arm64, armhf, and/or ppc64el. Use the "Change details" page for the PPA to enable the architectures you want.
19:18:50 <pabelanger> https://help.launchpad.net/Packaging/PPA
19:18:53 <clarkb> apis change and then you break against newer kernels
19:18:58 <ianw> dmsimard: the modules are pretty kernel specific, since kernel has no abi ... it just needs a later version to work against later kernels
19:19:18 <clarkb> package dependencies should be able to express that at least
19:19:21 <fungi> #link https://launchpad.net/~openstack-ci-core/+activate-ppa Activate a New “OpenStack CI Core” team Personal Package Archive
19:19:26 <fungi> for reference
19:19:29 <corvus> clarkb: well, to be fair *ubuntu's* afs package is broken against *ubuntu's* kernel.
19:19:30 <clarkb> ianw's package can depend on the hwe kernel
19:19:42 <dmsimard> ok, it's been too long since I've done debian packaging but clarkb beat me to it -- it should be possible for the newer package to require the hwe kernel to make sure it doesn't get installed on non-hwe
19:19:47 <pabelanger> fungi: FWIW: I don't have permission for that
19:19:59 <ianw> it won't really matter if it does anyway, it's backwards compat
19:20:06 <ianw> just not fowards
19:20:11 <dmsimard> ianw: oh, that's what I was asking -- if it's backwards compatible it's fine then
19:20:42 <clarkb> I like the idea of a ppa for this too fwiw. especially if we have to do it for arm too
19:20:42 <pabelanger> fungi: looks like need to be added to https://launchpad.net/~openstack-admins/+members
19:21:01 <corvus> pabelanger: done
19:21:21 <pabelanger> corvus: thankyou
19:21:28 <corvus> ianw: added you
19:21:37 <clarkb> so maybe we continue to monitor it while we get that set up and if still looking good do the update across the executors?
19:21:41 <ianw> so is this one ppa with "everything" we build?
19:21:53 <fungi> yeah, i expect any infra-root person who wants to help maintain that ppa can be added to the group there
19:22:09 <ianw> i just mean, will it drag in other stuff we've built
19:22:18 <pabelanger> ianw: yah, vhd-util and bwrap right now
19:22:25 <pabelanger> oh, and python3 testing :)
19:22:41 <clarkb> you have to explicitly install those packages but if you do ya should come from the ppa
19:22:49 <fungi> note that they're individual ppas. i expect this would be an openafs-hwe-kernel (or similar) ppa adjacent to the bwrap and vhdutil ones
19:22:59 <clarkb> ++
19:23:01 <ianw> oh right, that's what i meant
19:23:21 <fungi> as opposed to, say, a single ppa with all those unrelated packages dumped in
19:23:23 <ianw> ok, maybe we should discuss in infra.  i'll take an action item to get the ppa setup with packages and then we can do a few upgrades from that
19:23:35 <fungi> thanks ianw!
19:23:40 <corvus> dmsimard: added you
19:23:41 <clarkb> #action ianw set up afs ppa for arm and hwe kernels
19:23:43 <pabelanger> ianw: happy to help
19:23:56 <ianw> one other quick thing was that ipv6 is broken to ze02 for undeterminted reasons
19:24:10 <clarkb> ianw: separately from the larger scale ipv6 breakage we had right?
19:24:17 <corvus> i guess we should just replace that server?
19:24:23 <ianw> rax said they didn't know what's wrong and wanted access to the hosts on either end of the issue (cacti and ze02) which i wasn't too sure about
19:24:27 <fungi> where "broken" means some significant % packet loss, but not 100%
19:24:45 <ianw> yeah, so should i try 1) a hard reboot (maybe ends up on a different node) and 2) a rebuild
19:24:51 <fungi> do any other instances also get ipv6 packet loss communicating with ze02?
19:25:07 <ianw> i haven't noticed any.  a sure fire way is if cacti has no/spotty logs
19:25:19 <corvus> i'm not comfortable sharing access to ze02.
19:25:44 <clarkb> maybe make new server then nova rebuild old one with a new image?
19:26:00 <clarkb> I have a feeling nova actually reschedules you to a new hypervisor when you do that which may invalidate the testing
19:26:04 <corvus> clarkb: yeah, and if that works, share that with rax.  and if it doesn't, "oh well"
19:26:24 <corvus> i guess we could try to clean the server then give them access
19:26:50 <ianw> i don't think they *want* to investigate, but probably would if we asked, if you know what i mean
19:26:56 <fungi> we wouldn't want to reuse it after they log in, at any rate
19:27:10 <clarkb> ianw: in that case maybe we just make a new server
19:27:12 <fungi> and wouldn't want to leave keys on it
19:27:14 <corvus> wfm
19:27:16 <ianw> so basically ok to hard reboot/rebuild over the next day or so and see what happens?
19:27:22 <pabelanger> wfm
19:27:23 <clarkb> ianw: wfm
19:27:40 <fungi> sounds good
19:27:46 <clarkb> maybe do that before we commit hard to the hwe kernel. It is possible that breaks ipv6 somehow
19:27:54 <clarkb> however I think the server had similar problems before we updated the kernel
19:28:02 <ianw> true also, but the access (from cacti) has been bad for ages
19:28:02 <fungi> worth noting we still have cacti01 in dns
19:28:25 <pabelanger> fungi: oh, it is possible I didn't remove it
19:29:41 <clarkb> The other zuul item on the agenda is handling zk disconnects. Which I think should be fixed at this point, but wanted to double check on that before removing it from the agenda.
19:30:15 <pabelanger> do we want to consider migrating to new zookeeper at PTG?
19:30:25 <pabelanger> zk01 / zk02 / zk03
19:30:30 <pabelanger> from nodepool.o.o
19:30:37 <clarkb> pabelanger: probably not a bad time to do it considering load will likely be low during ptg
19:30:40 <fungi> how disruptive do we anticipate that being?
19:30:56 <pabelanger> I believe we need to stop / start everything to pick up new configuration
19:31:09 <fungi> it requires a full outage for nodepool because we can't expand/contract zk clusters i guess?
19:31:19 <clarkb> fungi: yes
19:31:22 <pabelanger> yes, and zuul too
19:31:30 <clarkb> at least my reading of docs is that you need a very new version of zk to do expansion safely
19:31:56 <fungi> oh, i suppose zuul does need to know to connect to the new cluster as well yes ;)
19:32:09 <dmsimard> Have we ever tested connecting to a cluster in the first place ?
19:32:12 <clarkb> process would be something like stop all services including zk, copy zk data files to 3 new servers. Start 3 new servers. Check quorum and data is present, start nodepool and zuul
19:32:24 <pabelanger> dmsimard: not yet, but we likely should
19:32:26 <dmsimard> I'm admittedly not knowledgeable about Zookeeper but maybe there's a few gotchas around connecting to a cluster
19:32:33 <pabelanger> would be good to confirm replication is working correctly
19:33:05 <dmsimard> For example, some client bindings require listing all the nodes in the configuration and the client decides where to go, others it's handled server side, etc.
19:33:19 <clarkb> ya its worth testing. We even have a cluster you can use :)
19:33:21 <dmsimard> pabelanger: not just replication but actual usage/connection
19:33:31 <tobiash> dmsimard, pabelanger: I connect to a cluster since day one ;)
19:33:39 <pabelanger> oh
19:33:40 <pabelanger> :)
19:33:44 <clarkb> tobiash wins. Can double check our config for us :P
19:33:45 <dmsimard> tobiash: well isn't that awesome
19:33:59 <dmsimard> a wild tobiash has appeared :D
19:34:14 <tobiash> didn't notice any problems related to this so far ;)
19:34:21 <clarkb> alright anything else related to zuul?
19:34:30 <pabelanger> okay, so we still need to figure out how to move data between servers, but i can maybe look into that and start an etherpad
19:34:40 <clarkb> pabelanger: its just file copies aiui
19:34:47 <pabelanger> okay
19:34:53 <clarkb> pabelanger: so stop zk everywhere, copy the files out, start zk
19:35:15 <clarkb> we may even be able to test it on the new cluster by just copying what is on disk at any point in time of existing server
19:35:43 <clarkb> (basically don't stop the source side)
19:35:52 <pabelanger> clarkb: yah, that's what I'm thinking for first test
19:36:19 <clarkb> #topic General Topics
19:36:28 <clarkb> ianw: gema aarch64 udpate time?
19:36:53 <ianw> i got what would be testing-nodes building a booting
19:37:17 <pabelanger> nice
19:37:20 <ianw> it's a bit janky but gets there -> https://review.openstack.org/542591
19:38:08 <gema> ianw: the hw_firmware_type is not needed anymore, the patch was not fully applied everywhere, should be redundant
19:38:18 <ianw> actually it's up if you have access @ root@211.148.24.197
19:38:19 <gema> ianw: but we can clean that up when we upgrade the clouds to queens
19:38:56 <ianw> cool.  so on my todo list is figuring out how to get the hwe kernel installed by default during the build ... it seems to work better, and just finishing off the efi stack
19:39:11 <ianw> the dib gate has been broken so took me on a journey of fixing jobs last few days
19:39:19 <gema> ianw: if HWE kernel works better on arm64 I can talk to the ubuntu folks to make it default on arm cloud images
19:39:43 <clarkb> ianw: worst case you should be able to just have a package install for it right?
19:39:49 <gema> ianw: we have a conversation pending for PTG, you can join if you'd like
19:39:53 <clarkb> small element that checks arch and installs hwe if arm
19:40:06 <ianw> yep, there's always a way :)
19:40:17 <gema> ianw: this is awesome work, thank you
19:40:21 <pabelanger> infra-needs-element :D
19:41:17 <ianw> if we get this ppa going, i was also going to update puppet to make sure we can build the mirror automatically
19:41:29 <fungi> much appreciated
19:41:31 <ianw> and we should probably mirror ports.ubuntu
19:41:48 <pabelanger> yah, that would be awesome
19:41:56 <fungi> is that where ubuntu sticks non-x86 architecture builds?
19:42:15 <ianw> yep, so we'd want to prune that to arm64
19:42:16 * fungi is too used to debian just putting all architectures in one place
19:43:57 <ianw> so i think it's all moving forward ... if anyone wants to review the dib stack at https://review.openstack.org/#/c/539731/ feel free :)
19:44:14 <gema> clarkb: from our side we are moving hw to a new lab in london next week and after that we'll bring up a new cloud, with some resources for infra also
19:44:56 <corvus> is this a second cloud or a replacement?
19:45:01 <gema> a second cloud
19:45:05 <clarkb> ianw: gema sounds like good progress, exciting!
19:45:13 <gema> we have another one in the US that needs updating and will also be added in due time
19:45:34 <fungi> most awesome
19:45:39 <gema> corvus: the plan is to have the three clouds running queens during the coming months
19:45:47 <gema> and have resources for infra on the three of them
19:45:47 <corvus> groovy!
19:46:13 <pabelanger> Yay
19:46:26 * clarkb is now going to try and quickly get through a couple of things leaving us ~10 minutes to talk PTG
19:46:28 <gema> and our members are super excited and talking about contributing more hw
19:46:59 <gema> clarkb: sorry, go for it :D
19:47:02 <clarkb> I've switched jjb's core review group in Gerrit to be self managed as they are already operating mostly independently. I do not think this is at odds with them remaining as an infra project
19:47:15 <clarkb> if anyone has concerns about this let me know but reality is they've been operating independently for some time
19:47:32 <clarkb> I did speak with the release team last week about project renames and the date I got back was March 16
19:47:52 <clarkb> cycle trailing releases should just be about done then. we will want to dobuel check with release team before taking gerrit down though
19:48:05 <fungi> yeah, in my tenure as ptl i basically just acted on whatever recommendations their core team had for adding new core members
19:48:12 <fungi> (jjb that is)
19:48:35 <pabelanger> I don't believe I'll be around that week to help
19:48:47 <clarkb> we can sort out process for renames after hte PTG I expect
19:48:50 <fungi> i'll be available
19:48:50 <clarkb> pabelanger: thats ok, I think several of us will be around we should have enough people to do it
19:49:27 <clarkb> Alright PTG planning
19:49:27 <fungi> that is the night before a major drinking holiday, but shouldn't pose any problems
19:49:38 <clarkb> fungi: ya its day after you have to wrry about
19:49:43 <fungi> ;)
19:49:58 <clarkb> Looks like we have a good collection of stuff for wed-fri
19:50:39 <clarkb> do we want to try and schedule that loosely say put all the Zuul stuff on wednesday/thursday and the rest of it on friday? Or do it more unconference style and let people work through what they are interested in as they are able?
19:51:03 <clarkb> I'm not sure if one of the other helps or hurts people's ability to attend based on travel or other scheduling demands
19:51:39 <corvus> i think attempting scheduling would be helpful
19:51:49 <gema> for the arm64 bit it'd help if we had a slot, so I know when we'll be discussing it. We will be splitting time between kolla and infra
19:51:59 <dmsimard> I'll stay in the cold land of Montreal to help regular infra-root things along during North American timezones :)
19:52:28 <clarkb> ok thats two votes for scheduling. I'll work up a proposed schedule then. Probably put it in the etherpad then send mail to the infra list letting people know they should look it over
19:52:43 <pabelanger> wfm
19:52:55 <gema> wfm
19:52:56 <clarkb> it probably won't be super specific and instead be rough like wednesday morning vs friday after noon
19:53:06 <fungi> yeah, there are a few non-infra discussions i need to split my time between too, so knowing when would be convenient helps me
19:53:07 <corvus> we still have an ethercalc, right?
19:53:17 <clarkb> corvus: oh good point I should use that
19:53:17 <fungi> indeed we do!
19:53:59 <fungi> there's no ptg-wide ethercalc integrated with ptgbot any longer, but doesn't preclude us from having our own spreadsheet for this
19:54:08 <clarkb> thinking about it that might be a good way to do "appointments" with the help days too
19:54:21 <clarkb> just block out each hour of the day monday/tuesday and let projects sign up
19:54:30 <pabelanger> great idea
19:54:49 <corvus> ++
19:55:07 <clarkb> ok I will work on getting that set up
19:55:37 <clarkb> and the last thing is do we want to try and do a team dinner (I think we should I've just been sick and otherwise distracted by family obligations haven't had much time to look into it)
19:55:47 <clarkb> worst case I imagine we can roll into a pub and sit at a couple tables
19:56:26 <clarkb> but maybe a quick poll of what night works for people would be good too
19:56:31 <corvus> i like eating with people
19:56:35 <corvus> also drinking
19:56:42 <clarkb> pabelanger: ^ were you maybe still interested in that?
19:56:54 <pabelanger> sure, I can look into it more
19:57:09 <pabelanger> get up etherpad for people to vote on a datetime
19:57:21 <clarkb> #topic open discussion
19:57:21 <fungi> dublin has a preponderance of venues where a large group can just wander in and get served. we may have to spread out across multiple tables, but i doubt we'll have trouble finding somewhere
19:57:28 <clarkb> few minutes for anything else that we've missed
19:57:41 <clarkb> fungi: good to know thanks
19:57:45 <AJaeger> One request: I have two stacks up that will remove together /usr/local/jenkins from our nodes, please review and comment on naming/location. https://review.openstack.org/541868 and https://review.openstack.org/543139/
19:58:39 <pabelanger> https://review.openstack.org/542422/ is something nodepool related to help better manage our min-ready nodes. If people are maybe interested in reviewing
19:58:41 <corvus> AJaeger: why don't we use the ansible script module ?
19:59:29 <AJaeger> corvus: where exactly? Happy to make changes...
19:59:43 <corvus> pabelanger: min-ready: 0 is now the default, but we may need launcher restarts to pick that up
20:00:10 <pabelanger> corvus: k, will update
20:00:13 <AJaeger> corvus: let's continue in #openstack-infra
20:00:28 <corvus> AJaeger: well, it's more of a question of approach -- rather than have jobs which expect scripts to already exist on a node, the ansible 'script' module copies and runs the scripts
20:01:11 <clarkb> ya discussion will have to move to -infra. We are out of time! thank you everyone.
20:01:17 <fungi> thanks!
20:01:18 <clarkb> #endmeeting