19:01:06 #startmeeting infra 19:01:07 Meeting started Tue Feb 13 19:01:06 2018 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:08 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:11 The meeting name has been set to 'infra' 19:01:18 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:29 #topic Announcements 19:01:35 #link https://etherpad.openstack.org/p/infra-rocky-ptg PTG Brainstorming 19:01:52 If we end up having time towards the end of the meeting we will swing back around on ^ to make sure we aren't missing anything critical 19:02:02 but please do skim it over when you have a chance 19:02:23 also PTL Election season is now, about one day left to vote in elections. I think kolla, QA, and one other project have elections this time around 19:02:37 mistral 19:02:52 voter turn-out is around 40% across all 3 teams 19:02:54 so go vote if you are eligible and haven't yet 19:03:25 #topic Actions from last meeting 19:03:32 #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-02-06-19.01.txt minutes from last meeting 19:03:54 I have yet to clean out old specs because either there is firefighting or I've come down with the latest plague it seems like 19:04:00 #action clarkb actually clean up specs 19:04:32 #topic Specs Approval 19:04:50 (sorry for moving fast but I would like to get through normal agenda then hopefully have time to talk about ptg things as it is coming up fast) 19:05:03 I am not aware of any specs that need review other than general cleanup 19:05:12 have I missed any that people want to call attention to? 19:05:36 the topic of polling platforms has come up recently in a few conversations 19:05:53 i wonder if we should ressurrect the proposed spec for that as a help-wanted thing 19:06:19 might be a potential future internship opportunity much in the same way codesearch.o.o was 19:06:44 in the email I wrote last month I suggested that maybe we didn't need that anymore but since then yes more people have asked about polls 19:06:53 probably a good indication there is still a use for it 19:07:27 when going over the old specs I can look at resurrecting it 19:07:39 cool, just a thought i had 19:08:16 #topic Priority Efforts 19:08:27 #topic Zuul v3 19:08:50 corvus: ianw Looks like you want to talk about executors OOMing and the status of ze02 with the new kernel 19:09:21 good topic! 19:09:22 looking at http://grafana01.openstack.org/dashboard/db/zuul-status you'd have to think ze02 was the lowest memory usage? 19:09:43 it's the highest 19:09:50 "used" 19:09:54 i'm going to invert that metric :) 19:10:20 yeah, available ram is sort of backwards from what people expect to see graphed 19:10:37 ahhh, well ... the only difference there is the hwe kernel 19:10:46 so is it working or not? :) 19:10:49 anyway, the biggest thing i would take from that is apparently memory management is different 19:11:04 that's probably a good thing since i believe our oom was caused by poor memory management in the kernel 19:11:37 it doesn't seem to be running more builds than the others though 19:11:44 (ie, my hypothesis is that it wasn't recovering slab memory correctly) 19:12:23 it's run for several days without ooming, which is a good sign (other servers have oom'd since, but not all) 19:12:29 There slabtop we can use to monitor that (I haven't used that in ages) 19:12:42 but yeah, i wonder if it's simply more accurately reporting actual available memory while the others simply seem like they have more available than they do 19:13:05 if we wanted to increase our confidence it was improving things, we could try to unwind some of the executor load improvements we made recently in an attempt to intentionally stress the executors more. 19:13:23 o/ 19:13:39 or we could assume it's an improvement and roll it out to the others. then we'll have full coverage and obviously any oom will disprove it. 19:13:56 fwiw, no oom indicated in dmesg since booting a week ago 19:14:03 I'm somewhat partial to just rolling it out considering how easy it is to revert (just build new executors or switch kernels and reboot) 19:14:05 If the HWE kernel hasn't made thing _worse_, I would roll it out to the others 19:14:13 and what clarkb said 19:14:27 I think increasing the load to shake things out more will likely result in more pain for our users 19:14:29 we have had 2 zuul-executor OOM, taking down streaming too 19:14:30 ianw: what's the cost for the afs package? 19:14:37 especially since hwe kernel hasn't catastrophically failed 19:14:37 ze04 / ze09, IIRC 19:14:38 ok; the only trick is i've got some custom afs .debs that need installing 19:14:59 ianw: could we add them to a PPA? 19:15:00 we can either puppet that in, or just do it manually 19:15:03 we could set those up in our ppa? 19:15:11 +1 19:15:18 we already have bwrap in there 19:15:32 er, set up a ppa for that in our ppa space i suppose 19:16:02 i'm assuming we'll revisit this in 2-3months as well. 19:16:09 i can just stick them on tarballs like https://tarballs.openstack.org/package-afs-aarch64/ 19:17:16 i sort of like the ppa aspect for .debs -- it seems more accessible and repeatable if someone else has to make an update 19:17:39 agree 19:18:12 I think there is also arm64 PPA support too? 19:18:18 ianw: so afs is kernel-specific ? I mean, would that package work against a non-HWE kernel ? I mostly wonder about how this translates for people potentially uses our puppet modules/PPA 19:18:47 dmsimard: yes the drawback to being an out of kernel driver 19:18:48 You may also request builds for arm64, armhf, and/or ppc64el. Use the "Change details" page for the PPA to enable the architectures you want. 19:18:50 https://help.launchpad.net/Packaging/PPA 19:18:53 apis change and then you break against newer kernels 19:18:58 dmsimard: the modules are pretty kernel specific, since kernel has no abi ... it just needs a later version to work against later kernels 19:19:18 package dependencies should be able to express that at least 19:19:21 #link https://launchpad.net/~openstack-ci-core/+activate-ppa Activate a New “OpenStack CI Core” team Personal Package Archive 19:19:26 for reference 19:19:29 clarkb: well, to be fair *ubuntu's* afs package is broken against *ubuntu's* kernel. 19:19:30 ianw's package can depend on the hwe kernel 19:19:42 ok, it's been too long since I've done debian packaging but clarkb beat me to it -- it should be possible for the newer package to require the hwe kernel to make sure it doesn't get installed on non-hwe 19:19:47 fungi: FWIW: I don't have permission for that 19:19:59 it won't really matter if it does anyway, it's backwards compat 19:20:06 just not fowards 19:20:11 ianw: oh, that's what I was asking -- if it's backwards compatible it's fine then 19:20:42 I like the idea of a ppa for this too fwiw. especially if we have to do it for arm too 19:20:42 fungi: looks like need to be added to https://launchpad.net/~openstack-admins/+members 19:21:01 pabelanger: done 19:21:21 corvus: thankyou 19:21:28 ianw: added you 19:21:37 so maybe we continue to monitor it while we get that set up and if still looking good do the update across the executors? 19:21:41 so is this one ppa with "everything" we build? 19:21:53 yeah, i expect any infra-root person who wants to help maintain that ppa can be added to the group there 19:22:09 i just mean, will it drag in other stuff we've built 19:22:18 ianw: yah, vhd-util and bwrap right now 19:22:25 oh, and python3 testing :) 19:22:41 you have to explicitly install those packages but if you do ya should come from the ppa 19:22:49 note that they're individual ppas. i expect this would be an openafs-hwe-kernel (or similar) ppa adjacent to the bwrap and vhdutil ones 19:22:59 ++ 19:23:01 oh right, that's what i meant 19:23:21 as opposed to, say, a single ppa with all those unrelated packages dumped in 19:23:23 ok, maybe we should discuss in infra. i'll take an action item to get the ppa setup with packages and then we can do a few upgrades from that 19:23:35 thanks ianw! 19:23:40 dmsimard: added you 19:23:41 #action ianw set up afs ppa for arm and hwe kernels 19:23:43 ianw: happy to help 19:23:56 one other quick thing was that ipv6 is broken to ze02 for undeterminted reasons 19:24:10 ianw: separately from the larger scale ipv6 breakage we had right? 19:24:17 i guess we should just replace that server? 19:24:23 rax said they didn't know what's wrong and wanted access to the hosts on either end of the issue (cacti and ze02) which i wasn't too sure about 19:24:27 where "broken" means some significant % packet loss, but not 100% 19:24:45 yeah, so should i try 1) a hard reboot (maybe ends up on a different node) and 2) a rebuild 19:24:51 do any other instances also get ipv6 packet loss communicating with ze02? 19:25:07 i haven't noticed any. a sure fire way is if cacti has no/spotty logs 19:25:19 i'm not comfortable sharing access to ze02. 19:25:44 maybe make new server then nova rebuild old one with a new image? 19:26:00 I have a feeling nova actually reschedules you to a new hypervisor when you do that which may invalidate the testing 19:26:04 clarkb: yeah, and if that works, share that with rax. and if it doesn't, "oh well" 19:26:24 i guess we could try to clean the server then give them access 19:26:50 i don't think they *want* to investigate, but probably would if we asked, if you know what i mean 19:26:56 we wouldn't want to reuse it after they log in, at any rate 19:27:10 ianw: in that case maybe we just make a new server 19:27:12 and wouldn't want to leave keys on it 19:27:14 wfm 19:27:16 so basically ok to hard reboot/rebuild over the next day or so and see what happens? 19:27:22 wfm 19:27:23 ianw: wfm 19:27:40 sounds good 19:27:46 maybe do that before we commit hard to the hwe kernel. It is possible that breaks ipv6 somehow 19:27:54 however I think the server had similar problems before we updated the kernel 19:28:02 true also, but the access (from cacti) has been bad for ages 19:28:02 worth noting we still have cacti01 in dns 19:28:25 fungi: oh, it is possible I didn't remove it 19:29:41 The other zuul item on the agenda is handling zk disconnects. Which I think should be fixed at this point, but wanted to double check on that before removing it from the agenda. 19:30:15 do we want to consider migrating to new zookeeper at PTG? 19:30:25 zk01 / zk02 / zk03 19:30:30 from nodepool.o.o 19:30:37 pabelanger: probably not a bad time to do it considering load will likely be low during ptg 19:30:40 how disruptive do we anticipate that being? 19:30:56 I believe we need to stop / start everything to pick up new configuration 19:31:09 it requires a full outage for nodepool because we can't expand/contract zk clusters i guess? 19:31:19 fungi: yes 19:31:22 yes, and zuul too 19:31:30 at least my reading of docs is that you need a very new version of zk to do expansion safely 19:31:56 oh, i suppose zuul does need to know to connect to the new cluster as well yes ;) 19:32:09 Have we ever tested connecting to a cluster in the first place ? 19:32:12 process would be something like stop all services including zk, copy zk data files to 3 new servers. Start 3 new servers. Check quorum and data is present, start nodepool and zuul 19:32:24 dmsimard: not yet, but we likely should 19:32:26 I'm admittedly not knowledgeable about Zookeeper but maybe there's a few gotchas around connecting to a cluster 19:32:33 would be good to confirm replication is working correctly 19:33:05 For example, some client bindings require listing all the nodes in the configuration and the client decides where to go, others it's handled server side, etc. 19:33:19 ya its worth testing. We even have a cluster you can use :) 19:33:21 pabelanger: not just replication but actual usage/connection 19:33:31 dmsimard, pabelanger: I connect to a cluster since day one ;) 19:33:39 oh 19:33:40 :) 19:33:44 tobiash wins. Can double check our config for us :P 19:33:45 tobiash: well isn't that awesome 19:33:59 a wild tobiash has appeared :D 19:34:14 didn't notice any problems related to this so far ;) 19:34:21 alright anything else related to zuul? 19:34:30 okay, so we still need to figure out how to move data between servers, but i can maybe look into that and start an etherpad 19:34:40 pabelanger: its just file copies aiui 19:34:47 okay 19:34:53 pabelanger: so stop zk everywhere, copy the files out, start zk 19:35:15 we may even be able to test it on the new cluster by just copying what is on disk at any point in time of existing server 19:35:43 (basically don't stop the source side) 19:35:52 clarkb: yah, that's what I'm thinking for first test 19:36:19 #topic General Topics 19:36:28 ianw: gema aarch64 udpate time? 19:36:53 i got what would be testing-nodes building a booting 19:37:17 nice 19:37:20 it's a bit janky but gets there -> https://review.openstack.org/542591 19:38:08 ianw: the hw_firmware_type is not needed anymore, the patch was not fully applied everywhere, should be redundant 19:38:18 actually it's up if you have access @ root@211.148.24.197 19:38:19 ianw: but we can clean that up when we upgrade the clouds to queens 19:38:56 cool. so on my todo list is figuring out how to get the hwe kernel installed by default during the build ... it seems to work better, and just finishing off the efi stack 19:39:11 the dib gate has been broken so took me on a journey of fixing jobs last few days 19:39:19 ianw: if HWE kernel works better on arm64 I can talk to the ubuntu folks to make it default on arm cloud images 19:39:43 ianw: worst case you should be able to just have a package install for it right? 19:39:49 ianw: we have a conversation pending for PTG, you can join if you'd like 19:39:53 small element that checks arch and installs hwe if arm 19:40:06 yep, there's always a way :) 19:40:17 ianw: this is awesome work, thank you 19:40:21 infra-needs-element :D 19:41:17 if we get this ppa going, i was also going to update puppet to make sure we can build the mirror automatically 19:41:29 much appreciated 19:41:31 and we should probably mirror ports.ubuntu 19:41:48 yah, that would be awesome 19:41:56 is that where ubuntu sticks non-x86 architecture builds? 19:42:15 yep, so we'd want to prune that to arm64 19:42:16 * fungi is too used to debian just putting all architectures in one place 19:43:57 so i think it's all moving forward ... if anyone wants to review the dib stack at https://review.openstack.org/#/c/539731/ feel free :) 19:44:14 clarkb: from our side we are moving hw to a new lab in london next week and after that we'll bring up a new cloud, with some resources for infra also 19:44:56 is this a second cloud or a replacement? 19:45:01 a second cloud 19:45:05 ianw: gema sounds like good progress, exciting! 19:45:13 we have another one in the US that needs updating and will also be added in due time 19:45:34 most awesome 19:45:39 corvus: the plan is to have the three clouds running queens during the coming months 19:45:47 and have resources for infra on the three of them 19:45:47 groovy! 19:46:13 Yay 19:46:26 * clarkb is now going to try and quickly get through a couple of things leaving us ~10 minutes to talk PTG 19:46:28 and our members are super excited and talking about contributing more hw 19:46:59 clarkb: sorry, go for it :D 19:47:02 I've switched jjb's core review group in Gerrit to be self managed as they are already operating mostly independently. I do not think this is at odds with them remaining as an infra project 19:47:15 if anyone has concerns about this let me know but reality is they've been operating independently for some time 19:47:32 I did speak with the release team last week about project renames and the date I got back was March 16 19:47:52 cycle trailing releases should just be about done then. we will want to dobuel check with release team before taking gerrit down though 19:48:05 yeah, in my tenure as ptl i basically just acted on whatever recommendations their core team had for adding new core members 19:48:12 (jjb that is) 19:48:35 I don't believe I'll be around that week to help 19:48:47 we can sort out process for renames after hte PTG I expect 19:48:50 i'll be available 19:48:50 pabelanger: thats ok, I think several of us will be around we should have enough people to do it 19:49:27 Alright PTG planning 19:49:27 that is the night before a major drinking holiday, but shouldn't pose any problems 19:49:38 fungi: ya its day after you have to wrry about 19:49:43 ;) 19:49:58 Looks like we have a good collection of stuff for wed-fri 19:50:39 do we want to try and schedule that loosely say put all the Zuul stuff on wednesday/thursday and the rest of it on friday? Or do it more unconference style and let people work through what they are interested in as they are able? 19:51:03 I'm not sure if one of the other helps or hurts people's ability to attend based on travel or other scheduling demands 19:51:39 i think attempting scheduling would be helpful 19:51:49 for the arm64 bit it'd help if we had a slot, so I know when we'll be discussing it. We will be splitting time between kolla and infra 19:51:59 I'll stay in the cold land of Montreal to help regular infra-root things along during North American timezones :) 19:52:28 ok thats two votes for scheduling. I'll work up a proposed schedule then. Probably put it in the etherpad then send mail to the infra list letting people know they should look it over 19:52:43 wfm 19:52:55 wfm 19:52:56 it probably won't be super specific and instead be rough like wednesday morning vs friday after noon 19:53:06 yeah, there are a few non-infra discussions i need to split my time between too, so knowing when would be convenient helps me 19:53:07 we still have an ethercalc, right? 19:53:17 corvus: oh good point I should use that 19:53:17 indeed we do! 19:53:59 there's no ptg-wide ethercalc integrated with ptgbot any longer, but doesn't preclude us from having our own spreadsheet for this 19:54:08 thinking about it that might be a good way to do "appointments" with the help days too 19:54:21 just block out each hour of the day monday/tuesday and let projects sign up 19:54:30 great idea 19:54:49 ++ 19:55:07 ok I will work on getting that set up 19:55:37 and the last thing is do we want to try and do a team dinner (I think we should I've just been sick and otherwise distracted by family obligations haven't had much time to look into it) 19:55:47 worst case I imagine we can roll into a pub and sit at a couple tables 19:56:26 but maybe a quick poll of what night works for people would be good too 19:56:31 i like eating with people 19:56:35 also drinking 19:56:42 pabelanger: ^ were you maybe still interested in that? 19:56:54 sure, I can look into it more 19:57:09 get up etherpad for people to vote on a datetime 19:57:21 #topic open discussion 19:57:21 dublin has a preponderance of venues where a large group can just wander in and get served. we may have to spread out across multiple tables, but i doubt we'll have trouble finding somewhere 19:57:28 few minutes for anything else that we've missed 19:57:41 fungi: good to know thanks 19:57:45 One request: I have two stacks up that will remove together /usr/local/jenkins from our nodes, please review and comment on naming/location. https://review.openstack.org/541868 and https://review.openstack.org/543139/ 19:58:39 https://review.openstack.org/542422/ is something nodepool related to help better manage our min-ready nodes. If people are maybe interested in reviewing 19:58:41 AJaeger: why don't we use the ansible script module ? 19:59:29 corvus: where exactly? Happy to make changes... 19:59:43 pabelanger: min-ready: 0 is now the default, but we may need launcher restarts to pick that up 20:00:10 corvus: k, will update 20:00:13 corvus: let's continue in #openstack-infra 20:00:28 AJaeger: well, it's more of a question of approach -- rather than have jobs which expect scripts to already exist on a node, the ansible 'script' module copies and runs the scripts 20:01:11 ya discussion will have to move to -infra. We are out of time! thank you everyone. 20:01:17 thanks! 20:01:18 #endmeeting