19:02:50 #startmeeting infra 19:02:51 Meeting started Tue May 23 19:02:50 2017 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:02:52 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:02:55 The meeting name has been set to 'infra' 19:03:00 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:03:07 #topic Announcements 19:03:18 #info Many thanks to AJaeger (Andreas Jaeger) for agreeing to take on core reviewer duties for the infra-manual repo! 19:03:30 #info Many thanks to SpamapS (Clint Byrum) for agreeing to take on core reviewer duties for the nodepool and zuul repos! 19:03:42 as always, feel free to hit me up with announcements you want included in future meetings 19:03:51 #topic Actions from last meeting 19:04:03 #link http://eavesdrop.openstack.org/meetings/infra/2017/infra.2017-05-02-19.05.html Minutes from last meeting 19:04:14 pabelanger Open an Ubuntu SRU for bug 1251495 19:04:15 bug 1251495 in mailman (Ubuntu Trusty) "Lists with topics enabled can throw unexpected keyword argument 'Delete' exception." [High,Triaged] https://launchpad.net/bugs/1251495 19:04:23 guessing this one's still pending, not seeing one there yet 19:04:45 fungi: yes, at this point if you want to take it off action list, I am trying to work ubuntu community to get it done 19:04:56 but, since layoffs it has been difficult finding people 19:05:17 was going to reach out to a few openstack ubuntu members and see how to best move forward 19:05:24 okay, no sweat. we had a fallback plan for now anyway, i think? 19:05:29 I just don't want to update up a SRU bug and leave it to bitrot 19:05:44 ya, we are running a manual patch to lists.o.o today 19:05:55 fallback, we could do our own PPA if needed 19:06:02 or move to xenial 19:06:05 and i think it's fixed in xenial 19:06:12 yes, xenial is okay 19:06:15 does that get reverted if unattended-upgrades updates the mailman package on us (for a security fix or the like)? 19:06:29 fungi: I believe so 19:06:29 yes 19:06:46 okay, so this steps up the priority to update the listserv from trusty to xenial i guess 19:06:51 fungi: ++ 19:07:28 i'm on board to reprise my role from last time 19:07:51 I guess security fixes bypass the sru process ya? 19:07:54 jeblair: want someone else to draft the server upgrade plan this time? 19:07:56 cool, upgrade to xenial is likely our best option I think 19:08:07 (eg it is possible for a patch like that to come in that doesn't include pabelanger's patch to fix the bug) 19:08:09 clarkb: yes, it looks that way 19:08:50 fungi: either way; given my etherpad from last time, it's probably not too hard for someone else to do it. if no one else wants to, i can. 19:09:15 i'm resisting the temptation to volunteer myself for yet one more thing... any takers? 19:10:13 I can give the upgrade a shot, although will likely have many questions. 19:10:30 Unless this would be something better for an infra-core to handle. 19:10:48 bkero: want to start by adapting jeblair's previous etherpad contents? 19:11:14 #link https://etherpad.openstack.org/p/lists.o.o-trusty-upgrade 19:11:23 fungi: Sure 19:11:29 ya some things like ext4 upgrade won't be necessary 19:11:44 #action bkero draft an upgrade plan for lists.o.o to xenial 19:11:48 do we want to give bkero a snapshot of lists.o.o to work from? 19:12:18 i would have no problem snapshotting the current server and adding an account/sshkey for him on that 19:12:47 wfm 19:12:56 I can help get a snapshot done since I did that last time 19:13:09 we'll obviously want to follow our earlier disable/enable steps for stuff around teh snapshot creation so the snapshot isn't booted spewing duplicate e-mails 19:13:11 jeblair: we want ot doctor the base image before doing that right? 19:13:16 to disable services? 19:13:21 ya that 19:13:25 ++ 19:13:40 though likely not today as I am going to try and get shade out the door and nodepool upgraded so that citycloud can run multinode jobs 19:13:46 but tomorrow I can likely do this 19:14:06 Sounds good. I'll have time to tackle this tomorrow and Thursday evening. 19:14:07 yes. we're going to get that out the door if it kills me 19:14:26 (we got a bug report this morning that I'm squeezing the fix for in because I'm a bad person) 19:15:37 Bad mordred! Bad dog! 19:15:43 * mordred hides 19:15:46 mordred: always trying to get the fix in 19:16:07 mordred: and just approved the last of those 19:16:11 Shrews: thank you 19:17:07 clarkb to add citycloud to nodepool 19:17:09 #link http://grafana.openstack.org/dashboard/db/nodepool-city-cloud City Cloud nodepool utilization graphs 19:17:14 (and there was much rejoicing) 19:17:18 Yay 19:17:55 there are/were two hiccups with this 19:18:04 one minor but annoying issue still outstanding in one of their regions which looks like a bug/misconfiguration at this point i guess 19:18:06 first one was missing flavor in the La1 region which I sent email about and they corrected 19:19:06 the second is we sometimes get multiple private IP addrs assigned to isntances which breaks in two different ways. The first is if nodepool writes the non working private ip to the private ip list on the instance then multinode breaks. The shade stuff I mention above should address this by using the private ip address associated with the floating IP address 19:19:26 the second way this breaks is if the floating IP address is attached to the non working private IP then nodepool fails to ssh in and deletes the node and tries again 19:19:42 which is probably accounting for most of the hits on that error node launch attempts graph too, i would assume 19:19:44 I've sent email to citycloud with example instances and info on this in hopes they can track down why this is happening 19:19:53 pabelanger helped track down the second way this breaks so thanks 19:20:01 fungi: ya 19:20:16 2.1 and 2.2 both seem like either one or two openstack bugs, yes? i guess we're expecting them to confirm that? 19:20:18 Ya, nodepool debug script wins again 19:20:39 jeblair: I think its the same underlying bug in openstack or their deployment yes. Hoping they can fix/confirm 19:21:13 does shade pick the private IP to attach the FIP too? 19:21:18 or is that openstack 19:21:28 pabelanger: yes 19:21:35 pabelanger: (it depends) 19:21:45 being a private IP though shade has no way of knowing which one is "correct" 19:21:59 right 19:22:01 so there isn't much it can do there other than assume the floating IP one will work 19:22:01 pabelanger: if there is only one fixed ip on a server, openstack picks. if there is more than one, the user (or shade) has to tell openstack which to use 19:22:13 there are _some_ ways to infer the correct one, which shade does 19:22:22 but if those don't work there is an occ config option the user can set 19:22:28 or be explicit in the create_server call 19:22:37 in this case they are both on the same network though 19:22:59 I wonder if it is always the 2nd private IP that ends up working 19:23:02 so the only thing differentiating is the ip address and thats a toss up without actually see which got dhcped on the VM 19:23:20 pabelanger: it could be though I don't know how the are ordered if at all 19:23:31 clarkb: but so far the one with the same mac as the fip seems to be correct, right? 19:23:56 mordred: for the multinode job case yes, because in order to get that far the fip had to work 19:23:57 the fact that the first one has a different mac from the mac on the server is the thing that makes me think something is extra broke 19:24:19 mordred: but we have ssh failures in nodepool that pabelanger has tracked back to the fip being attached to the wrong private ip 19:24:28 clarkb: oh - ah - right 19:24:42 clarkb: ya, it seems when FIP attaches to 2nd private IP, it works 19:24:42 also we don't get 2 private IPs on every server 19:24:44 soyah - nothing we can do about that shade-side 19:24:55 all the working servers now, are on private IP2 19:24:58 its all very weird, hoping the cloud can clarify 19:25:02 pabelanger: interesting 19:25:25 and it looked like they were always adjacent addresses (at least in the examples i saw) 19:25:37 anyways we don't have to spend much more time on this. Cloud is in use, once servers boot and nodepool sshes we should be fine (especially after shade is released) 19:25:44 ++ 19:25:50 clarkb: thanks, and thanks for the update 19:25:52 thanks clarkb, pabelanger! 19:25:54 pabelanger: ^ 19:26:12 #topic Specs approval: PROPOSED Add nodepool drivers spec (jeblair) 19:26:16 #link https://review.openstack.org/461509 "Nodepool Drivers" spec proposal 19:26:58 i proposed a thing 19:27:05 this is pretty high level 19:27:14 looks like it's gone through some review/iteration at this point 19:27:29 basically, some folks showed up and wanted to start working on the static node support in nodepool 19:27:57 this spec lays out an approach for doing that as well as laying the groundwork for future expansion for other non-openstack providers 19:28:21 i think we've circulated it around the folks interested in that area, so i think it's ready for a vote 19:29:19 awesome 19:30:07 #info The "Nodepool Drivers" spec is open for Infra Council voting until 19:00 UTC Thursday, May 25 19:30:18 that cool? 19:30:40 it being that late in the year is not cool 19:30:42 cool. cool. 19:30:47 where did the last 5 months go 19:31:01 clarkb: yeah, i don't know where the first half of the year went 19:31:24 thanks jeblair! this will be awesome to have working 19:31:43 #topic Priority Efforts 19:32:03 nothing called out specifically here, though the spec above is related to the zuulv3 work 19:32:23 #topic Old general ML archive import (fungi) 19:32:27 #link https://etherpad.openstack.org/p/lists.o.o-openstack-archive-import Mainte 19:32:35 #undo 19:32:36 Removing item from minutes: #link https://etherpad.openstack.org/p/lists.o.o-openstack-archive-import 19:32:47 #link https://etherpad.openstack.org/p/lists.o.o-openstack-archive-import Maintenance plan for old general ML archive import 19:32:59 (not sure where that stray newline came from) 19:33:26 anyway, repeat from about a month ago when i said i'd punt this maintenance until after the summit 19:34:23 we're in the middle of a few dead weeks in the release schedule 19:34:30 lgtm. i think this is fine to do either before or after the xenial upgrade. just not during. :) 19:34:33 #link https://releases.openstack.org/pike/schedule.html Pike Release Schedule 19:34:50 so, yeah, this seems like a good time to go ahead with it 19:34:59 just means some (brief) downtime for the listserv 19:35:11 count me in as standby help. 19:35:23 and to have some volunteers on hand to visually inspect the archive before we start allowing new messages into the list 19:35:31 thanks jeblair! 19:36:06 i'm probably fine doing it late utc on friday (20:00 utc or later) 19:36:24 have an appointment on the mainland earlier in the day so won't be around until then 19:36:25 should be able to help also 19:36:36 ok, I'll be around on friday as well 19:36:53 2000 fri wfm 19:37:01 awesome, i'll send an announcement after the tc meeting 19:37:39 #info The mailman services on lists.openstack.org will be offline for about an hour on Friday, May 26 starting at 20:00 UTC 19:38:19 refreshing the agenda and seeing no other last-minute additions... 19:38:22 #topic Open discussion 19:39:23 don't all talk at once now ;) 19:39:27 nb03.o.o is online (and xenial). However, we are at volume quota for vexxhost, waiting for feedback on how to proceed from vexxhost 19:39:33 osic is running at max servers of zero right now 19:39:53 we are told that we should get access to the cloud once dns and ssl are sorted 19:39:53 folks should expect to see another zuulv3 email update soon 19:40:26 pabelanger: how is volume quota related? 19:40:48 pabelanger: (did you mean image quota?) 19:40:57 we have a 1tb cinder volume attached to nb01 and nb02 19:41:00 we only have 200GB HDD for server, we usually mount a 1TB for diskimage builder 19:41:03 at /opt 19:41:03 ooooh got it 19:41:40 since we have multiple copies of images and multiple images and multiple formats that all adds up 19:41:45 also the scratch space and cache for dib 19:41:48 we could _probably_ get by with 0.5tb in there... nb01 is using 261gb and nb02 185gb 19:42:11 ya, since feature/zuulv3 branch, the storage has been much lower 19:42:27 mostly because we are not leaking things :) 19:42:41 and we've split the image storage across multiple nodes 19:42:47 that too 19:43:13 also we stopped rawing 19:43:20 so lots of good improvements 19:43:59 so maybe 0.5tb is plenty, but regardless we have little/no volume quota in that tenant currently? 19:44:28 ya, we are over quota atm 19:44:54 what else do we have in there besides planet01? 19:45:08 old mirror 19:45:10 for nodepool 19:45:14 (which isn't using cinder afaict) 19:45:22 it may actually be using it 19:45:30 its possible you could remove the mirror and its volume to reclaim that quota 19:45:32 i meant planet01 isn't 19:45:39 but yeah, we can delete that mirror server 19:46:04 since we stopped trying to use vexxhost for nodepool nodes 19:46:06 k 19:46:20 I was using it to debug some apache proxy caching stuff 19:46:25 and hopefully that frees you up to push forward on the nb03 build 19:46:42 oh, well you could also just unmount and detach the cinder volume in that case 19:46:44 pabelanger: we can always redeploy one without a volume for that sort of testing 19:46:48 or that 19:46:56 the issue is plant is using 200GB volume 19:46:59 and then just delete the volume not the server 19:47:00 and our quota is 100GB 19:47:16 sorry, have to run openstack commands again to confirm 19:47:30 VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed gigabytes quota. Requested 1024G, quota is 1000G and 200G has been consumed. (HTTP 413) (Request-ID: req-2ea6e088-f9cd-4e71-b635-901ee212f7f8) 19:47:46 yeah, that's a 200gb volume on the mirror server 19:49:03 sorry, I am not sure what our quota is 19:49:15 /faceplam 19:49:19 100GB :) 19:49:26 so, we could do 500GB volume 19:49:32 1000* 19:49:39 I am going to stop typing now 19:49:43 yeah, that ought to be plenty for now 19:50:31 yup 500GB seems fine. 19:51:06 k, I'll do 500GB 19:51:38 and as for the other 200gb, confirmed as suspected: 19:51:40 | 17eeb39d-c6de-4e32-8c08-26cf3592a22c | mirror.ca-ymq-1.vexxhost.openstack.org/main02 | in-use | 200 | Attached to mirror.ca-ymq-1.vexxhost.openstack.org on /dev/vdc | 19:55:35 okay, well if there's nothing else, that concludes this week's installment! 19:55:38 thanks everyone 19:55:46 #endmeeting