19:01:05 #startmeeting infra 19:01:05 Meeting started Tue Feb 10 19:01:05 2015 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:06 o/ 19:01:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 The meeting name has been set to 'infra' 19:01:13 present 19:01:17 o/ 19:01:19 Howdy 19:01:20 o/ 19:01:29 #link agenda https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:30 o/ 19:01:30 o/ 19:01:38 o/ 19:01:40 o/ 19:01:40 #link previous meeting http://eavesdrop.openstack.org/meetings/infra/2015/infra.2015-02-03-19.01.html 19:01:42 O/ 19:01:49 #topic Actions from last meeting 19:01:56 pleia2 draft summary email about virtual sprints 19:01:57 o/ 19:02:00 and she sent it too! 19:02:05 I saw it 19:02:07 it was a great read 19:02:08 it was very good 19:02:20 pleia2: thank you! 19:02:21 +1 19:02:31 ttx's respons was good too, these sprints are good for well-defined tasks 19:02:42 yes 19:02:55 openstack needs more well-defined tasks 19:02:55 which I think some teams struggle with, I think the friday hack day at summit really helped us solidify some things 19:03:29 some teams struggle with them because it isn't until they all get in the same room they finally realize they agree 19:03:39 when for 4 months they were convinced they didn't 19:04:20 #topic New infra-core team member 19:04:41 so much suspense 19:04:41 pleia2 will find that she has some extra buttons in gerrit now :) 19:04:47 woooo 19:04:53 congratulations pleia2 19:04:53 thanks everyone! 19:04:56 congrats! 19:04:57 gratz! 19:04:59 I'll try not to break openstack 19:05:06 pleia2: that part's covered 19:05:07 just fix whatever you break 19:05:21 it's a good day if i don't fix more than i break 19:05:26 er, if i do 19:05:29 something 19:05:30 fungi: hehe, noted 19:05:31 :) 19:05:37 pleia2: congrats :-) 19:05:51 pleia2: so now, you get to propose your own addition to infra-root by changing some stuff in puppet 19:05:58 pleia2: and to be honest, i don't even know where that lives anymore 19:06:00 pleia2: so good luck! :) 19:06:07 hahaha 19:06:07 left as an exercise for the reader 19:06:07 fungi pointed me in the right direction earlier 19:06:40 so I'll take care of that soon, thanks 19:06:47 thank you! 19:06:55 o/ 19:06:57 #topic Priority Efforts (Swift logs) 19:06:59 (sorry late) 19:07:09 o/ 19:07:23 So logs are ticking along.. I think our current challenge is looking into why they take ~5-10min for devstack logs 19:07:34 it's most likely bandwidth 19:07:41 possibly when coming from hpcloud 19:07:57 might be good to compare some samples between providers 19:08:12 but other than that, I think we can start to move some other jobs across 19:08:18 especially ones with smaller log sets 19:08:27 do we have any of those? 19:08:40 but even when using scp, the data goes from node(hpcloud) -> jenkins master(rax) -> static.o.o(rax) 19:08:46 well smaller than devstack isn't hard as you don't have all the various service logs 19:08:57 fair enough 19:09:11 hi 19:09:14 jeblair: to be honest, I haven't compared how long the scp takes 19:09:21 probably something worth poking at 19:09:25 yeah 19:09:46 o/ 19:09:57 so a couple of actions for me there (compare times + move more jobs over) 19:10:06 #action jhesketh look into log copying times 19:10:11 #action jhesketh move more jobs over 19:10:17 cheers :-) 19:10:29 jhesketh: are still doing scp + swift? 19:10:41 if so, are we ready to remove scp from any? 19:10:55 jeblair: for devstack, yes. I think it's only turned off for a set of project-config jobs 19:11:31 okay, so we can probably keep doing that for a little bit until we're happy with the timing 19:11:39 jeblair: I suspect so, but maybe we need to get somebody who works closely with devstack logs to do some user acceptability testing? 19:11:54 (eg sdague or jogo) 19:12:02 jhesketh: aren't we doing swift-first already? 19:12:05 * sdague pops up? 19:12:13 jhesketh: so, in other words, acceptance testing is already in progress? :) 19:12:15 jeblair: nope, disk-first 19:12:18 oooh 19:12:20 ok 19:12:30 which is dictated by apache serving its indexes 19:12:54 jhesketh: can you dig up a log set for sdague to look at? 19:13:05 yep, happy to 19:13:43 #action sdague look at devstack swift logs for usability 19:14:21 jhesketh, sdague: thanks 19:14:26 #topic Priority Efforts (Nodepool DIB) 19:14:48 stabstabstab 19:14:53 SO 19:14:59 I now have ubuntu working 19:15:31 am battling centos - not because we need centos - but because it's a thing we have in servers that uses systemd and I figure we should solve systemd before declaring victory 19:15:32 will do, thanks sdague 19:16:16 although it turns out that centos7 has a) old version of systemd and b) not consistent systemd 19:16:19 * mordred cries 19:16:58 * greghaynes hands mordred a fedora 19:17:31 mordred: if you want to sync me up with some of the details later, i can help out 19:17:33 anywho - I'm expecting to have that all sorted today so that I can go back to making the nodepool patch 19:17:37 ianw: oooh 19:17:52 ianw: I will do that 19:17:58 ianw: I'm assuming you grok all the systemds 19:18:44 i haven't set aside time to get very far with collapsing bare and devstack image types together nor job run-time database configuration so we can stop needing to have a dib solution for that part. hopefully later this week will be better than last was 19:19:38 #action mordred fix the systemd problem 19:19:38 optimist 19:19:41 (ha!) 19:19:59 #action fungi collapse image types 19:20:11 optimism all around! 19:20:18 :) 19:20:27 anything else nodepool dibby? 19:20:46 uhm ... things that don't do DHCP are bonghits? 19:20:51 clarkb may have things, but he's occupied 19:21:05 just my bugfix for image update change 19:21:07 any reviews need attention? 19:21:15 accomodates jeblairs image build fix 19:22:02 there are still outstanding reviews for f21 image builds 19:22:14 https://review.openstack.org/140901 19:22:25 https://review.openstack.org/138250 19:22:35 both were going in but hit merge conficts 19:22:45 conflicts 19:23:05 clarkb: i'm not sure which you're talking about? 19:23:16 oh 19:23:18 #link https://review.openstack.org/#/c/151749/ 19:23:19 that one? 19:23:32 "Better image checking in update_image command" 19:23:41 yes 19:23:43 tha ks 19:24:03 asselin_: had a comment on that 19:24:10 but yeah, we should take a look at that one 19:24:31 o/ 19:24:57 #topic Priority Efforts ( Migration to Zanata ) 19:25:04 * anteaya admires asselin_'s useful reviews 19:25:28 anteaya, thanks 19:25:55 so mrmartin has been helping me get my module into shape 19:26:03 https://review.openstack.org/#/c/147947/ 19:26:25 helping dependencies make more sense (depending on services vs files for installation) and doing tests in vagrant 19:26:26 needs some work on zanata puppet modules, I set this up in vagrant, and had some dep problems that pleia2 solved 19:27:10 pleia2: doing local testing in vagrant? 19:27:19 I allocate some time this week and try to find out why the wildfly zanata app deplyoment fails 19:27:32 anteaya: mrmartin is, I'm using some snapshotted VMs 19:27:51 why do you need vagrant if you are using vms? 19:28:02 anteaya: we're both testing in our own ways 19:28:05 sorry if this was discussed before and I missed it in backscroll 19:28:27 he's using vagrant, I'm using VMs 19:28:33 oh sorry 19:28:34 double-check 19:28:36 :) 19:28:56 vagrant launching vm(s) anyway. 19:29:14 #link https://review.openstack.org/#/c/147947/ 19:29:23 so progress is being made, not as fast as I'd like, but java is clunky 19:29:54 pleia2, mrmartin: groovy, thanks! 19:30:06 #topic Upgrading Gerrit (zaro) 19:30:46 ok. i think i have managed to fix the testing stack of review-dev, zuul-dev, and jenkins-dev 19:31:00 all things working now, so will be easier to test 19:31:16 review-dev.o.o is on trusty and on Gerrit 2.9.4 19:31:26 db is still a problem? 19:31:30 So if anybody wants to test anything do it there. 19:31:55 clarkb: yes, the issue about db disconnect is still a aproblem. but it's also in prod 19:32:02 zaro: did you find that we need 2.10 for wip plugin? 19:32:18 well, at least, we can't prove that it isn't in prod 19:32:32 and when we run all the same versions of things in dev, it happens 19:32:41 but of course the trove db server is actually different 19:32:49 jeblair: no, wip plugin will be a ways out. it needs fixes from master which won't show up unil 2.11 19:32:54 right, zaro was able to reproduce the problem with the ubuntu and gerrit versions we're running in prod 19:32:54 unil/until 19:33:01 #info WIP requires >= 2.11 19:33:06 #undo 19:33:07 Removing item from minutes: 19:33:13 #info WIP plugin requires >= 2.11 19:33:32 clarkb: though it does seem suspiciously similar to the db problem we were seeing with paste.o.o 19:33:37 fungi ya 19:33:50 I think the trove dbs are partially to blame 19:34:00 * anteaya clicks the review-dev storyboard link in the commit message 19:34:25 i verified that the mysql timeout values are the default on the review-dev trove instance 19:34:32 so it's at least not that kind of misconfiguration 19:34:50 so i just finished validating zuul pipelines and the launchpad integration. 19:35:30 will probably see if there's anything to check in Gerrit ACLs next. 19:35:37 zaro I don't see the ability to change the topic in the gui 19:36:09 zaro: would you be willing to spend some time with the trove folks and see if there's something they can do? 19:36:27 jeblair: yes, most definatlye 19:36:33 anteaya: i thought we discovered that only change owners and gerrit admins could do in-ui topic edits? 19:36:37 i'll ask them to help debug 19:36:44 zaro: cool, thanks 19:36:44 fungi: perhaps I don't have the permissions then 19:36:47 iccha works on trove at rax now - might be a good contact too 19:37:02 anteaya: i don't think that's availabe in old screen UI, not even on review.o.o 19:37:05 mordred: not any more 19:37:05 #action zaro to chat with trove folks about review-dev db problems 19:37:12 anteaya: oh1 well, don't listen to me 19:37:20 mordred: rax decided they can only work on trove in their own time 19:37:30 mordred: so only when she has time after work now 19:38:16 i'm going to test prod db migration next. 19:39:00 So about moving review.o.o to trusty? 19:39:19 anybody against that? if not should we schedule something? 19:39:20 zaro: we need to do that to upgrade? 19:39:25 yes 19:39:40 no opposition from me on that 19:39:48 sounds good to me 19:39:49 I'm for scheduling something the week before summit like we did last year 19:40:04 or will this be less involved? 19:40:04 anteaya: this is why #link https://review.openstack.org/#/c/151368/ 19:40:07 zaro: so you want to do os upgrade first, then gerrit upgrade? or both together? 19:40:34 bouncy castle again 19:40:36 best to do OS upgrade first 19:41:22 right, so we didn't get that far with this last week, but let's try again 19:41:39 nothing before feb 28 19:41:53 I know one problem is every time we change IP corps need to update firewall rules 19:42:02 we still cant floating ip in rax right? 19:42:41 how is feb 28, mar 7, mar 21? 19:42:48 also, see: https://wiki.openstack.org/wiki/Kilo_Release_Schedule 19:43:24 can I vote for may 7th? 19:43:37 jeblair all should worj for me 19:43:37 I'm around all those days 19:44:13 clarkb: that sounds to me like a reason to do this ~monthly 19:44:17 fungi: ++ 19:44:32 maybe they'd learn that outbound blocking is crazypants 19:44:33 eventually they'll get tired of having to maintain special egress rules for that port 19:44:39 ha 19:44:54 and https does work fwiw 19:44:55 all are good with me as well. i'm partial to feb 28. 19:45:03 to anteaya's point. do we want to wait until after the release? 19:45:10 anteaya it should be very low impact 19:45:24 should and is can be miles apart 19:45:24 spin up new node side by side, easy switch, easy fallback 19:45:47 if there is a compleling reason to do it before may I'm all ears 19:45:48 anteaya well I have done this before and it was easy (lucid to precisr) 19:46:03 no blocking from corp firewalls? 19:46:15 no contributors unable to work? 19:46:25 happy to be wrong 19:46:26 anteaya: we will announce it well in advance, with the new ip. 19:46:30 there will likely be blocking on port 29418 they can use httpa 19:46:33 *https 19:46:43 okay if I am in the minority so be it 19:46:49 the biggest issue is probably going to be finding a slow enough week at this point in the cycle that having several hours of downtime won't severely impact development momentum so we can do the maintenance carefully and if necessary roll back 19:47:01 fungi: yes 19:47:44 * krtaylor wonders if it will impact some third party ci systems 19:47:48 so this feb 28 is the saturday before feature proposal freeze. the week following is likely to be busy 19:48:04 clarkb, zuul doesn't https 19:48:10 i'm not certain that's a reason not to do it. 19:48:13 krtaylor: some may need a restart to reconnect to redo dns resolution and reconnect to the new ip address, yes 19:48:40 * fungi had redundant words in that last sentence 19:48:48 we'll need to have firewalls updated for thirdparty ci 19:49:06 possibly, yes 19:49:20 so we should spread the event happening far and wide 19:49:24 asselin_ or use a proxy 19:49:32 and there will still be questions :) 19:49:35 krtaylor: we always do :) 19:49:40 ohh, i forgot. the Toggle CI button doesn't work. is anyone willing to take a look at that? i've already took a quick look but i don't know js so it's not apparent to me how it even works. 19:50:01 jeblair: for sake of information - I believe last time we swapped it took between 1 and 2 months to get the egress rules changed at HP 19:50:11 clarkb, not sure how. last time nothing worked for zuul. this was last summer. 19:50:14 mordred: i hope it goes faster this time. 19:50:20 jeblair: I do not believe that's possible 19:50:32 mordred, and they fat-fingered the rule for my site, so it took even longer 19:50:46 asselin_: you'll only need firewalls updated if your firewalls are for some reason configured to block _outbound_ connections from your systems 19:50:48 jeblair: it's a change to global security rules which goes through an IT process 19:50:49 to be clear, egress filtering is a bad idea. it's particularly bad for systems that rely on connecting to a system that runs in _a public cloud_ 19:50:55 yes. this is all true 19:51:06 fungi, yes, we're blocked on outbound :( 19:51:13 so, i think what we can do is try to disseminate the information as soon as possible 19:51:14 I'm merely reporting on the state of the world for at least one of our constituencies 19:51:19 mordred: thank you 19:51:19 keep the old instance and ip and redirect the traffic with haproxy to the new one 19:51:31 so don't need to change the ip 19:51:47 or you can keep it as a backup 19:51:58 but we can not let this be a blocker 19:52:00 asselin_: you would likely need to do a port fowrad through a SOCKS proxy 19:52:05 asselin_: it should just work once you get it set up 19:52:19 jeblair, agreed, speaking for my system anyway, anytime is as bad as any other 19:52:21 mrmartin: that may cause its own problems and greatly increase the complexity 19:52:27 mrmartin: no then we still have an old precise box around 19:52:36 this is where we all talk about some kind of ha proxy as a service thingy 19:52:37 and it increases the number of places where things can break as jeblair points out 19:52:46 mrmartin: if we do that, we'll either end up maintaining it indefinitely or ~50% of the people who are going to be impacted simply won't find out until we eventually take down the proxy 19:52:53 tchaypo: if only such proxy services were configurable in ways that made them useful :) 19:53:14 ok, but you can give a 2 month grace period, and everybody can migrate 19:53:57 poor-man's floating ip 19:54:20 on the up side, it only increases places where things can break for people who are stuck behind egress filters managed by people who need far too long to update them 19:54:23 so who's around on feb 28? 19:54:29 I can be 19:54:29 jeblair: me 19:54:32 I kinda think we should go the "wait longer" route so that we can spin up the new box and get the new IP info out to our various 3rd party testing folks and the corporations with idiotic network policies 19:54:48 jeblair: me 19:54:51 mordred: i believe that we can do that by the end of this week and provide 2 weeks of notice. 19:55:10 ok. I think that we know fora fact that will break a large portion of our user base 19:55:13 how do we get the new ip? 19:55:21 i'm out of town from the 25th to the 6th but don't let my absence stop you 19:55:23 tchaypo: we will send an email announcement 19:55:32 rephrase 19:55:34 mordred: how much notice do you want to provide? 19:55:47 how does the infra team find out the new ip to put in the email? 19:55:55 I think 2 months is probably the amount of time HP and IBM and Cisco are all liekly to need 19:55:57 tchaypo: we spin up the server 19:56:02 tchaypo: by starting the new server. 19:56:05 mordred: that seems excessive 19:56:05 which is sad 19:56:08 and they should be ashamed 19:56:40 not sure we'd need 2 months, but 2 weeks is tight also 19:56:50 but given the number of corporate contributors we have AND the number of 3rd party testing rigs taht exist - even though this is almost as broken as rackspace's not-dhcp - it is what it is 19:56:55 for third-party ci specifically i guess, since as clarkb points out everything besides stream-events should work fine via https api 19:57:17 fungi: yah - but 3rd party without streamevents is going to be kinda tough 19:57:22 yep 19:57:34 time check 19:57:41 ask.o.o 19:57:43 krtaylor: you need an egress change to connect to zuul? I was pretty sure the IBM corp network should just allow that 19:57:46 that puts us at mordred: that puts us at march 21. 19:58:01 fungi: ya https is how people are getting around china's great firewall now 19:58:03 anteaya, mrmartin: i think we're going to have to defer 19:58:06 fungi: so we know it works for the most part 19:58:07 I remember looking at the zuul code, and we may be about to update the socket it opents with a proxy configuration 19:58:11 jeblair: looks like it 19:58:27 asselin_: you just configure localhost:poxy_port as the gerrit location 19:58:31 jeblair: I'm around and available both days, fwiw 19:58:37 I agree with fungi 's assessment though, ~50 arent paying attention, 2 weeks should be enough to get the word out 19:58:38 and will help either day we choose 19:58:59 sdague, yes, no egress needed for us 19:59:32 we could also put gerrit ssh on port 22 19:59:37 i am happy to spin up the new server as soon as this meeting ends if someone wants to work on drafting an announcement i can plug the ip addresses into 19:59:43 but Ithink we should only do that if we can listen on 29418 as well 19:59:51 asselin_: have you worked through this before? how long did it take? 20:00:01 fungi, will announce in third-party meetings 20:00:22 okay, we'll continue this in the infra channel 20:00:24 thanks all 20:00:26 #endmeeting