19:01:09 #startmeeting infra 19:01:10 Meeting started Tue Feb 18 19:01:09 2014 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:14 The meeting name has been set to 'infra' 19:01:16 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting 19:01:22 #link http://eavesdrop.openstack.org/meetings/infra/2014/infra.2014-02-11-19.01.html 19:01:37 so we have a packed agenda and i'm sure we won't get through it. :( 19:01:46 actions from last meeting are not very interesting 19:02:02 #topic trove testing 19:02:05 #link https://review.openstack.org/#/c/69501/ 19:02:11 I think that's still the status of that 19:02:39 #topic Tripleo testing (lifeless, pleia2, fungi) 19:02:40 I'd like to have 2 mins for savanna testing/infra update at the end of meeting if possible 19:02:52 hi 19:03:18 so the news here is that we have pulled the tripleo cloud from nodepool and zuul 19:03:51 we've identified two specific improvements we need to make on the infra side to deal with a cloud that may not always be available 19:04:30 this week is the feature proposal freeze 19:04:36 I don't really have any specific updates otherwise, just chugging along on other pieces 19:04:37 and the week after that is the feature freeze 19:04:43 #link https://wiki.openstack.org/wiki/Icehouse_Release_Schedule 19:05:19 my option is to try to polish cloud (HA is prefered) using 3rd party testing 19:05:29 jaypipes makes a good doc about how to setup it 19:05:30 because it's a critical time for openstack development, i think we should wait until after i3 (march 6) before we add it back 19:05:50 we generally try to soft-freeze our systems around these times 19:06:24 which means now is the time when we're cramming to finalize scalability improvements to cope with the coming onslaught 19:06:38 yes, that's why i was working yesterday. :/ 19:07:18 so there there are two problems AIUI 19:07:24 so our tripleo sprint is that week, hopefully the high bandwidth time we have then (mar 3-7) will put is in a good position to help with further bugs 19:07:26 one is that there is a chicken and egg situation with stability 19:07:40 the second is that we're now without CI 19:07:44 for the feature freeze 19:07:51 which is a pretty bad time to be without 19:08:08 lifeless: what's the chicken and egg situation? 19:08:47 lifeless: i thought the cloud that you were plugging into nodepool was supposed to be stable (eg, not necessarily CD, at least, not to start) 19:09:11 jeblair: to be stable we have to have worked through any emergent issues from being used in the infra workload (e.g. https://bugs.launchpad.net/neutron/+bug/1271344 ) 19:09:12 Launchpad bug 1271344 in tripleo "neutron-dhcp-agent doesn't hand out leases for recently used addresses" [Critical,Triaged] 19:09:21 jeblair: to be added back you want us to be stable. 19:09:52 jeblair: so it is as far as we know stable 19:09:56 jeblair: we're not changing it 19:10:02 jeblair: not upgrading, not reconfiguring. 19:10:16 jeblair: the two outages so far were a) I fucked up and deleted the damn thing 19:10:31 lifeless: so this isn't a case of deploying a new rev that was broken; but rather something that was thought to be stable was, after all, not; and that wasn't exposed except under load. 19:10:43 jeblair: and b) we encountered a love timebomb bug which we've now applied a workaround for so it won't come back 19:10:55 lifeless: okay, i certainly understand that. a lot of infra isn't testable except under load either. 19:10:56 jeblair: right 19:10:56 well, there was an outage a few weeks before the deletion too which lasted a couple days, when you replaced the previous test provider with the ci one 19:11:07 fungi: that was the 'robert deleted it' 19:11:37 fungi: the delay bringing it back was that I chose given all the variables t the time, to delay bringing it back by hand and fix automation to bring it back bigger 19:11:51 fungi: which is why it now has 10 hypervisor nodes (each with ~96G of ram, 2TB of local disk) 19:11:58 well, there was something a few weeks prior to the deletion too. anyway i recall it wasn't something likely to recur 19:12:08 fungi: that was the neutron bug I linked above 19:12:17 fungi: where you weren't getting IP addresses 19:12:23 ahh, yep 19:12:27 anyhow, point is - this is a static deployment 19:12:53 specifically because a moving target would be bad 19:13:02 lifeless: okay, thanks, that reassures me we are on the same page. and i'm more or less convinced that we're at the point that we should be experimenting with the tripleo cloud... 19:13:51 lifeless: but having said that, i think part of experimentation is realizing when something doesn't work and backing off... 19:14:22 jeblair: so, if it wasn't working for unknown reasons I'd totally agree with you 19:14:46 lifeless: so even when we get those two problems with nodepool and zuul sorted, is the period from now through i3 really a good time to be dealing with the churn from this, and finding the _next_ problem? 19:15:16 jeblair: the benefits to tripleo are substantial; we hope the benefits to other programs will be too 19:15:46 jeblair: there is a risk; perhaps we should talk about how we can mitigate it? 19:16:00 also, was there an updated status/eta on the rh-provided region? 19:16:16 lifeless: i'm mildly concerned about the infra load, but i'm more concerned with the potential impact to the operation of the gate during this time... 19:16:58 as an example, having 50 jobs stuck in the check queue is counter to the expectations of people monitoring the overall throughput, looking from problems, etc. 19:18:10 fungi: dprince assures me its been escalated 19:18:13 jeblair, I remember that lifeless have a patch to move all this stuff to experimental-tripleo pipe 19:18:39 fungi: but realistically it will still take a little time to bring up a ci-overcloud region there and address multi-region layout etc. 19:18:55 fungi: I don't think we'll have multi-region live in the next two weeks. 19:19:29 okay, just curious 19:19:34 jeblair: yes, I can see that. Would making a tripleo-check queue specifically - same config etc, just only tripleo jobs in it - help with that? 19:19:42 or check-tripleo 19:19:48 SergeyLukjanov: i thought the experimental pipeline was for testing other projects (eg nova), but that tripleo would still want some check jobs 19:20:11 jeblair: not as a long term strategy, but as a reduce-cognitive-load *in the event* that something goes wrong ? 19:21:00 and there was an issue that nodepool couldn't start with offline provider 19:21:29 SergeyLukjanov: yup, derekh was poking at that last night, I should have an update in a couple hours I expect 19:21:37 lifeless: a dedicated check queue may help with that and also work around the fact that the check queue is required for gate; which would help you in case of problems. 19:22:06 jeblair: I presume thats basically the same patch as mine adding experimental-tripleo, + move the existing jobs from check -> check-tripleo ? 19:22:49 lifeless: yes. mind you, i'm only addressing the technical aspects, not the question of whether we should do this. 19:22:55 also worth noting here while everyone is looking, there is a team of folk on the hook for supporting the ci-overcloud 19:23:05 http://git.openstack.org/cgit/openstack/tripleo-incubator/tree/tripleo-cloud/tripleo-cd-admins 19:23:13 broad time zone coverage 19:23:36 and all have access to every machines console via IPMI etc; the only thing the non-HP folk can't do is file datacentre tickets 19:24:05 (but at that point the cloud is clearly not 'in a little trouble', so you'd be facing a big outage then regardless) 19:24:24 and the full tz coverage in infra team to revert tripleo testing if it'll start failing 19:24:31 lifeless: basically, i think that adding this right now is contrary to the soft-freeze that we try to do around milestones and releases. but it's a soft freeze and we can choose to waive it. 19:24:34 if it'll be needed 19:24:43 lifeless: i'd like to get some others to weigh in on this 19:25:10 lifeless: people who are likely to be affected. ttx, sdague, jog0, and perhaps some ptls. 19:25:42 just from the point of view of having gone through a few feature freezes 19:25:43 and, oh maybe some more people on the infra team :) 19:26:02 I am for anything that stablizes the gate now and then introduces changes after ff 19:26:24 we _will_ encounter unpredicted circumstances in the next two weeks 19:26:25 the only major thing that bothers me is nodepool not being able to start properly when a cloud is gone 19:26:39 fwiw fix of nodepool to work with offline provider + check-tripleo pipeline are sounds like it'll not affect other projects => it's ok 19:26:46 i'd mainly like to see the nodepool exceptions rooted out before readding providers, just because having everything stop when there's a provider outage (any provider) is sort of painful. the zuul job timeout patch seems less critical to bringing tripleo back online, as long as zuul's now able to drop those jobs when they're unconfigured 19:26:47 we need to have the personal stress minimized to survive it 19:26:51 I don't find the jobs that hangaround to be too bothersome as it only really affects tripleo anyways 19:27:20 fungi: ya, agreed. If nodepool can be made more happy in the event of unexpected derp then it is fine from my end 19:29:32 so it seems like there's some consensus in infra that we'd be okay with the nodepool fix and dedicated tripleo pipelines as hard requirements; the zuul fix is something we should do soon, but not critical. 19:30:04 ttx isn't around today. i'd like to give jog0, sdague, and mtreinish a chance to weigh in since they would be affected by problems too. 19:30:06 agreed 19:30:34 ok. I'll put up the pipeline patch (merged with the experimental one I guess if that hasn't landed yet) 19:30:39 regardless 19:30:41 so let's see if we can catch up with them today, and if they don't jump up and down on their hats, we'll proceed with that. 19:30:50 thank you 19:31:16 lifeless: thank you. i'm still really excited by this. 19:31:36 #topic Requested StackForge project rename (fungi, clarkb, zhiwei) 19:31:44 have we heard from zhiwei ? 19:32:05 yes, he's eager to have it happen as soon as we're able to do the rename 19:32:09 zhiwei has pinged at PST night time. I suggested we would bundle it with the next openstack related downtime (savanna?) 19:32:19 sounds like a plan 19:32:22 right zhiwei would like to get this done soon so they can cut an icehouse branch 19:32:44 but savanna is votign on stuff now so I expect that to move along at a good pace now 19:32:46 SergeyLukjanov: ^ ? 19:33:12 fungi, clarkb: can you update the wiki and indicate the old and new names of the project or projects that need renaming? 19:33:13 i thought the vote was scheduled to end yesterday? 19:33:20 clarkb, I hope that we'll have a couple of discussed options at the end of this week 19:33:22 jeblair: definitely 19:33:32 clarkb, than we'll wait for foundation to check them 19:33:34 fungi: it got extended 19:33:44 jeblair: defnitely (on the meeting agenda?) 19:33:45 oh, got it 19:34:05 clarkb: yeah, let's drop it to the bottom and collect projects there until we do a rename 19:34:13 #topic Ongoing new project creation issues (mordred) 19:34:21 jeblair: will do 19:34:27 anteaya, fungi: you've been working on this, what's the latest? 19:34:29 fungi, the initial vote will end today, but it's a first round to filter really bad options :) 19:34:43 SergeyLukjanov: filter the bad ones out or in? :) 19:35:02 jeblair, I mean filter out :) 19:35:04 mostly there is logging available, plus patches for more 19:35:06 jeblair: bug is updated with most recent findings, but in short we do capture tracebacks in the syslog when puppet tries to add project which don't import an existing repository 19:35:18 jeblair, heh, we'd like the most bad name ever 19:35:25 to have* 19:35:29 and also we've spotted a race condition between when create-cgitrepos runs on the git servers and when gerrit is told to replicate 19:35:44 fungi: that pretty much needs to be solved with salt, right? 19:35:58 the latter, yes or something driven from the gerrit server anyway 19:36:03 looks like we could move on and approve more create-project patch with upstream 19:36:32 so the logging was needed to determine the next steps for solving, correct? 19:36:36 fungi: so is that the _only_ problem at this point? 19:36:39 SergeyLukjanov: yes it seems like the ones i approved which imported an existing repository worked fine 19:36:48 shall I move to working with salt, or is it still too early? 19:36:59 jeblair: the only two problems? 19:37:04 #link https://bugs.launchpad.net/openstack-ci/+bug/1242569 19:37:06 Launchpad bug 1242569 in openstack-ci "manage-projects error on new project creation" [Critical,In progress] 19:37:28 fungi: jeblair SergeyLukjanov maybe a pre replicate configureable optional shell out step in manage-projects 19:37:42 then have that trigger salt, or ssh in a for loop or puppet even 19:38:17 anteaya: i still have an open etherpad where utahdave was going to provide us with clearer examples of using reactors, if you wanted to take over and try to figure that part out 19:38:32 it should fix the non-ustream creation issue I hope 19:38:35 I can do that, yes 19:38:37 clarkb: yeah, it seems like having manage-projects run salt commands is a good architecture. it could even "import salt", right? 19:38:37 #link https://etherpad.openstack.org/p/Salt-Event-System 19:38:47 would this be a candidate for a reactor do you think? 19:39:10 clarkb: but i don't understand reactors, so maybe that's better? 19:39:14 anteaya: apparently anything which needs to happen as a result of something else happening successfully requires a reactor, from what i'm led to believe 19:39:35 jeblair: yup import salt and talk directly 19:39:46 fungi: let me gather my thoughts on this and then get back to you 19:39:48 that at least i can understand and reason about. ;) 19:40:00 import salt and talk directly to what? 19:40:00 I don't want to offer an opinion before I am ready 19:40:15 fungi, to salt I think :) 19:40:17 have the gerrit server be a salt master and the git servers be salt minions? 19:40:20 what I am hearing is explore how salt can help manage-projects 19:40:25 anteaya: ya 19:40:30 that is what I will go on 19:40:41 fungi: i don't think it has to be a master, but we can let gerrit run the create-cgit-repos command via salt 19:40:50 anteaya: mostly I am leaning down that road beacuse it might reduce the amount of additional infrastructure necessaryto trigger the pre replication steps 19:40:55 fungi: where gerrit in that sentence really means 'manage projects running on review.o.o' 19:41:04 clarkb: yes, I am leaning the same way 19:41:18 jeblair: oh, i see, just using salt as a proxy for "ssh to these machines and do this" (didn't realize you didn't need a salt master to be able to do that) 19:41:28 no 19:41:36 salt trigger can feed commands to master 19:41:44 if that is the best option 19:41:51 and we have salt trigger up and running 19:42:02 fungi: i think you need the master, but you can grant minions access to run specific commands; i think that was the idea of having the jenkins salt slave trigger something 19:42:14 yes, what jeblair said 19:42:40 anteaya: well, that's for config repository changes, and its design apparently depends on having a working reactor implemented on the master to get any cascading work done (update git repo in one place, run puppet apply in another) 19:43:06 fungi: I will have to look more deeply into the reactor part 19:43:22 this is what utahdave was working on getting us good examples for, because he said dependent activities aren't well covered in the documentation 19:43:23 it is for config changes since that is how we have it triggered 19:43:34 it can trigger on anything we decide to trigger on 19:43:41 fungi: so aside from 'cgit repos not created in the right order' what's the other bug? 19:44:20 projects which don't import an existing repository fail to get created, and spew a traceback from gerritlib trying to create-project through the ssh api 19:44:58 and i noted the resultant state of the jeepyb scratch repository, but without adding more logging to the script it's hard to know what else might have gone wrong 19:45:00 fungi: is that where this comes from? https://bugs.launchpad.net/openstack-ci/+bug/1242569/comments/13 19:45:02 Launchpad bug 1242569 in openstack-ci "manage-projects error on new project creation" [Critical,In progress] 19:45:03 is it really critical to be able to create projects w/o upstream? 19:45:07 jeblair: yes 19:45:50 ok. thanks. 19:46:17 jeblair: it seems that something (nice and vague, huh) is preventing the initial repository in the gerrit homedir from getting fully created. script dies somewhere between that and building the jeepyb scratch repo 19:46:29 that should be a recreatable and diagnosable problem with a local gerrit. 19:46:40 weird that it only affects empty projects? 19:46:45 since all gerrit projects start that way 19:46:45 fungi: will my additional logging patches help id the something? 19:46:53 then later on manage-projects force pushes into the blank repo 19:47:20 clarkb: yeah, i was thinking the same. that may both suggest a bug in jeepyb and help narrow the location. 19:47:29 i believe that the previous blind testing was inconclusive because we actually had (at least) two different bugs, so the results were hard to correlate 19:47:36 clarkb: but i don't have the jeepyb code loaded in my brain. 19:47:44 I still have gotten nowhere on setting up a local gerrit, I am not very good at sorting out the modules from config and implementing them 19:47:52 my failure here 19:49:05 #topic Discuss about using compute nodes in LXC container for multi-node setup (matrohon, jgallard) 19:49:12 hi 19:49:16 matrohon: what's up? 19:50:01 we would like to enhance neutron testing especially with ML2 and overlay networks 19:50:13 isn't this a non starter for the reasons that pleia2 and jaypipes have discovered? I suppose we can test less cinder (but then we lose test coverage) 19:50:27 clarkb: and less nbd 19:50:41 I still have my notes from a few months back here: https://etherpad.openstack.org/p/tripleobaremetallxc2013 (it's tripleo-specific, but a lot carries over into more general multi-node) 19:50:55 clarkb: which means you have to have a working guestfish, because nova really likes mounting disks. 19:51:09 lifeless: whcih doesn't work on precise :/ 19:51:13 right 19:51:16 I saw jaypipes issues, and we could potentially have the same with ovs 19:51:27 but not with linuxbridge 19:51:33 matrohon: is the multi-node part the important part? 19:51:33 for multinode - I'd point folk at tripleo-gate personally. 19:51:34 matrohon: if you see the etherpad, ovs works ok if you load in the modules 19:51:41 or the lxc part? 19:51:52 pleia2 : thanks 19:52:11 tgeh multi-node part is the most important for our gates jobs 19:52:13 if we can get lxc to work I think that would be great, but having poked at it with pleia2 I don't have high hopes 19:52:24 matrohon: yes, runnign jenkins slaves that need to install OVS or iscsi is a non-starter. 19:52:31 so what about going back to the multinode + openvpn l2 that we talked about at summit 19:52:33 iscsi was our stopping point 19:52:40 matrohon: okay, so I suggest you focus on that part and then let discussion unfold about how to do that 19:52:48 just start 2 actual cloud guests and build them a layer 2 19:52:55 then proceed from there 19:53:02 if anyone wants to fix iscsi.. :) https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1226855 19:53:03 Launchpad bug 1226855 in lxc "Cannot use open-iscsi inside LXC container" [Undecided,Confirmed] 19:53:04 anteaya : ok 19:53:07 pleia2: right, and OVS isn't installable in a shared-kernel VM either... at least, I've tried and can't do it.. 19:53:25 matrohon: so lifeless has a proposal up, tripleo-gate 19:53:33 because I think lxc brings more problems than it is worth here, and I expect if this part isn't easy, we're going to find a ton of other issues down the road 19:53:59 jaypipes: lines 20-22 are how I got ovs working ok 19:54:04 sdague: yes. i like that approach. i think it will be moderately easier when we move from jenkins to non-jenkins workers.... 19:54:10 sdague: but it is still probably doable with jenkins 19:54:14 jaypipes : but the idea is to use linux bridge instead, this would help us to test linuxbridge agent 19:54:28 jeblair: yeh, I think it's something that we could do today, with the cloud resources we have. 19:54:40 anteaya : we will look at tripleo-gate too 19:54:50 matrohon: try to stay focused on one thing at a time, testing linuxbridge is a sub requirement 19:55:02 I'm reluctant to just keep saying tripleo-gate will save the world, because it's not even self gating yet :) 19:55:05 it "just" needs additional automation around grouping workers or proxying them and being able to grant workers to other workers 19:55:07 matrohon: if you stay focused on multi-node you now have two suggestions 19:55:34 matrohon: yes, keep in mind tripleo-gate doesn't exist yet. 19:55:35 sdague: I know, right! what are those tripleo folk thinking :) 19:55:38 plus the tunneled networking implementation of course 19:56:15 fungi: so tunnelled networking should be easy enough with openvpn L2, even between cloud providers 19:56:16 fungi : and the live-migration 19:57:00 matrohon: so how much time do you have to devote here? 19:57:04 hi all, I would like to know what you think about to add LXC support in devstack as extra hook ? 19:57:14 matrohon: if you are willing to put in some work on the openvpn and how to get zuul and jenkins to assign multiple nodes to a task, i'd be happy to help point you at where to work on that. 19:57:23 jgallard is full time on it 19:57:28 matrohon: but it's going to be a good deal of infrastructure work, as none of that exists at the moment. 19:57:42 we started to work on that 19:57:47 jeblair: matrohon fungi sdague nodepool may be able to coordinate the openvpn setup with the features BobBall and firnds are adding to it for xen image creation 19:58:07 clarkb: true, that could be another useful tool 19:58:39 jgallard: so lxc might solve a very limitted test scenario, but with the issues that were already run into, it can't be the generic multinode case 19:58:41 jgallard, matrohon: chat with us further in #openstack-infra 19:58:46 #topic Savanna testing (SergeyLukjanov) 19:59:04 jeblair : ok thanks 19:59:08 ok, thanks a lot 19:59:16 SergeyLukjanov: real quick? :) 19:59:26 there are several small updates - we're now have cli tests in tempest and so we'd like to gate together savanna and it's client 19:59:29 jeblair, yup 19:59:38 and we're moving client docs to client 19:59:57 so, I'll really appretiate for review/approve https://review.openstack.org/#/c/74310/ and https://review.openstack.org/#/c/74470/ 19:59:58 end 20:00:01 cool, and i don't feel bad about running into the tc timeslot because i'm sure they're interested to see that. :) 20:00:07 SergeyLukjanov: great, thanks 20:00:12 hey 20:00:14 thanks everyone! 20:00:16 roll up, roll up 20:00:17 #endmeeting