19:03:00 #startmeeting infra 19:03:00 Meeting started Tue Mar 29 19:03:00 2016 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:03:02 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:03:04 The meeting name has been set to 'infra' 19:03:23 hi there 19:03:24 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:03:33 #topic Announcements 19:03:54 #info Reminder: add summit agenda ideas to the Etherpad 19:03:59 #link https://etherpad.openstack.org/p/infra-newton-summit-planning Newton Summit Planning 19:04:04 let's plan to try to do a little voting on them at next week's meeting 19:04:14 #topic Actions from last meeting 19:04:16 #info happy second term PTL-ness to fungi :) 19:04:28 heh, thanks (i think?)! 19:04:34 #link http://eavesdrop.openstack.org/meetings/infra/2016/infra.2016-03-22-19.02.html 19:04:40 none last week 19:04:57 fungi: congrats! 19:05:04 o/ 19:05:05 #topic Specs approval 19:05:11 #info APPROVED: "Nodepool: Use Zookeeper for Workers" 19:05:16 #link http://specs.openstack.org/openstack-infra/infra-specs/specs/nodepool-zookeeper-workers.html Nodepool: Use Zookeeper for Workers 19:05:22 #info APPROVED: "Stackviz Deployment" 19:05:27 #link http://specs.openstack.org/openstack-infra/infra-specs/specs/deploy-stackviz.html Stackviz Deployment 19:05:39 those url's _should_ be valid soon ;) 19:05:45 I have missed looking at Zookeeper / Nodepool. Would Zookeeper be optional ? 19:05:48 o/ 19:05:48 i approved, but jobs still need to run 19:06:02 as a third party user of Nodepool I have only a single image to build/update. Just wondering really 19:06:36 hashar: i believe it would be required, however, it scales down very well, and i anticipate simply running it on the nodepool host will be fine 19:06:47 hashar: (and will be what we do for quite some time) 19:06:49 hashar: the spec proposal (mentioned last week) is at https://review.openstack.org/278777 while we wait for the jobs to finish running/publishing 19:07:08 if you want to read further 19:07:10 fungi: thanks 19:07:22 jeblair: yeah I guess I will survive and we have some Zookeeper instances already iirc 19:07:26 jeblair: thx ;) 19:07:44 also, there's still time to get involved! specs are not written in stone (more like silly putty really) 19:08:14 #topic Priority Efforts: Infra-cloud 19:08:30 added briefly to highlight cody-somerville's awesome weekly status reporting 19:08:49 o/ 19:08:56 (sorry to be late) 19:09:00 #link http://lists.openstack.org/pipermail/openstack-infra/2016-March/004090.html latest infra-cloud status 19:09:17 note that the hardware from former "west" has arrived in houston now 19:09:43 that is good news 19:09:44 so maybe we'll have access back for our desired priority hardware shortly and can pick up where we left off at the sprint! 19:10:07 #topic Baremetal instances in Nodepool with Ironic (igorbelikov) 19:10:13 hey folks 19:10:21 this idea was brought up in chat by kozhukalov last week 19:10:39 i know this has been brought up many times in the past few years, so i'll let people reiterate the usual concerns 19:11:17 I wrotr an email about it 19:11:22 we really want to integrate our fuel deployment tests with infra, the issue is - we use baremetal nodes to launch a bunch of VMs and deploy openstack 19:11:22 there are a bunch of limitations we can't overcome yet to span the VMs across multiple baremetalal nodes 19:11:23 and a lot more issues come to mind if we imagine doing it on top of a bunch VMs requested from Nodepool 19:11:23 so we discussed this internally and while we still will continue to move forward to overcome this limitations 19:11:23 the most realistically looking way is to use Ironic and its baremetal driver in Nova, so Nodepool will be able to request baremetal nodes without any serious changes to Nodepool logic and current infra workflow 19:11:39 hard to dig up on phone but that should cover thr details 19:12:08 and I wanted to get some input from infra on this general idea 19:12:13 clarkb: do you remember date or subject key words? 19:12:14 igorbelikov: how do we upload images to that? 19:12:27 just curious if your plans include glance 19:12:28 baremetal nodes will be able to use dib-images 19:12:53 fungi: so basically the current workflow for nodepool vms, but with baremetal 19:13:18 also running untrusted code on your servers has the opportunity to taint them if anyone proposes a patch which, say, uploads malicious firmware 19:14:10 fungi: fuel deployment tests will work just fine with restricted access from jenkins user. It doesn’t completely solve all security issues, but this can be discussed further 19:14:12 angdraug: was on infra list and adrian otto was on the thread 19:14:48 i think that thread was having to do with bare metal testing for magnum or something? 19:15:06 yup 19:15:17 anyway, yes it's a suggestion which has come up multiple times, as i've said, from multiple parties 19:15:22 but should cover general ironic + nodepool 19:15:47 and so far no one has given us a workable endpoint or attempted to 19:15:55 clarkb: thanks, I’ll dig up the thread, sadly I missed it 19:16:08 #link http://lists.openstack.org/pipermail/openstack-infra/2015-September/003138.html 19:16:12 from my perspective, once we have infra cloud up and running, there can be opportunity to start moving it 19:16:25 i can see how it would be possible to implement, but more generally the usual needs for multiple separate environments, making sure the clouds providing those resources are staffed and maintained to keep them running, et cetera are typical concerns we have over any special regions in nodepool as well 19:16:40 deploy nova + ironic, use dib images, and figure how to deal with security problems 19:17:13 the current goal with infra-cloud is to provide virtual machines, not baremetal test nodes, but it's possible that could act as an additional available region for those tests if we decided that was something we should implement 19:17:31 "figure out how to deal with security problems" seems like a lot of handwaving to me 19:17:47 we’re ready to work on required infra-cloud change for that to happen 19:17:51 a spec should be needed of course 19:18:09 it's a hard problem, and i know the tripleo and ironic crowd have struggled with it for a few years already, so looping them in early in such a conversation would be wise 19:18:12 and i see that as next steps once we have a stable infra cloud 19:18:16 yolanda: a spec is a must for this, sure:) 19:18:37 I can understand the need for igorbelikov wanting bare metal nodes upstream, but I would be curious to see what else is needed to migrate more of ci.fuel-infra.org upstream personally. 19:19:07 pabelanger: http://lists.openstack.org/pipermail/openstack-dev/2015-November/079284.html 19:19:13 anyway, i guess my point is that "modify nodepool to support ironic" is the simplest part of this. having good answers for the _hard_ parts first is what we'll need to be able to decide if we should do it 19:19:40 fungi, from our experience downstream using baremetal, we focused on two things: code review is very important, to ensure that no code is malicious. And also, periodical redeploys of baremetal servers, to ensure they are clean 19:19:46 pabelanger: the only things non-migrateble right now are deployment tests, we’re working on moving everything else upstream 19:20:21 nodepool supporting ironic should be a matter of using the right flavors. The complicated part should be the nova + ironic integration... 19:20:24 yolanda: reviewing code before jobs run is also a significant departure from our current workflow/tooling so that's not a solution to be taken lightly 19:20:34 yolanda: periodical redeploys fit perfectly in the picture 19:20:39 * crinkle would like to see infra-cloud turned back on and providing reliable nodepool resources before thinking about new uses for the hardware 19:20:48 angdraug: igorbelikov: thanks, will read up on it after meeting 19:20:59 crinkle: ++ 19:21:09 i am in complete agreement there/ let's table any discussion of what else infra-cloud could be useful for until we're using it for what we first wanted 19:21:27 thanks for the reality check, crinkle 19:21:31 :) 19:22:06 yep, that should be a next step once we have the hardware on place and redeploy again. But I think that we have this possibility for the mid-term 19:22:07 btw there's a sizeable pool of hw behind ci.f-i.org, just saying ) 19:22:35 okay, so it seems like this is a topic which would be better moved to a ml thread, we can loop in people with experience in the problem areas and determine if there's a good solution that fits our tools and workflow, and make sure concerns brought up in previous iterations of the same discussion are addressed to our satisfaction 19:23:16 angdraug: one sizable pool is insufficient. if it goes offline then any jobs which can only run there won't run, and projects depending on those jobs will be blocked 19:23:33 fungi: do you think it's too early to start a spec? 19:23:40 we've seen this many times already with tripleo and are strongly considering switching them to third-party ci 19:24:04 fungi: I recall you made the same point in the nfv ci thread on the -dev mailing list 19:24:07 because trying to showhorn their special cloud in which only runs their tests and has no redundancy turns out not to be a great fit for our systems 19:24:09 good point, one more concern to address on ML/in spec 19:24:15 er, shoehorn 19:24:36 that's exactly what we want to avoid 19:24:43 right now we're using that hw in our own special way 19:24:53 we want this to become a generic pool of hw for any openstack ci jobs 19:24:53 so, yes you can start with a spec but i think it may be easier to have an ml thread to work out bigger questions before you bother settling on a set of solutions to propose in a spec 19:24:57 there are actually 2 pools in different geographical locations, but it’s still a good point 19:25:05 so, team up with tripleo and have two pools to share ;) 19:25:19 AJaeger: +1 :) 19:25:37 well, tripleo's environment would need a complete redesign from scratch to be generally usable anyway 19:25:46 so does ours 19:26:00 their model with brokers and precreated networks is very specific to teh design of their jobs 19:26:08 ah ;( 19:26:24 AJaeger: fungi: I'm hoping to talk with the tripleo team in austin to see what can be done moving forward 19:26:39 okay, meeting's half over, 6 topics to go. need to move on 19:26:50 sorry, thanks for giving us the time! 19:26:51 ok, moving this to mail thread, thanks! 19:27:04 thanks angdraug, igorbelikov! 19:27:09 #topic Gerrit tuning (zaro) 19:27:25 zaro: saw you had more details on the ml thread this week! 19:27:27 anybody get a chance to read the link 19:27:28 ?? 19:27:55 #link http://lists.openstack.org/pipermail/openstack-infra/2016-March/004077.html 19:28:00 anyways yeah, performance is way better after running git gc 19:28:08 zaro: yes, thanks for testing this! 19:28:21 so was wondering if anybody had further questions about it? 19:28:35 can we look closer at git push origin HEAD:refs/for/master 19:28:52 the stats for user 19:28:55 what do you mean look closer? 19:29:04 before is 5 seconds the after is 11s 19:29:13 has anyone looked into server performance with the resulting repositories? 19:29:18 zaro: How can we run this? Is there a gerrit setting or is that manual? And while git gc runs, is the repo available for usage? 19:29:19 for user that looks like it taeks twice as long to me 19:29:29 (not only gerrit, but cgit/git) 19:30:39 i haven't looked into it. sounds like zaro is the only one who's run comparative stats so far 19:30:48 anteaya: i didn't notice that, but that's very odd result. i 'm not sure why the descrepency there. 19:30:58 but i agree the server impact (gerrit and cgit) is still a missing piece 19:30:59 zaro: okay, I question it 19:31:00 zaro: we tried it in our downstream environment and it really made difference for some our projects. so thanks for that. 19:32:14 AJaeger: i ran it manually with the nova repo provided by jeblair 19:32:34 AJaeger: i only ran locally on my own machine. 19:32:41 abregman: if you are able to collect any statistics and share them as a reply to that mailing list post, that would be wonderful 19:32:42 since git gc (or jgit gc for that matter) is by definition a destructive action, it's not something we'll be easily able to recover from if we later discover an adverse impact somewhere, hence the need for thorough testing 19:33:01 I can't remember if I mentionned it on the list but Gerrit upload-pack ends up sending all refs/changes/* to the client doing a git fetch :( 19:33:03 anteaya: sure, np 19:33:06 AJaeger: i supposed you can do the same 19:33:09 abregman: thank you 19:33:24 zaro: I meant: Run on gerrit itself 19:33:26 abregman: include commands run and as much detail as you can 19:33:35 ack :) 19:34:29 #link https://tarballs.openstack.org/ci/nova.git.tar.bz2 a snapshot of the full nova repo from review.openstack.org's filesystem 19:34:31 and somehow the git fetch is way faster over https compared to ssh (on my setup and using Wikimedia Gerrit 2.8 .. ). Long food: https://phabricator.wikimedia.org/T103990#2144157 19:35:15 #link https://phabricator.wikimedia.org/T103990#2144157 19:35:20 thanks for the details hashar 19:35:46 feel free to poke me in your mornings if you want me to elaborate 19:36:00 AJaeger: most things are cloning from git.o.o not review.o.o so i just tested directly. 19:36:08 at least on my setup using https for fetch solved it. I should try on your nova installation 19:36:53 zaro: so anyway, it sounds like we're a lot closer to seeing performance benefit for this but more comfort about the potential impact to the server side of things is preferred before we decide it's entirely safe 19:37:22 what would provide more comfort? 19:37:49 server performance with the resulting repositories is what jeblair has asked for 19:38:02 zaro: indications that performance on git.o.o or review.o.o (on the servers) will improve or at least remain constant after a git gc (and definitely not get worse) 19:38:39 e.g. is it more work for git to serve these after than it was before 19:39:20 anyway, continuation on the ml 19:39:22 okay, need to continue pushing through as many of these topics as we can 19:39:36 #topic Status of gerrit replacement node (anteaya, yolanda) 19:39:51 okay so on April 11th we commited to doing a thing: http://lists.openstack.org/pipermail/openstack-dev/2016-March/088985.html 19:40:02 and my understanding is that yolanda has a node up 19:40:06 i just wanted to confirm that nothing is pending for that node replacement 19:40:11 beyond that I don't know what the plan is 19:40:14 i created the node, it's on ansible inventory, and it's disabled 19:40:14 this is just a quick check on the existing server replacement schedule, and making sure someone writes up the maintenance plan for it? 19:40:16 but I think there whould be one 19:40:23 fungi: yes 19:40:30 I'm away next week 19:40:38 just want to hear someone is driving this 19:40:52 can be but doesnt' have to be yolanda 19:40:56 basic process is stop review.o.o, copy git repos, index(es), start on new server 19:41:03 we likely also need a one-week warning e-mail followup to the previous announcement 19:41:05 do we have pre-existing maintenance plans for gerrit? 19:41:16 clarkb: git repos are in cinder now, i believe 19:41:24 oh right 19:41:37 so thats potentially even easier, unmount, unattach, attach, mount, win 19:41:53 and db is on trove, so that should be fast 19:41:53 zaro: did you run git gc --aggressively? 19:41:56 so detach volume from old server, attach to new server, update dns with a short ttl if it hasn't already and then swap dns records right at the start of the outage 19:42:07 i can write an etherpad for it 19:42:09 abregman: no. 19:42:12 if we don't have any 19:42:21 thanks yolanda! 19:42:24 abregman: we've changed topics, we can continue discussion in the -infra channel 19:42:40 #action yolanda draft a maintenance plan for the gerrit server replacement 19:42:51 thank you 19:42:55 did you also want to send the followup announcement around the one-week mark? 19:42:58 anteaya: oh k, sorry 19:43:01 i will also send a remainder on 4th april 19:43:03 abregman: no worreis 19:43:21 #action yolanda send maintenance reminder announcement to the mailing list on April 4 19:43:25 thanks yolanda! 19:43:30 thanks yolanda 19:43:37 glad to help :) 19:43:38 #topic Ubuntu Xenial DIBs (pabelanger) 19:43:45 ohai 19:43:49 awesome work on these pabelanger :) 19:43:53 #link https://review.openstack.org/#/q/topic:ubuntu-xenial+status:open 19:43:54 so ubuntu-xenial dibs are working 19:44:06 pabelanger: including the puppet runs? 19:44:08 even tested with nodepool launching to jenkins 19:44:09 clarkb: yup 19:44:11 #link https://etherpad.openstack.org/p/infra-operating-system-upgrades 19:44:12 nice 19:44:20 so, that link above has 1 review that needs merged 19:44:27 and we can then turn them on in nodepool 19:44:30 pabelanger: have you run a devstack-gate reproduce.sh script against one of the images yet? 19:44:30 oh, i see i skipped a topic a couple back, i'll thread that one in next (sorry zaro, hashar!) 19:44:34 surprisingly is was straightforward 19:44:42 clarkb: not yet. 19:44:46 clarkb: I can do that later today 19:44:50 fungi: or we can skip git-review and follow up on list 19:45:12 either way, out puppet manifests and dib elements works well 19:45:19 xenial isn't released until april 21st (beta 2 was last thursday), but I don't anticipate any ground-breaking changes between then and now 19:45:24 er, now and then 19:45:35 pabelanger, so not much changes needed right? 19:45:36 good work 19:45:48 yolanda: right, see the topic above for all the patches 19:45:50 pleia2: ya we can just avoid switching any jobs over until releaes has happened 19:46:01 python35 is the other potential place we will see issues 19:46:06 clarkb: yeah 19:46:20 xenial ships with 35? 19:46:25 anteaya: yes 19:46:28 great 19:46:30 i anticipate the py24-py35 transition will be similar to how we did py33-py34 last year 19:46:43 er, s/py24/py34/ 19:46:49 fungi: well and we may need to decide if we want to do 34 and 35 19:46:55 but yes 19:47:37 right, we had the luxury last time of considering py3k testing a convenience and dropped it from stable branches so we could just cut over to py34 19:48:13 though in this case we're not maintaining special platforms just for py34 testing, so there's less incentive to drop it anyway 19:48:25 py33 testing was kinda hacky 19:48:38 pabelanger: do you have this in a project config somewhere that we could run an early devstack job on to shake out bugs? 19:49:08 sdague: once all changes by pabelanger merged, see above for review link: Yes 19:49:21 that would be extremely easy to add once we have the patches in to start building images/booting nodes 19:49:24 Yup, https://review.openstack.org/#/q/topic:ubuntu-xenial+status:open are the current patches needed to land 19:49:33 fungi: indeed 19:50:07 okay, any other questions we need to address on this topic in the meeting before i move on (or rather, back to the topic i unceremoniously skipped earlier)? 19:50:15 none here 19:50:34 #topic git-review release request (zaro, hashar) 19:50:40 FYI, Py35 works perfectly with everything right now. 19:50:41 #link http://lists.openstack.org/pipermail/openstack-infra/2016-March/004058.html git-review release request 19:50:50 (tested at build time in Sid) 19:51:10 so in short git-review last release is from June 12th 2015 19:51:21 just wondering what needs to happen for a release? 19:51:37 whether i can help with that? 19:51:40 I could myself use the optional feature push url to be release. That lets bypass the creation of an additional remote named "gerrit" 19:51:48 which causes folks to fetch from both origin and gerrit remotes when ever they do git remote update 19:51:48 A tag, then ping me to build the package in Debian, then I'll ping Ubuntu ppl? 19:51:57 so I guess a tag 19:52:04 i replied on the ml thread as well, but want to see someone get any remaining bug fixes or test improvements dlushed from the review queue before we tag a new release. we should consider git-review feature frozen for the moment, but can figure out what new features make sense to add once we have the current master state polished and released 19:52:16 on the list one hinted at looking for open change that one might want to get approved before tagging a release 19:52:24 Remember: we have a few days for a freeze exception so that it reaches the next LTS. Do we want that new version in 16.04 ? 19:52:29 I ran into a bug the other day 19:52:37 git review -d fetched a patch from a different git repo 19:52:44 I swear it used to fail on that 19:52:52 zigo: yup would be good to have it in before Ubuntu freeze 19:53:11 hashar: *IF* there's no regressions! :) 19:53:13 clarkb: that would definitely count as a regression, please get up with me later if you need help reproducing 19:53:17 clarkb: yes, i still use an old version and it fails on that 19:53:24 jesusaur: what version? 19:53:31 will help us bisect 19:53:44 hashar: Also, Ubuntu Xenia *IS* frozen, we just happen to have FFE for all OpenStack things until Mitaka is out. 19:53:46 clarkb: 1.23 19:54:10 (and I guess git-review could be included in the FFE) 19:54:14 zigo: hashar: well, git-review shouldn't be an openstack-specific thing 19:54:16 zigo: doh! :-) then it will be in the next stable or maybe we can push it via xenix-updates or similar 19:54:41 https://wiki.ubuntu.com/XenialXerus/ReleaseSchedule 19:55:26 okay, so it sounds like nothing significant to add to this topic aside from what is in the ml thread, so we should follow up there once someone has a chance to run back through the outstanding changes and make suggestions for missing fixes (including the regression clarkb spotted) 19:55:29 hashar: The question is, are there features we *must* have, or is the current version in Xenial just fine? 19:55:41 fungi: wikimedia community definitely uses git review 19:56:02 zigo: no clue :/ 19:56:20 Let's switch topic then! :P 19:56:22 hashar: yep! i definitely want to look at git-review as something developed by the openstack community for anyone using gerrit, not just for people using _our_ gerrit deployment 19:56:43 #topic Infra cloud (pabelanger) 19:56:51 #link http://lists.openstack.org/pipermail/openstack-infra/2016-March/004045.html 19:57:01 This is a simple question, did we confirm we are doing 2 drops per server or 1? 19:57:01 fungi: git-review has received a wild range of contribs from Wikimedia community for sure :-} 19:57:12 we talked about it at the mid-cycle, but haven't see it brought up 19:57:32 if not, we should ask HP NOC team about it 19:57:52 clarkb had mentioned that as a preferred deployment model so that we could skip the nasty bridge-on-bridge action we had in west 19:58:00 indeed 19:58:01 also 19:58:08 we had to do some stuff in glean 19:58:17 right but I think 10Gbe is probably more valuable than 2 drops 19:58:22 and we don't have that right now? 19:58:23 as it was not ready by the time to handle the vlans thingy we were using 19:58:26 roght, glean support for bridge configuration post-dated the west design 19:58:27 I don't think they are giving us 10G 19:58:35 crinkle: :( 19:58:37 oh 19:58:47 (sad trombone) 19:58:52 and I don't think we impressed hard enough that we wanted 2 drops 19:59:03 right because with 10Gbe we would deal 19:59:05 like we did before 19:59:14 hmm, so now just 1 nic 1GB 19:59:14 ? 19:59:17 but if we are only getting gig the nI think we need to impress onthem that we need it 19:59:18 Is it too late to ask? 19:59:20 it looked like the servers from west all had at least two 1gbe interfaces, but some had only one 10gbe while a few seemed to have two 19:59:29 i don't think it will be hard for venu to accommodate two nics 19:59:30 for 2 drops 19:59:50 i mean 19:59:55 i've managed the gozer baremetal 19:59:57 fungi: it was 2x10Gbe with only one gbic installed and 2xgig iirc 20:00:01 and that' s the setup i had 20:00:06 2 nics 20:00:17 yeah, cat4e patch cables are likely no problem for them at all. copper 10gb sfps/switches on the other hand... 20:00:20 looks like we are at time 20:00:21 and never had any hold up on that 20:00:27 from venu's confirmation email: "We checked with DC Ops and they said all the nodes have only 1G NICs on them. So nodes to TOR switch are 1G connections." 20:00:36 i can follow up with them 20:00:40 s/cat4e/cat5e/ 20:00:40 crinkle: what thats not true 20:00:45 not sure what's with my fingers today 20:00:50 oh, also we're out of time 20:00:51 heh 20:00:52 crinkle: I am almost 100% positive we had 10Gbe nics in every one of the machines 20:01:03 whee 20:01:12 pleia2: we'll get to your topic first on the agenda netxt week if that's okay 20:01:17 it was those silly nics that caused kernel issues constantly 20:01:18 thanks everyone!!! 20:01:22 thank you 20:01:23 #endmeeting