19:01:35 #startmeeting infra 19:01:36 Meeting started Tue Jul 28 19:01:35 2015 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:37 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:40 #link agenda https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:40 The meeting name has been set to 'infra' 19:01:40 #link previous meeting http://eavesdrop.openstack.org/meetings/infra/2015/infra.2015-07-21-19.02.html 19:01:48 Morning 19:01:51 o/ 19:01:51 #topic Specs approval 19:01:55 #topic Specs approval: Zuul v3 (jeblair) 19:02:08 #link zuulv3 spec https://review.openstack.org/164371 19:02:17 i wrote a spec 19:02:38 i think we should do it 19:02:48 i confess that it does not hold all the answers 19:03:47 but i think it describes enough that several people can work on aspects of the problem in parallel 19:03:51 I'm a fan and I'm looking forward to it. I'm sure we can figure it out what's left as we go 19:04:16 it's certainly been hammered on for a while 19:04:38 i'm willing to have some bits of that spec 19:04:45 as a downstream consumer, i'm worried about others 19:04:56 specially about jenkins replacement 19:05:40 yeah, that will be some work 19:05:49 but it beats hiring a bunch of people to restart jenkins all the time 19:06:00 +1000 as th eperson that did that yesterday 19:06:17 my upstream bug is still DOA fwiw 19:06:23 yes, we have jenkins problems as well 19:06:28 anyway, anyone think it's not ready for voting, or shall we vote on it? 19:07:19 * fungi is fashionably late 19:07:26 I think it's as far as it'll get in this phase, voting sounds good 19:07:27 apologies 19:07:41 #info zuulv3 spec voting open until 2015-07-30 19:00 UTC 19:07:47 yes, please open voting on the zuul v3 spec ;) 19:07:53 woot 19:07:56 #topic Priority Efforts 19:08:16 i'd like to take a break from our typical "never do status reports" and instead "do status reports" this week 19:08:53 mostly because a lot of us have been out and about recently, and it'll be good to try to resync on where we are now that most of us are back 19:09:04 and i'd love to remove some from the list :) 19:09:20 #topic Priority Efforts (Swift Logs) 19:09:29 who knows what's up here? 19:09:51 jhesketh ^ otherwise I can fill in what I know 19:10:21 the os loganalyze changes to support configurable file passthrough from swift are in place 19:10:34 that gets us "host binary files" ? 19:10:41 Yep so now the next part is just changing more jobs over 19:10:45 Yes 19:10:47 so now we just need to deploy config that says pass through anything that isn't a log so that js and disk images etc work properly 19:11:06 I think pleia2 and mtreinish discoverd a traceback today in the swift upload script which we should look at too 19:11:50 Yeah so part of changing the jobs over is figuring out what needs to be passed through or not too 19:11:55 yeah, I can dig up the link if needed 19:11:57 Oh? 19:12:05 That's be handy 19:12:26 jhesketh: our existing apache mod rewrite rules can be used as the basisi 19:12:34 we already decide for on disk stuff 19:12:44 pleia2: and possibly copying that traceback into paste.o.o before jenkins expires it in the next day 19:12:57 clarkb: yep :-) 19:13:13 fungi: good idea, I can share this over in -infra since we don't need to debug during the meeting 19:13:21 jhesketh: do you want to write that change? I can give it a go if you are busy 19:13:21 thanks! 19:13:28 pleia2: okay thanks 19:13:40 #action pleia2 provide swift log upload traceback 19:14:19 clarkb: I can do the config change today but I don't think I'll have time to look at jobs until next week 19:14:26 jhesketh: ok 19:14:28 #action jhesketh update os-loganalyze to use pass-through rules 19:14:29 clarkb: yeah we did 19:14:53 it was the same one from before, about hitting a timeout 19:15:04 what's the status of the index problem we ran into? (where you can't explore jobs that have run for a change) 19:15:23 jhesketh, clarkb: https://jenkins03.openstack.org/job/gate-tempest-pep8/3690/console 19:15:34 (basically, there was no autogen index of things like logs.o.o/PATH_TO_CHANGE/gate/) 19:15:47 jeblair: thats still not addressed, it requires knowledge of all the jobs that have run or will run in order to upload an index.html 19:16:03 clarkb: or it requires os_loganalyze to ask swift, right? 19:16:05 we can probably have os-loganalyze generate it on the fly? do a swift list of that prefix 19:16:07 jeblair: ya 19:16:20 jeblair: ah I wasn't aware of that, but yes that's right 19:16:33 anyone want to write that? 19:16:42 Which makes os-loganalyze more complex 19:16:50 But yeah, I can take a look 19:17:30 (but again not this week unless somebody else wants it) 19:17:42 if I hvae time to poke at it I will let you know 19:18:00 I am still in weird state of not knowing how much time I actually have t oget things done 19:18:07 Also there were a few tricky parts which is why we generated the indexes at upload time rather than on the fly 19:18:08 anyone else want it? relatively standalone python hacking thing 19:18:35 jhesketh: yeah, i think that makes sense for the jobs, but outside of that, i'm pretty sure we need osla to do it 19:18:49 For example how to things that are half on disk and swift 19:18:50 at least, this is one of the big questions we had for swift folks early on -- would it be able to handle these kinds of queries 19:19:10 and i think the answer from them was "try it, it should work" 19:19:19 jhesketh: yep; i think osla can figure that out too 19:20:25 Yeah there's a few things we can do 19:20:30 #action jhesketh/clarkb have os-loganalyze generate indexes for directories which lack them 19:21:03 i'd still love it if someone else volunteered for that ^ :) talk to jhesketh and clarkb if you want it 19:21:36 #topic Priority Efforts (Nodepool DIB) 19:22:03 who knows about this one? 19:22:04 the bindep bits of this are basically ready i think 19:22:20 I'm not sure where the larger state of things is here but I want to note that I had a 100% image upload failure rate to rax when I tried to do that recently. Get an error 396 from the api in a jsob blob 19:22:25 though i discovered that diskimage-builder really, really doesn't like to preserve distro package caches 19:22:39 fungi: awesome! 19:22:46 fungi: has that been solved? 19:23:05 greghaynes apparently solved it a while back for the dpkg provider by giving us a toggle to turn off cache cleanup which i enabled a week or two ago, but this is still unsolved for the yum element 19:23:36 i haven't had time to refactor its avoid-package-caching-at-all-costs bits so that they can hide behind an envvar 19:23:52 but that would be consistent with the fix in the dpkg element at least 19:24:07 (I find this slighly funny because yum is so slow >_>) 19:24:17 fungi: do we hvae to do it for the new fedora package manager too? 19:24:23 fungi: do you still want to do that or see if anyone else wants it? 19:24:32 we should get someone familiar with that to weigh in and make sure we don't have yum preserving cahces just to have dnf nuke it 19:24:46 if someone else has time that would be a pretty trivial patch, it's all bourne shell 19:24:53 anyone ^? 19:25:13 i can take a look 19:25:19 well, s/trivial patch/neatly self contained task/ 19:25:24 woo! 19:25:29 thanks ianw! 19:25:46 #action ianw update yum dib element to support disabling cache cleanup 19:26:02 anyway, i have nova unit tests passing with the bindep fallback list on our minimal ubuntu-trusty workers via an experimental job 19:26:03 fungi: what else do we need to do to start using bindep? 19:26:26 and am testing the addition of a custom bindep list in nova itself just to confirm that bit of the macro workes 19:26:28 er, works 19:27:03 we still need minimal workers for anything we have bare-.* right now (precise is all that's left i think?) 19:27:04 #info fungi testing bindep machinery with nova 19:27:32 and should do some similar tests on one of our centos-6 dib-built workers 19:27:55 and then we need to migrate/replace the jobs 19:28:18 i believe mordred demonstrated an experimental job with devstack/tempest working on ubuntu-trusty instead of devstack-trusty 19:28:33 but last i heard multinode was still in a questionable state on those? 19:28:51 are the glean/hostname issues ironed out now? 19:28:53 iirc they got smoke jobs working 19:29:02 so need to switch to the full tempest runs and retest then we can switch 19:29:23 I cna write the patch to do that switch 19:29:30 sorry the switch to full tempest 19:30:04 o hai 19:30:05 what's the delta for us being fully-dib (regardless of whether we're using minimal or traditional images) 19:30:07 but yeah, moving jobs off devstack-.* is apparently a step for this now, unless someone can get dib-built devstack-.* images uploading into rax successfully 19:30:10 the glean/hostname issues should all be worked out as of yesterday (thank you pleia2 for approving that change) 19:30:30 fungi: oh i guess that's a partial answer to my question? :) 19:30:36 fungi: I recommend against that 19:30:38 the second thing 19:30:51 moving jobs off devstack-* and on to the others is the next step 19:30:54 so i think it's either get glean integration for our devstack-.* images or move devstack-.* based jobs to the new glean-using images 19:31:07 mordred: fungi ewll the rax dib built image issue is a problem in either case right? 19:31:13 we are uploading minimial images to rax, right? 19:31:14 clarkb: yes 19:31:19 jeblair: not reliably no 19:31:30 i think the raw image size is biting us 19:31:30 jeblair: and by not reliably I mean when I tried it I got 100% failure rate 19:31:53 okay, so we're getting way ahead of ourselves 19:32:03 step 1 is: get images uploaded to rax 19:32:19 and apparently transparent image conversion in glance was a fakeout. the facade is there but the plumbing was missing (i think that's in the process of being nailed down now?) 19:32:25 (that's the end of my list) 19:32:50 dib image uploads to rax _were_ working last i checked 19:33:00 we just need glean instead of cloud-init and nova-agent 19:33:22 yah - I'm confused as to what broke - but I've been on vacation - I'll try to engage on that again tomorrow 19:33:39 yeah, i see ubuntu-trusty images ready in rax as of ~4.5 hours ago 19:33:49 we are anywhere from one hour to 24 days old on those images 19:33:54 abd centos-6 as well 19:33:57 er, and 19:33:57 fungi: its highly unreliable 19:34:19 okay, i haven't been watching it closely so i don't know whether today's a fluke 19:34:32 centos6 is 1-3 days old 19:35:19 mordred: retty sure its error 396 19:35:24 which means rax failed 19:35:53 clarkb: oh, yep, i keep forgetting the age column format changed to include days 19:36:01 1) investigate problem uploading images to rax 2) ensure we have minimal images of all operating systems 3) move jobs to use those images 4) drop snapshot and non-minimal-dib images from nodepool 19:36:03 so these images are random numbers of days old 19:36:20 is that the plan ^ 19:36:29 jeblair: sounds good to me 19:36:48 #info a plan: 1) investigate problem uploading images to rax 2) ensure we have minimal images of all operating systems 3) move jobs to use those images 4) drop snapshot and non-minimal-dib images from nodepool 19:37:01 and yeah, i think the last theory i heard is that we're possibly either overhwelming the nodepool.o.o interface when uploading them all in parallel or we're running afoul of some upload throttle on the provider end 19:37:13 #action mordred investigate problem uploading images to rax 19:37:21 fungi: i thought we uploaded in series 19:37:24 because when i test manually i can upload, but that's just one upload 19:37:31 oh 19:37:39 jeblair: we do 19:37:40 greghaynes was going to look into the nodepool dib worker....? 19:37:54 but its 16GB/chunk size uploaded in parallel iirc 19:37:57 am i remembering correctly? 19:37:59 maybe we can force that to be serial too? 19:38:03 hrm, then i guess it's not parallel uploads, unless something's gone wrong with our upload serialization 19:38:06 jeblair: if it's error 396 - we just need to retry the glance import 19:38:15 jeblair: and I either wrote a patch for that or thought about it 19:38:17 jeblair: yes he is vacationing but has a working poc that runs worker on same host as nodepool daemon 19:38:19 oh, right, chunk parallelization not image parallelization 19:38:27 jeblair: iirc next step or that is to make it not depend on the database 19:38:35 clarkb: cool 19:38:42 it's funny how all these seemingly unrelated projects are so intertwined 19:38:45 #info greghaynes working on nodepool image build worker, poc in progress 19:39:10 okay, i think we've synced up enough on this, at least, enough to know where we need to sync up more later. 19:39:15 #topic Priority Efforts (Migration to Zanata) 19:39:36 how's it going with the i18n team? 19:39:55 team is slowly signing up. 19:40:09 and they've been providing feedback that we 19:40:16 're processing: https://etherpad.openstack.org/p/zanata-feedback 19:40:51 most of it is not really infra-specific, the service seems to run well, just evaluation of zanata itself and expectations around what it's supposed to do 19:41:23 pleia2: any (a) infra potential-blockers or (b) zanata-itself potential-blockers? 19:41:28 I think it's generally going well, I hope to sync up with people on list about the issues that have been raised to make sure we're closing all the feedback loops 19:41:40 there are two major infra things outstanding 19:42:09 1. the upstream wildfly module is in heavy development on github, we should look at upgrading ours before a solid launch of the service 19:42:09 #info zanata trial underway with i18n team 19:42:13 #link https://etherpad.openstack.org/p/zanata-feedback 19:42:42 pleia2: the upstream puppet module? 19:42:56 2. thanks to nibalizer we now have the puppet-community/archive module on all our servers, which makes handling download, unzip, install a less horrible process, so we should use this for installing the server and client 19:43:00 clarkb: yep 19:43:36 pleia2: those both sound tractable! 19:43:37 #1 just needs testing of our deployment with the newer version, #2 needs me to rewrite our module to use archive 19:43:54 totally, we're doing well 19:43:54 i wonder if we could reapply that method to some other modules we have installing things from tarballs 19:44:09 what's the status of the "admins can make languages but we want our scripts to do it and not have admins" issue? 19:44:10 fungi: yes 19:44:23 fungi: i read that as 'installing from terribles' 19:44:31 pleia2, nibalizer: should we take that archive work back out in to our other things that do a similar process? 19:44:35 oh 19:44:37 fungi: jinx 19:44:47 jeblair: i thought it as "memories of slackware" 19:45:00 jeblair: not "fixed" but our infra user now creates things and ultimately it's not a blocker, we're trying to just ignore it for now and tell the admins to behave 19:45:11 mordred: yes 19:45:14 and infra user uses the api via scripts 19:45:18 pleia2: what do we still have admins for? 19:45:30 getting away from exec { 'curl something': unless => 'something } is nice 19:45:33 much more robust code 19:45:36 worth noting I saw an email from transifex about some largish changes on their end. This may affect us. Does anyone know if we ar eaffected? 19:45:50 jeblair: adding new users to translations teams, admins have all kinds of powers that aren't grandular 19:46:08 the translations team doesn't yet have a coordinator for every language, so an admin is needed to get people added where they belong 19:46:21 ok 19:46:23 clarkb: I looked at it and it looked not related to us at all. let's see 19:46:25 AJaeger just brought that up to Daisy in an email earlier today 19:46:34 pleia2: are those things that could conceivably driven from configuration management eventually? 19:46:37 that == list of coordinators 19:46:53 AJaeger: they said paths were changing which I think may affect our use of the client and api? 19:46:54 fungi: not sure yet 19:47:18 clarkb: I thought it was only for the glossary which we do not use 19:47:27 ++ for archive instead of exec 19:47:30 AJaeger: oh gotcha 19:48:13 clarkb: but if *every* path changes, we might need a new transifex - since we only communicate with the client. 19:48:24 new transifex *client* I mean 19:48:37 i mean, we do need a new transifex, we just call it zanata 19:48:38 ;) 19:48:45 hehe 19:48:55 on that note 19:48:56 #topic Priority Efforts (Downstream Puppet) 19:49:12 nibalizer, i wanted to talk with you about the boilerplate changes 19:49:18 yolanda: okay 19:49:21 i see too much code repeated on each puppet project 19:49:26 jeblair: for sure! 19:49:28 can we have it on a better way? 19:49:30 asselin isn't here to talk about the openstackci part; but last i heard, it was pretty close and needed a sample manifest 19:49:40 jeblair: thats correct 19:49:50 there is a sample manifest up for review from asselin 19:50:11 the puppet-httpd stuff is done or almost done 19:50:21 and elsewhere, tons of changes in flight to refactor stuff out of system-config 19:50:24 I have a big set of simple patches https://review.openstack.org/#/q/status:open+puppet+message:beaker,n,z 19:50:37 for nodepool, there is a chain of patches pending to land as well 19:50:43 follow-on from puppet-httpd stuff will remove conditional logic added to make the transition possible 19:50:45 nibalizer: are we ready to start doing any public hiera stuff? 19:50:49 there is some -1 for puppet_os_client_config 19:50:49 jeblair: ya 19:51:02 and I'd like to enable beaker tests on all infra modules (nonvoting) right now they don't run 19:51:28 (did we get zuul cloner + beaker figured out?) 19:51:44 I would mind some help reviewing open patches for fedora22: https://review.openstack.org/#/c/203729 19:51:53 clarkb: we did 19:51:56 bumping puppet-ntp to newer version 19:52:07 are the openstackci patches 19:52:08 er 19:52:16 are the openstackci beaker jobs working again? 19:52:32 jeblair: no I think we have NOT_REGISTERED again/still 19:52:46 I have fixes for some of that 19:52:49 pabelanger: jeblair an explicit goal for us should be to update our versions of common libraries: mysql, stdlib, ntp, postgres etc 19:53:05 nibalizer: they were registered, but they were failing because we were missing a build-dep; i made a patch for that which merged 19:53:21 the NOT_REGISTERED is for a centos / fedora node. Which, is just waiting for +A 19:53:35 nibalizer: why do we want to upgrade those libraries now? 19:53:51 https://review.openstack.org/#/c/205668/ is review 19:53:55 nibalizer: can that wait until we accomplish some of the other outstanding work? 19:54:02 jeblair: yea we can wait on it 19:54:04 jeblair: gate-infra-puppet-apply-devstack-centos7 is still NOT_REGISTERED 19:54:16 jeblair: I feel a need, to be 'ready' for infra cloud 19:54:23 AJaeger: patchset I linked fixes that 19:54:44 pabelanger: that's puppet-apply, not beaker 19:54:59 I expect at some point crinkle or spamaps will be like 'okay lets add the puppet-openstack stuff to system-config' and we'll run into module incompatiblity on those libraries 19:55:00 which is still important 19:55:06 if we have masterless puppet working then there's less need to have modules updated for infra cloud 19:55:11 AJaeger: you linked the puppet-apply job 19:55:30 AJaeger: oops 19:55:31 now we've discussed this before, and a number of workarounds have been suggested, but I think the best thing is to just get our stuff current 19:55:35 and iirc mordred was hoping to have masterless puppet first/soonish 19:55:49 yes. I want masterless puppet. I got busy. I'm sorry! 19:55:52 jeblair: right. I looked at a beaker today, seemed like it worked 19:56:24 nibalizer: even if we get modules updated there are bound to be some kind of incompatibility with the openstack modules and the infra modules 19:56:24 #action nibalizer add beaker jobs to modules 19:56:40 #action nibalizer make openstackci beaker voting if it's working (we think it is) 19:56:45 nibalizer: ^ i took a liberty there :) 19:56:53 nibalizer: Ya, I want to get them upgraded too. 19:57:09 While im doing a bad job of vacationing - yep, ive been messing with nodepool dib workers, ive gotten it pretty far along with nodepool spawning workers when it launches and having upload and build triggered by gearman. I also noticed that out ubuntu-trusty images in rax are not resizing to use the fool disk space which was causing our devstack tests to fail on them. 19:57:10 jeblair: right now the testing is quite cowardly... I'm not sure if you'd rather a coward vote or a hero nonvote 19:57:10 nibalizer: maybe a periodic job to test master modules too (or something) 19:57:21 i don't think upgrading for the sake of upgrading is our highest priority 19:57:23 (sorry for the out of context) 19:57:36 jeblair: I agree 19:57:42 jeblair: thats fair 19:58:02 i'd much rather get things like masterless puppet, in-tree hiera, real testing, etc. in place, then upgrade either when we need to or run out of other things to do :) 19:58:12 yea 19:58:15 ++ 19:58:36 nibalizer: coward vote and expand as it becomes less... cowardly 19:58:36 upgrade for me, are to add fedora22 support. Mind you, only a few modules right now that are lacking it 19:58:37 #action nibalizer create first in-tree hiera patchset 19:58:42 oh i cant do that 19:58:53 #undo 19:58:54 Removing item from minutes: 19:59:05 nibalizer: why not? 19:59:09 i could take it if i get some background 19:59:12 (that made me really happy) 19:59:24 oh i thouht no-response from the bot meant I didn't have bot-acls to add it 19:59:25 (and undoing it made me really sad) 19:59:31 if mordred has script going, I can help with masterless puppet launcher. 19:59:35 i'm happy to do that work it only about ~5 minutes 19:59:40 nibalizer: oh, you can totally do that 19:59:44 #action nibalizer create first in-tree hiera patchset 19:59:56 it never responds to actions, links, infos 20:00:04 ahhh 20:00:14 cool, we're out of time, but i think we got through all the priority things 20:00:19 woot 20:00:19 backlog next week 20:00:31 thanks everyone, especially those i put on the spot today! :) 20:00:35 #endmeeting