Wednesday, 2016-11-02

*** smatzek has joined #openstack-powervm00:42
*** smatzek_ has joined #openstack-powervm00:44
*** tjakobs has joined #openstack-powervm00:45
*** seroyer has joined #openstack-powervm00:45
*** smatzek has quit IRC00:48
*** smatzek_ has quit IRC01:54
*** thorst has quit IRC01:55
*** thorst has joined #openstack-powervm01:56
*** thorst has quit IRC02:04
*** seroyer has quit IRC02:36
*** thorst has joined #openstack-powervm03:02
*** thorst has quit IRC03:10
*** thorst has joined #openstack-powervm04:09
*** thorst has quit IRC04:15
*** thorst has joined #openstack-powervm05:12
*** thorst has quit IRC05:20
*** thorst has joined #openstack-powervm06:19
*** thorst has quit IRC06:25
*** k0da has joined #openstack-powervm07:21
*** thorst has joined #openstack-powervm07:23
*** k0da has quit IRC07:29
*** thorst has quit IRC07:30
*** openstackgerrit has quit IRC07:48
*** openstackgerrit has joined #openstack-powervm07:48
*** thorst has joined #openstack-powervm08:27
*** apearson has joined #openstack-powervm08:29
*** k0da has joined #openstack-powervm08:34
*** thorst has quit IRC08:35
*** apearson has quit IRC08:45
*** thorst has joined #openstack-powervm09:32
*** thorst has quit IRC09:39
*** apearson has joined #openstack-powervm09:42
*** apearson has quit IRC10:27
*** apearson has joined #openstack-powervm10:28
*** apearson has quit IRC10:28
*** apearson has joined #openstack-powervm10:28
*** apearson has quit IRC10:28
*** apearson has joined #openstack-powervm10:29
*** apearson has quit IRC10:29
*** apearson has joined #openstack-powervm10:30
*** apearson has quit IRC10:30
*** apearson has joined #openstack-powervm10:31
*** apearson has quit IRC10:31
*** apearson has joined #openstack-powervm10:31
*** apearson has quit IRC10:32
*** apearson has joined #openstack-powervm10:32
*** apearson has quit IRC10:33
*** apearson has joined #openstack-powervm10:33
*** apearson has quit IRC10:33
*** apearson has joined #openstack-powervm10:34
*** apearson has joined #openstack-powervm10:34
*** apearson has quit IRC10:35
*** thorst has joined #openstack-powervm10:37
*** apearson has joined #openstack-powervm10:39
*** thorst has quit IRC10:45
*** apearson has quit IRC10:45
*** apearson has joined #openstack-powervm10:55
*** apearson has joined #openstack-powervm11:02
*** seroyer has joined #openstack-powervm11:23
*** smatzek_ has joined #openstack-powervm11:29
*** thorst has joined #openstack-powervm11:49
*** wangqwsh has joined #openstack-powervm11:51
*** edmondsw has joined #openstack-powervm12:21
*** apearson has quit IRC12:24
*** apearson has joined #openstack-powervm12:25
*** apearson has quit IRC12:25
*** apearson has joined #openstack-powervm12:25
*** apearson has quit IRC12:26
*** apearson has joined #openstack-powervm12:52
*** wangqwsh has quit IRC13:06
*** wangqwsh_ has joined #openstack-powervm13:06
*** wangqwsh_ is now known as wangqwsh13:06
*** mdrabe has joined #openstack-powervm13:20
*** seroyer has quit IRC13:24
adreznecJeez thorst, an hour long scrum. Must have a lot of status to report13:31
thorstadreznec: I just love taking your time13:31
thorstefried esberglu wangqwsh: around for virtual scrum?13:31
* adreznec runs and hides13:32
thorsto yes, lets make this semi-official13:32
adreznecI wonder if the meeting controls work outside the meeting channels...13:32
efriedDo we need to re-summarize what broke the CI?13:33
thorstnah, I know13:33
thorstI borked it13:33
adreznec#startmeeting CI Scrum13:33
openstackMeeting started Wed Nov  2 13:33:44 2016 UTC and is due to finish in 60 minutes.  The chair is adreznec. Information about MeetBot at
efriedWith good intentions, of course.13:33
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.13:33
openstackThe meeting name has been set to 'ci_scrum'13:33
adreznec#topic Overview of current status13:33
adreznecI'll turn the floor over to thorst13:34
thorstOK - so the CI runs on Tempest VMs.  They run remote to the NovaLink.  So the fact that I made them with a file path is what was broken13:34
*** tblakes has joined #openstack-powervm13:35
thorstunfortunately we didn't catch cause CI was down, and now we can't get the CI running until we get a fix in13:35
*** smatzek_ has quit IRC13:35
thorsthere are my thoughts...13:35
thorstwe could add a config option to go back to old way.  I think for the image size its fine.  We needed this new path for when people like kriskend and seroyer deploys twenty 20-gig images13:36
thorstbut our CI is doing very small images, and most are linked clones (though not all)13:36
efriedSo a secret config option or a public one?13:36
efried(I'm down with the idea, btw.13:36
adreznecWe're open source, hard for it to be a secret config option :P13:36
thorstefried: I think it should be public13:36
esbergluthorst: We are moving to large images once we add in the OSA stuff13:37
thorstand I think that we need it public for our new in tree driver too.13:37
thorstesberglu: larger for the under cloud, but not the VMs that tempest deploys13:37
efriedSo we're talking about reinstating the IterableToFileAdapter and all that13:37
thorstthe under cloud would actually use the new path...because it is running on the novalink13:37
esbergluNo larger for the vms. OSA nodes need 50G free space13:37
thorstefried: well, I replaced it with a ChunkyFileIter in an old path...13:37
efriedin an old patch on the same change set?13:38
adreznecesberglu: yeah, but not for the guest VMs deployed as part of the CI run13:38
esbergluNM i'm dumb13:38
thorstesberglu: run...but that's all VMs that the under cloud provisions, not what the Tempest running in the VM provisions13:38
esbergluYeah I was confused by what you meant13:38
thorstrun -> right13:38
esbergluI get it now13:38
efriedNope.  In an abandoned change set?13:38
* adreznec can't help but laugh every time he sees ChunkFileIter13:39
thorstefried: yeah, its actually a previous version of what merged13:39
thorstbut what I'm not sure about is what to call this opt13:39
thorst"compat_upload_mode" or something13:39
adreznecAnyway. So basically this configopt would provide a way to revert back to that not-quite-as-old behavior13:39
adreznecI guess it depends on whether we want this configopt to be specific for toggling to the old upload function13:40
adreznecOr if we want it to be a development-only configopt for allowing the driver to work remotely in the future13:40
thorstthat's kinda neat.13:40
adreznecWhere we could have other things this configopt "fixes"13:40
thorstbut whatever it is, it will need to persist into the future in-tree driver too13:40
adreznecSomething like "remote_driver_dev"13:41
thorstadreznec: slippery thing you know we'll have config opts for remote ips13:41
thorstand look like a remote driver.13:41
efriedYeah, that's what I meant by a "private" option - basically either undocumented, or documented as "don't use this unless you're us"13:41
adreznecthorst: yeah13:41
thorstefried adreznec: another idea...13:42
thorstis there a way we could ... hide this somehow?13:42
thorstput a pypowervm patch in that says 'no, you don't get to do that'13:42
efriedI was looking at it, and I couldn't see a way to do it easily.13:42
thorstwe'd have to read in from the file path...and then pipe that into the REST layer13:42
efriedOr we put the IterableToFileAdapter (or the artist formerly known as) into pypowervm.13:43
thorstefried: well, really a FileToIterableAdapter13:43
efriedAnd then pypowervm detects remote and overrides the specified option, ignoring the function.13:43
thorstwell, you need the function...cause that's the only interlock into glance13:43
thorstI kinda prefer that cause we need to patch the pypowervm in OSA already...13:44
thorstand in our devstack...13:44
thorstits trickier but I think it prevents slipperiness13:44
efriedoh, you want this as part of local2remote patch, not a permanent fixture in pypowervm?13:44
efriedI guess that makes sense.13:44
thorstefried: right.13:44
adreznecBasically this would be another library tweak for CI13:45
thorstthose were only two ways I could think of it...  Config opt or hide it in pypowervm local2remote.  I prefer the second cause it will just work with everything and limit it to our CI13:45
efriedbtw, does local2remote become moot if we decide to support remote pypowervm officially?13:45
thorstefried: nah13:45
*** seroyer has joined #openstack-powervm13:45
thorstbecause its really a question of whether or not we support nova-powervm remotely13:46
thorstwhich we don't, except for CI (to allow scale)13:46
efriedOkay, so presumably...13:46
efried#action thorst to propose the local2remote patch to make this work13:46
adreznecAnd I can't see a compelling reason to want to outside development... kind of defeats the purpose of the driver13:47
thorstcan we swap the owner to efried?13:47
thorstcause I want a different action later in meeting  :-D13:47
thorst(I want to spend time updating the nova-powervm proposal rst)13:47
efriedYou'd be quicker at it, but if you show me this interim thing you mentioned (which I have not been able to find), I'll take it on.13:47
thorstOK - we'll work it together.13:48
thorstapearson would be the quickest, but he's decided to be in Europe and be afk13:48
efried#agreed thorst & efried to work the local2remote patch13:48
adreznec#action efried and thorst work together to bring harmony to the CI system13:48
adreznecOk, so that would get applied as part of our existing patch path then13:48
thorstso that brings to second point13:48
adreznecNo new code required from esberglu so far13:48
thorsthow are we going to do our CI for proper nova13:49
thorstI think this patch solves one aspect of it13:49
adreznec#topic In-tree driver CI discussion13:49
*** tblakes has quit IRC13:49
thorstbut the other was that I was planning on localdisk for in-tree.  I think we need SSP at a minimum for CI 'harmony'13:49
adreznecYeah, a bit of a wrinkle13:49
apearson@thorst - so I don't have to read through a ton (yeah, I'm lazy), is there a short summary I can look at to help?13:50
thorstbut...I think its not that awful?  We could lead with SSP...13:50
thorstapearson: don't worry about it - I was just poking on how you're supposedly afk13:50
apearsonoh fine - I know when I'm not wanted...13:50
adreznec #link <-- Driver blueprint13:51
adreznecthorst: would we really lead with SSP only?13:51
adreznecI think we'd also want localdisk in the mix13:51
thorstadreznec: I think we throw both in13:52
thorstsee what sticks13:52
adreznecThat would allow us to run the most basic case of the driver13:52
thorstbut we probably develop SSP first.13:52
efriedThere's a matter of staging, in any case.  We would probably want to ... yeah.13:52
thorstso that we get CI running ASAP13:52
adreznec#agreed on including both localdisk and SSP in the first pass of the in-tree driver13:53
thorstalright...amazing.  We have a plan on those.13:53
thorst#action thorst to update powervm blueprint13:53
thorstdoes that actually do anything?13:53
adreznecAre there any other things we need to decide in the blueprint?13:54
adreznecIt should in the meeting minutes13:54
thorstnot sure...probably, but I haven't looked.13:54
efriedI think it makes stuff appear in a different font in the meeting minutes.13:54
adreznec(if we get meeting minutes)13:54
thorstwell, haven't looked in depth.  E-mail hell.13:54
adreznecLooking at comments now13:54
adreznecFirst one up was about an overview of old powervm vs powervc vs powerkvm vs new powervm driver13:54
thorstyeah, I can put that stuff in.  None of this is really heart burn.13:55
adreznecI think a couple lines on that is fine13:55
efriedI responded to a few of the comments with links to the WIP change sets.13:55
thorstand some we already discussed in I think we're really OK here.13:56
thorstnext topic?13:56
efriedThere's probably only three or four comments that need some nontrivial text added.13:56
adreznec#topic Next steps on stabilizing CI13:56
esbergluI have a couple things for that13:56
adreznecSo esberglu once we land the updated local2remote patch, what's next?13:56
*** tblakes has joined #openstack-powervm13:56
esbergluadreznec: I just saw your comment about disabling stable/mitaka runs. stable/mitaka is not compat. with 16.04, which we have now moved to13:57
adreznecAh, right13:57
adreznecOk, I think I'm ok with dropping that from CI runs...13:57
esbergluBut also I think there is another issue. The run where we discover the above remote thing only took 1 hour. Some are still taking forever / timing out13:58
adreznecWe'd be stuck with it through the next cycle without CI going, but... I'm not sure it's a big deal13:58
esbergluI think there is a devstack config option to force runs even though devstack hasn't been tested on 16.0413:58
*** smatzek has joined #openstack-powervm13:59
esbergluIf we want to try that on staging at some point and see what happens13:59
adreznecYou can always force the run13:59
thorstadreznec esberglu: we need 16.04 because OSA, right?13:59
adreznecNot sure it's worth the headache down the road13:59
adreznecWell and for Ocata13:59
adreznecOcata isn't going to support trusty for most projects by the end of the cycle14:00
adreznecSo we'd be here in a month or two anyway14:00
thorstOK - yeah, I'm OK with that.14:01
thorstunfortunate, but OK.14:01
thorstcan't do something like that when in tree...14:01
efriedSo the timeouts appear to be related to our multiplexed image upload algorithm.14:01
thorstefried: when you say multiplexed...14:01
efriedWe need a deeper debug (I'm probably on the hook for that); but I think a broader design discussion may be in order.14:01
thorstdo you mean my code or your marker lu thing?14:01
adreznecWe'll need to figure out handling multiple image "flavors" for different branches down the road14:01
efriedprobably the wrong term.  I mean the marker LU thing.14:02
adreznecFortunately we have ~2 years to figure that out, probably14:02
thorstefried: and how much of that is due to marker lu or the fact that the file never actually uploads (my thing)14:02
efriedthorst, you mean the thing that _just_ happened?  Not related.  The marker-based upload stuff behaves properly in that scenario.14:03
efriedWhich is why this is kinda bizarre.14:03
esbergluadreznec: That multiple flavor thing will be a piece of cake once zuul v3 comes out14:03
efriedIt should be behaving the same on any other kind of failure.14:03
adreznecThat's why I don't think it's worth chasing now14:03
adreznecWhen we get more complex config (static nodes, etc) with zuulv314:03
thorstso revisit when we have things a bit more stable (patch landed)14:03
thorstready to move onto the issues that wangqwsh is hitting?14:04
adreznecOk, we'll need to have a deeper dive into this once we land the local2remote stuff14:04
efriedaren't we still discussing the upload hangs?14:05
* thorst waiting...14:05
efried1) I wonder if we need to move the marker LU *creation* inside the try/finally; 2) I wonder if we somehow need to handle the scenario where deleting the marker LU fails; but most profoundly, 3) should we consider a timeout of some kind, where I can delete a marker LU I didn't create if a certain amount of time has elapsed? (scary)14:05
efriedIt's possible 3a) we can detect whether the real image LU hasn't been created for "a while" and act then.14:06
thorsta timeout scares the hell out of me14:06
thorsta timeout where we see no progress being made doesn't scare me14:06
efriedYeah, there's no way we can really set expectation for the speed of the actual upload.14:06
thorstwell, if we see any bytes moving...then ok14:07
thorstbut do we even get that visibility?14:07
efriedYeah, I don't know if there's a way to detect how much of the upload has happened.14:07
*** tjakobs has joined #openstack-powervm14:08
adreznecDo we really have visibility into the rate of data happening in the upload?14:08
efriedThe schema doesn't provide anything but the capacity as far as the LU itself is concerned.14:08
efriedAnd remember, the whole point is that the upload is happening from a different nvl that we can't talk to (except through the SSP).14:09
efriedSo... we could theoretically use the marker LU as a message bus.  This gets pretty complicated.14:09
efriedHave the owner of the marker LU write heartbeats of some kind, and the other guys read the heartbeats.14:09
efriedNow we have clock sync problems and everything; but we can get around that.14:10
thorstcan you see a last touched thing?14:10
thorstget a time that the marker was last touched and have the one uploading actually touch the marker14:10
efriedNot via REST.  Would at the very least have to map & mount it.14:10
thorstthough, we get into the same lock contention we'd be in otherwise14:10
thorstwhoa, no mounts14:10
adreznecEw ew ew14:11
efriedMaybe not mount.14:11
efriedWhat metadata does linux provide on a mapped device?14:11
efriedSo yeah, not map, but read.14:11
efriedBasically have raw, dd-able data written by the marker owner, read by the waiters.14:12
thorstshould we table that for further discussion?  I want to make sure we get to wangqwsh's item because it is late for him14:12
thorstwe can swing back to it?14:12
efriedIf I propose a patch for 1 & 2...14:12
adreznecesberglu: any other CI stabilizing topics?14:12
efriedCan it be tested without merging it?14:12
adreznecefried: that would probably be a good place to start discussion14:12
adreznecand I think we could?14:12
esbergluI think thats it14:13
adreznec#topic OpenStack-Ansible CI bring-up14:14
adreznecwangqwsh: thorst the floor is yours14:14
adreznecOh, right14:14
adreznec#action efried to start proposing discussion patches on marker LU enhancements14:14
adreznecAs you were14:14
efried(#1 is kinda dead in the water, alas)14:14
thorstalright.  wangqwsh I think you were seeing odd Configparser import issues due to the use of the local2remote patch in your OSA CI14:15
thorstas of last night when we discussed, we didn't really have a plan...14:15
adreznecI think we have two options14:16
thorstwangqwsh: did you make any progress on it or is that still the latest?14:16
*** efried_otm has joined #openstack-powervm14:16
wangqwshno progress...14:16
adreznecEither install configparser into the nova-master venv for the compute node14:16
adreznecOr make it so the local2remote patch doesn't require configparser14:16
thorstlet me look at that patch...14:17
thorstewww...the patch has a tab in it!14:17
wangqwshrepo container builds the wheels.14:17
adreznecwangqwsh: that doesn't really matter, we could patch it in post-build14:18
thorstadreznec efried: I feel like ConfigParser could easily be removed...for something more trivial.  Though its probably a few hours of work.14:18
thorstare we using the 'setup.ini'?14:19
adreznecHmm ok14:19
adreznecLet me look at the patch14:19
efried_otmI could rewrite the confit parsing if I had to. not trivial.14:20
thorstyeah, but are we even using it...14:20
thorstit looks like it could fall back to nothing14:20
efried_otmother than in the patch?  I don't think so.14:20
efried_otmugh, and do the discovery every time, which is slow.14:21
thorstefried_otm: yeah, but14:21
thorstdiscovery when you start the adapter.14:21
efried_otmbut yeah, that's the east path.14:21
thorstwhich is once.14:21
thorstmaybe twice.14:21
*** wangqwsh_ has joined #openstack-powervm14:23
*** wangqwsh has quit IRC14:23
*** wangqwsh_ is now known as wangqwsh14:23
thorstflip side...14:23
thorsthow hard is it to add that dependency to the container?14:23
thorstadreznec / esberglu?14:24
adreznecLets see14:24
adreznecSo the actual action of adding it to the container is really easy, right?14:24
adreznecThe path is consistent, so we'd source /path/to/nova/venv/bin/activate14:24
adreznecand pip install configparser == v.whatever there14:25
adreznectiming might be more complicated14:25
adreznecwangqwsh: at what point in the run were you seeing the failure?14:26
*** efried_otm has quit IRC14:26
adreznecWould it be enough to let the OSA AIO finish running, then before we kick tempest patch configparser into the venv, restart nova-compute, validate it comes up, then do the tempest run?14:26
wangqwshstart the nova-compute service, it printed14:26
thorstbut will OSA be OK if nova-compute just dies14:27
adreznecI think so14:27
adreznecEasy to test locally14:27
adreznecI'll just break my driver settings, kick off an AIO and find out :)14:27
adreznecI think it will though14:27
adreznecI don't think it checks service state for long enough to notice the failure14:28
adreznecOk, so 2 minutes left here14:28
adreznecwangqwsh: do you want to try patching configparser into the venv and seeing if nova-compute works?14:29
adreznecShould just need to run "source /openstack/venvs/nova-master/bin/activate" and "pip install configparser"14:29
wangqwshhow to install the pkg? via pip?14:29
adreznecThe restart nova-compute14:29
adreznecI'll test the driver breakage situation on my AIO14:30
adreznecTo see if that timing would work14:30
wangqwshthe pip config was changed to repo containter.14:30
adreznecAh, and configparser isn't there?14:30
wangqwshpip install would not find the pkg.14:30
adreznecHmm ok14:30
adreznecThat's inconvenient14:31
thorstI'm wondering if maybe we try to remove the dependency in pypowervm...  Maybe wangqwsh could drive that and efried could review?14:31
thorstjust seems...simpler...14:31
adreznecI wonder if we could just patch in configparser as an additional dependency for nova-powervm in CI runs14:31
adreznecand then it would just end up in the venv14:32
wangqwshrepo container builds the wheels using openstack requirement files.14:32
thorstwangqwsh: want to try those two approaches?  1) Change the nova-powervm requirements to include Configparser (I find that eww) and 2) work on the local2remote patch with efried to see how to remove that dependency14:33
adreznecwangqwsh: right, we could basically add configparser to the list of requirements needed for nova-powervm, but only for CI runs14:34
adreznecI'll take #114:34
adreznec#action adreznec to test patching nova-powervm requirements to include configparser in OSA CI runs14:34
adreznec#action wangqwsh and efried to evaluate removing configparser dependency from local2remote patch14:34
adreznec#topic Future meetings14:35
adreznecSo I think this has been pretty productive14:35
thorstI liked this.  We should do it again14:35
adreznecWhat do you guys think about doing this weekly14:35
adreznecI can get something scheduled14:35
adreznecefried: esberglu wangqwsh ^^ ??14:35
thorstwe should get a wiki out there too, like the nova meetings.14:35
adreznecthat was my plan14:36
adreznecFormalize this as a driver meeting14:36
esbergluYeah I think this was better than phone calls14:36
esbergluPlus there is now a chat history14:36
wangqwshif the #1 not work, i can try #214:36
wangqwsh1 question:14:36
adreznecDoes this time slot work for people?14:37
thorstworks for me14:37
thorstunless one of us has to SDB present...14:37
wangqwshhscipaddess issue14:37
adreznecWhich would be an issue in 2 weeks14:37
adreznecOk, I'll look at calendars14:37
wangqwshthorst: do you mean the hscipaddress works for you?14:39
thorstwangqwsh: I think that will go away with the ConfigParser dependency14:39
thorstI only saw that error once, so I think it was an anomoly14:39
adreznecI think that was a timing issue14:40
wangqwshi will try it again14:41
adreznec#action adreznec to schedule weekly driver team meeting14:41
adreznecAll right, I think we're done here?14:41
adreznecAnd we're over time14:41
adreznecThanks everyone!14:41
thorstdamn...almost made it14:41
openstackMeeting ended Wed Nov  2 14:41:43 2016 UTC.  Information about MeetBot at . (v 0.1.4)14:41
openstackMinutes (text):
adreznecesberglu: has no action items14:42
adreznecHow did that happen14:42
esbergluYeah I was just gonna ask if I could help with anything14:42
esbergluOtherwise if efried still needs help getting his devstack going14:42
adreznecYour job is to Make CI Great Again14:42 did we talk CI for 50 min and esberglu gets none  :-)14:43
esbergluI'm sneaky14:43
adreznecBecause the rest of us are all CI slackers14:43
adreznecAnd he already has work items14:43
thorstasking efried what he needs help with is best route14:43
thorsthe's got a lot on his plate.14:43
thorstand we just added more14:43
esbergluYeah just let me know14:44
esbergluefried ^^14:44
thorstI'm told efried's IRC is screwed up at the moment14:45
thorstso slack him if he's unresponsive14:45
esbergluThose meeting minutes are slick14:47
thorstthey really are14:48
adreznecDefinitely doing more of these15:06
efriedthorst, I'm looking at the get_or_upload_image_lu code, and I don't think my above suggestions #1 & #2 make sense.15:13
*** wangqwsh has quit IRC15:14
thorstlooking up that code15:14
thorstefried: what I just don't understand is what is causing the thing that claims the marker to not actually upload15:16
thorstis it just...failing/15:16
thorstcause if so...we delete the marker_lu15:16
efriedRight - that's what 'finally' is for.  I haven't been able to figure out why that's not happening in the CI scenarios we've seen.15:17
efriedPerhaps the thing to do is add some more debug logging and reproduce.15:17
thorstefried: my guess is that there is no winner15:17
thorstand they're all just waiting.15:17
efriedNo, that shouldn't be the case.15:17
efriedAnd actually, the logs disprove that.15:17
efriedCause we do see the second guy come in, bounce off the first one, delete his marker, and wait.15:18
efried...and wait, and wait, and wait...15:18
thorstbut we don't see anything from the first...15:18
*** mdrabe has quit IRC15:18
efriedWe see the first one create his marker LU and start the upload.  Then... nothing.15:19
efriedesberglu, do you happen to still have any logs around from runs manifesting this timeout behavior?15:20
thorstadreznec kriskend: that kinda sounds like what we saw with our localdisk uploads15:20
thorstand one of the reasons to move to this new model...15:20
efriedYeah, if the upload itself hangs, we really have no recourse.15:20
efriedas far as this algorithm is concerned.15:20
efriedesberglu, thanks.  thorst ^^ take a look with me.15:21
adreznecthat sounds really familiar15:22
efriedOne issue we have is that we turn off debug logging for pypowervm because it's so bloody verbose (we see all the REST requests & responses).15:22
efriedWhich is as it should be - those logs are WAY too much to deal with in nearly 100% of cases.15:23
efriedIn fact... I can't think of the last time we had to use 'em.15:23
efriedI wonder...15:23
adreznecIt's enough data as to make the logs kind of useless is most cases15:23
efriedIf only there was *another* log level.15:23
efriedIs that less than DEBUG?15:24
efriedIma propose that change RIGHT NOW.15:24
efriedesberglu, your action will be to change the local.confs to make pypowervm DEBUG (it's INFO right now).15:25
adreznecand then TRACE becomes the log level we all wish didn't exist, most of the time15:25
thorstefried: yeah.15:25
esbergluYou mean change to TRACE?15:26
esbergluWait which is the more verbose one, trace or debug15:26
*** mdrabe has joined #openstack-powervm15:26
efriedtrace is more verbose.15:26
esbergluOkay so debug15:26
esbergluis what we want15:26
efriedBackground: pypowervm reports all the REST payloads - request & response - under DEBUG currently.  This makes debug logging in pypowervm pretty much useless - which takes away our ability to use it for useful stuff - like this upload problem.15:27
efriedSo we always set pypowervm to INFO, which is manageable.15:27
efriedI'm going to move all the REST request/response stuff to TRACE so we can set the default level to DEBUG and have it be actually useful.15:27
efriedthorst, it sure would be nice if I didn't have to make all these changes in the to-be-removed caching code.15:28
efriedCan we get that guy approved?15:28
thorsto yeah15:29
thorstforgot about that15:29
esbergluadreznec: I like your idea of just commenting out the stable/mitaka stuff for now, and then trying to bring it back with zuul v3. I'm going to move forward with that15:36
adreznecesberglu: cool, I'll +2 an updated patch with that in it15:38
efriedthorst, re-review please - added a note to README.15:39
thorstgood removal of debt...15:40
efriedAre our jenkins builds affected by the network outage?15:40
efriedthorst, 4421 for debug=>trace.15:41
efriedConsidered leaving in the ones that don't have full body text, but figured that would still be a lot.  Whatcha think?15:42
adreznecyeah... going to be a tough day to merge patches internally15:42
*** tblakes has quit IRC15:45
adreznecesberglu: so looking at that patch again... we're only pulling 12.0.0, 13.0.0, etc15:45
adreznecWhat do we do when tempest tags point releases?15:46
adreznece.g. 12.1.0, 12.2.0?15:46
esbergluProbably need a discussion on that15:48
esbergluWould need to be tested on staging15:48
esbergluAnd changes made accordingly15:48
esbergluWhy doesn't tempest follow the openstack release model?15:48
adreznecGood question16:03
adreznecProbably because tempest is really a library16:04
adreznecand not an actual core project16:04
adrezneclibraries tend to have independent release schedules16:04
efriedthorst, esberglu: here's a thought.  Maybe I should generate the marker LU name to include some representation of the host name.  That would help with debugging.16:09
efriedWe run into problems with name length, unfortunately.16:11
esbergluadreznec: I set it up so I will get emails of future point releases on tempest. So when they come out I will know and can test at that point16:14
efriedesberglu, in that log you noted above -- did you at any point go in and delete marker LUs?16:22
efriedWell, that's....16:22
esbergluDidn't touch anything16:22
efriedesberglu (thorst) So several things jump out of that log.16:24
efriedRemember the "ssp_primer" instance?  This is the one we create from to prime the SSP with the image LU to make subsequent deploys faster.16:24
efriedFirst of all, doesn't create that guy if the image LU already exists in the SSP.  In this log, I see we're trying to create it - so must've thought we needed to.  As if this was the first run against this SSP.  Which clearly can't be true because...16:27
efried...the ssp_primer VM creation is finding (supposedly) in-progress uploads - that is, existing marker LUs.16:27
efriedBut not just one.  Like six.  That's really weird.  Unless half a dozen different nodes were all coming up at the exact same time.  esberglu, is that possible?16:28
efriedProbably not, because...16:29
thorst6 could come up....but with jitter I think16:30
esbergluMaybe? If it is right when I finish redeploying the CI env. Basically zuul starts the queue while the initial image builds. Then spawns 20+ nodes and start all of the tempest runs for all of the changes that came in while that stuff was going on at the same time16:30
esbergluIf that made any sense16:30
esbergluWhat time did this happen at?16:31
efried...several *other* uploads try to happen over the next seven minutes, but they *also* bounce off of (supposedly) in-progress uploads.  And those also show various large numbers of marker LUs.  Like more than ten.16:31
efried2016-11-01 16:47:17.489 is when the ssp_primer first bounces off the in-progress uploads.16:31
efriedsix marker LUs.16:31
thorstundercloud could have 6 hit at once...16:32
esbergluThat would have been right around the time that the first runs were going on yesterday I think16:32
thorstbut I thought we only had 4 hosts in a given SSP or something16:32
efriedSo let's assume for a second that we really did have six (or more) of the 20+ nodes trying their uploads at the same time.  What should be happening is that they should all see all the marker LUs, compare them, and whoever's "first" (lowest sort order of marker LU name) should "win" - but the rest ought to delete their markers and spin.16:33
efriedSo at 2016-11-01 16:48:41.456 and 2016-11-01 16:48:44.270 we see two other threads bounce off the markers - now ten of 'em.16:34
thorstare we talking about an undercloud rebuild or a tempest run16:35
efriedEven if we managed to create six at once, no new marker LUs should have come into the picture after that, cause all the other guys should have just seen the existing ones and entered into a wait loop.16:35
efriedThen at 2016-11-01 16:54:16.053 things get _really_ weird.16:35
efriedThe ssp_primer thread CREATES A MARKER LU AND TRIES TO UPLOAD.16:36
efriedWhich must mean it came out of a sleep, looked around, and found a) no marker LUs, and b) NO completed image LU.16:36
efriedAt that point, he tries his upload and fails (because APINotLocal).16:36
efriedWhich brings us to the first potentially actionable item - it's possible should actually wait for the ssp_primer to come up successfully before it allows testing to start.16:37
thorstefried: but even with the SSP primer...16:38
thorstthings are going to be uploaded16:38
thorstsnapshot and then deploy from snapshot16:38
efriedYeah, but those are different images.16:38
efriedThe snapshots need to be created from existing instances, which can't be created until the primer is there.16:39
efriedThe only thing this would buy us is the ability for to bail out early and not run tempest if the ssp_primer fails to create.16:39
efriedIt wouldn't actually help the tests themselves.16:39
efriedCall it early error detection.16:39
efriedSo anyway, I'm scouring the get_or_upload_image_lu algorithm for anything that would allow a marker LU to be created if we already see marker LUs in the SSP.16:40
thorstyeah...but I think we hit this thing even in undercloud build out16:40
thorstare marker_lu's guaranteed to be unique?16:41
efriedAs unique as the first 8 chars of a randomly-generated UUID.16:41
efriedWhich is more guaranteed-to-be-unique than the frequency with which we're seeing this issue.16:41
thorstso HIGHLY unlikely16:41
efriedwhat's the significant difference in the undercloud?16:42
thorstit deploys across 20 or so hosts16:42
thorstbut the SSP is shared across 4 hosts16:42
thorstso there are 5 or so SSPs16:42
thorstI think we've seen this issue there.16:42
efriedstill remote pypowervm?16:42
esberglu*deploys across 14 hosts16:42
thorstlocal pypowervm16:42
efriedokay.  I don't think that's related to the upload marker stuff - it just results in the primer failing ultimately.16:43
efried(remote, that is)16:43
efriedSo as I said, I'm scouring the get_or_upload_image_lu algorithm, and would welcome some more eyes.16:44
efriedThe obvious, but hopefully-impossible, thing that could cause this is our SSP GETs returning the wrong data.16:45
efriedLike we do a GET and it comes back empty, when it's really not.16:45
efriedI wonder if that can happen when a VIOS goes bad (busy, RMC dead, etc.)16:46
efriedOr the other way - maybe VIOS and/or REST is caching improperly and reporting the LU list in a state it's not in anymore.16:48
efriedBeyond that, trying to inspect the algorithm for holes where we could create a marker even if we find some already there.16:48
efriedSigh, this is pretty clear:16:49
efriedif _upload_in_progress(lus, luname, first):16:49
efried    first = False16:49
efried    _sleep_for_upload()16:49
efried    continue16:49
thorstefried: I've heard rumors of cache issues in the SSP code16:49
efriedon VIOS or REST?16:49
thorstI think csky and shyama hit that.16:50
efriedThis is, of course, going to be tough as hell to nail down.16:50
esbergluefried: It would be weird if it was the algorithm right? Nothing has changed in that recently. If it was an algorithm thing we likely would have been hitting it at some point previously16:50
thorsthas the algorithm ever been 100%?16:51
efriedesberglu, have we not been hitting this more or less ever since we really cranked up the number of nodes?16:51
thorstwell, maybe the algorithm is...16:51
thorstbut the cache was always the issue...16:51
esbergluWe have been seeing issues with the marker lu stuff, but the underlying cause has always been something else.16:52
esbergluThen the underlying cause is fixed, we run fine for a while, then it manifests again16:52
efriedthorst, adreznec, esberglu: do me this, if you please.  Take 15 minutes and scrutinize this again for holes:
efriedWith its helper methods (all in same file) it's only about 170 lines including docs, comments, and whitespace16:54
thorstefried: I have been  :-)16:54
efriedIf I got it to pass sonar's complexity check, it can't be too complicated, can it?  ;-)16:54
thorstwhile I'm reviewing...maybe reach out to Hsien to see if he knows of any issues with the cache there?16:55
thorstspecifically what Hsien saw was something with the tier...which we're using here.16:56
efriedOne thought: in the 'finally' clause, when we delete the marker, maybe we spin doing GETs until it really disappears?  (And warn like hell if we have to spin even once - dammit, when delete() returns, the thing should be GONE.)16:58
*** k0da has quit IRC16:58
efriedBut it wouldn't be super surprising if VIOS claims completion before the thing is really purged from the SSP.  They gave us that guff about LU mappings recently, if you recall.16:59
thorstone thing I can possibly maybe see here...17:00
thorstthe crt_lu fails (but actually manages to succeed)17:00
thorstand ends up creating a marker lu.17:00
thorstbut we think it failed...17:00
*** apearson has quit IRC17:00
efriedYeah, I was looking at that.  That was #1 waay above.  The only way we could detect that would be, after that crt_lu "fails", do a GET and see if that LU really got created.17:01
thorstseems unlikely17:02
efriedOnce again, that's a bunch of extra code on the theme of essentially distrusting the atomicity/transactionalism of the API.17:02
thorstefried: some extra logging in here feels like it could go a long way.17:06
thorstmaybe something in the _find_lus to log the names of the LU's returned.17:06
efriedThe 'Waiting for in-progress' log records the names of the LUs we're waiting on.  That's how I found out how many of those there were.17:08
efriedBeing able to use the debug log will produce those on every iteration instead of just the first one, which will be very helpful.17:08
*** esberglu has left #openstack-powervm17:12
thorstefried: true...17:14
*** tblakes has joined #openstack-powervm17:20
*** tblakes has quit IRC17:41
*** esberglu has joined #openstack-powervm18:11
*** openstackgerrit has quit IRC18:18
*** openstackgerrit has joined #openstack-powervm18:18
esbergluthorst: Any progress on the LU stuff while I was at lunch?18:23
*** k0da has joined #openstack-powervm18:45
*** tblakes has joined #openstack-powervm19:29
*** tblakes has quit IRC19:35
*** seroyer_ has joined #openstack-powervm19:36
*** seroyer has quit IRC19:37
*** seroyer_ is now known as seroyer19:37
thorstesberglu: not from my side...been looking at SDN stuff unfortunately19:50
*** openstack has joined #openstack-powervm19:59
*** dwayne_ has quit IRC20:11
*** openstackgerrit has quit IRC20:18
*** openstackgerrit has joined #openstack-powervm20:18
thorstman...Ubuntu is so nice20:25
thorstcan't say enough nice things about Ubuntu20:25
*** dwayne_ has joined #openstack-powervm20:27
*** smatzek has quit IRC20:29
*** tblakes has joined #openstack-powervm20:41
*** edmondsw has quit IRC20:50
*** esberglu has quit IRC21:03
*** esberglu has joined #openstack-powervm21:04
*** esberglu has quit IRC21:08
*** tblakes has quit IRC21:18
efriedthorst, and RHEL is HEL(l)21:21
thorstI can not confirm nor deny21:21
thorstbut man, that ubuntu is classy21:21
efriedadreznec, queue up topic for our next scrummyscrum (or preferably sooner): we need pypi fixed for pypowervm so we can get going as a prereq for any serious driver integration.21:22
efriedIt needs to *happen*.21:22
adreznecI was just talking to esberglu about driver meetings/scrums/etc21:22
efriedWe can get away with the live migration object as soon as the blueprint is approved, but pypowervm is gonna block everything else.21:22
adreznecI'll set up a discussion with Dom21:23
adreznecSo we can talk getting a job created to do it21:23
adreznecHopefully for tomorrow if I can find calendar space21:23
adreznecthorst: you still here?21:23
thorstadreznec: kinda21:24
thorstabout to leave21:24
adreznecI want to cancel our existing OSA phone scrums21:24
adreznecand replace them with IRC discussions21:24
adreznecIt seemed more productive for me21:24
thorstmy only concern is with wangqing21:25
adreznecAnd wangqwsh21:25
thorstgiven timing21:25
adreznecWell right21:25
adreznecWe can discuss more when you have more time21:25
thorstbut I'm in support otherwise, lets just make sure it works with him21:25
thorstit was super duper awesome sauce21:25
adreznecBut it seemed easier for us to communicate that way21:25
adreznecEasier in text21:26
adreznecAnd it has logs and action items21:26
adreznecI'll poke at it a bit21:26
thorstadreznec: +121:27
thorstalright...I'm out21:27
adreznecsee ya21:28
*** thorst has quit IRC21:31
*** esberglu has joined #openstack-powervm21:34
*** tjakobs has quit IRC21:37
*** seroyer has quit IRC21:38
*** esberglu has quit IRC21:38
*** smatzek has joined #openstack-powervm21:57
*** tblakes has joined #openstack-powervm21:58
*** mdrabe has quit IRC22:09
*** smatzek has quit IRC22:30
*** apearson has joined #openstack-powervm22:32
*** tjakobs has joined #openstack-powervm22:38
*** tblakes has quit IRC22:58
*** apearson has quit IRC23:00
*** apearson has joined #openstack-powervm23:00
*** thorst has joined #openstack-powervm23:13
*** thorst has quit IRC23:18
*** seroyer has joined #openstack-powervm23:34
*** seroyer has quit IRC23:35
*** esberglu has joined #openstack-powervm23:53

Generated by 2.14.0 by Marius Gedminas - find it at!