Thursday, 2016-08-18

*** thorst_ has joined #openstack-powervm00:19
*** thorst_ has quit IRC01:19
*** seroyer has joined #openstack-powervm01:20
*** thorst_ has joined #openstack-powervm01:20
*** thorst_ has quit IRC01:29
*** edmondsw has quit IRC01:40
*** seroyer has quit IRC02:01
*** thorst_ has joined #openstack-powervm02:22
*** thorst_ has quit IRC02:23
*** toan has joined #openstack-powervm02:23
*** thorst_ has joined #openstack-powervm02:24
*** thorst_ has quit IRC02:32
*** tsjakobs has joined #openstack-powervm02:33
*** tsjakobs has quit IRC02:38
*** thorst_ has joined #openstack-powervm03:30
*** thorst_ has quit IRC03:37
*** kotra03 has joined #openstack-powervm03:58
*** kotra03 has quit IRC04:01
*** thorst_ has joined #openstack-powervm04:34
*** kotra03 has joined #openstack-powervm04:35
*** thorst_ has quit IRC04:42
*** kotra03 has quit IRC04:51
*** kotra03 has joined #openstack-powervm05:18
*** Cartoon has joined #openstack-powervm05:22
*** thorst_ has joined #openstack-powervm05:40
*** thorst_ has quit IRC05:47
*** thorst_ has joined #openstack-powervm06:45
*** thorst_ has quit IRC06:52
*** kotra03 has quit IRC07:22
*** thorst_ has joined #openstack-powervm07:50
*** thorst_ has quit IRC07:56
*** k0da has joined #openstack-powervm08:13
*** kotra03 has joined #openstack-powervm08:35
*** thorst_ has joined #openstack-powervm08:55
*** thorst_ has quit IRC09:02
*** Cartoon_ has joined #openstack-powervm10:12
*** Cartoon has quit IRC10:15
*** smatzek has joined #openstack-powervm10:26
*** thorst_ has joined #openstack-powervm11:04
*** Cartoon_ has quit IRC11:06
openstackgerritMerged openstack/nova-powervm: Wrap console failure message
*** svenkat has joined #openstack-powervm11:52
*** edmondsw has joined #openstack-powervm12:15
*** mdrabe has joined #openstack-powervm12:30
openstackgerritDrew Thorstensen (thorst) proposed openstack/networking-powervm: Enforce limit of VLAN clean ups in each pass
openstackgerritDrew Thorstensen (thorst) proposed openstack/networking-powervm: Enforce limit of VLAN clean ups in each pass
*** burgerk has joined #openstack-powervm12:43
*** apearson has joined #openstack-powervm12:59
*** mdrabe has quit IRC13:21
*** mdrabe has joined #openstack-powervm13:33
*** esberglu has joined #openstack-powervm13:39
*** seroyer has joined #openstack-powervm13:42
*** tblakes has joined #openstack-powervm13:42
*** seroyer has quit IRC13:49
*** burgerk has quit IRC14:02
thorst_efried: when your SR-IOV change goes in, this can start to be a reality:
thorst_esberglu: 32 concurrent jobs...nice!14:06
adreznecdat silent pipeline14:08
esbergluthorst: Regarding the heal_and_optimize_interval, we have it increased for the compute nodes but not the aio nodes. What do you think would be reasonable for that? Default is 30 min.14:08
adreznecesberglu: thorst_ Once we get things stabilized with the new volume and things are passing regularly, how are we going to handle notifications for the nova/neutron pipeline failures?14:08
adreznecTurn emails back on?14:09
adreznecRight now we're running the jobs but we'd have to look manually to see failures14:09
thorst_esberglu: if that patch set works properly...I'm thinking we leave the heal and optimize intervals at the normal rate14:09
thorst_basically should be able to leave it as is.14:09
thorst_adreznec: nova/neutron pipeline failures...yeah, should e-mail to us14:10
thorst_though some are taking 3 hours?!14:10
adreznecYeah, I was just noticing that...14:10
esbergluTurn the compute nodes back down to default or leave them up?14:10
adreznecIt's still running tempest14:10
thorst_esberglu: no changes on your side basically14:10
thorst_just leave everything as is14:11
thorst_adreznec: we have hit a queue limit14:11
thorst_so maybe these jobs are just taking a while cause they were queued?14:11
adreznecmax jobs?14:11
adreznecThat's easy enough to determine14:12
thorst_yeah, the 3 hr 12 minute one has actually only been active for 2 hours 20 min14:12
thorst_still high...not ridiculous14:12
adreznecWe'll have to watch it14:12
esbergluthorst: Part of those longer timeas is the VIF failures. Each time it happened it timed out at 5 min waiting. Hit a handful of those and it can really increase the time14:12
thorst_30 as a parallel number is also too small14:12
thorst_esberglu: true...true14:12
adreznecthorst_: 30 parallel jobs?14:12
thorst_I have a patch for that!14:12
thorst_adreznec: we should bump that to 3x the system count...maybe 4x14:13
esbergluIt shouldn’t be maxing out at 30 it should be 5014:13
thorst_o, its set to 50?14:13
adreznecI thought that's what we were trying for14:13
esbergluMight just be spawing more nodes right now14:13
esbergluIf volume went up really quick14:13
adreznecWhich it probably did14:14
adreznecBecause U.S. morning14:14
thorst_yeah, we have no ready nodes14:14
adreznecFire off patches, rechecks, etc and grab coffee14:14
thorst_well, seems we need more ready nodes.14:14
adreznecHow many are we at now?14:14
esberglu Ughh. There are a bunch of nodes stuck deleting again. Which contribute to the 50.14:14
adreznecWe have 1.8 ready nodes?14:15
adreznecNo wonder we can't keep up14:15
thorst_I'm kidding.14:15
thorst_delete issues...14:15
thorst_that's a bad.14:15
thorst_esberglu: which server is having trouble deleting a devstack node...and what's the name of the instance?14:16
*** tsjakobs has joined #openstack-powervm14:19
esbergluneo19: PowerVM_CI-PowerVM_DevStacked-6868, PowerVM_CI-PowerVM_DevStacked-6862, PowerVM_CI-PowerVM_DevStacked-682214:20
esbergluthorst: Looks like it is only happening on a subset of the servers, but most one have multiple cases. Neos 19, 21, 24, 25, 27, 28, and 30 all have at least 1 stuck14:22
efriedthorst_, confirmed vlan ID does indeed come through.14:24
efriedbinding:profile does not - haven't figured that out yet.14:24
thorst_efried: sweet...maybe we push yours through initially14:25
thorst_so we can do that fancy SEA removal thing14:25
thorst_esberglu: got time to look now...going in...14:25
esbergluCool. Looks like it started almost 48 hours ago, but I didn’t notice because we weren’t at high volume. And just random nodes failing intermittently since14:27 the logs are...neat14:29
thorst_-6868 was created today14:30
thorst_instance uuid: c9bf97c3-db3b-47fc-8275-634be9d06abb14:30
openstackgerritEric Fried proposed openstack/nova-powervm: VIF driver implementation for SR-IOV
thorst_esberglu: do you know if these instances actually built?14:33
thorst_or did they attempt to build, kinda get hung, and then we are now trying to delete them?14:33
esbergluThe latter. They are still in build state. But running the deleting task14:34
thorst_this looks like efried territory to be honest14:34
thorst_they're hung on the SSP upload.14:34
efriedLet me get this commit in place right quick and I can take a look.14:36
thorst_efried: thx14:37
openstackgerritEric Fried proposed openstack/networking-powervm: WIP: Mechanism driver & agent for powervm SR-IOV
thorst_efried: is that still WIP?14:39
thorst_I think that bit could potentially go in...14:39
efriedthorst_, need to finish UT.14:41
thorst_efried: ah14:41
*** burgerk has joined #openstack-powervm14:46
efriedthorst_, esberglu: So what is it that needs to be looked at SSP-wise?14:46
thorst_so go to neo19, just scp down /home/neo/n-cpu.log14:47
thorst_that's a snapshot14:47
thorst_filter on the instance-uuid: c9bf97c3-db3b-47fc-8275-634be9d06abb14:47
thorst_you can see it gets stuck in the SSP 'crt-disk' step14:47
thorst_here's what I suspect happened...we had a lock issue in the SSP upload14:48
thorst_and said lock has been hanging around all day14:48
efriedthorst_, esberglu: sorry, been multitasking, which I suck at.  We should have REST investigate this:15:22
efried2016-08-18 03:24:49.638 41657 WARNING [req-94fbdad2-82e6-497c-8217-3545375dfd3a admin admin] HTTP error 500 for method DELETE on path /rest/api/uom/Tier/b0f7f3d0-41a8-336e-beee-4c1ceec29ae6/LogicalUnit/fe7646fe-dd69-33bb-b934-2b8be9f25b2e: Internal Server Error -- java.lang.NullPointerException15:22
efriedthorst_, esberglu: The upshot is that we've been bouncing off of this guy since 1:29 system time:15:27
efriedthat's a "marker LU" indicating an in-progress (probably hung/failed) upload from another process.15:27
thorst_I've got to step out...someone able to follow up with apearson?15:27
efriedDid we ungracefully kill a compute process at some point?15:27
efriedAt this point, if you want to get things moving, you can remove that LU.  Not sure how assiduously we want to pursue the root cause of this particular incident.15:28
thorst_efried: I've actually seen this a few times15:29
thorst_so I think we do want to root it out.15:29
esbergluThere are 16 nodes hanging on delete right now15:29
thorst_in the past month or so...yeah15:29
efriedhanging on delete?  That's weird - and likely unrelated.15:29
thorst_no no, the marker LU had a hiccup15:29
thorst_and things went ... bad15:29
thorst_ok...gotta run, back in an hour15:29
efriedShould still be able to delete the node.  Which should interrupt the build process.15:30
efriedSo esberglu, example of a node that's hung on delete?15:30
efriedvm, not node.15:30
esbergluWell they are stuck in the deleting task but in build status.15:32
*** k0da has quit IRC15:32
efriedHm, I would expect delete to interrupt the build.  Let's wait for thorst_ to get back and we can pursue that.15:33
efriedMeanwhile, I guess let's try to figure out why this upload failed, and what we might could have done about it.15:34
efriedesberglu: It is clear to me at this point that the marker LU in question was created from a different node.  What other nodes are sharing this SSP?15:50
esberglu19, 21, 24, 25. All of which have hanging vms15:51
efriedOffending LU was created on 25.  Tracking it down...16:04
efriedesberglu, found the culprit16:09
efried2016-08-18 01:34:19.786 8769 ERROR nova.compute.manager [instance: 77770e51-5d82-4a60-9e20-56fefcbc54a9] HttpError: HTTP error 500 for method DELETE on path /rest/api/uom/Tier/b0f7f3d0-41a8-336e-beee-4c1ceec29ae6/LogicalUnit/5a8b2d26-8dca-3881-8ef9-acd8787924ca: Internal Server Error -- java.lang.NullPointerException16:09
efriedAttempting to delete the marker LU.16:09
efriedIf we can't delete the marker LU, everything else trying to use the same image will hang.16:09
efriedWe need REST to investigate the above.16:09
esbergluThanks for the assist16:10
thorst_efried esberglu: got it sorted?16:24
efriedSee #novalink16:24
efried@changh is working it.16:24
efriedNeed to clear the env now and get things moving.  Stand by...16:25
esbergluthorst: Your neworking_powervm change passed CI btw. With that change and fixing this issue we might be stable!16:26
thorst_ooo, I want to look at those logs16:26
thorst_will dig in after a bit16:26
efriedthorst_, esberglu: deleted the errant marker LU.  Things should proceed now, one hopes.16:27
efriedthorst_, question is: should we try to recover from this somehow?16:29
efriedAt the very least, I would expect an instance delete to interrupt the spawns that are wedged on that image upload.  Why doesn't that happen?16:30
efriedDo we need an interrupt handler in that loop?  Does the loop maybe need to check for instance state DELETING whenever it wakes up, and bail?16:31
esbergluefried: This is also happening on the ssp with 27, 28, 30. Want to delete that marker too?16:38
efriedesberglu, or I could teach you to fish.16:38
efriedHave you found the log entry for the offending marker LU?16:39
efriedYou're looking for something like:16:40
efried Waiting for in-progress upload(s) to complete.  Marker LU(s): ['part65fd0cbbimage_template_PowerVM_Ubuntu_Base_1471493040_ab1d7cee5378f003d2749']16:40
esbergluYep. Looking at it now16:41
efriedThat'd have the same req ID as the rest of the entries for the spawn you're trying to unblock.16:42
efriedSo to get it moving, you delete the marker LU via e.g.:16:42
efriedpvmctl lu delete -i name=part65fd0cbbimage_template_PowerVM_Ubuntu_Base_1471493040_ab1d7cee5378f003d274916:42
thorst_esberglu: my change may have functionally worked but has a bug in the logging...16:45
thorst_will get another patch up16:45
esbergluefried: Cool thanks16:48
openstackgerritDrew Thorstensen (thorst) proposed openstack/networking-powervm: Enforce limit of VLAN clean ups in each pass
adreznecFYI thorst_ efried - Proposed Ocata schedule is up at, release date ~Feb 20th17:00
adreznecShorter cycle17:01
adreznecTo align with the new schedule discussed at the last summit17:01
thorst_yeah, crazy17:01
adreznecTake back to whoever internal as needed17:01
thorst_right right...17:01
adreznecWe'll have to decide development timeline17:01
thorst_well, certainly Ocata will have less content  :-)17:01
adreznecI mean no, thorst_17:02
adreznecSame content, just faster!17:02
thorst_but I'm already working as fast as I can!17:02
* adreznec cracks whip17:02
thorst_kinda wish we had 4 releases a year.17:04
*** k0da has joined #openstack-powervm17:05
thorst_esberglu: so...where we at?  Unwedged?17:10
esbergluYep. Nodes are spawning right now. I’m pretty sure we will hit capacity17:11
efriedthorst_, in light of -- is the code arount needed??17:11
thorst_esberglu: hit capacity as in...hit our limit of number of nodes we can run at once?17:11
thorst_or just plain run out of capacity17:11
esbergluHit our limit of 5017:12
thorst_esberglu: phew17:12
esbergluthorst: What’s the status on those new systems?17:12
thorst_efried: yeah, totally.  Notice that in 357239 the device up is in the sea_agent.  It does it after it makes sure the SEA has the VLAN.  In the SR-IOV agent you do the device up when the request comes in17:12
thorst_esberglu: that's on my todo next list.  :-)17:13
thorst_just finishing easier stuff first.17:13
efriedthorst_, I'll pretend I understood that, and leave the code alone?17:15
thorst_efried: sounds good to me17:15
thorst_also, my code can't go in until some nova changes go in and 34342317:15
*** esberglu has quit IRC17:17
efriedthorst_, working up UT for the networking-powervm side now.17:21
efriedLots of work pending there.17:21
thorst_efried: awesome...we'll be able to make things much simpler then17:22
*** esberglu has joined #openstack-powervm17:31
efriedthorst_, gah!, I can never remember how you're supposed to override conf options in a unit test.  Remind me?17:31
thorst_efried: I always forget too17:32
thorst_I think its the 'flags' thing in nova17:32
thorst_not sure about neutron.17:32
efriedah, flags sounds familiar.  But I'm in neutron, so...17:33
thorst_may still be flags...17:33
efriedthorst_, should the sriov_agent be using the CNAEventHandler?18:01
thorst_efried: I'd say no to start...18:02
thorst_1) its not CNA' it would be a different event type18:02
thorst_2) I'm not sure you don't have to do anything18:02
thorst_3) with the latest stuff...I'm not sure I care anymore in the sea one...18:02
*** k0da has quit IRC18:04
efriedthorst_, 3) as in, gonna rip it out of the SEA agent too?18:04
thorst_efried: Not yet...I need to think it through.18:04
thorst_I'm not convinced I need it18:04
thorst_maybe...not 100% sure.18:04
thorst_the maybe is for live migration.18:05
efriedthorst_, okay - right now the setup for that guy is in agent_base.  So I'm going to need to move it to sea_agent.  Does the order matter?18:05
thorst_efried: I don't think so18:05
efriedCan I set it up after rpc_setup?18:05
thorst_adreznec: is this one still needed?
adreznecBut we can't merge it until the dep goes it18:10
adreznecIt can't pass jenkins before that...18:10
thorst_right right...just thought we only had one to go.  But then I remebered Ashana's18:20
adreznecI forgot about that one18:24
*** k0da has joined #openstack-powervm18:25
*** kotra03 has quit IRC18:31
*** catintheroof has joined #openstack-powervm18:49
thorst_esberglu: this hardware stuff is so time consuming...19:10
*** apearson_ has joined #openstack-powervm19:11
efriedthorst_ mdrabe: 3764 ready for y'all.19:13
*** apearson has quit IRC19:14
esbergluthorst_: What do you have to do for it now?19:16
thorst_I have to rewire 4 ethernet switches, 4 san switches, etc... before I can even start talking about wiring the servers19:17
thorst_taking apart an old cloud19:17
thorst_its really unraveled into something amazing.19:17
*** tsjakobs has quit IRC19:28
*** tjakobs has joined #openstack-powervm19:30
*** apearson__ has joined #openstack-powervm19:54
*** apearson_ has quit IRC19:58
*** thorst_ has quit IRC21:16
*** smatzek has quit IRC21:18
*** catintheroof has quit IRC21:21
*** svenkat has quit IRC21:25
*** edmondsw has quit IRC21:25
*** tblakes has quit IRC21:38
*** thorst_ has joined #openstack-powervm22:06
*** burgerk has quit IRC22:10
*** thorst_ has quit IRC22:10
*** thorst_ has joined #openstack-powervm22:14
*** apearson__ has quit IRC22:14
*** tjakobs has quit IRC22:15
*** esberglu has quit IRC22:16
*** mdrabe has quit IRC22:19
*** thorst_ has quit IRC22:31
*** thorst_ has joined #openstack-powervm22:32
*** smatzek has joined #openstack-powervm22:38
*** thorst_ has quit IRC22:40
*** smatzek has quit IRC22:44
*** svenkat has joined #openstack-powervm22:50
*** k0da has quit IRC22:57
*** svenkat has quit IRC23:01
*** thorst_ has joined #openstack-powervm23:31
openstackgerritMerged openstack/networking-powervm: Enforce limit of VLAN clean ups in each pass
*** thorst_ has quit IRC23:46
*** thorst_ has joined #openstack-powervm23:47
*** thorst_ has quit IRC23:55

Generated by 2.14.0 by Marius Gedminas - find it at!