Monday, 2017-04-10

openstackgerritShyama proposed openstack/nova-powervm master: FileIO adapter does not remove mappings on detach
thorstefried: are we still held up on CI?  I don't see esberglu on, but I'm 99% sure that's where we left off on Friday13:28
efriedthorst I posted a recheck last night...13:29
efriedbut it failed.13:29
thorstgoing to look it up...13:29
efriedIn a different way than last time.13:29
efriedI'll recheck it again now, but this may be something esberglu needs to look at.13:30
thorstyeah...recheck that again.  Possibly new keystone specific tests13:30
thorstsigh...I feel like even OOT may want to go whiteliest13:30
thorsthowever, you have some legit failures in there too13:30
thorstHTTP error 400 for method PUT on path /rest/api/web/File/contents/a2335cef-4520-4657-87e3-7dd7eee1cc3d: Bad Request -- REST002C Content-Length specified in header does not match that of the meta file: 104,857,60013:32
efriedSame as before.13:34
thorstefried: do we handle that with the Iterable facade thing up in nova-powervm13:34
thorstwhile we work a proper fix for pypowervm 1.1.213:34
efriedWell, theoretically.13:35
efriedThat is, we never saw this problem when using IterableToFileAdapter.13:35
thorstmight as well try it...13:35
efriedI'm buggered if I understand why it's different from the façade I have in pypowervm.13:35
efriedIt also appears as though that pypowervm patch is still NOT in play here.13:36
efriedBecause the File element still has the ExpectedFileSizeInBytes field in it.13:36
thorstefried: yeah...we were trying a pypowervm patch that would remove that...but I think that we really want that facade anyway so we support older versions of nova-powervm13:37
efriedthorst Does not compute.13:37
thorstyeah...ok.  Let me rephrase13:38
efriedIf the façade is in nova-powervm, how are we supporting older versions of nova-powervm?13:38
efriedYou mean older versions of pypowervm?13:38
thorstwe can backport the facade into ocata and newton...which should work with the older versions of pypowervm13:38
efriedthorst That's already done.13:38
thorstI thought the older ones were still using coordinated?13:39
thorstthey aren't?13:39
efried (ocata) and (newton) are using IO_STREAM instead of FUNC.13:39
efriedAnd will use coordinated or streaming purely depending on which pypowervm is under 'em.13:40
thorstI'll recheck those...13:41
efriedtjakobs__ and I tested both of those last week.  Coordinated was a hair slower, but both worked.  Right tjakobs__ ?  I can't remember which one(s) hung.13:41
tjakobs__newer pypowervm with older nova_powervm was hanging. With backport/new nova_powervm patch I haven't seen it hang, but the deploys took an extra 15-20 ish seconds (of like 20-30 before)13:43
tjakobs__of a 2G image13:43
thorstthe previous runs (April 7th) timed I'm re-running now13:43
thorstefried: could those be stuck because of a stale marker lu?13:46
thorstesberglu: did you just wipe out the current runs?13:47
thorsto, they all just came back as 'aborted'13:48
thorstall the current rechecks13:48
thorstalso, it does look like we have stale marker LU's in some of the SSPs13:48
esbergluStill having net issues?13:48
thorstthey said we were...but I'm not seeing any13:48
esbergluHmm idk, I guess just kick off another round of checks13:49
thorstcan you check for stale marker LU's/13:49
efriedthorst esberglu FYI, I opened a screen session on my pok victim on Friday evening and it's still alive now.  So the network couldn't have been TOO bad since then.13:53
thorstk.  based on logs I'm betting there are some.  I can then recheck them after we clean them out (unless I'm wrong)13:53
efriedI want to get this resolved so I can bask in the glow of the TaskFlow change I made.13:54
efriedReally chuffed about that.13:54
thorstefried: I want to get this resolved so that we can fix the 10+ merges we have in the backlog13:54
efriedYeah yeah, that too.13:54
adreznecefried: Interesting, all my SSH connections from RCH->POK died over the weekend13:54
efriedadreznec SSH connections seem to die when idle for too long.  Screen session was continuously posting changes, so stayed alive.13:55
efriedOr perhaps rch-pok conn had separate problems.13:55
thorstefried adreznec: Yeah, I think connections within POK are OK13:56
thorstbut from outside in weren't13:56
efriedI'm connecting from aus13:56
adreznecefried: Yeah, I have tweaks to my SSH config that usually stops connections from dropping13:56
efriedadreznec You'll have to share those with me.  That's annoying.13:56
efriedSeems like it didn't used to do that.  Not sure what changed.13:56
efriedNever looked into it.13:56
efriedthorst Reinstating the stupid IterableToFileAdapter in master...13:57
adreznecefried: It's mostly just setting the ServerAliveInterval/ClientAliveInterval and setting TCPKeepAlive13:58
efriedadreznec Yeah, okay, I don't even know where that config file lives.  Never used it before.13:58
adreznecUsually /etc/ssh/sshd_config on the server and ~/.ssh/config on the client13:59
esbergluthorst: Stale LUs are cleaned up14:04
thorstrunning a recheck14:04
thorstesberglu: can you send a note with steps on how we clean up dead jenkins nodes so that we can back you up a bit while we're debugging14:05
esbergluthorst: All I have been doing is logging onto the mgmt node14:09
esbergluAnd running14:09
esberglusudo nodepool delete <node number>14:09
esbergluThe node number is the number at the end of the instance name14:09
esbergluSo 129014:10
thorstahh, easy enough14:10
openstackgerritEric Fried proposed openstack/nova-powervm master: create_disk_from_image: IO_STREAM instead of FUNC
efriedthorst esberglu ^^ with IterableToFileAdapter.14:14
thorstlets see how stable/ocata goes first14:14
efriedShrug, the patch set should kick off its own check.14:15
thorstefried esberglu: the jobs are failing again14:48
thorstlooks like it failed while running testr...14:49
esbergluBy failing do you mean it aborted again?14:49
thorstyeah, look at run 14414:49
esbergluLooks like the master patch made it all the way through but failed everything14:55
esbergluNot sure what is going on for 14414:55
thorstI'm watching 145 now...14:56
thorstbut meh14:56
esbergluefried: Why don't we need 5109 anymore?14:58
efriedesberglu Well, the error we were seeing - the one about file size mismatch in Content-Length header - that check is only done if a) the File object is created with the ExpectedFileSizeInBytes set, and b) the Content-Length header is set in the upload HTTP request.14:59
efriedFor some reason, when we use IterableToFileAdapter, Content-Length doesn't get set by the requests python module, so it's aaight.15:00
efriedBut when we use the façade in pypowervm that does substantially the same thing, the requests module sets Content-Length to zero.15:00
mdrabeefried: Is gonna fix what you're talking about?15:01
efriedI just updated the master branch change set - the one that fixes the hang, where we go from FUNC to IO_STREAM - to use IterableToFileAdapter again.15:01
efriedSo - if the theory holds - we won't see that error now, because the requests module won't set Content-Length.15:01
efriedmdrabe No.15:02
efriedmdrabe It ought to fix a bug we have where we don't want to do retries on a partially-consumed stream.15:02
efriedBut otherwise I don't think it affects anything.15:03
efriedWe still need to finish that dialog with thorst though.15:03
mdrabeYea I'd like to get that through, forgot about it for a bit15:04
esbergluefried: We are still seeing that error on patch set 4 of the master changeset15:06
esbergluThe newest patchset15:07
efriedesberglu The Content-Length mismatch error??15:07
esbergluI was looking at the wrong results15:08
esbergluefried: It doesn't look good though15:09
thorstwell, are they all identity exceptions?15:09
efriedesberglu Yeah, this doesn't look like problems in our code.15:09
thorstthen that looks GREAT15:09
efriedThis is what we needed - bugs in other projects piled on top.15:10
thorstnot all identify15:10
esbergluLooks like it can't connect to the mysql db15:10
thorstesberglu: do we capture any database logs?15:12
*** tjakobs__ has joined #openstack-powervm15:14
thorstdo we know if other CI's are hitting this?15:15
thorstI suspect this is a root for many things15:15
thorstI'm wondering if we're running out of memory15:15
esbergluI think that this all ties back to the production env. issues15:16
esbergluI kicked off another recheck on the master set15:16
esbergluAnd will look at the results on staging15:16
thorstesberglu: PM us the IP its running on15:17
thorstwe will want to look at two things15:17
thorst1) Disk spaces15:17
thorst2) Memory capacity15:17
thorstthe dstat-csv gives us insight into memory...15:17
thorstyeah...I think we're out of memory on these things15:22
thorstcan we bump them up to 12 GB memory?15:22
thorstmaybe at the start of a run...and see if that helps15:22
thorstthe 8 GB may not cut it anymore...15:22
adreznecI think last month there were issues in the gate with memory consumption15:23
thorstbuff is down to 4 MB at end, cache is down to 166 MB at end, and free is down to 200 MB at end...15:23
thorstthat seems...too low to me15:24
adreznecThey tried constraining mysql/rabbit memory, but iirc it didn't go well15:24
thorstand disk ops just skyrocket at 9:40 (on this run)15:24
thorstwell, we have enough memory now...we just upgraded all nodes15:25
thorstso 12 or hell, even 16 to see if that fixes it...15:25
thorstcrap...I need to run to a call now.  efried / esberglu - if you want to hop on a node and try bumping for the next seems worthwhile IMO to try that15:26
efriedI don't know how to do that.  adreznec ?15:32
thorstbleh...we can't update the max memory15:32
thorstI just tried it while I'm on hold.15:32
thorstwe'll need to update the flavor and rebuild the ready nodes...15:32
thorstI'm hopping into the under cloud to see if I can't update the flavor...then we'll need to clean out some nodes.15:33
thorsthorizon so slow....15:33
adreznecYeah, you should just be able to edit the flavor directly, then delete all the nodes and they'll get recreated15:33
thorstundercloud is basically out of disk space.15:34
adreznecthe SSP is out of disk space15:34
adreznecOr the controller is out of disk space15:34
adreznecLet me ping the guy in control of disk space for that node15:35
thorstI have to jump... adreznec...could you take a peak15:35
adreznecthorst: The controller is out of disk space15:35
thorstI think it just needs cleanup15:35
adreznecI have like 20 minutes here, I'll look and see what's up15:35
thorst_afkthx dude15:36
adreznecYup... /dev/mapper/novalink--ci--vg-root  181G  166G  6.4G  97% /15:37
adreznecWhatever I fix we'll have to make sure to get a cleanup utility for it into the CI deployment15:37
thorst_afklogs didn't seem too gross15:37
thorst_afkso I'm wondering if its glance images15:38
thorst_afkhorizon UI still won't boot up for me15:38
adreznecthorst_afk: Yeah, it's all images15:53
adreznecWe had 6 15G images on there15:54
adreznecNot sure why nodepool isn't cleaning up the old images...15:55
adreznecMaybe something to do with all the rebuilds15:55
*** shyama has joined #openstack-powervm15:59
thorst_afkadreznec: did you update the flavor?16:07
thorst_afkI just updated it16:10
adreznecthorst_afk: Sorry, people stopped in my office16:11
adreznecthorst_afk: did you bump it to 12 or 16?16:12
adreznecnvm, looks like 1616:12
thorst_afkI also bumped disk16:18
thorst_afkI think I'm going to wipe the nodepool VMs now16:18
thorst_afkso that they rebuild at the new size16:18
thorst_afkadreznec: now its complaining that it can't find the flavor16:27
adreznecthorst_afk: Did the flavor id change or something? Not sure how nodepool is actually looking for the flavor16:28
thorst_afkI think nodepool just needs to be restarted16:29
thorst_afkcan I just do a nodepool service restart?  I seem to remember there was funkiness with that16:29
adreznecthorst_afk: I think so... but not 100% sure. Guess you could always manually stop/start if it fails16:30
thorst_afkadreznec: I think it just needed a restart in order to cache a new flavor16:34
adreznecthorst_afk: Did you actually delete and recreate the flavor?16:34
thorst_afkno, I hit edit...16:34
thorst_afkbut if I remember right, you can't edit flavors16:34
thorst_afkthey actually create a new one under the covers and soft delete the old16:34
adreznecHmm yeah, that sounds right16:34
thorst_afkas there are VMs using the old16:34
thorst_afkhuh...still stuck in building16:54
thorst_afkefried: I learned something interesting.17:03
thorst_afkduring a long upload, the compute service stops responding17:04
thorst_afkat least with the undercloud17:04
thorst_afkthough that looks like its coordinated.17:05
*** thorst_afk is now known as thorst17:16
*** shyama has joined #openstack-powervm17:22
thorstefried: new runs actually have VMs...running17:51
esberglu_thorst: efried: Have you guys been manually cleaning out the nodepool nodes?18:08
esberglu_Or are they somehow magically deleting themselves again18:08
*** esberglu_ is now known as esberglu18:15
thorstesberglu: well, I did a mass delete18:18
thorstalso it looks like at least ONE went through as a success18:18
thorstadreznec did clear out some images18:19
esbergluYeah ocata and newton runs both passed18:19
thorstits not clear to me if the runs that just finished cleaned themselves up afterwards18:19
thorstbut hey - 150 and 151 went through ok!18:19
esbergluThey did not clean up18:19
thorstesberglu: yeah, looks like we still have the offline issue18:19
thorstso 'derp'18:20
thorstI'm still of mindset maybe in production we hammer this and add something that calls off to the nodepool and does a nodepool delete18:20
thorstas the final part of the jenkins job18:20
esbergluAborted connection 2518 to db: 'nodepool' user: 'nodepool' host: 'localhost' (Got an error reading communication packets)18:22
esbergluThat's all over the mysql error log18:22
esbergluThat might have something to do with it18:23
thorstesberglu: to get the successful runs, we did have to change two things18:23
thorst1) Increased memory to 16 GB18:23
thorst2) Increased disk to 30 GB.18:23
thorstso I think you'll want to push a change to neo-os-ci that reflects that18:23
esbergluI was planning on doing 1 anyways since we got the mem upgrades18:23
esbergluJust hadn't gotten to it yet18:24
esbergluI will push that up18:24
*** shyama has quit IRC18:28
thorstefried: you back yet?19:16
efriedthorst Sorry, just saw this.  Sup?19:24
thorstbasically, runs are going through...we'll want to make sure we're primed to merge them19:24
thorststill some hokiness to them19:25
efriedthorst "compute service stops responding" smells like that threading issue.19:27
thorstyeah, once we get these patches in...we will want to update the undercloud with your changes19:27
efriedthorst So should we merge the ocata & newton changes?19:29
thorstlets wait for master to finish19:29
thorstthen bust through all three at once19:29
thorstthen bust through some ceilometer-powervm ones19:30
esberglumaster just passed19:51
esbergluefried: thorst: ^19:52
efriedMerge, quick!19:52
thorstefried: I already +2'd19:52
thorstyou want the W+1 glory/19:52
thorstafter that, I'll hit up the ceilometer-powervm rechecks19:53
efriedthorst I'm going to need a consult on some of mriedem's comments on #2.19:55
efriedGreat, a merge conflict too.19:57
efriedcrap! on #1 too!19:57
efriedstupid cursive19:58
openstackgerritMerged openstack/nova-powervm master: create_disk_from_image: IO_STREAM instead of FUNC
thorstefried: put some comments in20:08
efriedthorst thx20:09
thorsthe totally gets all the frustrations I had20:09
thorstxag - wtf is that20:09
thorstadapter.helpers - totally not clear that it returns a copy20:09
thorstefried: you have a chance to help Jay?20:17
thorst*do you have time to help him20:17
thorstwith devstack issues20:17
efriedthorst I can try.  Kind of the blind leading the blind.20:18
efriedWhat does he need help with?20:18
thorstlet me PM you the error...its large  :-)20:18
thorstesberglu: I just kicked off a bunch of rechecks20:19
esbergluthrost: try again, none of them got picked up....20:24
esbergluzuul died20:24
esbergluthorst: ^20:25
thorstwhy did zuul die?  I see you added a recheck comment already to at least one20:26
esbergluThe production management seems to be hosed20:26
thorstdisk space?  I wonder if we should just scrap it and let it rebuild from scratch...20:27
esbergluthorst: I wouldn't be against that at this point20:27
thorstlets get this recheck through and then maybe that's what we bite off tomorrow20:27
esbergluIt has plenty of disk20:28
thorst --> Does that need a recheck again?20:28
esberglu160G available20:28
esbergluNope that went through on my recheck20:29
esbergluThat's the only one going20:29
esbergluSo if there are other ones you kicked off, they will need another recheck comment20:29
thorstk.  I just did 3 more20:30
esbergluYep I see them20:30
esbergluthorst: 448134 failed because it couldn't resolve github.com21:00
esbergluKicked off another one21:00
jay1_efied: I have been getting the error "Error [Errno 2] No such file or directory: '/opt/stack/pypowervm/' while executing command python" while stacking with Install_Pypowervm=false21:23
jay1_any idea to by pass this ?21:24
efriedjay1_ I'm working on it.21:31
efriedSomething is causing pvm-novalink stuff to get uninstalled during stacking.21:31
efriedThat's gonna break the world.21:31
efriedadreznec You around to help diagnose deb pkg dependency boggle?21:31
efriedoh, but wait, stacking actually succeeded this time around.21:32
jay1_efried: sure..21:32
efriedNone of the novalink services started properly because pvm-rest-app ain't there.21:33
efriedIma see if they'll start after I install that.21:33
efriedBut mebbe not cause not sure if the pvm_admin group is gonna show up right.21:33
jay1_user neo is already member of pvm_admin now.21:36
efriedI was able to start the compute and networking services.  But several of the services are showing RPC timeouts - something still wrong with rabbitmq I think.21:37
jay1_I can see the current status of rabbitmq-server as Active21:39
efriedI suspect there's a dep conflict between arping and iputils-arping.  The former seems to be a pkg req of pvm-novalink; the latter of... something stacky.21:39
*** smatzek has quit IRC22:06
jay1_efried: pls let me know once the issue is fixed.22:08
efriedjay1_ It's not going to be a simple fix.22:08
*** esberglu has quit IRC22:09
*** esberglu has joined #openstack-powervm22:09
*** esberglu has quit IRC22:13
*** esberglu has joined #openstack-powervm22:22
*** esberglu has quit IRC22:26
*** apearson has quit IRC22:29
*** jay1_ has quit IRC22:32
*** edmondsw has joined #openstack-powervm22:53
*** edmondsw has quit IRC22:57
*** thorst has joined #openstack-powervm23:02
openstackgerritMerged openstack/networking-powervm master: Remove INSTALL_PYPOWERVM
openstackgerritMerged openstack/nova-powervm master: FileIO adapter does not remove mappings on detach

