Sunday, 2014-01-19

fungii was able to launch one myself in az2 (and delete it)00:02
*** david-lyle_ has quit IRC00:06
*** sarob has joined #openstack-infra00:13
*** sdake has quit IRC00:14
*** gokrokve has quit IRC00:17
*** sarob has quit IRC00:18
fungioh, i see one issue... we have name-filter: 'Performance' on the bare-precise entries for hpcloud00:33
fungithat explains why we're getting the flavor error00:37
clarkbodd that it seems to have recently stopped building slaves though00:37
fungiyeah, however i can launch one myself manually, so something's up with nodepool i'm thinking00:38
fungistill digging00:38
*** salv-orlando_ has joined #openstack-infra00:40
*** salv-orlando has quit IRC00:40
*** salv-orlando_ is now known as salv-orlando00:40
clarkbwe should fix the performsnce thing. if you edit the file locally nodepool should just pick it up00:41
fungiyeah, i'm going to00:41
fungii'm uploading a patch too while i'm thinking about it so we don't forget00:41
mordredclarkb, fungi: we should also figure out how to get off of az[1-3] and on to the 1.1 cloud at HP - because the old azs are going to go away at some point00:42
mordredsdague: yes. we should put that in the check queue - but that requires slightly more tooling - which I'm working on00:42
clarkbmordred "figure out"00:42
*** gokrokve has joined #openstack-infra00:43
mordredclarkb: yeah.00:43
clarkbmordred Im not sure there is anything we can do. we just increased test time by a major factor00:43
mordredclarkb: we still need to test nodes that are twice the size00:43
clarkbmordred we can use rax for all those tests and hpcloud for single use unittesters00:43
mordredsince that's supposed to give us twice the cpu throttling allocation00:43
mordredoh wow: min-ram: 3072000:46
mordredwe're asking for pretty large nodes in region-b00:46
clarkbmordred: initial testing of that had very poor resulyts00:46
mordredspectacular00:46
clarkbyes and it didnt help00:46
clarkbwell it helped a tiny bit but not 2x00:46
mordredwell - maybe lifeless cloud will save us00:46
clarkbmt rainier is out!00:47
mordrednice!00:47
openstackgerritJeremy Stanley proposed a change to openstack-infra/config: Remove incorrect name filters from nodepool config  https://review.openstack.org/6768400:51
*** talluri has joined #openstack-infra00:52
*** talluri has quit IRC00:53
*** talluri has joined #openstack-infra00:53
fungiaha, okay so manually launching a node in az2 from the webui seems to work, but launching one using novaclient hangs at "Instance building... 0% complete"00:54
clarkbmordred: mattoliverau: I have to say the tmpfs/eatmydata zuul idea was really good. seems to still be humming along00:54
clarkbfungi: weird, nova api version trouble maybe?00:55
sdagueclarkb: so fwiw, people are still approving stable/havana changes into the gate00:58
sdaguethere are a few still in there00:58
sdagueI'm very tempted to bulk -2 all of stable/havana00:58
sdagueto prevent more of that00:58
*** talluri has quit IRC00:59
clarkbsdague: or and this is eviler, delete the stable havana branch permissions :)00:59
sdagueactually, that might be less evil. Then I don't have to bulk unset it00:59
fungiwe *could* just remove approve from the all-projects acl entry for refs/stable/*00:59
fungithat way stable release managers can still +2, just can't approve01:00
sdaguefungi: sounds good tome01:00
fungilemme finish formulating this support case with hpcloud first while you discuss amongst yourselves01:00
sdagueI also think we should remove reverify completely01:01
clarkbmy reason for it being eviler is it is a bit like a coup01:01
sdaguebecause what happens is patch authors will reverify their code because they want it in, and are never reading the ML about things that will or will not work01:02
*** odyssey4me has quit IRC01:02
sdagueif it's only cores that can toggle the approved, then it should be a responsible set01:02
*** Sonicwall[A] has joined #openstack-infra01:03
*** Sonicwall[A] has left #openstack-infra01:03
fungihow about this... we unset any approval votes on stable changes and send an e-mail to the stable management ml pleading with them not to approve until the bug(s) linked in that message are resolved (and to please pitch in if they can)01:03
clarkbremoving reverify entirely has long been my stance definite +2 for thst from me01:03
sdaguefungi: so emails aren't helping, that was a set of approves01:03
fungidid e-mails about it go to the -dev ml or the stable branch ml?01:04
sdague-dev01:04
fungii wonder if some of them don't read -dev as regularly01:04
sdaguewell they should be01:05
fungi'course they might just not read any lists regularly, the stable branch ml included01:05
sdagueseriously, we can't be going and tracking down every freaking bad actor01:05
fungioh, my boot test finally went to 100% but now any ssh attempt to the resulting vm gets an immediate connection closed01:07
mordredclarkb: yay re: tmpfs01:07
fungiclarkb: it was pointed out last night that we missed one more thing... i got the rechecks page working again by stopping recheckwatch on the old zuul, copying the pickle and report from it to new zuul and starting the service there01:09
*** dcramer_ has quit IRC01:09
mordredI approve removing +A from refs/stable/*01:09
mordredand then we can let bad actors declare themselves01:09
mordredwhen they complain01:09
*** odyssey4me has joined #openstack-infra01:10
sdagueyou also need to reset all the approved bits, so reverifies don't happen01:12
sdagueor kill reverify01:12
mordredwe could block reverify on stable/* too01:12
sdagueok, well, I have a bulk -2 script I can loop on now. Or someone else can take those on01:13
openstackgerritDerek Higgins proposed a change to openstack-infra/config: Add some dependencies required by toci  https://review.openstack.org/6768501:13
*** sarob has joined #openstack-infra01:13
mordredfungi: do we have any example negative lookahead regexes in zuul anywhere?01:14
mordred          ref: ^(?!(refs/stable/.*)).*$01:16
mordredperhaps?01:16
sdagueactually, I'm going to bulk recheck all the stable havana jobs01:18
*** sarob has quit IRC01:18
sdaguethat don't already have a -101:18
mordred++01:18
sdaguethen they'll get a -1 and hopefully the reviewers won't be silly01:18
sdaguethough with the node starvation, it will slow down the rest of things. But best idea I have.01:21
mordredsdague: why not remove +A?01:21
sdaguedoesn't solve reverify01:21
clarkbit does01:21
sdaguealso, I don't have those permissions.01:21
clarkbzuul wont reverify without the votes01:21
sdagueok, well, I already fired off the bulk recheck01:22
*** gokrokve has quit IRC01:23
mordredwell - I just removed +A on stable/* from stable-maint01:23
mordredbetween the two, let's see how it goes01:24
sdaguemordred: cool01:24
mordredI'm going to announce that I've done that too01:24
sdaguemordred: or don't, and see who complains :)01:24
sdaguewho didn't keep up with the list01:24
*** derekh has quit IRC01:24
*** Hefeweizen has quit IRC01:26
*** Hefeweizen has joined #openstack-infra01:26
mordred:)01:26
mordredsdague: so how do we work on fixing the problem is stable/* is blocked for +A?01:27
sdaguemordred: a possible patch is in the queue01:28
sdaguehttps://review.openstack.org/#/c/67425/01:28
sdaguethough I have not tested a stable/havana change behind it01:29
sdagueso we could promote that to see01:29
sdagueI found an pulled out two more stable/havana changes from the queue01:34
sdagueand getting called to dinner, night all01:34
clarkbok into areas of I5 with poor service going to afk now too01:38
*** praneshp has joined #openstack-infra01:38
*** morganfainberg|z has quit IRC01:39
*** morganfainberg|z has joined #openstack-infra01:40
*** morganfainberg|z is now known as morganfainberg01:40
*** morganfainberg is now known as Guest5219501:40
*** FallenPegasus has joined #openstack-infra01:44
fungithat failing glance change in the gate hit a socket timeout pip-installing sqlalchemy on both its pep8 and python27 jobs01:47
clarkbfungi I think rax has had network blips today01:48
clarkbI did not check their status page though01:48
*** FallenPegasus has quit IRC01:48
fungiyeah, two different bare-precise nodes in iad01:49
fungistatus.r.c says investigating potential issue for next-gen cloud servers in london, but otherwise green01:50
*** talluri has joined #openstack-infra01:55
*** talluri has quit IRC01:59
fungiactually, playing around with the az2 problem, i think it may just be our old friend ssh timeout02:00
clarkb120 seconds not long enough? this edge connection is surprisingly useable for irc02:02
fungiyeah, i think it's taking longer. i was able to get into an az2 i launched after waiting a few minutes02:09
*** milki has quit IRC02:09
*** milki has joined #openstack-infra02:10
*** sarob has joined #openstack-infra02:13
*** sarob has quit IRC02:18
*** oubiwann_ has quit IRC02:18
*** oubiwann_ has joined #openstack-infra02:20
*** thuc has joined #openstack-infra02:21
fungiyep, bumping my test script to sleep 300 seconds allows me to continue building...02:28
fungiin fact, we've already got it set to 18002:31
fungifor hpcloud02:31
*** thuc has quit IRC02:32
*** thuc has joined #openstack-infra02:33
*** thuc has quit IRC02:37
*** rfolco has joined #openstack-infra02:40
fungiactually 180 seems to be enough for my tests too02:42
*** talluri has joined #openstack-infra02:56
*** talluri has quit IRC03:01
*** ok_delta has quit IRC03:06
*** HenryG has joined #openstack-infra03:07
*** rfolco has quit IRC03:10
*** talluri has joined #openstack-infra03:11
*** ok_delta has joined #openstack-infra03:13
*** sarob has joined #openstack-infra03:13
fungimmm, i'm starting to think nodepoold might be in a bad way with respect to az2, because all those "building" status nodes are from 4-7 hours ago, don't show up at all in nova list, and can't be nodepool delete'd03:16
fungihttp://paste.openstack.org/show/6150403:16
fungisqlalchemy.exc.OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE node SET state=%s, state_time=%s WHERE node.id = %s' (4, 1390101192, 1131534L)03:17
*** sarob has quit IRC03:18
fungii don't think i'll be able to cleanly stop nodepool, but will get it restarted and see whether that helps03:19
*** talluri_ has joined #openstack-infra03:20
*** ok_delta has quit IRC03:20
*** talluri has quit IRC03:24
*** talluri_ has quit IRC03:29
fungihuh, actually it stopped cleanly03:31
fungiseems to have fixed the inability to delete at least03:33
fungiyeah, i think it's okay now. i'm deleting all the remaining stale nodes now03:41
*** yamahata has quit IRC03:45
*** rakhmerov has joined #openstack-infra03:52
*** rakhmerov1 has joined #openstack-infra03:53
*** rakhmerov has quit IRC03:54
*** rakhmerov1 has quit IRC03:55
*** jhesketh has quit IRC04:03
*** rakhmerov has joined #openstack-infra04:04
*** obondarev_ has joined #openstack-infra04:04
*** sarob has joined #openstack-infra04:13
*** senk has joined #openstack-infra04:15
*** sarob has quit IRC04:18
*** obondarev_ has quit IRC04:28
*** obondarev_ has joined #openstack-infra04:29
*** senk has quit IRC04:31
*** senk has joined #openstack-infra04:31
*** obondarev_ has quit IRC04:45
*** odyssey4me has quit IRC04:48
*** DennyZhang has joined #openstack-infra04:53
*** odyssey4me has joined #openstack-infra04:55
*** SergeyLukjanov_ is now known as SergeyLukjanov05:00
clarkbfungi weird, glad to know all is better now05:01
*** DinaBelova_ is now known as DinaBelova05:02
fungistill keeping an eye on it, but probably passing out soon05:03
*** DinaBelova is now known as DinaBelova_05:09
*** SergeyLukjanov is now known as SergeyLukjanov_05:09
*** rakhmerov1 has joined #openstack-infra05:10
*** rakhmerov has quit IRC05:10
*** sarob has joined #openstack-infra05:13
*** sarob has quit IRC05:18
*** david-lyle_ has joined #openstack-infra05:21
*** SergeyLukjanov_ is now known as SergeyLukjanov05:28
*** david-lyle_ has quit IRC05:32
*** SergeyLukjanov is now known as SergeyLukjanov_05:37
*** vkozhukalov has joined #openstack-infra05:38
*** SergeyLukjanov_ is now known as SergeyLukjanov05:43
*** SergeyLukjanov is now known as SergeyLukjanov_a05:43
*** senk has quit IRC05:44
*** SergeyLukjanov_a is now known as SergeyLukjanov_05:45
*** DinaBelova_ is now known as DinaBelova05:45
*** rakhmerov1 has quit IRC05:46
*** DinaBelova is now known as DinaBelova_05:49
*** rakhmerov has joined #openstack-infra05:54
*** rakhmerov has quit IRC05:55
*** rakhmerov has joined #openstack-infra06:09
*** sarob has joined #openstack-infra06:13
*** rakhmerov has quit IRC06:13
*** San_D has quit IRC06:14
*** sarob has quit IRC06:18
*** oubiwann_ has quit IRC06:19
*** nati_ueno has joined #openstack-infra06:20
*** nati_uen_ has joined #openstack-infra06:25
*** nati_ueno has quit IRC06:28
*** sarob has joined #openstack-infra06:45
*** odyssey4me has quit IRC06:45
*** sarob has quit IRC06:50
*** sarob has joined #openstack-infra07:06
*** rakhmerov has joined #openstack-infra07:09
*** rakhmerov1 has joined #openstack-infra07:11
*** rakhmerov has quit IRC07:11
*** DennyZhang has quit IRC07:16
*** madmike has joined #openstack-infra07:21
*** rakhmerov1 has quit IRC07:22
*** salv-orlando has quit IRC07:22
*** salv-orlando has joined #openstack-infra07:22
*** bnemec_ has joined #openstack-infra07:24
*** rakhmerov has joined #openstack-infra07:25
*** crank has quit IRC07:25
*** mfink has quit IRC07:25
*** bnemec has quit IRC07:25
*** lifeless has quit IRC07:25
*** akscram has quit IRC07:25
*** bradm has quit IRC07:25
*** obondarev has quit IRC07:25
*** obondarev has joined #openstack-infra07:26
*** sandywalsh has quit IRC07:26
*** rakhmerov has quit IRC07:26
*** rakhmerov1 has joined #openstack-infra07:26
*** crank has joined #openstack-infra07:26
*** bradm has joined #openstack-infra07:27
*** lifeless has joined #openstack-infra07:27
*** akscram has joined #openstack-infra07:28
openstackgerritMonty Taylor proposed a change to openstack-infra/storyboard-webclient: Allow use of node from packages  https://review.openstack.org/6760407:29
*** rakhmerov1 has quit IRC07:33
*** sandywalsh has joined #openstack-infra07:39
*** sarob has quit IRC07:56
*** julim has joined #openstack-infra08:21
*** julim has quit IRC08:25
*** yolanda has joined #openstack-infra08:26
*** derekh has joined #openstack-infra08:27
*** sarob has joined #openstack-infra08:28
*** rakhmerov has joined #openstack-infra08:30
*** sarob has quit IRC08:33
*** sarob has joined #openstack-infra08:44
*** rakhmerov has quit IRC08:45
*** sarob has quit IRC08:49
*** derekh has quit IRC09:02
*** sarob has joined #openstack-infra09:13
*** sarob has quit IRC09:19
*** flaper87|afk is now known as flaper8709:31
*** rakhmerov has joined #openstack-infra09:41
*** rakhmerov has quit IRC09:47
*** rakhmerov has joined #openstack-infra09:57
*** praneshp_ has joined #openstack-infra10:02
*** praneshp has quit IRC10:02
*** praneshp_ is now known as praneshp10:02
*** rakhmerov has quit IRC10:03
*** elasticio has joined #openstack-infra10:17
*** yolanda has quit IRC10:49
*** emagana has joined #openstack-infra10:50
*** rakhmerov has joined #openstack-infra11:00
*** emagana has quit IRC11:02
*** thuc has joined #openstack-infra11:04
*** yolanda has joined #openstack-infra11:06
*** rakhmerov has quit IRC11:09
*** thuc has quit IRC11:13
*** matrohon has quit IRC11:40
*** odyssey4me has joined #openstack-infra11:42
*** odyssey4me has quit IRC11:44
*** rakhmerov has joined #openstack-infra12:05
*** rakhmerov has quit IRC12:10
*** praneshp has quit IRC12:12
*** obondarev_ has joined #openstack-infra12:17
*** elasticio has quit IRC12:34
sdagueso I'm starting to feel that we need to take the jenkins outage and get the logs fixed, because our fail rate is as high as it was before russellb's concurreny patch, but we're pretty blind on what's causing it without console logs12:53
*** dizquierdo has joined #openstack-infra12:57
*** DinaBelova_ is now known as DinaBelova12:57
*** afazekas_ has joined #openstack-infra12:57
openstackgerritAndreas Jaeger proposed a change to openstack-infra/config: Remove non-voting documentation gate job  https://review.openstack.org/6770213:03
*** rakhmerov has joined #openstack-infra13:07
*** rakhmerov has quit IRC13:11
*** sarob has joined #openstack-infra13:13
*** sarob has quit IRC13:18
*** DinaBelova is now known as DinaBelova_13:25
*** salv-orlando has quit IRC13:33
*** DinaBelova_ is now known as DinaBelova13:40
openstackgerritSean Dague proposed a change to openstack-infra/config: add in elastic-recheck-unclassified report  https://review.openstack.org/6759113:44
*** obondarev_ has quit IRC14:06
*** rakhmerov has joined #openstack-infra14:08
*** rakhmerov has quit IRC14:12
openstackgerritSean Dague proposed a change to openstack-infra/devstack-gate: Timestamp setup logs  https://review.openstack.org/6708614:13
*** sarob has joined #openstack-infra14:13
*** sarob has quit IRC14:18
*** oubiwann_ has joined #openstack-infra14:29
*** beagles has quit IRC14:43
*** oubiwann_ has quit IRC14:46
*** bknudson has left #openstack-infra14:52
sdaguethere also seems to be a zuul issue that if a failing change makes it to the top of queue, it's rerun - http://ubuntuone.com/49l5sd2U7JLrMzx8RGavPU14:56
*** dizquierdo has quit IRC15:04
*** rakhmerov has joined #openstack-infra15:09
*** rakhmerov has quit IRC15:14
*** bknudson has joined #openstack-infra15:15
*** coolsvap has joined #openstack-infra15:25
*** obondarev has quit IRC15:41
*** obondarev has joined #openstack-infra15:42
*** skraynev has quit IRC15:44
*** skraynev has joined #openstack-infra15:44
*** luqas has joined #openstack-infra15:46
mordredsdague: I agree with all of the things in your email15:52
sdaguemordred: great, now we just need to implement them :)15:54
sdaguethe early kick out is something I'd like to understand how we handle it15:54
sdaguemostly how do we signal back about it15:55
sdaguebasically, under the current state of things, I don't think icehouse-2 is possible next week15:57
sdaguewe've got comment typo fixes in the gate queue that have been there for 40 hrs15:58
mordredsdague: early kick is tricky (responded to email)15:58
sdaguecool15:58
*** yolanda has quit IRC15:59
mordredsdague: the theory is that we should start streaming subunit results back to zuul, I believe15:59
mordredhrm15:59
sdaguethe gate is at a new level of bad15:59
mordredand/or have the thing running/processing the tests be attached to the gearman bus so that it could read the stream and return a gearman status on detected fail15:59
sdagueand we're actually completely blind as to why, because we're loosing at least 3/4 of the console logs from elastic search16:00
mordredthe work towards getting rid of jenkins was work towards being able to do early kick16:00
mordredI wonder if maybe we should think about how to implement it without not-jenkins16:00
mordredsdague: I _think_ zaro has the scp-plugin fix16:00
sdagueyeh, I think it needs implementing by march 1st, otherwise i3 will not be possible16:01
sdaguemordred: right, I keep hearing that :)16:01
mordredsdague: :)16:01
mordredsdague: is our incoming velocity higher than usual? or have things just gotten worse in the racy department?16:02
openstackgerritMonty Taylor proposed a change to openstack-infra/config: Remove reverify entirely  https://review.openstack.org/6770816:05
sdagueso we've merged 48 changes since friday16:05
sdaguemeans we're merging only 30 changes / day right now16:06
sdaguegit log --since=2014-01-17 --author=jenkins | grep '^commit' | wc -l    on openstack/openstack16:06
mordredwow16:06
notmynamewow16:06
bknudsonhow many could merge in a day?16:07
sdaguebknudson: if we weren't reseting... hundreds16:07
sdaguebut the biggest issue in my mind right now is we're actually completely blind to *why* we are failing, and largely have been for the last 2 weeks16:08
openstackgerritMonty Taylor proposed a change to openstack-infra/config: Early fail on pep8 in the check pipeline  https://review.openstack.org/6770916:08
mordredsdague: ok. there's two of your things16:08
sdagueas we're seeing huge loss of logs going into elastic search16:08
sdaguemordred: awesome16:08
mordredfungi: you awake? ^^16:09
sdaguealso, out of those 48, at least 2 were ninja merges that I had fungi do to relieve some of the fails16:09
mordredsdague: https://github.com/jenkinsci/scp-plugin/pull/816:09
mordredanybody else who wants to review some java ^^16:09
*** DinaBelova is now known as DinaBelova_16:10
*** SergeyLukjanov_ is now known as SergeyLukjanov16:10
*** rakhmerov has joined #openstack-infra16:10
sdaguemordred: we could also post populate the missing log files not from jenkins. We don't loose them to the log server, just to ES16:11
mordredagree16:11
sdaguethat would at least let the bulk query for ES actually help us sort issues16:11
sdaguethere also seems to be a new zuul bug, where if a failing change hits the top of the gate, it gets restarted again instead of thrown out16:12
notmynamemordred: why not run all the project-specific tests before the common integration tests (instead of just pep8)?16:13
*** sarob has joined #openstack-infra16:13
sdaguenotmyname: I think it's a judgement call of bang vs. buck. Throwing out on pep8 takes very little time, and saves a bunch of nodepool resources.16:15
*** rakhmerov has quit IRC16:15
*** luqas has quit IRC16:16
mordrednotmyname: I agree with you on that too - but it would require adding some new logic to zuul's config processing, where I can do the pep8 early right now16:16
notmynameok16:16
mordrednotmyname: specifically, I don't have a way of saying "run these _Three_ things and then when all are done run this additional thing"16:17
mordredI think it's a feature we need, tbh16:17
notmynameI was looking at the neutron one the just fell off the top of the gate queue (of course causing a gate flush). https://review.openstack.org/#/c/67475/ but I didn't realize those neutron tests were taking longer than the integration ones16:17
notmynamemordred: isn't a feature of zuul the composability of the jobs? run set one, then set two16:17
*** sarob has quit IRC16:18
zaroclarkb: i've added the additional logging to scp-plugin.  new build is on review-dev.o.o16:19
sdaguenotmyname: well, actually neutron at top was the other issue16:19
sdaguewhere that thing failed deeper in the queue16:19
fungizaro: jenkins-dev?16:19
sdaguethe stuff above it merged16:19
sdagueand zuul then reconnected the pipeline to it16:19
zaroohh yeah, jenkins-dev.16:19
sdague<sdague> there also seems to be a zuul issue that if a failing change makes it to the top of queue, it's rerun - http://ubuntuone.com/49l5sd2U7JLrMzx8RGavPU16:20
bknudsonI've seen the failure from 67475 in another review, since I was just looking into why the other one failed ... "tempest.scenario.test_cross_tenant_connectivity.TestNetworkCrossTenant.test_cross_tenant_traffic"16:21
sdagueyeh, so that's just which test failed. That's not why it failed. We need to know why that was expected to work and did not16:23
notmynamesdague: when you added the timer to the zuul queue (you did that right?), did you by any chance add in a statsd timing metric for graphite? I'd love to graph the average time a patch spends in the gate queue16:23
sdaguenotmyname: nope, didn't touch graphite at all16:24
sdagueI actually think the graphite metrics in this space are kind of broken. The jenkins interupt which happens on resetting nodes is often classified as failure16:25
sdaguewhich it isn't16:25
sdagueso all the graphite numbers are worse that reality16:25
*** jkt has joined #openstack-infra16:25
sdagues/that/than/16:26
bknudsonlooks like the failure from 67475 and my own review is a known problem -- https://bugs.launchpad.net/tempest/+bug/126261316:26
jkthi there, I'm reading through your openstack-infra/config repo, and have noticed the remark about an ongoing transition towards puppet modules "straight from puppetforge"16:26
fungisdague: your screen capture doesn't look to me like it's showing what your comment implies16:26
sdaguefungi: so I don't have the capture from before16:27
sdaguethat was already in a fail state16:27
jktI'm trying to deploy openstack setup at $job, to be managed by puppet, and I've never done a green-field deployment of puppet before16:27
sdaguegot moved to the head16:27
bknudsonthere's an e-r check for bug/1262613 already16:27
sdaguethen got the entire stream behind it16:27
jktI'm wondering about the security implications of using "random" version of code from "random" site on the web16:27
bknudsonthe e-r check says it was resolved.16:27
sdaguefungi: you kind of have to be watching zuul to see these happen16:27
jktI mean, I'm OK with using their packages for puppet itself, but blindly loading modules from forge makes me a bit uneasy16:28
fungisdague: that screen capture says that the failing head was severed and still has 19 minutes remaining until its other tests complete, but a recalculation of all the rest of the gate is underway so the following changes don't have all their jobs started yet16:28
jktis that really the plan, and do you have something for version management of these?16:28
bknudsonoh, I guess I got the wrong bug.16:28
mordredjkt: hi!16:28
mordredjkt: do you mean an install of the openstack-infra testing stuff? or of openstack itself?16:29
sdaguefungi: I'm pretty sure that change was off failing, the jobs behind were running, after merge they all reset again16:29
sdagueI've seen this twice this morning16:29
jktmordred: in the end, I'd like to install and configure openstack, but I'm not that far yet. What I'm doing now is getting familiar with puppetizing the infrastructure from day zero16:29
fungisdague: oh, okay. in that case i'll keep an eye out and see if i can catch it doing what you're suggesting16:29
jktmordred: and I'm asking here because I'm more or less copying the setup you have documented in that repo16:30
mordredjkt: gotcha. so, those of us in here don't know anything about installing openstack itself via puppet - so I wanted to be clear on expectations :)16:30
mordredjkt: for the other stuff - yeah, it's still in the plans to use what we can from puppet forge16:30
mordredbut we don't really do it blindly - we take one module at a time as we can - and already use several key ones - like the puppet-mysql module16:31
mordredthat said - one of the other goal of that is to break that repo apart and treat several of the modules liek they're forge modules16:31
mordredeven though we wrote them16:31
clarkbzaro: thanks16:31
mordredfor better lifecycle and composability16:31
jktmordred: so essentially specifying a version beforehand, in that install_modules.sh script, and running it every now and then on the master?16:32
mordredclarkb, fungi: could you +2/+A the two config changes I posted above - I agree with sdague that we shoudl do them16:32
mordredjkt: that's right16:32
fungimordred: will 67709 deal more sanely now with the situation which caused us to turn it off before (changes to requirements result in very obscure failures on the pep8 jobs which are hard for devs to diagnose)?16:32
jktwhat I like a *lot* in your setup is that everything is in one repo; and having to update modules manually is something which, to me, looks a bit against that goal16:32
mordredfungi: I believe so - but even if it doesn't , I think the tradeoff is worth it for the next couple of weeks16:33
mordredjkt: the thigns we want in external modules are the things that dont' change much - or that _We_ don't change that much16:33
jktI've just learned about `git subtree add --prefix modules/... ... --squash`, and I have to admit I like it a lot16:33
mordredhehe16:33
mordredwe don't use submodules at all :)16:34
jktsubtree != submodule, that's the catch16:34
mordred$ git subtree --help16:34
mordredNo manual entry for gitsubtree16:34
jktit's essentially "get a checkout of that remote ref I specify as ..., squash it asa single commit, and merge it as a subdirectory under the --prefix"16:34
jkthttp://blogs.atlassian.com/2013/05/alternatives-to-git-submodule-git-subtree/16:35
mordredah. interesting. so like a way to keep a super-repo for composability - but not attempt to treat the subtrees as things you'd do dev on in that context16:35
jktit's also 100% client-side; what gets pushed is a boring old commit16:36
mordredit's also not installed in the version of git in debian16:37
mordred:)16:37
mordredanyway - looks like interesting evening reading16:37
jktyeah, looks like it got only added in 201216:38
clarkbsubtree is bad for pther reasons :)16:38
fungimordred: oh, i see... 67709 only skips py26/27 and docs, but not other things like requirements checks, dsvm jobs, py33, one-offs and so on16:38
jktclarkb: I would love to listen to them16:38
mordredfungi: oh! piddle. thats a bug and you're right16:38
mordredhrm. in fact16:38
clarkbbiggest pronlem for us immediately would be no gerrit support16:38
mordredwith the new template org - I do not think it's possible to do what I was trying to di16:38
fungimordred: well, i already approved16:38
mordredclarkb: it doesn't need it16:38
fungibut i can -216:38
mordredfungi: well, it'll be _something_16:38
jktmordred: and btw, on that Atlassian page, they even show how to push commits back to the original repo (which provided the subtree contents)16:39
mordredfungi: but yeah - go ahead and -2 and I'll rework16:39
clarkbsecond problem is you have to manually know to write commits that dont span trees iirc16:39
mordredthat's the main problem I'd see16:39
jktclarkb: it's "subtree", not "submodule", not what you'll see in Gerrit (or any other git client) is a simple commit adding whole directory at once16:39
clarkband humans fail at things like that16:40
zaroclarkb, sdague : i tried disabling drafts on review-dev.o.o but was _not_ able to.16:40
clarkbjkt i knlw16:40
sdaguezaro: ok, thanks16:40
mordredzaro: darn16:40
clarkbjkt but you cant have commits thst span trees16:40
jktclarkb: how come?16:40
sdagueok, I need to get away from the computer for a bit. I'll check back in during football later.16:40
clarkbwithout gerrit support you cant enforce that easily16:40
mordredclarkb: so, early fail - how do we do that without getting rid of jenkins first?16:41
clarkbjkt because then you cant split trees iirc (or can but need filter branching)16:41
clarkbits a matter of sanity16:41
clarkbmordred our test runner jenkins side eg run-testr needs to do testr run --subunit and return one as soon as a failhappens16:42
jktclarkb: https://github.com/git/git/blob/master/contrib/subtree/git-subtree.txt shows the split feature (even with stable, i.e. deterministic and non-changing commit IDs, but I have no experience with them16:42
mordredclarkb: but won't that abort the rest of the test run/16:42
clarkbmordred not hard to do but would require a special subunit parser unless lifeless has a flag for that16:42
clarkbmordred yes16:43
clarkbwithout changing zuul to undsrstand fail without return 1 I think that is our option16:43
mordredclarkb: yeah - I don't think there's any quick and dirty way to do it - I'm just trying to figure out if there is any conceivable way at all to get there that I could parcel some tasks out to achieve16:44
mordredalso - anybody know when we get jeblair back? today? tomorrow? tuesday?16:44
clarkbeither tomorrow or tuesday16:45
jktanyway, thanks mordred and clarkb, your experience is appreciated16:46
*** elasticio has joined #openstack-infra16:46
mordredjkt: you're welcome - thanks for the pointer to the blog - I'm less unhappy about it than clarkb is - although I've got a really narrow usecase for it I'd like to poke at16:46
clarkbmordred because I have seen people use subtree thinking it fixes the world but really it just changes the problems :)16:47
*** sdake has joined #openstack-infra16:47
mordredclarkb: yeah - but it seems like it could be a specific solution for module composabilty for _us_ instead of puppet librarian or whatnot16:48
clarkboh and since subtree smashes trees together you have an extra element of license stuff to consider16:48
clarkbmordred no that is the case people thoyght it would fix16:48
clarkbthey wrote r10k instead16:48
*** rakhmerov has joined #openstack-infra16:48
mordredbut purely for deployment mechanics - not as something I'd expect us to ever check out ourselves16:48
mordredclarkb: I betcha they thought they could use it for composability and development16:49
*** sandywalsh has quit IRC16:49
clarkbmordred: so you are thinking some post merge step that build a new tree that only deployments use?16:51
mordredclarkb: yes16:51
mordredand the way we'd only do commits to that repo ourselves to update the commit tracking the external module16:52
mordredit's probably a bad idea still and I should probably figure out r10k16:52
clarkbI think r10k is simpler and it can consume items not in git as well16:53
mordredyeah16:53
*** rakhmerov has quit IRC16:53
clarkbback to jenkins. should I shutdown a jenkins and try new scp plugin?16:55
mordredclarkb: yes16:55
clarkbok starting that shortly16:55
mordredclarkb: is there any specific reason why the dvsm jobs don't have a template?16:56
mordredor just not gotten to?16:56
clarkbnot gotten to16:58
mordredk16:58
clarkbwe have been doing that refactor with small deltas to make it easier to review and help prevent massive test breakage16:59
mordredgotcha16:59
mordredI LOVE that work, btw16:59
*** senk has joined #openstack-infra16:59
clarkbmordred: what we need is a template that uses the envinject plugin to set all of the various d-g flags without changing the actual script calls16:59
clarkbsince the env vars are what vary test to test17:00
mordredclarkb: we need many things17:00
clarkbI am going to put the scp plugin on jenkins02 because it runs the largest variety of tests17:00
mordredokie17:00
clarkbwell 01 does too but 01 is old jenkins and old scp plugin so its fine17:00
clarkbis fungi still around?17:01
clarkbfungi: any opinions on ^17:01
clarkb02 is in shutdown mode17:03
*** senk has quit IRC17:06
clarkbestimated time remaining 54 minutes :(17:07
mordredclarkb: I'm too dumb to have actually followed the whole thing - can you give me the tl;dr on why we have different jenkins version ?17:07
clarkbmordred: we upgraded one jenkins host (02) then went on holidays17:07
clarkbspun up 03 and 04 on the new version but haven't upgraded jenkins.o.o and jenkins01 yet17:07
clarkbwe were being conservative17:08
*** luqas has joined #openstack-infra17:08
openstackgerritKhai Do proposed a change to openstack-infra/jenkins-job-builder: Fix references to examples in api documentation  https://review.openstack.org/6771217:08
mordredgotcha17:08
Mithrandir(that jjb change)> what?  Literalinclude looks pretty broken then17:10
Mithrandirand should rather be fixed17:10
clarkbMithrandir: that is kind of funny17:11
clarkbwould ./ work too? that might be more intuitive17:11
MithrandirI'd be fine with ./, but / meaning "start looking from where this file is located" is.. not how paths work.17:11
clarkbMithrandir: yup17:12
MithrandirThe file name is usually relative to the current file’s path. However, if it is absolute (starting with /), it is relative to the top source directory.17:12
Mithrandiris what the docs say17:12
Mithrandirso the current one should work, barring bugs17:12
zarothe './' didn't work, got the same warning.17:16
*** gokrokve has joined #openstack-infra17:16
Mithrandircan we use a custom sphinx tag instead?  nonbrokenliteralinclude?17:17
clarkbmordred: I am going to afk for a short time while I wait for tests to finish on 0217:17
Mithrandiror override the built-in17:17
mordredclarkb: kk17:18
zaroMithrandir: how would custom tag work?17:20
*** senk has joined #openstack-infra17:20
openstackgerritBrant Knudson proposed a change to openstack-infra/elastic-recheck: Add check for bug 1270608  https://review.openstack.org/6771317:20
*** rakhmerov has joined #openstack-infra17:20
Mithrandirzaro: http://sphinx-doc.org/extensions.html, apparently17:21
* fungi is back... reading now17:21
zaroMithrandir: i'm not following, is literalinclude an extension?17:24
Mithrandirzaro: it's a directive, I'm not sure if it's core or not.17:25
Mithrandirand we could have our own that works like literalinclude, but with non-crazy semantics.17:25
fungiclarkb: on getting jeblair back, keep in mind that he, i and the rest of the foundation staff will be in utah or in transit most of the week. i'll be working from airline seats and airport lounges most of tuesday and friday17:25
*** rakhmerov has quit IRC17:25
clarkbgah17:26
clarkbits like the conference madness never ends17:26
fungialso, r10k was the cpu for the sgi o2. what else is it in this context?17:26
mordredMithrandir: perhaps it has to do with how we're running sphinx17:26
clarkband we are rotating batters17:26
mordredMithrandir: and what it thinks the top of our source dir is17:26
mordredrather than being a bug in literalinclude itself17:26
clarkbfungi puppet librarian that actually works17:26
mordredfungi: jesus, really/17:27
mordred?17:27
Mithrandiraccording to the docs, if it starts with something else than /, it should be relative to the file.17:27
mordredoh - salt conf17:27
Mithrandirmordred: so either it's a bug in the implementation or the docs.17:27
mordredhrm17:27
fungiclarkb: upgrading scp plugin on 02 sounds good17:27
clarkbfungi great. currently waiting for tests to finish there17:27
mordredMithrandir: but we're doing extraction with an extension of our own17:28
clarkbI copied the scp.jpi from -dev to my homedir on 0217:28
mordredMithrandir: so the file that it's relative to might not be the file we think it is17:28
*** sdake has quit IRC17:28
Mithrandirmordred: oh, that might make for extra fun.17:28
Mithrandirmordred: maybe that extension should adjust the paths or something, then?17:29
mordredMithrandir: so, while I do think that it's a bug somewhere ... yeah - we might want to investigate the yaml extension we're using, and/or the results of pulling docstrings and generating sphinx from them17:30
*** sdague has quit IRC17:30
fungimordred: well, we happen to be there coincident with saltconf, but it's mainly a couple days mid-cycle for the staff to get face time not at a summit17:30
mordredfungi: wait - so you're saying we don't REALY get jeblair back, AND we lose you?17:30
*** sdague has joined #openstack-infra17:30
clarkbmordred: yes17:30
clarkbyou and I are batting next17:30
bknudsonthere's an e-r check for 1269940 that hit on https://review.openstack.org/#/c/66209/17:30
bknudsonbut the string in the yaml doesn't match anything in console.html17:30
mordredfungi: just so you know, I don't think it's valuable to anyone for the foundation staff to have face time17:31
fungimordred: well, i'll probably be worthless wednesday/thursday, but will be working with lossy/high-latency network access on tuesday and friday17:31
mordredfungi: but I have no leg to stand on as I've been afk for sevearl weeks17:31
mordredand i'll be going to brussels week after next17:31
clarkbjeblair is going to brussels too17:31
fungimordred: annual performance reviews and whatnot. i guess there's some perceived benefit to do face-to-face team building17:32
clarkbwe need to clone a few fungis17:32
mordredfungi: but your team is all of opensatck - not each other17:32
*** gokrokve has quit IRC17:32
* fungi reproduces asexually through spore propagation, so should be doable17:32
clarkbfungi also north carolinians are all robots17:33
mordredfungi: you have ZERO goals separate from the project's goals, or at least you _shouldn't_ have any goals separate from the project's goals17:33
mordredclarkb: ++17:33
fungimordred: that i agree with. i'd rather see my performance review come from random cross-sections of the project ;)17:33
mordredfungi: ++17:33
clarkbwe just cp your AI from one machine to another :P17:33
mordredfungi: in fact, seriously - performance review for foundation staff shoudl be done by the project - maybe using condorcet17:33
fungiclarkb: i haven't figured out how to upload my consciousness yet, but once i get that working we should be able to make copies just fine17:33
fungiwe have clouds17:33
fungimordred: sounds like a motion for the board17:34
openstackgerritA change was merged to openstack-infra/storyboard-webclient: Allow use of node from packages  https://review.openstack.org/6760417:34
*** nati_uen_ has quit IRC17:35
mordredclarkb: oh! you know what - if both fungi and jeblair are afk, we can finally push those HP-specific goals we've been hiding caring about!!!17:35
fungiheh17:35
*** dcramer_ has joined #openstack-infra17:42
*** gokrokve has joined #openstack-infra17:42
*** coolsvap has quit IRC17:50
openstackgerritKhai Do proposed a change to openstack-infra/jenkins-job-builder: Fix references to examples in api documentation  https://review.openstack.org/6771217:51
*** gokrokve has quit IRC17:53
openstackgerritMonty Taylor proposed a change to openstack-infra/storyboard-webclient: Allow people to source bin/setenv.sh  https://review.openstack.org/6771417:54
clarkbjenkins02 is idle now18:00
clarkbfungi: mordred: I am going to turn it off and start it with zaro's scp plugin build18:00
clarkbjenkins02 is starting again18:03
bknudsonis there a good way to download logs in http://logs.openstack.org/47/66247/4/check/check-tempest-dsvm-full/0d6e9cc/logs/ ?18:03
bknudsonwhen I wget screen-n-cpu.txt.gz it's downloading forever18:03
bknudsonthe logs page shows 10M but I canceled the download after 50M18:04
clarkbbknudson: it is a massive file :), if you set the encoding type to gzip you will get gzipped files instead of uncompressed data18:04
clarkbbknudson: it is on disk as 10MB compressed, wget requests uncompressed data though18:04
bknudsonI can uncompress it after I download if I can figure out how to get it compressed.18:05
clarkbbknudson: right you set the encoding header to gzip to get it compressed18:05
dimshttp://www.commandlinefu.com/commands/view/7180/get-gzip-compressed-web-page-using-wget.18:05
*** senk has quit IRC18:06
*** thuc has joined #openstack-infra18:07
clarkba quick spot check of console logs in logstash after the new scp plugin went in looks good18:08
clarkbsdague: fungi zaro I figure we let that burn in a bit today then do the others18:08
clarkbI also need to run errands shortly so lettign it run for a bit is my excuse to let me d othat :)18:08
mordredclarkb: awesome!18:10
bknudsondims: Thanks,  --header="Accept-Encoding: gzip" worked for me.18:12
*** salv-orlando has joined #openstack-infra18:12
*** persia has quit IRC18:12
*** 23LAAXGQE has joined #openstack-infra18:15
*** pcrews has joined #openstack-infra18:15
zaroclarkb: sounds good.18:15
*** 23LAAXGQE has quit IRC18:16
*** 23LAAXGQE has joined #openstack-infra18:16
*** 23LAAXGQE is now known as persia18:16
*** gokrokve has joined #openstack-infra18:17
openstackgerritFelipe Reyes proposed a change to openstack-infra/jenkins-job-builder: Adds Mercurial SCM support  https://review.openstack.org/6154718:20
fungiclarkb: awesome. i'm happy to iterate over the other jenkins masters later if we want18:27
*** nati_ueno has joined #openstack-infra18:35
*** nati_ueno has quit IRC18:35
*** nati_ueno has joined #openstack-infra18:36
*** ewindisch is now known as zz_ewindisch18:38
notmynameif someone wants to abort the jobs runing for 65399,3 I'm ok with that. I've got the log I need (https://jenkins02.openstack.org/job/gate-swift-python26/3523/console). Looks like a timing thing, but I'll make sure there is a LP bug for it in case it comes up again18:41
notmynamebug 122420818:45
notmynamewell, actually since it's not at the top of the queue, some failure ahead of it could re-enqueue it and it would pass next time18:48
*** thuc has quit IRC18:52
fungiyep18:54
fungiunless you expect that to be a consistent failure, better to just let it ride18:54
openstackgerritMonty Taylor proposed a change to openstack-infra/storyboard-webclient: Add tox.ini file to run things via tox  https://review.openstack.org/6772118:55
mordredok.18:55
mordredthat's the craziest thing I've written in quite some time18:55
mordredttx, fungi, clarkb, sdague ^^ that, I'm not even kidding, allows you to run javascript toolchain stuff via tox and virtualenvs without having to install any javascript toolchain globally18:56
openstackgerritAndreas Jaeger proposed a change to openstack-infra/config: Early abort documentation builds  https://review.openstack.org/6772218:56
notmynamefungi: nope. I don't expect it to be constant. thanks18:56
*** Ajaeger has joined #openstack-infra18:57
openstackgerritJeremy Stanley proposed a change to openstack-infra/nodepool: Delete nodes more aggressively  https://review.openstack.org/6772318:58
notmynamefungi: yup. just reset19:01
*** gokrokve has quit IRC19:06
*** gokrokve has joined #openstack-infra19:07
lifelesso/19:08
lifelessclarkb: flag for what?19:08
*** oubiwann_ has joined #openstack-infra19:08
*** thuc has joined #openstack-infra19:08
*** thuc has quit IRC19:11
*** thuc has joined #openstack-infra19:11
*** gokrokve has quit IRC19:12
*** thuc has quit IRC19:15
clarkblifeless fail fast which we can do in the runner now that I think of it19:17
fungiclarkb: if we do it in the runner, will that still let it run to completion and simply signal zuul earlier that the job is not going to succeed so it can get a head start on the ensuing gate reset?19:19
*** gokrokve has joined #openstack-infra19:19
fungior would we be sacrificing remaining test results for the failing job?19:20
*** thuc has joined #openstack-infra19:20
mordredfungi: I almost think that sacrificing remaining test results at this point might be acceptable sacrifice19:22
mordredfungi: in fact, given the state of the gate right now - I'd say that sacrificing remaining test results is almost certainly acceptable sacrifice19:23
mordredsdague: ^^ ?19:23
fungii'm inclined to agree, just curious of the implications19:23
mordredyeah19:23
mordredI mean, I think ultimately it's not what we want19:23
mordredultimately we want all of the tests to run to completion and we want to fail fast19:24
*** nati_uen_ has joined #openstack-infra19:24
mordredbut perfect might be getting in teh way of good here, and just having the test runner hard fail and exit on first subunit stream failure might be what we want today19:24
*** flaper87 is now known as flaper87|afk19:24
fungialso, 67723 is intended to relieve me from scraping nodepool list for nodes in a delete state for more than 20 minutes and manually retrying to delete them to get available resources back19:25
fungileft unchecked, they end up accounting for more than 50% of our aggregate quota19:26
mordredfungi: lgtm19:26
*** nati_ueno has quit IRC19:27
sdaguemordred: reading scroll back...19:30
mordredsdague: most pressing thing is the idea of doing hard-fail with no run continuation on first fail19:31
sdagueso if we can get it out of zuul faster, that's a clear win. Loosing the rest of the tests might not be.19:32
*** sarob has joined #openstack-infra19:33
openstackgerritMonty Taylor proposed a change to openstack-infra/storyboard-webclient: Add tox.ini file to run things via tox  https://review.openstack.org/6772119:34
mordredsdague: that's the question - how much not would it be?19:37
*** sarob has quit IRC19:38
mordredsdague: in the balance between early ejection and keeping the remaining tests after the fail19:38
mordreddo we, in our current issues, get a lot of good data from the stuff that happens on tests after the first fail?19:38
sdaguewell if we end up aborting, we might abort too early for the service logs to continue19:40
sdagueI know, for instance the issue on some of the network tests was the fact the allocation was taking too long19:40
sdagueit actually shows up later in the logs19:41
sdaguethat would be missed19:41
*** DinaBelova_ is now known as DinaBelova19:42
*** senk has joined #openstack-infra19:42
sdagueso basically we can't get fail fast but keep running?19:42
sdaguezuul also failed to allocate a dvsm node on job #2, given starvation, I wonder when that is going to show up19:43
lifelessclarkb: fungi: mordred: I've said before :) - don't make zuul parse subunit19:44
mordredsdague: fast fail but keep running is a harder problem. fast fail and stop running is easy and can be done nw19:44
lifelessit's a central non distributed process19:44
mordredlifeless: yeah - I don't mean zuul parse subunit19:45
mordredI mean SOMETHING needs to parse subunit, and that something needs to be able to talk back to the gearman status19:45
lifelesstestr can be taught to raise a signal (e.g. run a script that calls something over gear) on detecting a failure19:45
lifelessmordred: so testr is the thing that parses subunit here; I don't see why testr *itself* needs to talk gearman19:45
lifelessI mean it could, but it seems overly intimate coupling19:45
mordredI agree. and in fact, it's problematic for testr to be the one talking gear19:46
mordredbecause that means that we've violated the trust boundry19:46
mordredwe need the thing that is in the context to talk to gearman to be able to peer in to what's going on as it happens19:46
lifelessmordred: so move testr to a different context19:47
lifelessmordred: it's not part of the code under test19:47
lifelessmordred: and it can run things remotely, on multiple machines etc19:47
mordredpossibly - but now we're once again talking about massive changes that will not happen soon19:47
lifelessso, the reason I'm pushing back on zuul (or even jenkins) parsing the full subunit stream is because 400 test machines will overwhelm 5 jenkins or 1 zuul19:48
*** sdake has joined #openstack-infra19:48
lifelessmany MB of data to look for one bit19:48
mordredsure19:48
mordredbut the reason taht I'm pushing back on testr being invovled with distributing work across workers19:48
lifelesswe could use subunit as the encoding but have testr only send failure signals19:48
mordredis that there is too few devs hacking on testr and our ability to fix ui issues in it this past year has been very poor19:49
mordredso placing it in a positionof more operational complexity at this point would be a bad idea19:49
fungisdague: i don't think it actually failed to allocate a node. we get that behavior when jenkins fails to start a job i think or screws it up in certain ways which zuul recognizes as a need to restart that job19:49
lifelessmordred: I've merged every patch I've been given :(, but I ack your point19:49
sdagueyeh, testr being locking in bzr means testr changes to fix this is basically a non starter19:49
lifelesssdague: so its not locked in bzr19:49
lifelesssdague: it just needs tuits to move it19:49
clarkbfungi: yup should reschedule on another node19:49
sdaguetuits?19:50
fungisdague: the round variety19:50
lifelesssdague: http://2.bp.blogspot.com/-op8uJYMwdfI/TYpTkmur5hI/AAAAAAAAAd0/ty8WqHjiS58/s1600/A_Round_Tuit_Picture.jpg19:50
mordredsdague: "round-to-it == tuit"19:50
fungiwell, "a round tuit" (you generally only need one)19:51
sdagueright. well, we're basically at the point of having to start a separate project to work around testr ui for tempest.19:51
*** DinaBelova is now known as DinaBelova_19:51
mordredlifeless: in any case - the thing you and I are both talking about here are implementation details of the fact that the system overall needs to be designed to handle detect-fail-and-keep-running19:51
sdagueso it would be good if the tuits got prioritized19:51
lifelesssdague: I'd like to talk in detail before such a project is started; it might be the right answer, but I suspect not.19:52
sdagueit was really clear at the neutron code sprint that we are inflicting massive pain on our developers by them needing to manually dig through the test layers to do the things they need to do19:53
lifelesssdague: is that this discussion, or a different one ?19:53
sdagueit was a seperate one, which sort of joined on the first one19:53
sdaguelifeless: I 100% agree it's a less than ideal solution. But testr is locked in bzr in a corner. So going to be pragmatic about it and just having to stop copy and pasting work arounds between projects.19:54
lifelesssdague: you said that already; I got that much.19:55
lifelesssdague: I am happy to prioritise the tuits if they are the actual blocker, but until I understand the problem I won't know if even after doing a move to git a separate project might be the right thing19:56
lifelesssdague: so when I say I want to talk about it, I mean I want to talk about it :)19:56
lifelesssdague: but since its a different discussion, lets not distract the fix-the-gate thread19:56
sdaguesure, fair19:56
*** david-lyle_ has joined #openstack-infra19:57
lifelessI will have to go out in ~ 10m to do some minor errors, including picking up my pushbike after repair, should be < 1 hr19:57
mordredI do not think that we're going to magic in the real solution to fast-fail-continue-running this or next or the week after19:57
*** yolanda has joined #openstack-infra19:57
*** vkozhukalov has quit IRC19:58
mordredlike, we've deep dived into what needs to happen for it already, and it's not quick, or it would be done already19:58
mordredespecially with jeblair largely out for more time the next two weeks19:58
lifelessfail fast and stop is fairly straight forward19:58
mordredyes19:58
lifelesswhat proposed impl is on the table ?19:59
sdagueyeh, I'm not sure fail fast is a win at this point if we have to stop19:59
mordredfail fast and stop is straight forward and actionable19:59
mordredbut it may not be a win19:59
mordredsdague: would it be worth trying? or are you pretty sure about that19:59
* mordred is fine either way - just wants to be presenting options when we have them19:59
sdaguemordred: given what I know of the delayed network allocations, I think it would have completely masked that issue and made it undebuggable20:00
sdagueand I'm going to assume there are other such issues20:00
*** sarob has joined #openstack-infra20:00
mordredsdague: so that I understand - you're saying the appropriate logging happened AFTER the test timed out?20:00
bknudsonseems like we need the bugs fixed more than we need workarounds20:00
sdaguemordred: correct20:00
mordredsdague: ok. I grok20:01
mordredwe could put a delay on returning the error to jenkins perhaps20:01
lifelesswhats the proposed implementation ?20:01
sdaguemordred: because 2 minutes later we'd get a message from a network service that it allocated the network20:01
lifelessit will help a little if I know that ;)20:01
mordredwow. gotcha20:01
sdaguethat was one of the things that russellb saw, that had us bring down the concurency20:02
mordredlifeless: proposed impl was just to have our testr runner scan for failures and exit 1 if it sees one20:02
lifelessrighto20:02
lifelessso a small patch to testr - ack20:02
mordredsdague: maybe we only fail-fast in the gate, and we leave it on run-all-the-way in check?20:02
lifeless^ this20:02
lifelessI don't think the gate should be able diagnostics20:02
lifelessexcept for things that *only* run in the gate20:03
lifelesss/able/about/20:03
sdaguemordred: so if we don't keep enough of those logs, we loose our fingerprinting20:03
sdagueor potentially loose our finger printing20:03
mordredthat's true - but if we're ramping up auto-recheck, we should still have tons of check-level fails20:03
sdaguesure, but then we don't know what's *actually* failing us in the gate20:04
mordredjust ones that aren't also wrecking-balls20:04
mordrednod20:04
sdaguethe check queue is super noisy20:04
lifelessbecause bad code + flake20:04
sdaguebecause there is tons of bad code in it20:04
* mordred has to run - back in a bit...20:04
lifelesshow do you determine ignal20:04
lifeless*signal*20:04
sdaguelifeless: basically job fail rate check vs. gate20:05
*** sarob has quit IRC20:05
lifelesson the fingerprint of the failure, right?20:05
sdaguethis is overall fails. We don't have things broken down yet per job the way I'd like. That's blocked by ES having lost data, and just time.20:06
lifelesstuits again :P20:07
lifelessso20:07
lifelessthe question is, if we didn't have full gate logs20:07
lifelessit seems like that will limit the ER evolution20:07
sdagueyes20:07
lifelessbecause gate signal will be extremely hard to correlate to checks20:07
sdaguecorrect20:07
lifelessunless we have some canaries - no-op changes that we use to probe for flakiness20:08
*** luqas has quit IRC20:08
lifelessthat run on the gate machines, in the gate trust (thinking ahead to baremetal stuff), but with no changes merged20:08
lifelessand more often than 1/day20:08
sdagueyep, we need probably 100 / day to have a solid baseline20:09
lifelessI'm thinking whilte true: do20:09
lifelessas a starting point20:09
lifelesswhich would get 24 odd20:09
lifelessok, bb in an hour20:09
sdagueyep20:09
fungithat plays back into the idle pool jobs idea20:10
fungibut we'd need a dedicated idle pool if we compromise gate results in that manner20:10
sdagueyes20:11
fungirather than just available nodes to round out the count when the gate is less busy20:11
sdaguecorrect20:12
*** sarob has joined #openstack-infra20:13
*** salv-orlando has quit IRC20:15
*** luqas has joined #openstack-infra20:15
sdaguehmmm... I really wish we had the data over in ES. I think basically neutron jobs are 100% failing right now in gate, but it's hard to see20:16
*** sarob has quit IRC20:18
fungisdague: well, jobs run via jenkins01 or jenkins02 should be getting their console logs into elasticsearch currently20:18
fungiit's just 03 and 04 which aren't20:18
fungi(well, and jenkins.o.o but it's special-purpose jobs only anyway)20:19
*** afazekas_ has quit IRC20:20
sdaguehttp://logstash.openstack.org/#eyJzZWFyY2giOiJmaWxlbmFtZTpjb25zb2xlLmh0bWwgQU5EIGJ1aWxkX25hbWU6Z2F0ZS10ZW1wZXN0LWRzdm0tbmV1dHJvbi1pc29sYXRlZCBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBcIiBBTkQgYnVpbGRfcXVldWU6Z2F0ZSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDE2Mjk5MTgyM30=20:23
sdagueso of the logs we have, the isolated job in neutron has a 66% failure rate in the gate for the last 48hrs20:23
sdaguebased on that, I think we should remove all neutron from the gate20:24
*** luqas has quit IRC20:25
sdaguethe regular neutron job is at 45% failure - http://logstash.openstack.org/#eyJzZWFyY2giOiJmaWxlbmFtZTpjb25zb2xlLmh0bWwgQU5EIGJ1aWxkX25hbWU6Z2F0ZS10ZW1wZXN0LWRzdm0tbmV1dHJvbiBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBcIiBBTkQgYnVpbGRfcXVldWU6Z2F0ZSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDE2MzA5Mzk5NiwibW9kZSI6InRlcm1zIiwiYW5hbHl6ZV9maWVsZCI6ImJ1aWxkX3N0YXR1cyJ920:25
*** thuc has quit IRC20:26
*** thuc has joined #openstack-infra20:27
*** yolanda has quit IRC20:29
mattoliverauGood morning infra peeps!20:30
*** thuc has quit IRC20:32
fungisdague: what are the chances for the improvements from the sprint? i gathered a couple of them needed updated patchsets to pass tests, but that combined they should provide drastic improvement. any idea if they're ready to go yet?20:36
fungimorning mattoliverau20:36
mikalMorning20:36
sdaguefungi: honestly, I don't know.20:37
clarkbfungi they didnt pass testing20:39
clarkband were recheck no bugged so no debug context20:39
fungiclarkb: right, just didn't know if their respective owners had been working on them since20:39
*** SergeyLukjanov is now known as SergeyLukjanov_20:39
sdaguefungi: I don't know20:39
sdagueI think there is a base failure rate that's crept back up in the isolated jobs now20:39
sdaguepart of the challenge for the week was it was taking 4 - 5 hrs to get check results back20:40
sdagueso the timing of the zuul overload was unfortunate20:40
*** pcrews has quit IRC20:40
fungior the timing of the spring coincided with the timing of other sprints and general pre-milestone patch rush20:40
fungis/spring/sprint/20:41
sdagueyeh, agreed20:41
sdaguei2 runup, definitely an issue20:41
fungii know of at least one other major openstack project which had a sprint the same week20:41
*** pcrews has joined #openstack-infra20:41
fungibut there may have been more20:41
sdaguebut I think we're basically at a point now where there no longer is a "good" time during the cycle to get together because activity is always so high20:41
lifelessback20:45
*** elasticio has quit IRC20:48
*** mrodden has joined #openstack-infra20:50
*** Ajaeger has quit IRC20:52
*** dcramer_ has quit IRC20:52
*** rakhmerov has joined #openstack-infra20:53
*** mrodden1 has joined #openstack-infra20:53
sdaguefungi: so we are about to get a merge20:55
*** mrodden has quit IRC20:55
sdaguewe can see if the change after the neutron fail is made to start over20:55
sdaguenope, seemed to do the right thing20:56
fungiyeah, i've watched several such incidents since you mentioned it, and haven't see it happen yet20:57
fungiso must be an odd combo of circumstances triggering it20:58
sdagueI think in those cases where I saw it the jobs were still running on the failed job21:02
sdagueI'm now cron grabbing zuul every 60 seconds so I can reconstruct some of these21:02
fungigood idea21:02
clarkbwas it just continueing to run tests detached ftom the rrst of the queue?21:03
clarkbit will do that21:03
sdagueclarkb: it looked like it reset the job below it21:05
*** flaper87|afk is now known as flaper8721:05
*** mrda has joined #openstack-infra21:07
fungisdague: it does that when the change fails, because the change behind it needs to be retested against the branch tip as the new head of the gate, rather than on top of the failing change21:08
fungibut it sounded like you were describing something else, like a second gate reset21:09
openstackgerritEli Klein proposed a change to openstack-infra/jenkins-job-builder: Added rbenv-env wrapper  https://review.openstack.org/6535221:09
sdagueyeh, that's what it looked like21:11
lifelessmordred: on fail-early-keep-running; what if we added a second untrusted geard that can *only* signal failure21:13
*** sarob has joined #openstack-infra21:13
*** sarob has quit IRC21:17
sdagueso I'm basically manually pulling all of neutron out right now21:21
sdagueany neutron or python neutron client job has something like a 5% chance of passing at the moment21:22
sdagueand there were 5 of them in a row in there21:22
mikalWhat's the state of stable at the moment?21:24
mordredlifeless: right. so, honestly I cant' remember the state of teh design for that - but jeblair has a plan for implementing the complex version of this21:24
mikalRechecks are ok but the gate is still busted?21:24
mordredlifeless: but with all things being perfect, I would expect it to take us at least a month to get all of the various pieces landed21:24
mordredlifeless: the issue there isn't figuring it out - it's just working through the steps to do it21:25
lifelessok21:25
mordredlifeless: today's question is more "are there any less-ideal shortcuts we can take to help the current gate-slam"21:25
*** salv-orlando has joined #openstack-infra21:25
lifelesssue21:25
lifelesssure21:25
mordredlifeless: btw - not related to this, but just because it's the other thing I'm hacking on right now ... apparently it's possible to comingle js tooling in a python venv21:26
fungimikal: all of stable is still broken because of some exercises not passing on grizzly (which in turn means grenade can't test upgrading to havana changes)21:31
fungimikal: sdague mass-rechecked all outstanding stable changes so that they would get an obvious -1 from jenkins, to prevent stable cores from approving any more of them21:32
sdagueyep21:32
sdaguefungi: it's actually because cinder can't run because of stevedore version checking explosion21:33
fungiright, but that's where it manifests in the jobs21:34
sdagueyeh21:34
sdaguewell it also manifests in *all* grizzly being broken21:34
sdaguebut stable maint didn't seem to care on that one :)21:34
fungii thought it was only devstack/tempest failures for grizzly, but regardless yeah21:35
sdaguesure, but that means you couldn't land any changes21:35
fungiif it was also failing grizzly changes on non-integration jobs i missed that21:35
*** dizquierdo has joined #openstack-infra21:38
openstackgerritMonty Taylor proposed a change to openstack-infra/config: Use nodeenv via tox to do javascript testing  https://review.openstack.org/6772921:39
mordredthere we go. generic tox-based js-unittest job template that has docs-draft-like functionality21:39
notmyname"BuildErrorException: Server %(server_id)s failed to build and is in ERROR status" seems familiar to me, but I can't seem to find anything in LP21:43
notmynamering a bell with anyone else or is it something new?21:44
bknudsonhttps://bugs.launchpad.net/nova/+bug/126674021:44
notmynamehmm...this is in test_volume_boot_pattern21:45
notmynamesame bug or should it be filed as something new in LP?21:45
notmynamelogs at http://logs.openstack.org/16/66916/1/gate/gate-tempest-dsvm-full/b7f51bb/console.html21:45
*** gokrokve has quit IRC21:46
*** senk has quit IRC21:47
bknudsonnotmyname: I opened this bug https://bugs.launchpad.net/nova/+bug/127060821:47
bknudsonwhich has the same log message from n-cpu.21:47
notmynamebknudson: thanks. I'll use that one21:47
bknudsonnotmyname: I added a e-r check for it https://review.openstack.org/#/c/67713/21:48
portantenotmyname: I filed https://bugs.launchpad.net/cinder/+bug/1270350 so that I could find it searching LP21:55
portantebknudson: shall I close that one in favor of 1270608?21:56
mikalsdague: I think you missed at least one, because my script just rechecked it21:56
mikalsdague: so, same outcome...21:56
mikalOh, no I see.21:57
mikalsdague rechecked it, jenkins passed21:57
mikalThis is yesterday21:57
mikalhttps://review.openstack.org/#/c/6220621:57
openstackgerritMonty Taylor proposed a change to openstack-infra/config: Genericize javascript release artifact creation  https://review.openstack.org/6773121:57
mikalAhhh, because grenade is non-voting for neutron21:58
sdagueyeh21:58
sdagueneutron, ceilometer, and oslo still pass on havana21:58
sdaguebecause they don't attempt to upgrade21:58
mikalThat's good. Users never upgrade those thigns.21:58
sdaguenope21:59
sdaguebknudson: before putting this through - https://review.openstack.org/#/c/67713/ question in line21:59
notmynamesdague: on the gate timings, how feasible is adding the "time in gate" to the log message (both success and failure)? without a statsd timing metric, it would at least give the ability to track how long a piece of code stays in the gate21:59
sdaguenotmyname: to what log message?22:00
notmynamesdague: the jenkins message in gerrit22:00
sdaguedon't know22:01
sdagueyou could dive into the zuul code to see22:01
*** gokrokve has joined #openstack-infra22:01
notmynamesdague: ie I'm now looking at another many hours to get https://review.openstack.org/#/c/66916/ to the top of the gate (which has a 61% chance of passing). and the zuul status page has now been reset to 1 minute22:01
notmynamesdague: got a starting point to look at?22:01
sdaguenope, I don't know zuul code very well, I just dove in to do the enqueue time stuff22:02
openstackgerritA change was merged to openstack-infra/elastic-recheck: Add query for bug 1270309  https://review.openstack.org/6759422:02
*** gokrokve_ has joined #openstack-infra22:02
*** gokrokve has quit IRC22:06
*** gokrokv__ has joined #openstack-infra22:06
*** gokrokve_ has quit IRC22:06
openstackgerritA change was merged to openstack-infra/elastic-recheck: Add noirc option to bot  https://review.openstack.org/6752522:07
sdagueso the stable/grizzly fix made it to top of queue now, with any luck22:09
bknudsonportante: is it the same problem? see the e-r query22:10
bknudsonportante: if the query for the e-r works for  https://bugs.launchpad.net/cinder/+bug/1270350 then close https://bugs.launchpad.net/nova/+bug/1270608 as a dup22:11
bknudsonand I'll update the e-r change to use https://bugs.launchpad.net/cinder/+bug/127035022:11
bknudsonportante: I just want there to be an e-r query for it.22:11
portantebknudson: agred22:13
portanteagreed, looking22:13
*** sarob has joined #openstack-infra22:13
*** sarob has quit IRC22:18
*** sarob has joined #openstack-infra22:19
*** salv-orlando has quit IRC22:20
*** sarob has quit IRC22:25
bknudsonportante: logstash query with 'message:"BuildErrorException: Server %(server_id)s failed to build and is in ERROR status" AND filename:"console.html"'22:26
bknudsongets more hits than 'filename:"logs/screen-n-cpu.txt" AND message:"Error: iSCSI device not found at /dev/disk/by-path/"'22:26
bknudsonbut they're all failures either way.22:27
portanteyes, and I made 1270350 a dupe of 1270608 since it is more specific22:27
sdaguebknudson: the n-cpu.txt message is better, as that is specific of an underlying error, not just the symptom it causes22:29
sdaguemordred: so thinking about this more, while we are still at starvation, fast fail doesn't really help all that much, we're still going to be waiting around for nodes to tear down and rebuild22:31
*** dizquierdo has quit IRC22:31
sdaguethat's another bit of why were are hurting right now. We can't restart the changes behind the fail point very quickly22:32
*** salv-orlando has joined #openstack-infra22:39
*** 45PAA4WSM has joined #openstack-infra22:41
*** 45PAA4WSM is now known as jhesketh22:41
*** dcramer_ has joined #openstack-infra22:43
*** yassine has joined #openstack-infra22:43
*** yassine has quit IRC22:43
fungii've got a few things in play to help with some of the starvation... manually removed nodepool tracking for several nodes which are hung deleting at the provider or for a nonexistent provider, cleaning up some stray "alien" vms which nodepool has forgotten it created through unclean daemon restarts, and going to give 67723 a whirl to see if we reclaim some deleted nodes faster22:52
*** gokrokv__ has quit IRC22:52
fungiwe'll still be resource-starved, but at least the available resources should be increased a bit22:52
*** rakhmerov has quit IRC23:03
*** praneshp has joined #openstack-infra23:05
*** gokrokve has joined #openstack-infra23:08
*** jamielennox|away is now known as jamielennox23:12
sdagueyeh, I've had to walk away from beating my head on the gate for a while. I'm off trying to clean up nova request logs now23:14
*** sarob has joined #openstack-infra23:20
*** rakhmerov has joined #openstack-infra23:36
*** flaper87 is now known as flaper87|afk23:47
openstackgerritlifeless proposed a change to openstack-infra/nodepool: Don't load system host keys.  https://review.openstack.org/6773823:54
openstackgerritlifeless proposed a change to openstack-infra/nodepool: Ignore vim editor backup and swap files.  https://review.openstack.org/6765123:54
openstackgerritlifeless proposed a change to openstack-infra/nodepool: Add some debugging around image checking.  https://review.openstack.org/6765023:54
openstackgerritlifeless proposed a change to openstack-infra/nodepool: Only attempt to copy files when bootstrapping.  https://review.openstack.org/6767823:54
openstackgerritlifeless proposed a change to openstack-infra/nodepool: Document that fake.yaml isn't usable.  https://review.openstack.org/6767923:54
openstackgerritlifeless proposed a change to openstack-infra/nodepool: Don't load system host keys.  https://review.openstack.org/6773823:54
openstackgerritlifeless proposed a change to openstack-infra/nodepool: Ignore vim editor backup and swap files.  https://review.openstack.org/6765123:54
*** sarob has quit IRC23:55

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!