Friday, 2014-01-17

*** melwitt has joined #openstack-infra00:03
openstackgerritMichael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Customise Bootstrap  https://review.openstack.org/6733700:04
*** zz_ewindisch is now known as ewindisch00:08
*** vipul-away is now known as vipul00:08
*** CaptTofu has joined #openstack-infra00:09
*** ewindisch is now known as zz_ewindisch00:10
*** sarob has joined #openstack-infra00:11
*** MarkAtwood has quit IRC00:13
pabelangerA few weeks / month ago somebody was suggesting a graphic rendering lib for rst docs... it wasn't graphviz but something else00:14
pabelangerthere was some talk about maybe using it for -infra documentation00:14
*** rnirmal has quit IRC00:14
clarkbpabelanger: I think it was hashar, but I forget what the lib was called00:15
pabelangerclarkb: Ya, I thought it was hashar too00:15
zaroclarkb: hey, i just got back.  i'm just finishing up the gerrit testing, was gonna put it aside to start hacking on the scp-plugin tomorrow.00:16
clarkbzaro: great, thanks00:16
pabelangerhttp://blockdiag.com/00:19
pabelangereavesdrop.o.o to the rescue00:19
openstackgerritMichael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Customise Bootstrap  https://review.openstack.org/6733700:21
*** wenlock has quit IRC00:21
mattoliverauI haven't read the email, and this is just me thinking out loud but in regards to rate limiting how about doing something similar to TCP windowing. Pick a low point that the queue will never be smaller then, say 20. Then everytime a patch is merged increase the queue by X, say 1. Everytime there needs to be a reset be brutal, like halve the queue size. Requeuing the stuff taken off in a high priority00:22
mattoliverauqueue. This would mean when there are lot of resets the queue will be smaller and smaller so less ref re-pointing and hopfully push through all the congestion. When working again the queue will continue to build up.00:22
mattoliverauAgain, i'm new, so just my 2 cents.00:22
clarkbmattoliverau: yup that was my thinking00:22
clarkbtcp slow start00:22
clarkbit has its faults, you almost never hit peak efficiency, but it does work at protecting you00:22
mattoliverauclarkb: Lol, missed your comment with the name slow start :)00:23
mattoliverauclarkb: of course, but it's somewhere inbetween, better then a fixed queue length, but not to problem of a huge queue when zuul is needed most.00:23
*** vipul is now known as vipul-away00:24
*** mrodden1 has quit IRC00:31
*** vipul-away is now known as vipul00:34
*** fifieldt has joined #openstack-infra00:36
*** ok_delta has quit IRC00:37
*** odyssey4me has quit IRC00:37
*** sarob has quit IRC00:40
*** wenlock has joined #openstack-infra00:40
*** nati_uen_ has joined #openstack-infra00:40
*** smurugesan has quit IRC00:40
*** gokrokve has joined #openstack-infra00:41
*** yamahata has quit IRC00:42
*** michchap_ has quit IRC00:43
*** michchap has joined #openstack-infra00:43
*** nati_ueno has quit IRC00:43
*** odyssey4me has joined #openstack-infra00:46
clarkbthis is interesting the run handler sleeping run handler awake log messages haven't happened for 15 minutes. so that is what is starving us00:46
clarkbsomething is spending a lot of time in the middle of that loop00:46
clarkbsdague: ^00:46
openstackgerritA change was merged to openstack-dev/hacking: Move hacking guide to root directory  https://review.openstack.org/6213200:47
openstackgerritA change was merged to openstack-dev/hacking: Cleanup HACKING.rst  https://review.openstack.org/6213300:47
openstackgerritA change was merged to openstack-dev/hacking: Re-Add section on assertRaises(Exception  https://review.openstack.org/6213400:47
openstackgerritA change was merged to openstack-dev/hacking: Turn Python3 section into a list  https://review.openstack.org/6213500:47
openstackgerritA change was merged to openstack-dev/hacking: Add Python3 deprecated assert* to HACKING.rst  https://review.openstack.org/6213600:47
*** mrodden has joined #openstack-infra00:48
openstackgerritMichael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Moved homepage content to about page.  https://review.openstack.org/6734400:50
clarkbsdague: I am digging through logs now to see if I can determine where it is starving itself00:50
*** hogepodge has quit IRC00:50
*** harlowja is now known as harlowja_away00:50
*** CaptTofu has quit IRC00:53
*** melwitt has quit IRC00:58
*** melwitt1 has joined #openstack-infra00:59
clarkbit looks like it takes that long to submit all of the gearman jobs after a gate reset00:59
*** melwitt1 has quit IRC01:03
*** sarob has joined #openstack-infra01:05
*** sarob has quit IRC01:05
*** CaptTofu has joined #openstack-infra01:05
*** sarob has joined #openstack-infra01:06
*** dkranz has joined #openstack-infra01:07
*** harlowja_away is now known as harlowja01:08
*** melwitt has joined #openstack-infra01:08
clarkbthe bulk of the time was spent reseting the gate01:08
clarkb2014-01-17 00:31:52,791 DEBUG zuul.DependentPipelineManager: Starting queue processor: gate01:08
clarkb2014-01-17 00:47:17,732 DEBUG zuul.DependentPipelineManager: Finished queue processor: gate (changed: True)01:08
*** sarob_ has joined #openstack-infra01:08
clarkbthat is ~15 minutes of just dealing with gate reset, which is bad considering how often the gate resets01:09
openstackgerritEric Guo proposed a change to openstack/requirements: Have tox install via setup.py develop  https://review.openstack.org/6654901:09
mordredclarkb: wow01:10
*** sarob has quit IRC01:10
*** sarob has joined #openstack-infra01:12
clarkbit is taking 9-11 seconds to got git reset, git remote update, git reset --hard $BRANCH, git merge $patchset, then create a ref that zuul can advertise to the testers01:13
openstackgerritMichael Krotscheck proposed a change to openstack-infra/storyboard-webclient: Added apache license to footer  https://review.openstack.org/6734701:13
dkranzScrolling back, this might be a bad time to say this but I did a reverify with bug number on https://review.openstack.org/#/c/63934/ which closes the error-in-log-file hole.01:14
clarkb90*9 = ~13 minutes01:14
clarkbso that accounts for the bulk of the reset time01:14
clarkbknowing that, I think jeblairs farm of zuul workers plan is a really good one01:14
*** sarob has quit IRC01:15
clarkbif we can distribute that work instead of doing it serially we should be able to get that number much smaller01:15
*** sarob_ has quit IRC01:15
*** sarob has joined #openstack-infra01:15
clarkbnow it is also possible that the git repos themselves are degrading and are usually faster01:16
clarkbwhich isn't that far fetched as sdague indicated zuul had much better performance previously. Restarting zuul won't fix the problem but clearing out the git repos or otherwise repairing them might01:16
*** sarob has quit IRC01:18
clarkbhttp://paste.openstack.org/show/61413/ I have filtered out everything but the git checkouts there. This shows the amount of time between each git checkout which is roughly the amount of time it takes to do a checkout reset merge etc01:19
mordredclarkb: I wonder if git remote update is potentially too heavy of a hammer too. (although the farm of workers is better)01:20
clarkbwe might also try using a newer version of git on the zuul box01:20
clarkbwe can use https://launchpad.net/~git-core/+archive/ppa to get newer git on zuul.o.o01:21
*** zhiwei has joined #openstack-infra01:21
*** sarob has joined #openstack-infra01:21
clarkbmordred: it may be01:21
mordredclarkb: git fetch remotes/origin/$BRANCH ; git reset --hard FETCH_HEAD might do slightly less work01:21
*** zhiwei has quit IRC01:22
clarkbmordred: there is a big time delta between updating repository and the next step01:22
* clarkb looks at that code01:22
mordredclarkb: as in, the remote update step is taking a long time?01:22
clarkbya01:23
*** sarob has quit IRC01:24
*** sarob has joined #openstack-infra01:24
clarkbyup looks like that vast majority of time is in the remote update step01:25
clarkbit is happening in GitPython though. need to read up on it to see if we can make that smarter01:25
*** pcrews_ has quit IRC01:27
*** melwitt has quit IRC01:27
openstackgerritA change was merged to openstack-infra/config: Increase timeouts for jobs doing tempest runs  https://review.openstack.org/6637901:28
*** sarob has quit IRC01:29
mordredclarkb: will you point me to the part of the code you're looking at?01:29
clarkbmordred: I am digging through zuul/merger.py. mergeChanges() is the function that seems to do the work01:30
clarkbmordred: the repo update only happens once per project:branch relationship during a reset01:32
clarkbso while it is costly when it happens it isn't the biggest cost. the git checkouts seem to be most painful01:32
mordredreally?01:32
clarkbya checkout happens for each change so * 9001:32
*** xchu has joined #openstack-infra01:32
clarkband takes about as much time as an update01:32
mordredis that just because it's modifying a working tree?01:33
*** sdake has quit IRC01:33
clarkboh possibly as git has to reflect the changes on disk01:34
mordredcan I make a REALLY stupid suggestion?01:34
mordredwhat if we ran it under eatmydata?01:34
*** zhiwei has joined #openstack-infra01:35
mattoliverauCan you wrap the git checkout + other git reset stuff into some python thread so they can be done in parallel? That way it shouldn't be 90*901:35
clarkbhmm that is an interesting question. my first initial thought was are you crazy, my second thought is that may just be an incredible idea01:35
clarkbmattoliverau: we can, that is what jeblair's make workers do the work idea gets at01:36
clarkbmattoliverau: I think we will end up doing that regardless, but we need a short term solution01:36
*** afazekas has quit IRC01:36
clarkbmordred: eatmydata disables fsync? does that mean no data will ever get synced or it will sync whenever the OS feels like it?01:36
clarkbmordred: my biggest concern now is that zuul relies on disk persistence to do graceful restarts01:37
clarkbmordred: I am pretty sure that will break if we put zuul under eatmydata01:37
mattoliverauso how about tmpfs then? no disk IO then, only ram.01:38
mordredclarkb: hrm. good point01:38
mordredyeah - tmpfs would be the next question - but I don't think we have the ram to handle all of the repo size01:38
mordredI lied01:39
*** gokrokve has quit IRC01:39
mordredmordred@zuul:~$ sudo du -chs  /var/lib/zuul/git/01:39
mordred2.8G/var/lib/zuul/git/01:39
*** gokrokve has joined #openstack-infra01:40
clarkbtmpfs sounds like a great idea01:40
mattoliverauso we maybe able to put /var/lib/zuul/git under tmpfs, and bypass disk all together if it doesn't work then it just means it isn't disk io casuing issues.01:40
clarkbmordred: I think if we stop zuul, overlay a mount on /var/lib/zuul/git then start zuul it will just reclone everything01:40
*** pcrews has joined #openstack-infra01:41
clarkbcurrently git has a bout 4GB of cached and buffered memory01:41
*** thuc has quit IRC01:41
clarkbso 2.8GB filesystem may eat into that in ways that are unhappy though I bet a good chunk of that cache is for the git stuff01:41
*** thuc has joined #openstack-infra01:42
mattoliverauHow much RAM us the system using for everything else? is the servers RAM under utilised? I guess I could just go check out cacti :)01:42
mattoliveraus/us/is/01:42
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=392&rra_id=all01:43
*** gokrokve has quit IRC01:44
clarkbzuul is 4GB virt, 1.3GB resident, geard is 1GB virt, 760MB resident01:44
clarkbthen recheckwatch, puppet, and apache processes hang around with about 100MB a pice01:44
mordredwe could bump it to 8G if putting the temp merge location in ram would be good, ya know?01:46
*** thuc has quit IRC01:46
clarkbit is 8GB now01:46
clarkbit is an 8vcpu 8GB rackspace performance node01:47
mordredmeh01:47
mattoliverauYeah, the rest is mainly in cached memory. So the question is, how much will the kernel actually give back to us.. what is the real free figure.01:47
mattoliverauclarkb: there was a talk at LCA about thi during the sysadmin miniconf.01:48
mattoliveraus/thi/this/01:48
clarkbI missed it :(01:49
dstufftmordred: clarkb fungi So pip 1.5.1rc1 and 1.11.1rc2 just dropped, if you're at all able to run them through the paces in the openstack infra to make sure we fixed all your issues that would be really really awesome01:49
*** yamahata has joined #openstack-infra01:50
*** jp_at_hp has quit IRC01:51
mattoliverauif I remember correctly, we can check meminfo: cat /proc/meminfo |grep -i active01:51
clarkb4836276 kB01:52
mattoliverauwhatever the figure is for inactive should be currently what the kernel can dump (and thus give back at this point in time)01:52
mordreddstufft: I'd love to - the gate is so slammed though I don't think we're likely able to run anything with a difference - but I'll see what I can cook up01:52
clarkbmattoliverau: inactive is 2269672 kB01:52
*** wenlock has quit IRC01:52
dstufftmordred: ok I totally understand if you can't fwiw :) Mostly I want to avoid another upgrade apocalypse01:52
clarkboh there are several inactive categories, are they distinct or subsets?01:53
mattoliverauclarkb: so if I am correct, we may only get about 2.2 G back.01:53
*** locke105 has joined #openstack-infra01:53
clarkblooks like they add up so we only need that value above01:53
mordreddstufft: well, we're blocking >=1.5 anyway - so I think we can test upgrading to 1.5.1 at our leisure01:53
* mordred is still excited for his new 1.5 overlord01:53
mattoliverauI might go find the talk in question, the videos are up and it only went for 10 minutes or so.01:54
clarkbif we want to keep 8vcpu we can go up to a 30GB perf node01:55
clarkbthat will give us plenty of room for a ~16GB tmpfs01:55
mattoliverauYeah, that might be a good idea, that would give us room to grow.01:56
mordredclarkb: use 30G perf nodes for all the things!!!01:56
clarkbI think that is a not so crazy idea, but it is also late on thursday01:56
mattoliverauhttp://is.gd/U9kBon01:56
clarkbwould be curious to get fungi's input01:56
mordredclarkb: we should make the build farm use 30G perf nodes01:56
mattoliverauthe talk in question ^^01:57
mattoliverauI think01:57
mordredcan you imagine just how quickly pvo would should up and scold us?01:57
mattoliveraumordred: lol01:57
clarkbI also think newer git is worth a shot, the version of git we are running is pretty old01:57
mordredclarkb: ++01:57
mattoliveraucan't hurt, in theory the code has to be more efficient.. unless linus broke something :P02:00
clarkbya I am doing some quick unscientific tests locally02:01
mattoliverauLol the best kind of test ;)02:01
*** nosnos has joined #openstack-infra02:04
*** senk has quit IRC02:04
clarkbgit checkouts were any better. git clone was about 20 seconds faster for nova02:05
clarkbalso I just realized this is GitPython so it may be doing some stuff in pyth02:05
mattoliverauthat's true, makes it hard to determine the bottle neck. were those times based from gitpython or git?02:06
dstufftI think GitPython just shells out02:06
dstufftbut I might be thinking of a different project02:06
clarkbdstufft: it does for some stuff and not for others iirc02:06
*** jhesketh__ has joined #openstack-infra02:06
clarkbalso they use tabs in their source so now I don't want to read it02:06
dstufftclarkb: I've learned to avoid reading other people's source code unless I really want to be caremad02:07
dstufft(it's too late not to read my own :( )02:07
dimsjust peeked at gate queue, looks like it crept up to 10402:08
clarkbdstufft: it appears to shell out for checkout02:09
*** gyee is now known as gyee_nothere02:09
*** adrian_otto has joined #openstack-infra02:10
*** CaptTofu has quit IRC02:10
*** gokrokve has joined #openstack-infra02:10
dstufftclarkb: also question, is this cloning stuff to run tests on it?02:10
adrian_ottoare our Zuul workers clogged up? I have 4 Solum gerrit reviews that have no votes on them from jenkins, dating back over the past  ~4 hours.02:11
dstuffte.g. is it a read only clone and are you or can you use a shallow clone to make it go faster?02:11
clarkbadrian_otto: no zuul is clogged up02:11
clarkbdstufft: we can't shallow cloen for reasons. this is the repo zuul is using to build the refs that get tested02:11
clarkbiirc it needs all the refs in order to build the zuul refs which a shallow clone won't give you02:12
adrian_ottoclarkb: no zuul is clogged, or no it is not?02:12
dstufftclarkb: ok!02:12
dstufftI don't know much about zuul so :(02:12
clarkbadrian_otto: no, zuul is clogged02:12
clarkbthe workers themselves are fine02:12
adrian_ottoclarkb: ok, thanks02:12
*** slong- is now known as slong-afk02:15
*** gothicmindfood has joined #openstack-infra02:15
*** pballand has quit IRC02:21
*** julim has joined #openstack-infra02:22
*** yaguang has joined #openstack-infra02:23
clarkbadrian_otto: long story short is that the longer the gate queue gets the more time zuul spends reseting it (currently a full gate reset takes more than >15 minutes) while it is doing that reset the zuul scheduler does nothing else. There are plans to make that better (farming the expensive git work out to workers to allow massive scale out, and we have been fiddling with using a tmpfs as the cost of02:23
clarkbdisk seems to hurt quite a bit)02:23
*** julim has quit IRC02:24
*** portante_ is now known as portante02:24
*** gothicmindfood has quit IRC02:27
adrian_ottoclarkb: thanks for the detail. Can you help me understand what a gate reset is, and why it happens?02:29
clarkbadrian_otto: the gate pipeline is where we test serialized changes in parallel. change A gets approved first and goes onto the head of the queue, then change B gets approved and gets added behind A. Instead of waiting for A to merge before testing B we test B with A assuming A will pass and merge02:30
clarkbadrian_otto: when A does not pass and merge we have to retest B without A as the previous scenario is no longer valid02:31
clarkbthat is a gate reset.02:31
clarkbwhen you have 102 changes in the pipeline something failing at the head of the queue means we have to cancel jobs for 101 changes, then completely rebuild the git refs to test 101 changes (the 102nd is removed as it failed) then restart all of the tests02:32
fungiexcept in the current gate it's a plus b plus c plus... plus z and then repeat the alphabet several more times02:32
adrian_ottook, so that's sounds like a definite design weakness in zuul02:32
clarkbadrian_otto: its not a design weakness in zuul, it is a problem with speculative merging and testing02:33
adrian_ottoisn't that the key feature that makes zuul compelling?02:33
clarkbyes02:33
clarkbadrian_otto: in the best case you merge all 102 changes at one time and your time to test is O(1)02:34
fungiadrian_otto: more to the point, consider the integrated projects to basically be one software project with more than a thousand developers approving a hundred changes a day and trying to make sure every change passes the entire integration test suite prior to letting it merge02:34
clarkbwhen you are consistently failing that goes to O(n)02:34
clarkbin the previous state you were in O(n)02:34
clarkbso this is a win over the old state, but in the worst case is still bad02:34
adrian_ottoindeed02:35
fungithe alternative, which a lot of projects settle for, is merge first, then test periodically and see if the published software is obviously broken, then try to bisect and hope you can narrow down which commit to revert02:35
adrian_ottoso might it make sense to use an admission control strategy?02:35
adrian_ottoso the queue is limited?02:36
clarkbadrian_otto: see scrollback :)02:36
adrian_ottothat might speed up the reset case, at the cost of some concurrency in the best case02:36
*** nati_uen_ has quit IRC02:37
clarkbjeblair has historically been opposed to rate limiting the size of a zuul queue. I have argued for the feature in the past. I think something simple like tcps slow start would help quite a bit02:37
adrian_ottothanks for the additional detail!02:37
*** nati_ueno has joined #openstack-infra02:37
clarkbat LCA jeblair seemed to be more onboard with adding something like thatto zuul02:38
adrian_ottoyou can still have a backlog that's not part of the active queue02:38
clarkbyup02:38
mattoliverauIt was the sun and the Aussie beer ;P02:38
adrian_ottoand spoon feed the active queue so it remains a more optimal length02:39
*** yamahata has quit IRC02:39
clarkbadrian_otto: exactly, just like a tcp connection02:39
adrian_ottoyep02:39
clarkbwell tcp rarely if ever hits optimal state, but it is consistently not worst case02:39
*** dstanek has joined #openstack-infra02:46
lifelessclarkb: mmm02:46
lifelessclarkb: you could argue that tcp is nothing but worst case :)02:46
clarkblifeless: maybe when you latency is NZ bad02:46
clarkb:)02:46
StevenKclarkb: Well, it's a sliding window, and also best effort.02:48
StevenKclarkb: However, I agree with you -- I think checking a queue of 90-100 all the time is bong, and we should limit it to a window02:48
*** jishaom has joined #openstack-infra02:49
*** odyssey4me has quit IRC02:53
*** carl_baldwin has joined #openstack-infra02:54
*** AaronGr is now known as AaronGr_Zzz02:57
notmynameadrian_otto: since you were asking about stuff, I threw together a quick graph for you http://not.mn/solum_gate_status.html03:01
notmynameadrian_otto: if that's not the right jobs, let me know (or open a pull request--the repo link is at the bottom)03:02
*** odyssey4me has joined #openstack-infra03:02
*** jhesketh__ has quit IRC03:02
sdagueclarkb: so can you promote this now - https://review.openstack.org/#/c/65805/ ?03:02
*** rakhmerov has quit IRC03:02
*** jhesketh_ has quit IRC03:03
sdagueif the theory on load is correct, that should level things out a bunch03:03
*** jhesketh_ has joined #openstack-infra03:03
clarkbsdague it has been promoted should see it in a bit03:04
*** rossella_s has joined #openstack-infra03:05
sdagueok, just looked at the queue and it was still at the bottom03:05
sdaguebut I guess we're just processing the events still?03:05
*** jhesketh__ has joined #openstack-infra03:05
clarkbya the promotion takes ~15 minutes according to fungi03:05
clarkbsdague see sb for long explanation for zuul slowness03:06
sdagueyep, just read it03:06
clarkbi hunted it down. tldr really long gate is expensive03:06
sdagueclarkb: right, especially as it starves out the other events03:06
*** pballand has joined #openstack-infra03:06
sdaguethe tmpfs approach look promissing?03:06
*** yaguang has quit IRC03:09
clarkbya walking home now was hoping to chat with fungi about thst when I get back03:10
*** yaguang has joined #openstack-infra03:10
sdaguecool03:10
*** HenryG has joined #openstack-infra03:10
*** krotscheck has quit IRC03:10
*** ArxCruz has quit IRC03:11
*** zhiyan has joined #openstack-infra03:12
sdagueso I guess the other question is if we're taking forever to reset with the change that we think will make this better, would it make sense to just dump the gate queue at this point?03:13
sdaguethe d-g just popped to the head03:14
*** salv-orlando has joined #openstack-infra03:15
notmynamesdague: if stuff is getting promoted, then dumping the gate feels like something to do just to do something03:16
sdaguenotmyname: sure, though given that we can't allocate devstack nodes to jobs until the gate reset finishes, it's still adding 15 minutes additional friction on each hit. Which while small, adds up.03:18
sdagueclarkb / fungi: looks like a bad py26 node - https://jenkins01.openstack.org/job/gate-nova-python26/17060/console03:19
notmynameyes, but I'm working on getting a patch through for the past 12 hours, and I've got another dependency that's been over 50 hours in the gate with over 13 resets. an extra 15 minutes really isn't much03:19
sdagueit's not one extra 15 minutes, it's 15 * failing tests in gate (and right now there are at least 2 py 2.6 unit test failures that I see)03:21
fungii've taken centos6-1 offline03:22
fungithanks sdague03:22
sdague7 py26 unit tests fails... at least03:22
fungii'm also caught back up on scrollback since dinner now. i am a dismally slow reader03:22
sdagueyeh, about 40% of the gate jobs right now are in a fail state because of that py26 node03:23
sdaguezuul hasn't noticed yet because it's still processing the first promote03:23
fungii agree, in light of the performance breakdown, that saving the state of the pipelines and gracefully stopping zuul, mounting a suitably large tmpfs on /var/lib/zuul/git, starting zuul and restoring the changes would likely help performance03:24
*** rakhmerov has joined #openstack-infra03:25
fungithe +/- buffers/cache amount is a good bit larger than a du of that dir03:26
fungiand zuul has a ton of swap for spillover if that ends up being an underestimate03:26
*** adrian_otto has quit IRC03:28
*** dcramer_ has quit IRC03:28
fungi4g tmpfs should be doable looking at the present state of the server03:29
mattoliveraufungi: you need to check the active and inactive memory in meminfo to see how much the kernel will really give back to you +/- buffers is a bit of  lie.03:30
mattoliveraubut yeah there is swap.. so long as it swaps out something it doesn't need again :)03:30
fungiyeah, but some of what's currently resident is safe to page out03:31
fungimore a question of how much03:31
*** pballand has quit IRC03:31
*** nati_uen_ has joined #openstack-infra03:31
fungiactive(anon) is under 3g03:33
mattoliverauwhat is inactive03:33
fungithere's a fair amount of active(file) but i anticipate that being git03:33
clarkbfungi: I am willing to give tmpfs on current zuul a shot, we will need probably at least a 3GB filesystem03:33
fungiinactive is about 1.5g03:33
clarkbbut only about 2 gb was inactive03:33
mattoliverauyou you may get your 2g + what ever is actually free, and then everything else will be swaped.03:34
*** nati_ueno has quit IRC03:35
mattoliverauthere should be an inactive, inactive(anon) and inactive(file). Use the first as it is the total.03:35
mattoliveraubut i don't have access to the server so I don't actually know what the current value is.03:35
*** jerryz has quit IRC03:35
fungiright, inactive is roughly 2g03:35
clarkbfungi: I mentioned to mattoliverau earlier that we could go to a 30GB perf node to keep our vcpu count that will give us plenty of room for a massive tmpfs03:36
mattoliverauSo from my understandding, that is as much as the kernel can actually give you.03:36
fungiclarkb: yeah, i'm hesitant since the downtime to swap nodes would be a bit greater. how long did it take you the other day?03:37
clarkbit wasn;t too bad, you basically prestage the node compeltely then do the swap. making sure firewall rules are correct everywhere was the biggest hurdle03:38
fungii'm rapidly running out of steam for the night but can probably squeeze in another hour or so03:38
clarkbwe can probably get it done in well under half an hour03:38
StevenKmattoliverau: For a tmpfs? tmpfs are swapped-back03:38
clarkbfungi: I don't think we should do anything tonight unless you really really want to03:38
StevenKswap-backed, even03:39
mattoliverauStevenK: tmpfs is just a ramdisk, so yes, it'll be swapped out.. in theory.03:39
clarkbfungi: maybe fire off a 30GB node build tonight and plan for swap tomorrow?03:39
clarkbfungi: or, put tmpfs in place on existing zuul and see what happens03:39
fungiclarkb: i can get a new-new-zuul spinning up now. we'll hang our hopes on the tempest parallelism reduction to make some stability headway in the meantime03:41
clarkb++ I think that is path of most sanity03:41
*** weshay has quit IRC03:42
clarkbfungi: zuuls A record ttl is already 5 minutes so that is covered03:42
fungiawesome03:42
clarkbthen tomorrow we grab the pipeline state, stop zuul, update dns, make sure firewalls update (which more I think of it may not be a problem since most connections are to zuul so only zuul firewall matters) and start zuul on new server03:43
clarkbif anything goes uber terrible we put old server back in use03:43
*** amotoki has joined #openstack-infra03:43
mattoliverauSounds like a plan! And on that note then I'm going to go to lunch, ttyl.03:44
clarkbI am no longer convinced new git will make much of an impact03:46
fungiheh... "120 GB Performance"03:48
* fungi resists temptation03:49
fungiso we want 30 not 15?03:49
StevenKfungi: And then put / on a tmpfs? :-P03:49
fungiStevenK: bitcoins aplenty03:49
clarkbfungi: 15 has 4vcpu03:50
fungioh weird03:50
clarkbfungi: the current 8gb have 8vcpu03:50
clarkbI think we should go 30 just to keep the vcpu value constant03:50
fungiso new-zuul was non-performance?03:50
fungior 15g perf have fewer cpus than 8 and 30?03:51
clarkbnew zuul was performance, 8gb 8vcpu03:52
clarkbbut the flavors are weird, 8gb gives you 8vcpu but 15 give you 4vcpu03:52
clarkbdouble check that with nova flavor-list but pretty sure those were the values I saw earlier today03:52
fungiyou're right03:53
fungistrange but true03:53
clarkbthe other nice thing about 30GB is we can make the tmpfs pretty large and not worry too much about it filling unexpectedly03:54
fungiyup03:54
clarkbeg 16GB :)03:55
sdaguethey basically seem to have created a high memory set of perf nodes03:55
clarkbon my way home I was also thinking that zuul could do a better job in its scheduler of handling mroe than one discrete item at once03:57
clarkbat the very least it should be able to process different pipelines independently03:57
clarkbthe nice thing about the serial way it does things now is it makes it very predictable about the order jobs run in and so on03:58
clarkbbut gate being slow doesn't have to affect check for example03:58
clarkbbut I think making changes like that probably won't have large benefits when 99% of your time is waiting for a forked git process to do its thing03:58
*** coolsvap has joined #openstack-infra04:00
fungialso, you'd need multiple git workspaces to avoid collisions04:01
*** harlowja is now known as harlowja_away04:01
clarkboh right good point04:01
fungidon't want to be building two nova refs in one git clone at the same moment04:01
*** praneshp has quit IRC04:04
*** sarob has joined #openstack-infra04:08
*** sarob_ has joined #openstack-infra04:10
*** CaptTofu has joined #openstack-infra04:11
*** sarob has quit IRC04:13
*** CaptTofu has quit IRC04:15
*** sdake has joined #openstack-infra04:16
mikalzuul hates me04:17
notmynamemikal: don't worry. zuul hates everybody today ;-)04:17
mikalYay!04:17
mikalOn the performance nodes front, there are two types04:18
fungiclarkb: we haven't merged the change yet that autopartitions the secondary block device on these performance nodes, have we?04:18
mikalWhich might not be obvious from flavour list04:18
sdagueso it's not incredibly helpful for people to "reverify bug 123456789" - https://review.openstack.org/#/c/61714/204:18
sdaguebecause that patch can't pass right now, due to grizzly devstack issues04:18
mikalOMG, who did that?04:18
*** _ruhe is now known as ruhe04:18
mikalPerformance 1 has its biggest at 8 vcpus, 8gb ram04:19
mikalPerformance 2 has its biggest at 32 vcpus, 120gb ram04:19
*** coolsvap_away has joined #openstack-infra04:20
*** coolsvap has quit IRC04:21
*** coolsvap_away is now known as coolsvap04:21
*** vkozhukalov has joined #openstack-infra04:21
sdaguefungi: so given that the grizzly devstack issues are out there, could you kick out all the stable/havana patches in the queue? because they are all just time bombs04:22
*** SergeyLukjanov_ is now known as SergeyLukjanov04:22
notmynameis there a single job that is run for _every_ gate job that isn't run for check jobs? I'm looking for a graphite metric04:23
notmynameeg maybe gate-grenade-dsvm04:23
fungisdague: i'm not sure how to "kick them out" aside from uploading trivial new patchsets to each of them04:24
clarkbfungi: I think that is the only way04:24
sdaguefungi: yeh, that would be the only way04:24
fungibut a 'zuul eject' command would make for a good future addition04:24
sdagueyeh04:24
sdaguenotmyname: gate-tempest-dsvm-full is the best approximation of the gate04:25
notmynamesdague: thanks04:25
sdaguehowever, it's dynamic04:25
sdagueso not exact04:25
notmynamedynamic?04:25
sdaguethe integrated queue is assembled based on overlapping jobs04:25
sdagueso if change A runs tests 1 2 3, and change B runs tests 3 4 5, and change C runs tests 5 6 704:26
sdaguethey will be in a single queue04:26
sdagueeven though A doesn't overlap with C04:26
clarkband only one job of that entire set needs to fail to create a reset04:26
*** dcramer_ has joined #openstack-infra04:26
clarkbso I was thinking about this a bit more after LCA, and I think what I would like to do is expose the zuul logs more. There shouldn't be any priveleged info in them so it should be safe to just logstash them or whatever, but ti will give clear data on 'this was a gate reset' and so on04:27
openstackgerritA change was merged to openstack-infra/storyboard-webclient: Customise Bootstrap  https://review.openstack.org/6733704:28
openstackgerritA change was merged to openstack-infra/storyboard-webclient: Moved homepage content to about page.  https://review.openstack.org/6734404:28
SergeyLukjanovevening guys!04:28
clarkbSergeyLukjanov: ohai04:28
SergeyLukjanovclarkb, it'll be awesome to be able to read zuul logs :)04:28
clarkbSergeyLukjanov: ya, I want to double check with jeblair to see if there are any known gotchas with that, but we can pipe it into the test log logstash too and get overlapping data04:29
*** rossella_s has quit IRC04:29
mikalclarkb: noting that if you turn on our swift reporter and debug logging, it logs the swift password04:29
clarkbsdague: also did you see I diagnosed the missing console.html in logstash problem? zaro is going to work on a fix04:29
clarkbmikal: is swift reporter a thing?04:30
clarkbmikal: in any case we should sanitize that logging imo04:30
sdagueclarkb: cool04:30
mikalclarkb: it is for us. I think its meant to be for you in the future.04:30
mikalclarkb: but perhaps I am mis-representing jhesketh__ and jblair's plan04:30
clarkbsdague: what happens there is jenkins hasn't even touched the file on logs.o.o by the time logstash processes it, logstash gets a 404 and moves on04:31
sdaguegreat04:31
clarkbsdague: so we will update the scp plugin to not finish the job until that file has at least been touched04:31
clarkbshould be a simple wait on a thread sync event04:31
sdaguecool04:32
*** chandankumar has joined #openstack-infra04:32
clarkbsdague: how is neutron thing going?04:33
fungiclarkb: yeah, we haven't approved https://review.openstack.org/63190 apparently, so that explains the lack of swap04:33
sdaguegood, though slowed down by the current gate backup. The lower concurrency patch is looking promissing in the gate right now though.04:34
clarkbsdague: yeah04:34
clarkbfungi: are you just going to manually add swap then? or should we merge 63190 and rebuild?04:34
fungii'll delete my first launch and build another with that patch added04:35
sdagueI need to go to bed, but I think if we kick out the stable branch changes in the gate the gate will empty by morning04:35
fungiunless we want to merge it first04:35
clarkbfungi: ok, I am reviewing that change now too04:35
clarkbsdague: noted04:35
fungiclarkb: thanks. we might as well approve it rather than continue manually applying it on every launch environment ;)04:35
sdaguethat swift one that was reverify 123456789 is reset #2 in there, but that's because it can't pass04:35
sdaguethere is a nova unit test fail as well, above it04:36
clarkbsdague: is it just the two? I will look and figure it out don't stay up04:36
sdagueso that's what's failing now, though they are off to the side04:36
sdaguethe rest is being computed04:36
sdaguehowever, stable/havana patches will fail on grenade04:36
sdaguebecause of grizzly04:36
sdagueso they take a while to show up after a reset04:37
clarkbok so all stable/havana changes should be kicked out04:37
sdagueyes04:37
clarkbgot it04:37
*** carl_baldwin has quit IRC04:37
*** esker has quit IRC04:37
sdaguechmouel and I were looking at stable grizzly devstack today, will do so again in the morning04:38
clarkbok04:38
sdagueI think it's fundamentally the pip 1.5 thing04:38
*** esker has joined #openstack-infra04:38
sdagueanyway, bed time. Talk to you later.04:38
clarkbwe aren't using 1.5 anymore though right? or did we deal with that differently?04:38
clarkbfungi: ^04:38
clarkbfungi: oh I remember, I asked for that change not to be in install_puppet.sh04:39
clarkbfungi: because well it is doing something completely different and potentially harmful instead of simply installing puppet04:39
clarkbfungi: why don't you use a local checkout and we can figure out how to dlea with that properly when jeblair is backl04:39
fungiclarkb: ahh, right. you should note that in the review04:39
clarkbyup sorry I didn't do that before, my bad04:40
openstackgerritA change was merged to openstack-infra/devstack-gate: Cut tempest concurrency in half  https://review.openstack.org/6580504:40
*** fifieldt has quit IRC04:41
HenryGIn gerrit, is there a way to search for any reviews in progress that touch a particular file?04:42
*** fifieldt has joined #openstack-infra04:42
*** emagana has joined #openstack-infra04:42
jhesketh__clarkb: so (reading back...) I suggested on the infra mailing list that we run a zuul per a pipeline to ease the load on the gate04:42
clarkbjhesketh__: that won't ease the load on the gate but would help the other pipelines04:43
jhesketh__jblair didn't think it was necessary with the move to a performance node and also his future plan of sending git methods to workers04:43
notmynameclarkb: fungi: thanks for the help with the CVE patch today04:43
clarkbHenryG: if you have watched the projects and use the ssh query api then I think the answer is yes04:43
jhesketh__well zuul will be able to do it's git magic faster if it doesn't have to fight other pipelines04:43
clarkbjhesketh__: there is not fighting though they are all dealt with serially04:44
clarkbthe problem is that the gate pipeline takes 15 minutes to handle a reset, and nothing else in zuul runs04:44
funginotmyname: of course, it's my pleasure04:44
clarkbwe need to make that faster, the worker idea should help there as it distributes the expensive git work across nodes04:44
jhesketh__clarkb: if zuul is pulling in a patch for nova in the check pipeline doesn't that block any merge it might be wanting to try on the gate pipeline?04:45
jhesketh__right okay04:45
clarkbjhesketh__: not really because it will handle those one at a time04:45
clarkbthis interim idea is use tmpfs to speed up git operations04:45
fungijhesketh__: zuul's output is a constructed git ref, i the end, so the state of its work tree doesn't have to hang around. just a git object04:46
clarkbas that requires no code changes and should help quite a bit04:46
jhesketh__clarkb: so it does block, it's just not significant?04:46
jhesketh__(the check pipeline that is)04:46
clarkbjhesketh__: ya because the check pipeline work is once and done04:46
clarkb~10 seconds of work04:47
*** praneshp has joined #openstack-infra04:47
jhesketh__sure, but if somebody commits a dozen patches at once that's still a delay04:47
clarkbbut for dependent pipelines it processes the entire queue before being done. which is ~10 seconds multiplied by the number of changes04:47
fungijhesketh__: it blocks, but insofar as it all blocks because git operations are not happening in parallel04:47
jhesketh__yep04:47
clarkbjhesketh__: but it allows other work to happen between those changes04:47
HenryGclarkb: yes I have "watched" the project (tempest, in this case). Do you have a ptr handy to the ssh query api for a noob to get started?04:48
clarkbso the total work is 10*10 seconds but it doesn't starve the other queues04:48
jhesketh__sure,04:48
clarkbwith the gate it literally stops everything else for that 15 minute peruiod04:48
mikalI can asume that my stackforge approval from an hour ago isn't lost, right?04:48
mikalJust slow?04:48
clarkbmikal: yes just very very slow04:48
StevenK515 events, wheee04:48
clarkbthe compounding problem with the gate is on a failure it does all of the work again04:49
clarkbthen you fail and it does it all again04:49
clarkband on and on04:49
*** praneshp_ has joined #openstack-infra04:52
clarkbfinding trivial patchset content is non trivial04:53
clarkbfungi: just update commit message?04:53
sdagueclarkb: ok, not quite asleep yet04:53
* mikal promises not to approve anything for a while04:53
sdaguebut it looks like there are 6 - 8 stable/havana patches in the gate04:53
sdagueso if you nuke them now, I think the gate will clear out by morning04:53
clarkbsdague: I foud 504:53
sdaguelots of keystone with month old test results04:53
sdagueI went through and started -2ing a ton of stuff04:54
*** praneshp has quit IRC04:54
*** praneshp_ is now known as praneshp04:54
mikalOh, we still have that "old checks" problem?04:54
sdagueapparently, I have -2 on havana04:54
fungiclarkb: yeah, update commit message will work04:54
sdaguemikal: yes04:54
mikalWould it be meaningful to have that quick and dirty rechecker turned on04:54
StevenKsdague: But that turns into an event, and zuul isn't really getting around to that ...04:54
sdaguemikal: probably04:54
mikalI didn't do it because I was told that we'd have gerrit doing it soon04:54
sdagueStevenK: sure04:54
mikalBut if it would help, I'll get it done today04:54
sdaguehowever it will signal04:54
sdaguemikal: yes, it would be helpful, have it have a variable for # of days that we consider something stale04:55
sdaguethat we could set in infra04:55
sdagueit would be awesome04:55
mikalsdague: as in projects.yaml?04:55
* mikal pulls out that code and dusts it off04:55
jhesketh__mikal: is this the turbo-hipster gerrit rechecker?04:55
mikaljhesketh__: yeah04:55
sdaguemikal: wherever clarkb and fungi think it should live04:55
sdaguejust want to make it configurable04:56
mikalIt will reduce the number of merge fails04:56
mikalWell, what you get today is quick and dirty04:56
jhesketh__mikal: unless you set up turbo-hipster on infra the config will have to be in our cloud04:56
sdaguemikal: this actually isn't a mege fail problem04:56
jhesketh__well I guess you could hit a url for it04:56
mikalAnd then we do something less shit sometime real soon04:56
sdagueit's the fact that tox or deps changed in a month04:56
sdagueso the passing results aren't valid at all04:56
clarkbsdague: some of these do actually fail to merge04:56
clarkbsdague: its fun...04:56
mikalYeah, so a recheck of checks older than a week would have covered this, right?04:56
sdagueclarkb: ok04:56
sdaguemikal: yes04:56
clarkbsdague: I am pushing patchsets though to make it clear04:57
jhesketh__sdague: sure, so this code mikal whacked together is a turbo-hipster plugin.. so it'll probably not be configurable today if you want quick and dirty04:57
mikalOk, cool04:57
mikalI shall do a thing04:57
mikaljhesketh__: I think that's ok04:57
sdaguemikal: you are my hero :)04:57
mikalWe can make it suck less tomorrow04:57
jhesketh__mikal: oh yeah, I agree. Just letting others know04:57
mikalI need theme music04:57
fungiclearly i can't work on things and keep up with irc at the same time04:57
fungii'm sure you're all discussing exciting things04:58
mikalLOL04:58
mikalJust robots of doom04:58
mikaljhesketh__: is testzuul free at the moment?04:59
mikaljhesketh__: I might run this there04:59
jhesketh__mikal: go for it... I think it's in an okay state04:59
mikaljhesketh__: cool04:59
clarkbsdague: lol bugs are getting assigned to me because I am writing those patchsets :)04:59
fungiclarkb: so, new-new-zuul is 2001:4800:7815:0101:3bc3:d7f6:ff04:e07f05:00
fungi15g tmpfs on the git dir05:00
fungizuul daemon seems to properly recreate the contents of that directory when it's started05:00
clarkbfungi: noice05:00
fungii've also started the puppet agent on it05:00
clarkbfungi: is it accepting jobs though?05:00
clarkboh I know where we need to update firewalls, on the jenkins masters05:01
clarkber wait no05:01
clarkbwe just need to make sure the jenkins masters connect to new new zuul's geard05:01
fungiyeah. but i've stopped the zuul daemon again just to be safe05:01
clarkbcool05:02
*** chandankumar has quit IRC05:03
clarkbfungi: so ya, I think we plan to do a switcheroo early tomorrow and see if tmpfs helps a bunch05:03
*** mrda has quit IRC05:03
clarkbI will attempt to wake up early05:03
*** resker has joined #openstack-infra05:03
fungii'll be around and ready05:04
clarkbsdague: I have killed two keystone changes and one swift, there appear to be 3 more changes05:04
clarkbsdague: slowly getting through them05:04
notmynameclarkb: https://review.openstack.org/#/c/67186/ and https://review.openstack.org/#/c/67187/ are backports for the CVE bug05:05
notmynamefor grizzly and havana05:06
clarkbnotmyname: ok, neither will pass the gate until grenade is working for grizzly and havana05:06
clarkbnotmyname: sdague and chmouel are working on that as a priority05:06
notmynameclarkb: right. I just thought you were working on making sure those don't get into the queue. they were/are marked as approved05:07
*** esker has quit IRC05:07
clarkbnotmyname: I didn't see them in the queue05:07
notmynameah ok05:07
*** ruhe is now known as _ruhe05:07
*** krtaylor has joined #openstack-infra05:08
clarkbI think I got all of them according to a gerrit search05:09
clarkbjhesketh__: going back to zuul slowness. I probably wans't entirely clear but in zuuls main loop is processes all results then processes events05:11
*** yamahata has joined #openstack-infra05:12
clarkbjhesketh__: results cause gate resets (if a job result was fail) this causes zuul to cancel all jobs in the gate behind it, then remerge the new state of proposed git merging, then start jobs for all of those changes. That process takes 15 minutes or more with 90 changes in the queue05:12
fungiright. pragmatic ordering since results have a chance of reducing the complexity05:12
clarkbjhesketh__: that entire process is one iteration through the loop so no other results or events are processed during that time05:12
clarkbjhesketh__: because of that zuul per pipeline won't fix the problem but it will decouple it from check and post and so on05:12
*** zz_ewindisch is now known as ewindisch05:13
*** mrda has joined #openstack-infra05:13
clarkbjhesketh__: zuul per pipeline will still rsult in really slow gate processing. The way to fix that is to make git operations quicker. git worker nodes and git repos in tmpfs should make that better. And honestly after reading through logs I think if we solve that problem then zuul per pipeline isn't necessary05:13
clarkbwe are literally spending minutes running git remote update and git checkout foo and git merge05:14
*** resker has quit IRC05:14
jhesketh_clarkb: okay, thanks for the clarification, makes sense05:14
fungiclarkb: it might also have the effect of interleaving workers between pipelines, unlike the broad swing we see now (gate resets, all pending check changes get workers, then attempts are made on the gate changes, repeat)05:15
clarkbfungi: yup05:15
fungisince there would be more than one gearman server for a jenkins master to listen to05:15
clarkbjhesketh_: I do think another thing that would help but would require massive rewrites of zuul is to do everything in a non blocking manner. fire off hundreds of git merges at once and wait for IO to happen. Using the git gearman workers approximates this but could probably just be done in process too05:16
*** sarob_ has quit IRC05:18
clarkblifeless: https://jenkins02.openstack.org/job/gate-neutron-python27/6117/console is that a limitation of testtools matchers?05:18
clarkbjhesketh_: the whole situation has led me to drinking heavily05:18
*** amotoki_ has joined #openstack-infra05:18
*** sarob has joined #openstack-infra05:18
clarkbjhesketh_: :)05:18
*** SergeyLukjanov is now known as SergeyLukjanov_05:20
lifelessclarkb: no05:20
jhesketh_clarkb: heh, okay05:20
*** SergeyLukjanov_ is now known as SergeyLukjanov05:21
fungiclarkb: the whole situation has gotten in the way of my usual heavy drinking. opposite of the expected effect05:21
clarkbfungi: I'm sorry, I found this IPA to help tremendously05:21
lifelessthe matcher api doesn't assume strings etc05:21
*** sarob_ has joined #openstack-infra05:21
*** SergeyLukjanov is now known as SergeyLukjanov_05:21
mikalclarkb: is there a way to specify a wildcard project name in layout.yaml?05:21
fungiclarkb: as long as it's a v6 ipa05:21
clarkblifeless: I didn't think so, but figured I would ask anyways05:21
mikali.e. I want this to match more than one project05:21
clarkbmikal: no, but you can have templates that you apply to many projects05:21
mikalBut I still need to list the projects, right?05:22
clarkbmikal: yup05:22
mikal:(05:22
clarkbmikal: actually wait05:22
clarkbmikal: the thing that does event matching may do regexes everywhere /me examines code05:22
*** amotoki has quit IRC05:22
*** sarob has quit IRC05:23
clarkbmikal: best I can tell project is a magical key and doesn'05:24
clarkbt05:24
clarkbsdague: russellb: fungi: the spice flows. I think that d-g change helped05:24
*** esker has joined #openstack-infra05:25
fungiclarkb: awesome. instead of ipa, i think i'm going to settle in for a nap05:25
*** sarob_ has quit IRC05:25
fungimaybe after the zuul upgrade tomorrow i'll actually find some time to start catching up on e-mail and code review05:26
mikalfungi: better code review, or we'll kick you out of core!05:26
fungimikal: somehow i think my current code review stats would let me kick everyone else out05:26
mikalLOL05:27
mikalProject of one05:27
fungibut that's holidays for you05:27
fungilast month shouldn't really count05:27
clarkblast month was a lie05:28
*** nicedice has quit IRC05:29
fungibut there *was* cake, at least05:29
clarkbcode review is high on list of things now that we seem to have a handle on gate badness05:29
clarkband by have a handle on I mean understand05:29
fungicower in ph33r of05:30
fungi+++ATH05:31
fungiNO CARRIER05:32
clarkbfungi: is the zuul tmpfs in fstab?05:34
fungiclarkb: yup05:34
clarkbawesome, it occured to me that a reboot may result in weird things if it wasn't05:35
funginone /var/lib/zuul/git tmpfs defaults,size=15G       0  005:35
fungiwhat kinda sysadmin do you take me for? ;)05:35
clarkb:P I am just double checking05:35
fungiyeah, good to confirm that05:35
fungii just double-checked too because i'm running on fumes and no longer trust myself05:36
openstackgerritA change was merged to openstack-infra/elastic-recheck: Add query for bug 1269940  https://review.openstack.org/6730305:36
uvirtbotLaunchpad bug 1269940 in openstack-ci "[EnvInject] - [ERROR] - SEVERE ERROR occurs:" [Undecided,New] https://launchpad.net/bugs/126994005:36
openstackgerritA change was merged to openstack-infra/elastic-recheck: Add query for bug 1260311  https://review.openstack.org/6731405:37
uvirtbotLaunchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/126031105:37
openstackgerritA change was merged to openstack-infra/elastic-recheck: Add e-r query for bug 1266611  https://review.openstack.org/6534405:37
uvirtbotLaunchpad bug 1266611 in nova "test_create_image_with_reboot fails with InstanceInvalidState in gate-nova-python*" [Undecided,New] https://launchpad.net/bugs/126661105:37
clarkbfungi: I trust you05:38
*** odyssey4me has quit IRC05:38
fungieh, i don't recommend it. counterindicated by my operating manual05:39
* fungi is covered in warning labels05:40
clarkbfungi: I have a thing at ~10am PST, will try to be up early maybe we can attempt zuul stuff around 8am PST05:40
fungisounds great05:40
clarkbalso watch the gate, it may merge a ton of things all at once over the enxt 10 minutes05:41
fungii saw05:41
fungithough the longest-running changes have had a tendency to be the ones that fail, so it's always a major fake-out05:41
clarkb:/ we did just increase test time by a non trivial factor05:42
fungiplus, job run times are longer than jenkins expects now, so its estimates are a bit optimistic05:42
* clarkb hopes it is just that05:42
clarkbNNOOOOOO a job just afiled05:42
clarkboh it was just a test timeout for grenade lets bjump that timeout too05:43
* clarkb proposes that change05:43
fungii'll stick around to approve it if you propose05:43
openstackgerritClark Boylan proposed a change to openstack-infra/config: Double grenade test timeouts  https://review.openstack.org/6737405:46
clarkbfungi: ^05:46
clarkbwith that in place I feel confident that the queue will move05:47
fungiit's in05:47
clarkbdanke05:47
fungiwell, approved. will take time to get through the event queue05:47
clarkbya I figure we don't worry too much about that :)05:48
*** slong has joined #openstack-infra05:50
*** slong-afk has quit IRC05:51
*** HenryG has quit IRC05:52
*** DinaBelova has joined #openstack-infra05:53
*** SergeyLukjanov_ is now known as SergeyLukjanov05:53
*** pballand has joined #openstack-infra05:56
clarkbfungi: anyways don't stay up anymore, things should settle down overnight (I hope) and we can hit this with a big hammer tomorrow05:57
*** zhiwei has quit IRC05:59
openstackgerritRuslan Kamaldinov proposed a change to openstack-infra/storyboard: Fixed doc build  https://review.openstack.org/6737606:02
openstackgerritGuido Günther proposed a change to openstack-infra/jenkins-job-builder: tests: Allow to test project parameters  https://review.openstack.org/6726506:04
openstackgerritGuido Günther proposed a change to openstack-infra/jenkins-job-builder: project_maven: Don't require artifact-id and group-id  https://review.openstack.org/6603606:04
*** reed has quit IRC06:07
*** odyssey4me has joined #openstack-infra06:08
*** CaptTofu has joined #openstack-infra06:12
*** chandankumar has joined #openstack-infra06:15
*** CaptTofu has quit IRC06:16
*** pballand has quit IRC06:17
*** praneshp is now known as praneshp_afk06:18
*** denis_makogon has joined #openstack-infra06:26
*** pelix has left #openstack-infra06:31
*** SergeyLukjanov is now known as SergeyLukjanov_06:34
*** SergeyLukjanov_ is now known as SergeyLukjanov06:36
*** afazekas_ has quit IRC06:37
*** gokrokve has quit IRC06:38
*** gokrokve has joined #openstack-infra06:38
mikalI think I just realized my approach wont work06:39
mikalThe extra text zuul puts in the review comment will stop the recheck from triggering06:39
clarkbmikal: oh right, because the regex is very restrictive :06:40
mikalYeah06:41
mikalI'm going to write a crappy daemon for now06:41
mikalBut its a shame I can't use zuul06:41
*** sHellUx has joined #openstack-infra06:41
*** gokrokve has quit IRC06:42
*** sHellUx has quit IRC06:42
*** SergeyLukjanov_ has joined #openstack-infra06:44
*** SergeyLukjanov_ has quit IRC06:45
*** DinaBelova_ has joined #openstack-infra06:46
*** vkozhukalov has quit IRC06:52
*** ewindisch is now known as zz_ewindisch06:55
*** DinaBelova has quit IRC06:56
*** DinaBelova_ is now known as DinaBelova06:56
*** SergeyLukjanov is now known as SergeyLukjanov_06:58
*** DinaBelova is now known as DinaBelova_06:58
*** mrda has quit IRC07:01
*** odyssey4me has quit IRC07:04
*** yolanda has joined #openstack-infra07:07
*** nati_uen_ has quit IRC07:11
*** odyssey4me has joined #openstack-infra07:12
*** afazekas_ has joined #openstack-infra07:25
*** jcoufal has joined #openstack-infra07:27
clarkbanteaya: can you check if https://review.openstack.org/#/c/66490/ is just broken? it is flapping in the gate and I think the patch itself doesn't work07:33
clarkbanteaya: and if so can you make sure someone proposes a new patchset to it to remove it from teh gate if it is still in the gate when you see this?07:33
openstackgerritA change was merged to openstack-infra/config: Double grenade test timeouts  https://review.openstack.org/6737407:41
clarkboh good now I can go to bed07:42
openstackgerritAndreas Jaeger proposed a change to openstack-infra/config: Add gates for API projects and operations-guide  https://review.openstack.org/6739407:47
*** dizquierdo has joined #openstack-infra07:51
*** jamielennox is now known as jamielennox|away07:54
*** flaper87|afk is now known as flaper8707:55
*** DinaBelova_ is now known as DinaBelova07:58
*** SergeyLukjanov_ is now known as SergeyLukjanov07:58
*** SergeyLukjanov is now known as SergeyLukjanov_08:01
*** odyssey4me has quit IRC08:01
*** fifieldt has quit IRC08:05
*** fifieldt has joined #openstack-infra08:07
*** odyssey4me has joined #openstack-infra08:09
*** CaptTofu has joined #openstack-infra08:12
*** bookwar has quit IRC08:14
*** bookwar has joined #openstack-infra08:16
*** CaptTofu has quit IRC08:17
*** jcoufal has quit IRC08:21
*** SergeyLukjanov_ is now known as SergeyLukjanov08:24
*** mancdaz_away is now known as mancdaz08:25
*** mancdaz is now known as mancdaz_away08:25
*** vkozhukalov has joined #openstack-infra08:28
*** jcoufal has joined #openstack-infra08:31
*** luqas has joined #openstack-infra08:32
*** mancdaz_away is now known as mancdaz08:34
*** coolsvap has quit IRC08:35
*** coolsvap has joined #openstack-infra08:35
*** odyssey4me has quit IRC08:36
*** fifieldt has quit IRC08:37
*** NikitaKonovalov has joined #openstack-infra08:42
*** odyssey4me has joined #openstack-infra08:44
*** dpyzhov has joined #openstack-infra08:47
*** talluri has joined #openstack-infra08:48
*** odyssey4me has quit IRC08:49
*** mrmartin has joined #openstack-infra08:50
*** ogelbukh has quit IRC08:55
*** odyssey4me has joined #openstack-infra08:56
*** hashar has joined #openstack-infra08:57
*** lyle has joined #openstack-infra08:58
*** mrmartin has quit IRC08:58
*** david-lyle has quit IRC08:58
*** emagana has quit IRC08:59
*** mdenny has quit IRC09:01
*** mdenny has joined #openstack-infra09:01
*** vkozhukalov has quit IRC09:03
*** mrmartin has joined #openstack-infra09:04
*** mrmartin has quit IRC09:08
*** kruskakli has quit IRC09:11
*** fbo_away is now known as fbo09:12
*** praneshp_afk has quit IRC09:12
*** mrmartin has joined #openstack-infra09:13
*** _ruhe is now known as ruhe09:17
*** vkozhukalov has joined #openstack-infra09:18
*** yassine has joined #openstack-infra09:20
*** IvanBerezovskiy has joined #openstack-infra09:20
*** JohanH has joined #openstack-infra09:21
*** markmc has joined #openstack-infra09:22
*** max_lobur_afk is now known as max_lobur09:23
*** pblaho has joined #openstack-infra09:26
JohanHHi, we are trying to get Zuul to work in our own project and we are running into some issues that we can not get several concurrent gate checks to execute in parallel. The first job starts but all the other changes in the queue are skipped. Does anyone know what the problem might be? We would like to run as many parallel jobs a possible utilizing all our jenkins slave workers09:28
*** luqas has quit IRC09:38
*** ruhe is now known as ruhe_away09:41
*** ruhe_away is now known as ruhe09:42
*** denis_makogon has quit IRC09:44
SergeyLukjanovJohanH, which dependency manager are you using?09:45
SergeyLukjanovJohanH, if you're setting up zuul for gerrit.o.o than you're need to use 'check' pipeline instead of 'gate' because zuul.o.o will merge files instead of yours one09:46
*** jooools has joined #openstack-infra09:47
*** luqas has joined #openstack-infra09:47
*** odyssey4me has quit IRC09:54
*** yamahata has quit IRC09:56
JohanHHi SergeyLukjanov, we are using the gate pipeline and then I guess that it is the dependent pipeline manager. According to the zuul documentation and the description for the DependentPipelineManager: In order to achieve parallel testing of changes, the dependent pipeline manager performs speculative execution on changes. It orders changes based on their entry into the pipeline. It begins testing all changes in parallel, assumin09:58
JohanHwill pass its tests. If they all succeed, all the changes can be tested and merged in parallel.09:58
*** jishaom has quit IRC09:59
flaper87fungi: anyway I can ssh into a box running this test? http://logs.openstack.org/99/65499/4/check/gate-glance-python27/ff2cac8/nose_results.html09:59
JohanHSo, wouldn't it start testing the changes in parallel09:59
flaper87fungi: I've no idea what's going on there and tests pass in my box09:59
*** xchu has quit IRC09:59
*** odyssey4me has joined #openstack-infra10:03
*** SergeyLukjanov is now known as SergeyLukjanov_a10:10
*** SergeyLukjanov_a is now known as SergeyLukjanov_10:11
*** dpyzhov has quit IRC10:11
*** CaptTofu has joined #openstack-infra10:13
*** jp_at_hp has joined #openstack-infra10:14
*** CaptTofu has quit IRC10:18
*** SergeyLukjanov_ is now known as SergeyLukjanov10:21
*** pblaho has quit IRC10:21
*** rakhmerov has quit IRC10:22
openstackgerritGuido Günther proposed a change to openstack-infra/jenkins-job-builder: tests: Allow to test project parameters  https://review.openstack.org/6726510:25
openstackgerritGuido Günther proposed a change to openstack-infra/jenkins-job-builder: project_maven: Don't require artifact-id and group-id  https://review.openstack.org/6603610:25
*** talluri has quit IRC10:29
*** mrda has joined #openstack-infra10:29
*** talluri has joined #openstack-infra10:30
mikalIt is scary how often the stale recheck bot fires10:32
mikalIts like... really common10:33
*** dpyzhov has joined #openstack-infra10:35
*** jooools has quit IRC10:40
openstackgerritSlickNik proposed a change to openstack-infra/config: Update devstack-gate jobs for Trove tempest tests  https://review.openstack.org/6506510:40
openstackgerritSlickNik proposed a change to openstack-infra/devstack-gate: Add Trove testing support  https://review.openstack.org/6504010:42
*** zhiyan has left #openstack-infra10:43
SlickNik^^ jeblair / mordred / fungi / clarkb Please review when you get a chance. Thanks!10:43
mikalclarkb: I have a simple bot which does rechecks, I'm not goign to leave it running over night though, as it scares me that it might recheck the world without perission10:44
mikalAlso, the check queue is pretty long at the moment10:44
*** jooools has joined #openstack-infra10:46
*** vkozhukalov has quit IRC10:46
*** nosnos has quit IRC10:53
SergeyLukjanovJohanH, it should start in parallel10:54
SergeyLukjanovJohanH, do you have enough slaves&10:54
SergeyLukjanov?*10:54
*** mrmartin has quit IRC10:54
anteayamikal: thank you for holding off on the recheck bot10:55
anteayawe would never climb out of the current situation10:55
anteayayay down to 64 events, progress10:55
anteayawe started off yesterday with over 1000 events but never got below 600 by the end of my day yesterday10:56
mikalanteaya: so, the thinking is a recheck is a lot cheaper than a gate merge flush10:58
mikalSo, we were hoping doing recents on ancient check runs would make the gate queue a bit less horrible10:58
*** vkozhukalov has joined #openstack-infra10:58
*** yaguang has quit IRC10:58
mikalThe bot only does a recheck if someone comments on a review with an ancient check, so its also not a blanket thing10:58
mikalBut I wills top it over night and keep an eye on it while its running10:58
*** SergeyLukjanov is now known as SergeyLukjanov_11:00
*** tma996 has joined #openstack-infra11:02
*** talluri has quit IRC11:05
*** amotoki has joined #openstack-infra11:05
*** derekh has joined #openstack-infra11:05
anteayamikal: hmmm okay, let's keep an eye on the amount of events11:06
anteayaif you have been running it on the system for the past 8 hours, it might be a source of support for the > 500 event decrease I see on the zuul status page11:07
*** amotoki_ has quit IRC11:07
*** SergeyLukjanov_ is now known as SergeyLukjanov11:07
*** SergeyLukjanov is now known as SergeyLukjanov_11:08
anteayaclarkb: salv-orlando has beat me to it with a big -2 on 66490, thanks for alerting us and sorry causing a problem11:09
kiallSo - Just noticed a change that merged yesterday https://review.openstack.org/#/c/67143/ never got pushed to github, but did make it to git.o.o ..11:09
kiallI'm assuming the next merge will "fix" it .. But might be a problem11:09
*** NikitaKonovalov has quit IRC11:10
*** rakhmerov has joined #openstack-infra11:10
anteayahe sniped it with a new patchset11:10
sdaguemorning folks11:13
anteayamorning sdague11:14
*** rakhmerov has quit IRC11:14
anteayamikal: I just ready part of the backscroll, clarkb and fungi were casting incantations last night and some of them seemed to be working11:15
anteayaso that might be part of the source of the > 500 decrease in events11:15
sdagueyeh, jenkins is still blowing us up it looks like11:16
sdaguewhich actually seems to be the root cause of the problem right now11:16
anteayaclarkb and fungi are planning a zuul upgrade at 11am this morning11:17
anteayaall things being equal11:17
sdaguehttp://status.openstack.org/elastic-recheck/ - graphs 1, 2, and 3 are jenkins errors11:17
sdague#2 isn't effecting us, but the others are11:17
anteayagoodness we didn't fare well yesterday afternoon11:18
anteayagrenade test timeouts have been doubled: https://review.openstack.org/#/c/67374/11:18
anteayaand I think there was another d-g change but I didn't get far enough back in the backscroll to id the url for it11:19
*** SergeyLukjanov_ is now known as SergeyLukjanov11:22
*** ArxCruz has joined #openstack-infra11:22
*** mrda has quit IRC11:26
openstackgerritSean Dague proposed a change to openstack-infra/elastic-recheck: only run on openstack gate projects  https://review.openstack.org/6727311:27
openstackgerritSean Dague proposed a change to openstack-infra/elastic-recheck: expose on channel when we timeout on logs  https://review.openstack.org/6656511:27
openstackgerritSean Dague proposed a change to openstack-infra/elastic-recheck: move to static LOG  https://review.openstack.org/6656411:27
openstackgerritSean Dague proposed a change to openstack-infra/elastic-recheck: create more sane logging for the er bot  https://review.openstack.org/6643511:27
anteayatimeout for tempest runs have also been increased: https://review.openstack.org/6637911:30
anteayaI think that was the other change I saw referenced11:30
*** vipul is now known as vipul-away11:31
anteayamordred and clarkb: jog0 had done some evaluation of times using eatmydata yesterday and I believe the conclusion he and fungi had reached was it was not a significant time savings11:32
anteayaif I recally they were both rather disappointed by the outcome11:32
anteayaping jog0 for exact details as I might be incorrect in the application of what was being evaluated11:33
anteayas/recally/recall11:33
anteayait's early11:33
*** ruhe is now known as _ruhe11:33
*** rfolco has joined #openstack-infra11:33
*** NikitaKonovalov has joined #openstack-infra11:39
openstackgerritA change was merged to openstack-infra/elastic-recheck: only run on openstack gate projects  https://review.openstack.org/6727311:40
*** DinaBelova is now known as DinaBelova_11:41
openstackgerritA change was merged to openstack-infra/elastic-recheck: create more sane logging for the er bot  https://review.openstack.org/6643511:41
*** SergeyLukjanov is now known as SergeyLukjanov_11:41
openstackgerritA change was merged to openstack-infra/elastic-recheck: move to static LOG  https://review.openstack.org/6656411:41
openstackgerritA change was merged to openstack-infra/elastic-recheck: expose on channel when we timeout on logs  https://review.openstack.org/6656511:43
*** DinaBelova_ is now known as DinaBelova11:47
*** smarcet has joined #openstack-infra11:51
*** _ruhe is now known as ruhe11:52
*** dpyzhov has quit IRC11:52
*** dpyzhov has joined #openstack-infra11:53
*** jcoufal has quit IRC11:56
*** mrmartin has joined #openstack-infra11:59
*** DinaBelova is now known as DinaBelova_12:00
*** vkozhukalov has quit IRC12:00
*** hashar has quit IRC12:03
*** dstanek has quit IRC12:06
*** talluri has joined #openstack-infra12:10
*** lcestari has joined #openstack-infra12:10
*** rakhmerov has joined #openstack-infra12:11
*** vkozhukalov has joined #openstack-infra12:12
*** pblaho has joined #openstack-infra12:12
*** CaptTofu has joined #openstack-infra12:14
*** rakhmerov has quit IRC12:15
dimssdague, i had a suggestion in https://bugs.launchpad.net/openstack-ci/+bug/1260311/comments/312:15
uvirtbotLaunchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged]12:15
dimsfor the jenkins troubles12:15
sdaguesure12:16
sdaguehonestly, that's suspiciously high to me12:16
sdagueI need to talk with fungi when he gets up12:16
dimswe are on 1.525 of jenkins12:16
sdaguebecause it might be one of the things that there is retry logic around, but we still count it as a fail12:17
*** vkozhukalov has quit IRC12:17
sdaguewhich would totally skew things in graphite12:17
dimsy12:17
*** CaptTofu has quit IRC12:19
*** dpyzhov has quit IRC12:19
*** talluri has quit IRC12:21
*** vkozhukalov has joined #openstack-infra12:32
*** jcoufal has joined #openstack-infra12:33
dimssdague, bit more looking around and new recommendation on the version # for jenkins (https://bugs.launchpad.net/openstack-ci/+bug/1260311/comments/4)12:38
uvirtbotLaunchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged]12:38
*** chandankumar has quit IRC12:42
*** hashar has joined #openstack-infra12:43
*** derekh has quit IRC12:46
openstackgerritDavanum Srinivas (dims) proposed a change to openstack-infra/elastic-recheck: Better query for bug 1260311  https://review.openstack.org/6744612:49
uvirtbotLaunchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/126031112:49
*** dpyzhov has joined #openstack-infra12:51
*** emagana has joined #openstack-infra12:52
*** talluri has joined #openstack-infra12:53
*** dstanek has joined #openstack-infra12:53
*** CaptTofu has joined #openstack-infra12:55
*** emagana has quit IRC12:56
*** dstanek has quit IRC12:59
*** salv-orlando has quit IRC13:02
*** zz_ewindisch is now known as ewindisch13:02
*** coolsvap has quit IRC13:09
*** ewindisch is now known as zz_ewindisch13:09
*** mrmartin has quit IRC13:09
*** markmc has quit IRC13:11
*** rakhmerov has joined #openstack-infra13:12
*** zz_ewindisch is now known as ewindisch13:14
*** rakhmerov has quit IRC13:16
*** ewindisch is now known as zz_ewindisch13:18
*** amotoki_ has joined #openstack-infra13:18
*** amotoki has quit IRC13:20
*** jcoufal has quit IRC13:21
*** dizquierdo has quit IRC13:26
*** mfink has quit IRC13:26
*** dstanek has joined #openstack-infra13:29
*** thomasem has joined #openstack-infra13:31
*** hashar has quit IRC13:31
chmouelsdague: i was wondering if you were working on stable/grizzly issues as well?13:31
*** DinaBelova_ is now known as DinaBelova13:33
*** dstanek has quit IRC13:34
openstackgerritA change was merged to openstack-infra/elastic-recheck: Better query for bug 1260311  https://review.openstack.org/6744613:34
uvirtbotLaunchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/126031113:34
sdaguechmouel: trying to get your patch up now in a test env to try to help13:35
chmouelsdague: i think there is a bit more than that, at least with euca2ools and boto being incompatible13:35
*** dims has quit IRC13:35
sdaguechmouel: but we aren't running those anyway, right?13:36
chmouelsdague: i think we still have failures in tempest.tests.boto.test_ec2_volumes.EC2VolumesTest.test_create_volume_from_snapshot13:37
*** pblaho has quit IRC13:37
*** pblaho has joined #openstack-infra13:37
chmouelsdague: from https://review.openstack.org/#/c/67311/13:37
chmouelsdague: if i just rm -rf /usr/local/lib/**/*boto and rerun tempest it seems to work13:38
*** mfink has joined #openstack-infra13:38
*** carl_baldwin has joined #openstack-infra13:40
*** markmc has joined #openstack-infra13:40
sdaguechmouel: so in that review I'm seeing volumes fails unrelated to ec213:40
sdaguechmouel: http://logs.openstack.org/11/67311/2/check/check-tempest-dsvm-full/779c8f6/logs/screen-c-sch.txt.gz13:40
*** hashar has joined #openstack-infra13:41
chmouelsdague: oh yeah right, the ec2 runs but fails as you say due of the issue with cinder http://ep.chmouel.com:8080/Screenshots/2014-01-17__14-41-56.png13:42
*** nati_ueno has joined #openstack-infra13:42
*** nati_ueno has quit IRC13:42
russellbso, based on the failure rates graph here, looks like failure rates are down a good bit today?  http://status.openstack.org/elastic-recheck/13:43
sdaguerussellb: yes, I definitely think the concurency reduction helped13:43
russellbok cool13:43
*** jcoufal-m has joined #openstack-infra13:43
russellbmay take the weekend for the queues to recover a bit it seems13:44
*** nati_ueno has joined #openstack-infra13:44
*** DinaBelova is now known as DinaBelova_13:44
sdagueyeh, there are still other kinds of fails going on, which we'll need to figure out13:44
*** julim has joined #openstack-infra13:44
sdaguealso need to get the word out that stable bits can't be put in the gate right now until we address the pip 1.5 issuse on grizzly devstack13:44
*** jcoufal-m_ has joined #openstack-infra13:45
sdaguewhich will kill a stable/havana change because of grenade13:45
*** jcoufal-m_ has quit IRC13:45
*** jcoufal-m_ has joined #openstack-infra13:45
sdaguechmouel: so the log for that run is confusing13:45
*** emagana has joined #openstack-infra13:45
russellbalrighty13:45
russellbon to some other bugs then13:45
sdaguerussellb: yep, and thanks for getting to the bottom of the load thing13:46
chmouelsdague: yeah with my patch on my just reckicked test vm i definitively get netaddr updated properly:13:46
chmouelubuntu@devstack:~$ pip freeze|grep netaddr13:46
chmouelWarning: cannot find svn location for distribute==0.6.24dev-r013:46
chmouelnetaddr==0.7.1013:46
sdagueright, but something isn't right13:47
russellbsdague: np13:47
fungimmm, dims is gone, but what he doesn't realize is that we're actually only on 1.525 for jenkins01, but we're also seeing the same java stack trace (the missing class master one) on jenkins02 which runs 1.54313:47
sdaguechmouel: the fact that we pip install netaddr 6 times over the course of the console13:47
sdaguemeans pip keeps thinking there is a 0.7.5 to remove13:47
sdaguewhich is why cinder explodes13:47
*** dims has joined #openstack-infra13:48
sdaguefungi: right, so we started classifying infra bugs in er yesterday (because our classification rate was down to 30%)13:48
*** nati_ueno has quit IRC13:48
*** jcoufal-m has quit IRC13:49
sdaguefungi: http://status.openstack.org/elastic-recheck/ - Bug 126031113:49
uvirtbotLaunchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/126031113:49
*** DinaBelova_ is now known as DinaBelova13:49
sdagueit's the 3rd graph down13:49
sdagueof the er graphs13:49
sdagueit's so high and so frequent, I feel like we must be misunderstanding something13:49
chmoueldo we need to install python-netaddr from the packages first?13:50
*** emagana has quit IRC13:50
*** emagana has joined #openstack-infra13:50
fungiright. it's like i was explaining to jog0, we can *either* have catchall buckets like "jenkins it breakybadz" or we can track specific problems, but please let's not try to use a bug with "we gots stack traces" to diagnose actua faiures13:50
sdaguefungi: so, we can do it however you'd like to13:51
sdaguebut realize that those are failure events in graphite13:51
sdagueso right now ~ 40% of graphite failures for gate jobs are infra13:51
sdaguefor the last week13:51
*** salv-orlando has joined #openstack-infra13:52
*** zul has quit IRC13:52
fungiwell, i'm okay with catchall bucket bugs for that. and i'm fine with "jenkins stack trace" as an elastic-recheck pattern, but keep in mind that it's not going to assist much in diagnosing the underlying problem and the moment other devs start jumping in and trying to use the bug to that end, we're going to be running in circles chasing our tails13:52
*** dkliban has joined #openstack-infra13:52
*** jcoufal-m_ has quit IRC13:53
sdaguefungi: sure13:53
sdaguefungi: the point I'm trying to ask, is is that issue, which looks like a failure to launch at all, something that we already recover on?13:53
*** dcramer_ has quit IRC13:53
fungithat bug you linked has already collected stack trace details for two almost certainly unrelated issues, and dims was trying to use it to track down upstream bugs in jenkins. that's going to waste a lot of people's time13:53
*** yamahata has joined #openstack-infra13:53
*** SergeyLukjanov_ is now known as SergeyLukjanov13:54
fungithe *first* stack trace in that bug, from what we've seen, is the vm going missing between when it first talks to the jenkins master and when it gets assigned a job13:54
*** dprince has joined #openstack-infra13:55
sdaguefungi: so we can work on getting these broken out, which is fine, this is a process13:55
fungithe second stack trace in that bug is deeper in the slave agent, causing some manner of miscommunication with the master13:55
chmouelit's a bit annoying that i can't reproduce on clean precise vm :( the tempest runs fine after with my patch13:55
*** emagana has quit IRC13:55
*** zul has joined #openstack-infra13:55
*** emagana has joined #openstack-infra13:55
fungisdague: we already had two separate bugs. i referred that comment back to the other bug13:56
*** markmc has quit IRC13:56
sdaguefungi: ok, so we'll refine this. What I really want to know is are these gate resetting bugs, or are we actually autorecovering in zuul13:57
*** herndon_ has joined #openstack-infra13:57
fungiwell, we have seen both those stack traces associated with job failures. that's not to say that they don't also appear when a job gets aborted/cancelled and we tear down the vm before jenkins is done processing the abort/cancellation13:58
sdaguefungi: so the current rates on those makes those the biggest cause of resets right now13:59
fungibut i think in those cases we don't get logs into logstash, so if you're finding them there then these are likely jobs which did fail at some level13:59
sdaguefungi: this is datamining logstash13:59
fungiright. that's what i figured13:59
sdagueso only if it gets to logstash, and is marked as FAILURE13:59
*** markmc has joined #openstack-infra13:59
fungiwas there a job status of failure associated with thoe?13:59
fungitose13:59
fungithose14:00
sdaguebuild_status:FAILURE14:00
fungithis keyboard is annoying me14:00
*** CaptTofu has quit IRC14:00
*** CaptTofu has joined #openstack-infra14:00
*** jcoufal has joined #openstack-infra14:01
fungii do think it's probably not the biggest cause of actual gate resets though. the majority are going to be the one where the persistent slave is eaten by bug 1267364 and kills a lot of jobs at once, but we fix it by the time it's ejected one or two changes out of the gate (and the rest end up testing clean when the gate reset is done processing)14:02
uvirtbotLaunchpad bug 1267364 in openstack-ci "Recurrent jenkins slave agent failures" [Critical,In progress] https://launchpad.net/bugs/126736414:02
fungithe continuing work to move our testing off persistent slaves is our current solution to that14:02
*** mfer has joined #openstack-infra14:03
fungithe incidence of it has gone way down in the past week from what i've seen (i've only had to offline one persistent slave in several days even under the heaviest load we've been seeing)14:03
fungiit does still crop up for nonpersistent slaves, but they get torn down after impacting a single job rather than taking out dozens in a shooting-spree14:04
*** CaptTofu has quit IRC14:05
*** annegent_ has joined #openstack-infra14:05
*** smarcet has left #openstack-infra14:05
sdaguefungi: http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiamF2YS5pby5JbnRlcnJ1cHRlZElPRXhjZXB0aW9uXCIgQU5EIGZpbGVuYW1lOlwiY29uc29sZS5odG1sXCIgIEFORCBtZXNzYWdlOlwiaHVkc29uLkxhdW5jaGVyJFJlbW90ZUxhdW5jaGVyLmxhdW5jaFwiIEFORCBidWlsZF9xdWV1ZTpnYXRlIiwidGltZWZyYW1lIjoiODY0MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODk5Njc1MzI3MDl914:05
fungithe combination of it mostly only cropping up when the jenkins masters are under heavy strain and the accompanying gate dynamics when we're under that sort of load make the ratio of full gate resets to individual job failures for that bug abnormally high (probably by orders of magnitude)14:06
IvanBerezovskiyfungi, hi. Can I ask you  a question about Cassandra and Hbase installation on CI nodes?14:06
sdague231 gate errors in the last 24 hrs14:06
fungiIvanBerezovskiy: are you the one working to get it supported in ubuntu lts?14:06
*** jhesketh__ has quit IRC14:06
fungisdague: how many were from centos6-1?14:06
*** markmcclain has joined #openstack-infra14:07
fungithat's the one which went wild last night while i was at dinner, and i had to put it down when i got back to the computer14:07
fungisdague: but i agree, we should take this as a sign to continue prioritizing a move to nonpersistent slaves for all non-privileged jobs14:09
sdaguesure14:09
sdague25 we tempest-dvsm-full14:09
sdagueso it's not just the unit test nodes14:09
fungigood to know. those hopefully should have been only one job affected per slave experiencing that error14:10
dimsfungi, when you get a chance can i please have a stack trace from the 1.543 install for the JENKINS-19453 bug so i can try to match it to jenkins source to see if i can find something (per your comment #5 in bug 1260311)14:10
uvirtbotLaunchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/126031114:10
fungisdague: and that's for the stacktrace in 1267364, not the other one?14:10
fungidims: it's in the bug i inked14:10
fungilinked14:10
*** prad has joined #openstack-infra14:11
*** jaypipes has joined #openstack-infra14:11
fungidims: oh, actually i guess it's not14:12
fungiwe only had them from the jenkins console, which expires out after 24 huors14:12
fungihours14:12
*** rakhmerov has joined #openstack-infra14:13
*** NikitaKonovalov has quit IRC14:13
sdaguefungi: http://logs.openstack.org/84/65184/4/gate/gate-tempest-dsvm-postgres-full/7c3f2bc/console.html  is being classified as Bug 1260311 by jog0's query14:13
uvirtbotLaunchpad bug 1260311 in openstack-ci "hudson.Launcher exception causing build failures" [Low,Triaged] https://launchpad.net/bugs/126031114:13
*** NikitaKonovalov has joined #openstack-infra14:13
IvanBerezovskiyfungi, fungi, As it was said here https://review.openstack.org/#/c/66884/ we can't use non-ubuntu mirrors. So i want to find another way to isntall these packages. My suggestion is create job for single-use node like https://git.openstack.org/cgit/openstack-infra/config/tree/modules/openstack_project/files/jenkins_job_builder/config/storyboard.yaml . So it'll be job with shell script that'll install cassandra and hbase. What do you think14:13
sdaguewhich, we can figure out if that's wrong14:13
*** yaguang has joined #openstack-infra14:14
fungisdague: so there may be several different issues there14:14
jog0that query was taken straight from dims comment in the bug14:14
*** nati_ueno has joined #openstack-infra14:14
sdaguefungi: sure, so we should narrow that out14:14
fungidims: the stacktrace we were seeing in both 1.525 and 1.543 is the java.lang.NoClassDefFoundError: Could not initialize class jenkins.model.Jenkins$MasterComputer one http://paste.openstack.org/show/60883/14:15
sdagueI'm just concerned that we've had 40+ dvsm hits on that in the last 24hrs so reseting that way every 35 minutes seems very bad14:15
sdagueand could be a contributing factor to the zuul load14:15
fungisdague: right, this takes us back to "do we want a catchall bucket for people to recheck these against or should we have separate bugs for the different causes/events"14:15
sdaguefungi: what would you like?14:16
dimsfungi, the line numbers will be different between 1.525 and 1.543 - trying to figure out which stack trace came from which version14:16
*** jhesketh_ has quit IRC14:16
fungiIvanBerezovskiy: if it's for non-openstack jobs, that's fine. for openstack projects, all those jobs would fail any time that remote repository is unreachable/broken14:16
sdaguethe er bug reporting is an art not a science, so we just want rules in there on how to categorize it.14:16
fungidims: ahh, i may not have captured the exact line numbers for one triggered from jenkins02 in that bug. we'd need to find a new slave exhibiting that failure from jenkins02 and get those details14:17
*** nati_ueno has quit IRC14:17
*** rakhmerov has quit IRC14:17
dimsthanks fungi i'll look for it as well14:18
sdaguefungi: is there better metadata in ES that we need to bin these?14:18
*** nati_ueno has joined #openstack-infra14:18
jog0fungi: I am happy to split the bugs as you want14:18
*** zz_ewindisch is now known as ewindisch14:18
jog0long as we are categorizing them under something  I am happy14:18
fungisdague: i'm fine with catch-all bugs for elastic-recheck to use for infra problems, but we would still use separate infra bugs to work through the actual causes. in many cases, the bugs themselves will be solved before someone can add an accurate e-s pattern to match them14:19
ruhefungi: (on the topic started by IvanBerezovskiy), so the only option to test ceilometer backends, which aren't present in stable mirrors -  is to get them (hbase and cassandra) supported in ubuntu lts?14:19
sdaguefungi: well that's not the case for at least 3 infra bugs right now14:19
jog0fungi: you won't like this query then: bug 126994014:19
uvirtbotLaunchpad bug 1269940 in openstack-ci "[EnvInject] - [ERROR] - SEVERE ERROR occurs:" [Undecided,New] https://launchpad.net/bugs/126994014:19
fungiruhe: how do you expect people running plain ubuntu to test that on their own systems (particularly if they can't/won't install unvetted/insecure third-party packages)?14:20
*** sandywalsh has quit IRC14:20
fungisdague: agreed. it ends up being the case for other infra bugs however14:21
*** rossella_s has joined #openstack-infra14:21
fungijog0: i think it's like matching on "python traceback"14:21
jog0fungi: haha yup14:21
* dims realizes we need the jenkins01/02 info in logstash as well :)14:22
jog0that is a catch all as a stop gap for classifiying things14:22
fungi"bug: we seem to be using python"14:22
*** amotoki_ has quit IRC14:22
jog0so yes I agree its a really vague somewhat useless bug. so as we know more we can split the bug up14:23
ruhefungi: i understand your concern. the problem with this storages is they only have vendor-managed repositories and no one wants to maintain them since they're complex software. i guess this topic should be discussed in email14:23
fungianyway, i need to step away for a few. i should learn not to start checking work e-mail and irc when i first wake up... it leads to me working half the morning from my bedroom and skipping breakfast as a result14:23
sdaguefungi: so I think that given the windows of time where there aren't infra folks online, I think using er for real has value. Because bugs don't get fixed immediately14:23
*** yamahata has quit IRC14:23
sdague:)14:23
sdagueyeh, sorry about that14:23
*** yamahata has joined #openstack-infra14:23
dimsfungi, :)14:23
chmouelEmilienM: ping?14:24
EmilienMchmouel: pong14:25
EmilienMchmouel: here is good too, i use to talk about devstack on #openstack-qa though :-)14:25
*** dstanek has joined #openstack-infra14:25
fungiruhe: i would argue that makes them immature software projects, and we should seek to help them improve that situation so that we *can* use them rather than just accepting that situation14:25
* fungi will bbiab14:25
sdaguechmouel: yeh, lets take the grizzly devstack over to -qa14:26
EmilienMchmouel: i was wondering the cinder issue in devstack/havana and it's WIP by you and sdague, right?14:26
*** eharney has joined #openstack-infra14:28
*** ryanpetrello has joined #openstack-infra14:29
openstackgerritNikita Konovalov proposed a change to openstack-infra/storyboard: Introducing basic REST API  https://review.openstack.org/6311814:30
*** herndon_ has quit IRC14:31
*** nprivalova has joined #openstack-infra14:33
*** sandywalsh has joined #openstack-infra14:33
openstackgerritSean Dague proposed a change to openstack-infra/elastic-recheck: add uncategorized failure generation code  https://review.openstack.org/6726714:35
*** mrmartin has joined #openstack-infra14:36
*** pblaho has quit IRC14:36
*** mrodden has quit IRC14:38
*** dcramer_ has joined #openstack-infra14:39
*** dansmith is now known as damnsmith14:40
openstackgerritA change was merged to openstack-infra/reviewstats: Add --csv-rows option  https://review.openstack.org/6011514:42
openstackgerritA change was merged to openstack-infra/elastic-recheck: add uncategorized failure generation code  https://review.openstack.org/6726714:42
*** SergeyLukjanov is now known as SergeyLukjanov_a14:43
*** SergeyLukjanov_a is now known as SergeyLukjanov_14:44
openstackgerritMax Lobur proposed a change to openstack/requirements: Add futures library to global requirements  https://review.openstack.org/6634914:45
*** dizquierdo has joined #openstack-infra14:45
openstackgerritMax Lobur proposed a change to openstack/requirements: Add futures library to global requirements  https://review.openstack.org/6634914:47
*** thuc has joined #openstack-infra14:49
*** thuc_ has joined #openstack-infra14:49
jog0was a bug filed for 'No distributions at all found for oslo.messaging>=1.2.0a11' ?14:50
jog0example: http://logs.openstack.org/82/64682/1/gate/gate-glance-pep8/f1dce31/console.html.gz14:50
*** beagles is now known as beagles_brb14:50
*** mrodden has joined #openstack-infra14:51
openstackgerritTom Fifield proposed a change to openstack-infra/config: Add build job for Japanese Install Guide  https://review.openstack.org/6748114:51
jgriffithEmilienM: Cinder issue in devstack/havana?14:51
*** fifieldt has joined #openstack-infra14:51
EmilienMjgriffith: yeah, the stuff you were talking about yesterday14:53
*** coolsvap has joined #openstack-infra14:53
*** thuc has quit IRC14:53
*** annegent_ has quit IRC14:53
jgriffithEmilienM: oh, but interesting it's only affecting Cinder now, which leads me to believe thee's been a patch for other projects to address this?14:53
*** emagana_ has joined #openstack-infra14:54
*** senk has joined #openstack-infra14:55
*** russellb is now known as rustlebee14:55
*** mrmartin has quit IRC14:55
*** rakhmerov has joined #openstack-infra14:56
*** jog0 is now known as flashgordon14:56
*** oubiwann_ has joined #openstack-infra14:56
*** emagana has quit IRC14:56
*** marun has joined #openstack-infra14:57
*** SergeyLukjanov_ is now known as SergeyLukjanov14:57
*** talluri has quit IRC14:57
flashgordonlooks like this is the closest bug 126125314:59
uvirtbotLaunchpad bug 1261253 in tripleo "oslo.messaging 1.2.0a11 is outdated and problematic to install" [High,Triaged] https://launchpad.net/bugs/126125314:59
*** dims is now known as dimsum15:00
*** burt1 has joined #openstack-infra15:01
*** Ajaeger has joined #openstack-infra15:01
*** pblaho has joined #openstack-infra15:02
fungiaww, we lost jog0 now15:03
fungioh, wait, flashgordon15:03
fungiflashgordon: the No distributions at all found for oslo.messaging>=1.2.0a11 is an interesting one15:03
fungiflashgordon: that looks like pip 1.5 ignoring the -f15:04
fungii wish we had a pip --version and/or pip freeze at the end of that job15:05
*** talluri has joined #openstack-infra15:05
*** esker has quit IRC15:06
*** esker has joined #openstack-infra15:06
*** esker has quit IRC15:06
*** nicedice has joined #openstack-infra15:07
openstackgerritJoe Gordon proposed a change to openstack-infra/config: Don't run non-voting gate-grenade-dsvm-neutron  https://review.openstack.org/6748515:08
flashgordonfungi: casual nick friday  in nova land15:08
flashgordonsdague: ^15:08
*** thedodd has joined #openstack-infra15:09
flashgordonfungi: logstash query   message:"No distributions at all found for oslo.messaging>=1.2.0a11"   AND filename:"console.html"15:09
flaper87fungi: another case where it fails in the gate and not locally: https://review.openstack.org/#/c/65499/ :( Do you think I can get access to one box?15:09
fungiflashgordon: i got it, just slow. i've lost track of which days are which any more15:09
flaper87FWIW, I'm setting up an ubuntu saucy to test it too15:09
flashgordonfungi: heh I am amazed your still alive after this week15:09
*** nati_uen_ has joined #openstack-infra15:12
*** jergerber has joined #openstack-infra15:12
fungiflaper87: which one? the py26 and py27 unit tests fail in entirely different ways (though also, no, can't really grant you access to the long-running 26 slave for infra policy reasons unless i completely tear down and replace it, and the 27 slave is a single-use node which was automatically deleted after it ran)15:12
flaper87fungi: py27 would've been enough.15:13
flaper87fungi: I'll set it up in my vm and see if I can replicate it15:14
*** nati_uen_ has quit IRC15:14
dstufftfungi: adding various --version invocations to things you're using is the best thing I learned from travis-ci tbh15:14
dstufftit makes debuging things massively better15:14
*** nati_uen_ has joined #openstack-infra15:15
*** nati_ueno has quit IRC15:15
openstackgerritSergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version  https://review.openstack.org/6748715:16
fungidstufft: yep, we do that in a lot of places15:17
fungijust not ever enough places ;)15:17
openstackgerritRuslan Kamaldinov proposed a change to openstack-infra/config: Extracted ci docs jobs to a template  https://review.openstack.org/6748915:19
*** emagana_ has quit IRC15:21
openstackgerritSergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version to 2.3.3  https://review.openstack.org/6748715:22
*** emagana has joined #openstack-infra15:22
*** HenryG has joined #openstack-infra15:22
flashgordonfungi: what file do I touch to add branch name to logstash?15:23
flashgordonre: master or stable/havana15:23
*** annegent_ has joined #openstack-infra15:24
*** bookwar has left #openstack-infra15:24
openstackgerritSergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version to 2.3.2  https://review.openstack.org/6749115:24
*** CaptTofu has joined #openstack-infra15:24
*** rnirmal has joined #openstack-infra15:26
fungiflashgordon: do we not already index the zuul parameters for jobs in logstash?15:27
flashgordonwe have the build_ref15:27
*** gokrokve has joined #openstack-infra15:28
*** IvanBerezovskiy has left #openstack-infra15:28
*** annegent_ has quit IRC15:28
flashgordonwould it be zuul_branch?15:29
fungiflashgordon: i think that's probably what you want. remember in the context of our various integration tests there are multiple branches in play15:30
flashgordonohh nice zuul has docs15:30
fungiand yes, zuul has very nice docs15:31
*** carl_baldwin has quit IRC15:31
flashgordon'  The target branch for the change that triggered this build15:31
flashgordonfungi: if there is no zuul_change is there zuul_branch?15:32
*** carl_baldwin has joined #openstack-infra15:32
clarkbanteaya: salv-orlando: a -2 doesn't kick the change out of the gate. has a new patchset been pushed to it to kick it out of the gate?15:33
*** _NikitaKonovalov has joined #openstack-infra15:33
clarkbanteaya: salv-orlando: at this point it probably doesn't matter much as fungi and I are going to fork lift zuul and can simply not reverify that change15:33
fungiflashgordon: i believe there is always a zuul_branch, yes (periodic bitrot jobs for example have no zuul_change but would still have a zuul_branch)15:34
*** mancdaz is now known as mancdaz_away15:34
flashgordonfungi: thanks15:34
*** kmartin has quit IRC15:34
fungiflashgordon: i'm going to double-check that though15:34
*** NikitaKonovalov has quit IRC15:34
*** _NikitaKonovalov is now known as NikitaKonovalov15:34
fungibecause now that i say it, i start to doubt myself15:34
flashgordonheh thanks15:34
*** mancdaz_away is now known as mancdaz15:35
*** kmartin has joined #openstack-infra15:35
*** talluri has quit IRC15:35
fungiand that reminds me, some of the periodic jobs are still broken... need to track down where /opt/stack/new/devstack-gate/devstack-vm-gate.sh went: http://logs.openstack.org/periodic-qa/periodic-tempest-dsvm-all-havana/037442e/console.html15:36
*** marun has quit IRC15:36
*** marun has joined #openstack-infra15:36
*** jgrimm has joined #openstack-infra15:37
*** annegent_ has joined #openstack-infra15:38
mordredmorning fungi15:38
mordredmorning flashgordon clarkb15:38
fungimorning mordred15:39
openstackgerritJoe Gordon proposed a change to openstack-infra/config: Record build_branch in logstash  https://review.openstack.org/6749815:39
*** wenlock has joined #openstack-infra15:39
flashgordonfungi: ^15:39
clarkbmorning15:39
flashgordonsdague: ^15:39
*** emagana has quit IRC15:39
*** emagana has joined #openstack-infra15:40
clarkbfungi: I am mostly booted at this point and ready to do the zuul dance if you still think we should do that15:40
*** rcleere has joined #openstack-infra15:40
*** davidhadas has joined #openstack-infra15:41
fungidimsum: to your earlier question about identifying which jenkins master a job ran on, you can actually mine that out of the console log (though having it as a parameter would definitely be nice). the "Building remotely on" line hyperlinks to the appropriate jenkins master's webui15:41
fungiclarkb: sure thing15:41
clarkbzuul just merged a bunch of changes by the way. I think the d-g tempest concurrency change did have a drastic effect15:41
*** herndon has joined #openstack-infra15:41
dimsumfungi, y, just can't build a query that has the name of the jenkins host and snippet from hudson stack trace15:42
clarkbhttps://jenkins02.openstack.org/job/gate-tempest-dsvm-full/6416/console seems to be a relatively common failure causing resets (but I haven't even looked at e-r just noticed that 404 is common to several test failures last night and this morning)15:42
*** esker has joined #openstack-infra15:43
*** NikitaKonovalov is now known as NikitaKonovalov_15:44
*** bnemec is now known as beekneemech15:45
openstackgerritTom Fifield proposed a change to openstack-infra/config: Add build job for Japanese Install Guide  https://review.openstack.org/6748115:45
fungidimsum: in the meantime, the next rogue persistent slave i get failing jobs with that stack trace from jenkins02, i'll get the exact text including line numbers15:45
dimsumfungi, cool15:45
openstackgerritSergey Kraynev proposed a change to openstack/requirements: Update python-neutronclient version to 2.3.3  https://review.openstack.org/6749115:47
fungiclarkb: so what's the zuul swap operation here? we snapshot the pipelines, kill zuul ungracefully, update a/aaaa records, copy over the queue dump to new-new-new-zuul, start the zuul service there, wait as necessary for the dns propagation, make sure jenkins masters are connecting to it, load the queue dumps and we're off to the races?15:48
*** carl_baldwin has quit IRC15:48
*** carl_baldwin has joined #openstack-infra15:48
clarkbbasically15:48
clarkbwe also need to check nodepool has connected to new zuul15:49
fungiright, jenkins masters *and* nodepool15:49
fungigood reminder15:49
openstackgerritBen Nemec proposed a change to openstack-dev/hacking: Enforce import group ordering  https://review.openstack.org/5440315:50
clarkb2 more changes can merge and there are a few check tests that can be reported but I am less concerned about the check tests15:51
*** JohanH has quit IRC15:51
clarkbbut I think right around now is a decent time to do it as 2 changes will be merging and the gate is reseting otherwise15:52
*** senk has quit IRC15:52
*** adrian_otto has joined #openstack-infra15:52
fungioh, though the rather large queue lengths mean maybe we should gracefully stop it and wait for it to finish processing those?15:52
clarkbfungi: that requires it fully processing everything in those queues which could take days15:53
clarkb>_>15:53
fungiwe won't be able to copy over the event and result queues, right?15:53
clarkbfungi: right15:53
fungiit was down to 0/0 earlier15:53
clarkbI suppose we can wait to see if those numbers fall shortly15:53
fungibut it's started picking up now15:53
clarkbit picked up during the last gate reset where the zuul main loop does nothing15:54
fungiwe caught a nova fail a couple changes from the head of the gate an hour or two ago and the delay that caused allowed the events/results to pile up15:54
fungiyeah15:54
clarkbnormally that loop has a few iterations per second. during a gate reset it is one iteration every 15 or so minutes15:55
clarkbanother thing that occurred to me with back of napkin maths is that we only have enough slaves to run tests for ~64 changes concurrently15:55
salv-orlandoclarkb: I did first put a new patch set and then -2 it to ensure people did not approve it15:55
clarkbsalv-orlando: awesome, I missed that thanks15:55
clarkbso we are battling the resets but also having only about 1/3 of the test resources we need to get out of the hole15:56
*** jcoufal has quit IRC15:56
*** adrian_otto has quit IRC15:56
salv-orlandoBut we've probably found out that all those unit test failure are related to an oslo change that went in yesterday15:56
clarkbfungi: results queue is falling, under 100 now. I say we wait a handful of minutes to see if the events queue falls too15:56
*** pblaho has quit IRC15:56
fungik15:57
mordredfungi: from an hour ago, I would argue that it might also mean that distros haven't adapted to how some newer software operates and are trying to perpetuate a model that is more beneficial to their own processes than it is to solving today's problems15:57
fungiclarkb: also i think 67186,1 and 67187,1 there are probably contributing to gate churn15:57
fungiclarkb: since they're both stable branch changes15:58
clarkbfungi: they would be then, we should omit them from the zuul reenqueue15:58
clarkbfungi: oh other thing to do after we stop zuul, is to manually stop jobs in jenkinses so that nodepool can create new nodes15:58
*** annegent_ has quit IRC15:58
clarkbfungi: do you want to grab queue state, stop zuul, and update DNS while I kill jobs in jenkinses as quickly as I can?15:59
fungimordred: entirely possible, but in that case we need some serious reevaluation of our security support model15:59
mordredfungi: I think we might need some serious reevaluation of our security support model16:00
clarkbfungi: mordred: I am not seeing the context to security and distros16:00
clarkbhave a timestamp?16:00
mordredbecause I'm not sure that the distro approach which may involve staying on an old version of a piece of software that the otherwise very active upstream has stopped caring about is the right thing to do16:00
fungiclarkb: 14:13 utc16:00
*** dpyzhov has quit IRC16:01
fungiclarkb: our previous decisions not to install software from random third-party package repositories16:01
fungifor testing official openstack projects16:01
mordredcassandra has consitently not been a thing you really want to include in a distro - but I would not call it immature, even though I personally dislike many of their core devs16:01
clarkbI see thanks16:01
mordredthey produce software intended for continuous deployment - and people who use it use it in those contexts - so manufacturing a 3-year stable release is just silly16:02
openstackgerritTom Fifield proposed a change to openstack-infra/config: Add build job for Japanese Install Guide  https://review.openstack.org/6748116:03
*** yolanda has quit IRC16:03
mordredin fact - new thing from the CEO of redhat ... https://enterprisersproject.com/article/death-2016:03
fungimordred: well, i didn't mean immature in a negative connotation. i meant the reasons free software is usually not packaged at all is one of 1. it's so new not enough people have interest in it yet, 2. it's not interesting in general or, 3. there are design issues with the software which make it too hard to package reliably/consistently16:03
mordredtalks about how even 6-month releases are getting to be too much16:03
*** nati_ueno has joined #openstack-infra16:03
mordredfungi: indeed. I'm mainly saying that I think that one of the design pieces of 3 might not be something you want to fix in some cases16:04
anteayaclarkb: yes, salv-orlando submitted a new change to remove it from the gate, sorry I wasn't clear16:04
mordredsuch as "the delivery model is intended for continual consuption" - which is actually more likely to be able to be dealt with at scale than a periodic release model16:04
anteayasorry salv-orlando already answered you16:05
clarkbmordred: we should all start running arch16:05
fungimordred: i agree that some software stays well-tested enough that you can be reasonably assured of its reliability when drinking from the firehose. but there's also enough out there which still isn't that the linux distributions play a useful role in shielding admins who don't want to discover yet another new software bug every morning when they get to work16:05
*** gokrokve has quit IRC16:05
*** marun has quit IRC16:05
mordredfungi: totally16:05
mordredI think that the distros can and do play a very useful role16:05
*** marun has joined #openstack-infra16:05
*** gokrokve has joined #openstack-infra16:06
*** nati_uen_ has quit IRC16:06
clarkbfungi: event queue isn't falling very quickly. I figure we give it a few more minutes but otherwise I feel like we should take a hatchet to it16:06
mordredI'm just saying that a strict adhearance to distro-packaged software may not be necessarily the right choice every time - which is a reversal of my traditionl position16:06
mordredfungi: I think that some things have changed in the high-volume/high-scale world and I don't think distro-world has caught up16:07
*** reed has joined #openstack-infra16:08
clarkbmordred: I agree, but I also think that projects need to provide something. eg a pip installable thing from pypi (we fail at this), because the put up a jar file behind http without a sha1 that logstash does and our tarballs with similar problems aren't very friendly16:08
fungimordred: well, i do agree, particularly since we're part of that ;)16:08
mordredclarkb: +10016:09
mordredfungi: hehe16:09
mordredI had the idea the other day that someone should upgrade apt-get so that it understood pip and mvn and npm and gem16:09
clarkbfungi: I am going to step away for ~3 minutes then I say we go for it16:09
*** pasquier-s_ has quit IRC16:09
mordredso that you could perhaps do "apt-get install pip:python-novaclient" and it would do the right thing16:10
fungiclarkb: sounds good. i need a quick coffee refill anyway16:10
clarkbnote we should reverify savanna changes first and omit stable/* changes16:10
fungimordred: you mean pip install apt:mysql-client16:11
*** gokrokve has quit IRC16:11
fungi;)16:11
fungiclarkb: which savanna changes?16:11
fungii probably missed them in scrollbackl16:11
*** vipul-away is now known as vipul16:12
openstackgerritDavanum Srinivas (dims) proposed a change to openstack-infra/config: Add jenkins host name to the logstash records  https://review.openstack.org/6750816:12
*** tangestani has joined #openstack-infra16:12
dkranzfungi: Any chance we can move https://review.openstack.org/#/c/63934/ (restoring fail on log errors) up in the queue?16:13
dkranzI really don't want to see this fail because another log error crept in.16:13
fungidkranz: clarkb: let's move 63934 to the top of the gate list before we import it on the replacement zuul16:14
flashgordonclarkb: btw only 5 hits on gate for the 404 issue you found16:14
*** afazekas_ has quit IRC16:14
flashgordonin last 7 days16:14
*** thuc has joined #openstack-infra16:14
*** tangestani has quit IRC16:15
fungiflashgordon: for the console logs which got indexed anyway (scp plugin bug still lurking)16:15
flashgordonfungi: ack, thats implied for everything16:15
flashgordon196 hits with check queue16:15
* fungi nods16:15
* flashgordon files a bug16:15
dimsumAdded a couple of reviews to grab the jenkins host name for logstash (https://review.openstack.org/#/c/67495/ https://review.openstack.org/#/c/67508/ )16:15
* SergeyLukjanov triggered by savanna word used :)16:16
fungidimsum: yep, saw those just now16:16
clarkbfungi: in the gate queue16:16
clarkbfungi: one last thing to check before we dive in, we should make sure that the zuul ref replication is disabled on new zuul and new new zuul16:17
clarkbpretty sure jeblair dealt with that a wee kago so all should be well16:17
*** thuc_ has quit IRC16:18
fungiclarkb: right, that was reverted in the zuul source. i'll check the clone on it16:18
clarkbfungi: was it reverted in zuul source or just the config?16:18
*** thuc has quit IRC16:19
fungioh... hrm16:19
fungiright, it was the config16:19
*** lyle is now known as david-lyle16:19
clarkbI am logged into all 5 jenkins masters and ready to kill jobs16:20
clarkbfungi: basically ready when you are16:20
fungii'm looking for the revert16:20
*** dizquierdo has quit IRC16:21
*** anteaya is now known as tired16:21
*** tired is now known as very_tired16:22
openstackgerritSean Dague proposed a change to openstack-infra/elastic-recheck: add bug metadata to graph list  https://review.openstack.org/6751016:22
clarkbfungi: 0c8845494d308e8fedfd6e9890c5ea6cd2f85bdb16:22
clarkbin config16:22
fungiright, why couldn't i find that in the commit log?16:23
fungitrying to do too many things at once16:23
clarkbI did git log -p manifests/site.pp because I remembered it getting piped through there16:23
fungii don't see any reference to the git replication urls in zuul.conf on the new server16:24
fungihold on16:24
fungiokay, sorry. local distraction16:25
fungiso i missed why we need to reverify savanna changes if they're already in the gate16:26
*** esker has quit IRC16:26
clarkbfungi: isn't that how we restore the gate?16:26
*** BobBall is now known as BobBallAway16:26
fungiyeah, but don't we want to restore the whole gate, not just teh savanna changes?16:26
*** vipul is now known as vipul-away16:27
fungii'm clearly confused on some point16:27
clarkbfungi: we do, just pointing out we want to reverify them first16:27
clarkbso that their jobs queue up first as they are currently running16:27
fungioh, so they were causing some sort of disruption16:27
SergeyLukjanovcould I ask why savanna changes are so prio now? :)16:27
fungier, fixing some sort of disruption?16:27
clarkbSergeyLukjanov: simply because they managed to run tests for half an hour and we are about to kill them16:27
clarkbSergeyLukjanov: fungi: there is nothing special about those changes beyond their current position in the queue16:28
fungiahh, i see, you mean because they're in a different gate queue, so don't want to make them wait on available nodes16:28
clarkbexactly16:28
*** Ajaeger has quit IRC16:29
SergeyLukjanovoh, see it too ;)16:29
SergeyLukjanovthanks16:29
*** gyee_nothere has quit IRC16:29
fungiclarkb: and also prioritize 63934,3 so that we reduce the risk of more errors getting introduced before that merges16:29
clarkbyup16:30
*** Ajaeger has joined #openstack-infra16:30
clarkbI am actually less worried about the stable/* jobs, I can push new patchsets to them in order to make an impression on the change approvers :)16:30
fungigetting logged into rackspace and jenkins masters now16:30
*** gyee has joined #openstack-infra16:30
clarkbfungi: s/rackspace/nodepool/ ?16:31
*** marun has quit IRC16:31
fungilet's at least leave out 67186,1 and 67187,1 since we know about them and they're already relatively high up in the gate16:31
clarkbfungi: k16:31
fungirackspace to make dns changes16:31
*** mrodden has quit IRC16:31
clarkboh that16:31
*** vipul-away is now known as vipul16:31
fungitrying to reduce the zuul outage window as much as possible so we miss fewer patchset and approve events16:32
clarkb++16:32
openstackgerritAndreas Jaeger proposed a change to openstack-infra/config: Add build job for Japanese Install Guide  https://review.openstack.org/6748116:32
*** marun has joined #openstack-infra16:32
*** krotscheck has joined #openstack-infra16:32
clarkbmordred: any chance you can statusbot us?16:33
*** MarkAtwood has joined #openstack-infra16:34
fungithe rackspace dns interface needs a filter16:34
clarkboh ya, otherwise it is tons of scrolling16:35
*** mrodden has joined #openstack-infra16:35
fungiokay, logged into the jenkins webuis, rackspace dashboard at the dns entries, cli on nodepool and both old and new zuul16:36
*** nati_uen_ has joined #openstack-infra16:36
fungiare you doing the zuul pipeline dump/restore, clarkb?16:36
clarkbfungi: I thgouth you were :P I was going t okill jenkins jobs16:36
fungiahh, okay16:36
fungigimme a sec to referesh my memory on how that works16:37
clarkbnp, I believe the script is in zuuls tools dir16:37
openstackgerritJoe Gordon proposed a change to openstack-infra/config: Record short_build_uuid in logstash/ElasticSearch  https://review.openstack.org/6751616:37
*** nati_uen_ has quit IRC16:37
*** markwash has quit IRC16:37
fungithere was also a ~root/zuul-changes2.py left over from the last round16:38
clarkbflashgordon: re ^ I am pretty sure you can match on the short uuid16:38
*** nati_uen_ has joined #openstack-infra16:38
flashgordonclarkb: sample query?16:38
clarkbflashgordon: just search for build_uuid:someshortuuid16:38
clarkbnotice the lack of quotes16:38
*** markwash has joined #openstack-infra16:39
mferfungi is there a place I can "subscribe" to get an update of the openstack in an sdk name? i don't want to bug you but I'm so darn curious.16:39
clarkbfungi: oh right, you want that one as it uses the zuul rpc cli16:39
*** nati_ueno has quit IRC16:39
clarkbfungi: but the old one will work using the reverifies too (if you give reverify a bug)16:40
*** Ajaeger has quit IRC16:40
flashgordonbuild_uuid:2123b9a16:40
flashgordonvs: build_uuid:2123b9a6a1464d41864e8436d5bf439716:41
flashgordonshort has no hits16:41
flashgordonclarkb: ^16:41
clarkbflashgordon: sorry you need build_uuid:2123b9a*16:41
*** SergeyLukjanov is now known as SergeyLukjanov_16:42
flashgordonclarkb: sweet!16:42
flashgordonthanks16:42
*** adrian_otto has joined #openstack-infra16:43
flashgordonclarkb: here is another one https://review.openstack.org/#/c/67498/16:43
fungiclarkb: okay, so it's... for pipeline in check gate post ; do python zuul-changes2.py http://zuul.openstack.org $pipeline > $pipeline.sh ; done16:43
*** markmcclain has quit IRC16:44
*** markmcclain has joined #openstack-infra16:44
*** gothicmindfood has joined #openstack-infra16:44
clarkbfungi: k16:44
*** senk has joined #openstack-infra16:44
*** AaronGr_Zzz is now known as AaronGr16:45
*** adrian_otto has left #openstack-infra16:45
fungioh, it won't dump post16:45
fungibecause those aren't changes16:45
clarkboh right, I think we can get away with that here16:45
fungilooking through a sample real quick so i can confirm the reordering/filtering we want to do on the gate16:46
clarkbflashgordon: that change is technically fine. question about why it is necessary though. A bug fingerprint should indicate a bug regardless of branch, and a false positive due to branch should itself be a bug correct?16:47
*** davidhadas_ has joined #openstack-infra16:47
flashgordontwo fold16:47
flashgordonone is its easier when digging through logstash16:47
*** davidhadas has quit IRC16:47
flashgordonand two, if we *know* a bug is stable only we can prevent false positives16:48
*** senk has quit IRC16:49
clarkbpreventing false positives that way masks other bugs though16:49
*** markwash has quit IRC16:49
*** DennyZhang has joined #openstack-infra16:49
*** markmcclain has quit IRC16:50
clarkbfungi: how does the sample work? I suppose you can just ocmment out the lines for changes we want to ignore16:50
fungiclarkb: yep, i'm getting the reordering into the final command line too though16:50
clarkboh right for savanna :)16:50
flashgordonclarkb: it won't mask bugs it will leave them as unclassified16:50
clarkband dkranz's change16:50
fungiand the error filtering fix16:50
*** senk has joined #openstack-infra16:51
clarkbflashgordon: if you didn't filter on branch it would match all branches16:51
*** cp16net is now known as goofy-nick-frida16:51
flashgordoneither way, having the data makes understanding logstash data easier.16:51
flashgordonbefore writting the fingerprint16:51
*** goofy-nick-frida is now known as goofy-nic-friday16:51
fungiokay, have it the way i want it. making sure i can copy it quickly now16:52
clarkbflashgordon: gotcha16:54
openstackgerritDavanum Srinivas (dims) proposed a change to openstack-infra/config: Add jenkins master name to the logstash records  https://review.openstack.org/6750816:54
*** NayanaD has joined #openstack-infra16:55
*** NayanaD is now known as San_D16:55
fungiall set. so dumping the check/gate pipelines and immediately stopping zuul16:56
fungiready?16:56
clarkbI am ready16:56
*** sgrasley has joined #openstack-infra16:56
fungidone16:57
fungiupdating dns now16:57
openstackgerritMichael Krotscheck proposed a change to openstack-infra/config: Added artifact upload of storyboard.  https://review.openstack.org/6752016:57
clarkbok killing jenkins jobs now16:57
*** coolsvap has quit IRC16:57
*** tma996 has quit IRC16:58
openstackgerritMichael Krotscheck proposed a change to openstack-infra/config: Added artifact upload of storyboard.  https://review.openstack.org/6752016:58
krotscheckMy bad, sorry16:58
krotscheckThat one's good16:58
fungiclarkb: whups. the aaaa record had a one-hour ttl on it16:58
fungii should have double-checked that last night16:59
clarkbfungi: that'll teach me16:59
*** coolsvap has joined #openstack-infra16:59
clarkb:/16:59
zaromorning17:00
*** mancdaz is now known as mancdaz_away17:00
*** vkozhukalov has quit IRC17:00
fungiclarkb: should nodepool get restarted to connect to the new zuul?17:01
clarkbfungi: I believe the gear lib should do automatic reconnection17:01
fungiand is it safe to start new zuul and reenqueue changes now even though the jenkins masters aren't connected to it yet?17:02
*** davidhadas_ has quit IRC17:02
clarkbjenkins masters cannot connect to it until it has started, the geard lib is embedded17:03
zaroclarkb: you in today?17:03
clarkbI think you need to wait for at least one master to advertise its job list before reenqueuing17:03
clarkbzaro: after the zuul stuff is done I had planned on trying to make it in17:03
zaroclarkb: office i mean17:03
clarkbyes17:03
fungiclarkb: more to the point, i meant is it okay to reenqueue changes before the jenkins masters are connecting to the new zuul. i assume so17:04
*** yaguang has quit IRC17:04
clarkbfungi: I don't think so17:04
*** markwash has joined #openstack-infra17:04
clarkbfungi: zuul may report those jobs as lost since gearman won't know how to run those jobs17:04
fungiahh, right, jobs won't be registered17:05
clarkbso I think we start new new zuul, then get at least one master t oconnect to it, then reenqueue17:05
zaroclarkb: Azher asked for a meeting to help him get setup with zuul and jjb today.  didn't know if you were interested to be on the call.17:05
zaroclarkb: meeting will be at 11am pst17:06
clarkbzaro: we'll see...17:06
*** hashar has quit IRC17:07
clarkbfungi: ok, jenkins masters have had their jobs killed17:07
fungijenkins01 seems to have established sockets to the gearman port on new zuul's ipv4 address. that's a good sign17:09
clarkbfungi: nodepool is connected to 162.242.150.96:473017:09
clarkbwhich I think is new zuul17:09
fungiyep, checking the other masters still, but good so far17:09
*** sarob has joined #openstack-infra17:10
fungijenkins.o.o has no gearman connections according to netstat17:10
fungithe other masters are connected to new zuul though17:10
clarkbcool /me looks at jenkins.o.o17:10
*** hashar has joined #openstack-infra17:11
clarkbfungi: I am going to try disabling then enalbing a job on that host as that kicks the gearman plugin17:11
fungik17:12
*** fifieldt has quit IRC17:12
clarkbthat hasn't appeared to help17:13
*** sarob_ has joined #openstack-infra17:13
clarkbI lied I think it worked17:13
fungiwe could just restart jenkins service entirely17:13
clarkboh it is talking to old gearman17:13
clarkbyeah lets do that17:13
funginetstat -nt|grep 4730 shows nothing on jenkins.o.o17:13
*** obondarev_ has joined #openstack-infra17:13
fungistopping it now17:14
clarkbfungi: jenkins log shows it tring to talk to the .88 address17:14
fungistarting17:14
fungiright, i suspected that was why there were no established sockets17:14
ttxWhy oh WHY is Gerrit askling me to rebase17:14
fungithere it goes17:14
*** sarob has quit IRC17:14
ttxhttps://review.openstack.org/#/c/67422/17:15
ttxand I'm rebasing and it doesn't really help17:15
notmynamegate status graphs for common gate jobs + several projects http://not.mn/all_gate_status.html17:15
fungii see 8 connections to the right gearman server now17:15
clarkbttx: hold on17:15
fungiclarkb: ready for me to reenqueue all the things then?17:15
clarkbfungi: I think we should try reenqueing one thing first17:15
* ttx holds (and drinks more)17:15
clarkbfungi: see ttx's question17:15
*** thuc has joined #openstack-infra17:15
fungiclarkb: will do17:16
clarkbfungi: because something seems off but that may just be that he got zuul when it had no workers17:16
*** thuc_ has joined #openstack-infra17:16
fungienqueued 63934,3 into the gate17:17
*** jooools has quit IRC17:17
fungiclarkb: zuul hasn't cloned any repos in /var/lib/zuul/git yet17:18
clarkbfungi: it should do that automagically17:18
fungigit clone -v ssh://jenkins@review.openstack.org:29418/openstack/neutron /var/lib/zuul/git/openstack/neutron' returned exit status 128: Host key verification failed.17:18
clarkboh that :)17:18
fungii guess it's not puppeted?17:18
clarkbapparently not17:18
fungiwhat file(s) do i need?17:18
*** rakhmerov has quit IRC17:19
fungii'll grab them from old zuul17:19
fungiahh, right, i can just accept the host key17:19
clarkbfungi: it would be for the zuul users known hosts file17:19
fungiadded17:20
fungishould i restart zuul>?17:20
clarkbno, just try reenqueing that one change17:20
clarkbzuul will do clones on the fly if necessary17:20
*** thuc has quit IRC17:20
fungiworked17:21
*** sarob_ has quit IRC17:21
*** beagles_brb is now known as beagles17:21
clarkbbut still failed to merge17:21
*** kiall has quit IRC17:21
fungithough i seem to have the old ipv6 address lodged deep within my browser17:21
*** sarob has joined #openstack-infra17:21
fungiit did check out the project for that change though17:21
clarkbyup17:22
fungiUnboundLocalError: local variable 'repo' referenced before assignment17:22
fungizuul bug?17:22
clarkbyup must be17:23
fungii assume restarting zuul daemon is the best course of action for now?17:23
*** kiall has joined #openstack-infra17:24
clarkbya why don't we do that17:24
*** jp_at_hp has quit IRC17:24
clarkbfungi: oh wait17:24
*** senk has quit IRC17:25
clarkbgit doesn't know who we are17:25
*** yolanda has joined #openstack-infra17:25
clarkbthat should raelly be puppeted. on old zuul the zuul users gitconfig was set to set the name and email17:25
clarkbwe should do tat by hand on new new zuul17:25
fungifixing17:25
*** sarob_ has joined #openstack-infra17:25
clarkbthen document it needs puppeting17:25
*** gokrokve has joined #openstack-infra17:26
openstackgerritRuslan Kamaldinov proposed a change to openstack-infra/config: Extracted ci docs jobs to a template  https://review.openstack.org/6748917:26
clarkblooks like new zuul has a ~zuul/.gitconfig as well17:26
fungiit does now17:26
clarkbok now try reenqueing 6393417:26
*** pballand has joined #openstack-infra17:27
*** sarob__ has joined #openstack-infra17:27
mordredclarkb: ++ to puppeting17:28
*** sarob has quit IRC17:28
clarkbfungi: zuul is cloning all the things17:28
clarkbwhich is something to note about using a tmpfs if we don't prepopulate it zuul startup will be a bit slower than before17:29
*** markwash has quit IRC17:29
fungiyeah, i expected that17:30
fungibut that's just on reboot of the server17:30
clarkbyup17:30
* mordred is excited about our new tmpfs overlord17:30
*** sarob_ has quit IRC17:30
fungihmmm, my bsd firewall here is segfaulting my shell17:30
clarkbok stuff is queueing17:30
openstackgerritJoe Gordon proposed a change to openstack-infra/elastic-recheck: Add noirc option to bot  https://review.openstack.org/6752517:31
fungii might not be long for this internet if my dhclient segfaults too17:31
fungiready for me to enqueue everything else then?17:31
sdaguecool, check queue going to refilll automatically?17:31
clarkbfungi: :( where did you stash your preserved queues?17:31
clarkbsdague: yup, fungi grabbed check and gate queue state we just need to apply them now17:31
fungimy homedir, though the first entry in the gate.sh is redundant now. fixing17:31
sdagueclarkb: do you have a list of promote bits from markmcclain17:31
ttxfungi: any idea why i'm asked to rebase stuff  ?17:32
clarkbsdague: no I do not17:32
ttxI rebased on HEAD and that doesn't work either17:32
*** nati_ueno has joined #openstack-infra17:32
ttxhttps://review.openstack.org/#/c/67422/17:32
fungiclarkb: ready for me to requeue all the things?17:32
clarkbttx: yes, we just moved zuul to new host with a tmpfs /var/lib/zuul/git to speed up the zuul git operations. When we did that we discovered that puppet did not configure git for zuul properly17:32
ttxhah.17:32
clarkbfungi: I think so17:32
clarkbttx: we fixed that by hand and have noted that we need to automate it, you should not be asked to rebase anymore17:33
ttxclarkb: any ETA on fix ? Should I stay online for the next 5 min or come back in two hours ?17:33
clarkbttx: we just fixed it17:33
fungiclarkb: it's running under a screen session for the root user now17:33
fungiin case i disappear17:33
ttxclarkb: hmm, but how do I push the change AGAIN17:33
fungittx: recheck or reverify17:33
fungittx: or reapprove17:34
clarkbttx: you shouldn't need to, the existing patchset should be fine17:34
clarkbfungi: looks like python26 slaves/jobs are having trouble ;(17:34
*** fbo is now known as fbo_away17:34
sdaguehttps://etherpad.openstack.org/p/montreal-code-sprint - under Parallel17:34
fungiclarkb: i still can't see the new status page because of my resolver cache17:34
clarkbfungi: I am going to disable then enable jobs on jenkins01 and 02 to rekick gearman17:34
ttxclarkb: except Jenkins-2ed it already. I reverified it. We'll see how it goes. Thanks!17:34
fungitrying to call rndc flushname was how i discovered my firewall is in trouble17:34
sdaguebut unfortunately markmcclain isn't here at the moment17:34
fungiclarkb: okay17:35
sdagueI guess we'll just wait until he builds a list when he gets back17:35
fungiclarkb: do we think gearman plugin didn't reconnect to zuul properly when we restarted the service?17:35
*** nati_uen_ has quit IRC17:35
clarkbfungi: I think there is a bug in gearman client where it doesn't register all of its jobs17:35
fungiahh17:36
clarkber gearman plugin not client17:36
*** nati_ueno has quit IRC17:36
clarkbfungi: you should edit your /etc/hosts :P to get zuul status17:36
fungiclarkb: i'm going to17:36
*** nati_ueno has joined #openstack-infra17:36
clarkbfungi: https://jenkins01.openstack.org/job/gate-cinder-python27/5053/console17:36
clarkbnot sure why that is happening17:37
fungiclarkb: oh, wait, i'm not resolving zuul incorrectly. the status page just seems to be broken for some reason17:37
fungioh, or maybe i am17:38
*** rakhmerov has joined #openstack-infra17:38
clarkboh I wonder if the test slaves have the ipv6 address cached17:38
clarkbI can fetch the ref that the gate-cinder-ypthon27 job failed to fetch17:38
*** rakhmerov has joined #openstack-infra17:38
fungithere we go. had to clear my browser cache too17:39
*** yassine has quit IRC17:39
*** senk has joined #openstack-infra17:39
*** yassine has joined #openstack-infra17:40
fungiclarkb: hmmm, you mean like maybe a local dnscache daemon on the slaves?17:40
clarkbya17:40
fungithat might be a centos thing, agreed17:40
*** DennyZha` has joined #openstack-infra17:40
clarkbthat is a python27 job17:40
*** DennyZhang has quit IRC17:41
*** tjones has joined #openstack-infra17:41
clarkbI think we are mostly good now, just need to ride out the hiccups17:42
*** ruhe is now known as _ruhe17:42
*** yassine has quit IRC17:42
*** yassine has joined #openstack-infra17:43
*** praneshp has joined #openstack-infra17:43
*** yassine has quit IRC17:43
*** hashar has quit IRC17:43
clarkbthough the enqueue seems to not update zuul status? debug.log shows many jobs starting implying the enqueue is working but status doesn't reflect that for me17:44
fungimaybe those are still in the event queue?17:44
clarkblooks like the Run handler has only woken twice in the last 10 minutes, I think using the rpc to enqueue may do like a gate reset and hold everything up while it does its work17:45
*** yassine has joined #openstack-infra17:45
*** yassine has quit IRC17:45
*** sarob has joined #openstack-infra17:45
*** DennyZha` has quit IRC17:45
*** sarob has quit IRC17:45
*** sarob has joined #openstack-infra17:45
openstackgerritA change was merged to openstack-infra/storyboard-webclient: Added apache license to footer  https://review.openstack.org/6734717:46
*** mattray has joined #openstack-infra17:47
*** yamahata has quit IRC17:48
*** sarob__ has quit IRC17:48
fungiworth noting, the server is basically idle cpu-wise17:48
fungiso this has to be network-related delays, right?17:48
clarkbor the enqueue isn't doing what we expect17:49
fungi2014-01-17 17:49:27,357 INFO zuul.Gerrit: Updating information for 67333,417:50
*** sarob_ has joined #openstack-infra17:50
fungimaybe gerrit's getting firebombed17:50
*** talluri has joined #openstack-infra17:50
mordredload is fine on gerrit17:50
fungiyep17:50
clarkbhttp://paste.openstack.org/show/61460/17:50
clarkbI think gearman function registering is not working so well. I will enable disable on all jenkins masters17:51
fungiokay17:51
*** harlowja_away is now known as harlowja17:51
*** mrodden1 has joined #openstack-infra17:52
clarkbhave done 1 2 3 and 4 doing jenkins.o.o now17:52
fungiclarkb: you also have a thing of some kind in 8 minutes, right? if you need to stop, i can work through the rest of this17:52
*** mrodden has quit IRC17:52
*** sarob has quit IRC17:52
clarkbfungi: well its a meeting thing. I should be able to give you a bit of time17:53
*** rnirmal has quit IRC17:53
fungik17:53
clarkball jenkinses should have reregisterd their gearman functions17:53
fungiload on gerrit is spiking, so we did something17:53
clarkbok, going to watch tail -f /var/log/zuul/debug.log | grep ERROR17:54
fungisame thing i'm doing17:54
*** sarob_ has quit IRC17:55
*** sarob has joined #openstack-infra17:55
clarkbwe seem to still be hitting ERROR zuul.Gearman: Exception while checking functions17:56
clarkbfor that same set_description job17:56
clarkbzaro: any idea why that is happening?17:56
clarkbFunction set_description:jenkins01.openstack.org is not registered17:56
zaroclarkb: i think i've got the scp-plugin patch ready.  but i have a few meetings now, so will discuss with you after 1pm.17:57
*** jerryz has joined #openstack-infra17:57
clarkbzaro: sure, can you take a quick look at ^17:57
*** jerryz has quit IRC17:57
*** jerryz has joined #openstack-infra17:57
zaroclarkb: yeah. let find that in the code during my meeting.17:57
openstackgerritJoe Gordon proposed a change to openstack-infra/elastic-recheck: Add query for bug 1261253  https://review.openstack.org/6753917:58
uvirtbotLaunchpad bug 1261253 in tripleo "oslo.messaging 1.2.0a11 is outdated and problematic to install" [High,Triaged] https://launchpad.net/bugs/126125317:58
*** sarob_ has joined #openstack-infra17:59
*** yolanda has quit IRC17:59
fungiseen a couple timeout errors since... gate-tempest-dsvm-neutron-large-ops and gate-ceilometer-pep817:59
*** sarob has quit IRC18:00
fungier, the jobs were probably unrelated18:01
fungiException while checking functions18:01
*** sarob has joined #openstack-infra18:01
openstackgerritMatthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot  https://review.openstack.org/6754018:01
clarkbfungi: ya, those exceptions seem to be timeout errors18:01
zaroclarkb: is stop function registered?18:02
fungiin connection.sendAdminRequest18:02
clarkbzaro: fungi: I am not sure if stop function is regisered but /var/log/zuul/gearman-server.log shows errors around getting its status18:03
clarkbzaro: fungi: that looks like a possible geard bug18:03
*** sarob__ has joined #openstack-infra18:04
mordredclarkb, fungi: I've been floating in and out - please ping me if I can be useful to your brains18:04
*** odyssey4me has quit IRC18:04
clarkbfungi: zaro: I think zuul slowness may be due to those timeouts, it is waiting and waiting and well waiting18:05
clarkbshould we possibly try restarting zuul to begin a new geard?18:05
*** sarob_ has quit IRC18:05
openstackgerritMatthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot  https://review.openstack.org/6754018:05
fungiclarkb: i can do that and reenqueue it all again18:05
fungiclarkb: should we include a brief wait for jenkins masters to reconnect to the gearman service?18:06
*** sarob has quit IRC18:06
clarkbfungi: yes, I think so18:06
fungiokay, killing zuul now18:06
clarkbfungi: well a wait before reenqueing18:06
fungiyeah, that's what i meant18:06
clarkbfungi: the gearman service is a child of the zuul service so you start them both with the zuul init script18:07
fungihow long do you think is sane?18:07
clarkbhalf a minute is probably plenty18:07
fungik18:07
clarkbfungi: you can telnet localhost 4730 and run send status to the socket18:07
clarkbthat should return a giant list of everything ever18:07
fungiright now it returns nothing18:08
clarkbjust 'status' returns nothing?18:08
fungioh you said run send status18:08
clarkbgah my bad18:08
fungiyeah, status returns a ton18:09
clarkbthe command is just 'status'18:09
fungithough it picked up a nova change in the check pipeline already and marked a gate-nova-python26 as lost18:09
clarkbfungi: then before reenqueing the world I think we try enqueing one change again. and tail zuul/debug.log and zuul/gearman-server.log18:09
*** galstrom has joined #openstack-infra18:09
clarkbfungi: :(18:09
clarkbfungi: I wonder if that means jenkins* but not jenkins01 and jenkins02 have registered their functions18:10
clarkbas only 01 and 02 can run the python26 jobs18:10
fungiwell, i've reenqueued the devstack-gate change we had at the top before18:11
fungibut it has no py26 jobs18:11
*** sarob__ has quit IRC18:11
fungiERROR zuul.Gearman: Job <gear.Job 0x7fbe68147690 handle: None name: build:gate-trove-python27 unique: 247c5ef1806f4581ac54f8b7cb31e8b3> is not registered with Gearman18:11
clarkbfungi: how does zuul/gearman-server.log look? are there any recent tracebacks for the stop job?18:11
*** sarob has joined #openstack-infra18:11
clarkbwhy is gearman so cranky18:12
*** pballand has quit IRC18:12
fungi2014-01-17 18:06:23 [...] KeyError: 'stop:jenkins01.openstack.org'18:12
clarkbso that is from before the restart correct?18:12
fungichecking18:13
fungi18:06 was the start18:13
*** Ajaeger has joined #openstack-infra18:14
fungiahh, stopped at 18:06:3418:15
zarofungi: is that from jenkins gearman plugin?18:15
AjaegerWhat is a "LOST" failure for a gate? https://review.openstack.org/#/c/67493/18:15
fungistarted at 18:06:4918:15
fungiAjaeger: us18:15
fungiso that keyerror was from before i stopped it18:15
Ajaegerfungi: ok, I'll let you fix it ;)18:15
*** herndon has quit IRC18:16
*** sarob has quit IRC18:16
*** yamahata has joined #openstack-infra18:16
clarkbfungi: in that status listing does jenkins01 or jenkins02 show up at all?18:17
*** hogepodge has joined #openstack-infra18:17
fungiclarkb: i just tried reenqueuing a savanna change and got this in the log...18:18
fungi2014-01-17 18:16:52,521 WARNING zuul.Scheduler: Build <Build 8363618cf6394cf4bfc5e2596c900e09 of gate-savanna-python26> not found by any queue manager18:18
clarkbfungi: ya that is resulting in LOST builds18:18
fungiERROR zuul.DependentPipelineManager: Exception while canceling build <Build 8363618cf6394cf4bfc5e2596c900e09 of gate-savanna-python26> for change <Change 0x7fbe60456410 66554,4>18:18
clarkbit couldn't cancel it because there was no job I bet18:18
fungioh, wait, i need the non-cancel errors18:19
fungithere18:19
fungi2014-01-17 18:16:52,401 ERROR zuul.Gearman: Job <gear.Job 0x7fbe602e1210 handle: None name: build:gate-savanna-python26 unique: 8363618cf6394cf4bfc5e2596c900e09> is not registered with Gearman18:19
clarkbya, that means the jenkins masters never registered that function with the geard daemon18:19
clarkbfungi: perhaps look at jenkins logs on 01 and 02 to see if the gearman plugin is puking?18:20
salv-orlandofungi, clarkb: sorry for the interruption - I assume it's not yet ok to start approving again patches?18:20
clarkbsalv-orlando: ya not quite yet, we have run into unexpected trouble with gearman18:20
fungiand now status on port 4730 returns nothing18:20
salv-orlandoclarkb: will keep lurking waiting for a go-ahead18:20
clarkbfungi: o_O how does the gearman-server.log look?18:21
fungi2014-01-17 18:20:54,214 ERROR gear.BaseClientServer: Exception in poll loop18:21
fungiKeyError: 'stop:jenkins03.openstack.org'18:21
*** salv-orlando has quit IRC18:22
*** marun has quit IRC18:22
*** marun has joined #openstack-infra18:22
fungiquite a few, but all for jenkins0218:22
fungier, for jenkins0318:22
clarkbfungi: out of curiousity how does the version of gear compare on new zuul and new new zuul18:23
fungioh crap, this is what you ran into last time18:23
clarkbya18:23
clarkbso restarting it didn't help18:23
fungipip freeze says gear==0.5.018:23
*** herndon has joined #openstack-infra18:24
fungisame as on old zuul18:24
fungialso, we have newer statsd on new zuul18:25
fungiseparate problem18:25
fungii've downgraded statsd while i'm thinking about it18:26
clarkbnext crazy idea, stop jenkinses, bring up one at a time in a relatively slow manner allowing each to register with gearman without threash18:26
fungiokay, doing18:26
fungisounds sane enough to me18:26
*** SergeyLukjanov_ is now known as SergeyLukjanov18:27
*** hogepodge has quit IRC18:28
*** aude has joined #openstack-infra18:30
clarkbfungi: and check the gearman plugin versions are consistent across jenkinses, pretty sure jeblair ran into that though and made them consistent18:30
*** max_lobur is now known as max_lobur_afk18:30
*** hogepodge has joined #openstack-infra18:30
*** nati_uen_ has joined #openstack-infra18:31
fungiwill do. also deleting offline slaves, including long-running ones, so they don't get brought back online when jenkins restarts. i'll note them here18:31
*** CaptTofu has quit IRC18:31
*** smurugesan has joined #openstack-infra18:31
*** kgriffs has joined #openstack-infra18:33
*** nati_ueno has quit IRC18:33
*** luqas has quit IRC18:35
*** marun has quit IRC18:35
*** marun has joined #openstack-infra18:35
*** jaypipes has quit IRC18:36
fungicentos6-1, precise{1,11,13,17,19,21,27,29,3,37,39,7,14,16,34,38,4,40,8}18:37
clarkbwow that is a lot of slaves18:37
*** hogepodge has quit IRC18:37
clarkbI spot checked gearman plugin versions and they all look consistent and are 0.0.4.2.ad75b7e18:37
fungiyeah, jenkins masters have been so loaded they're failing out slaves right and left18:37
fungii'm still deleting offline nodepool nodes on 03 and 04, but i'll begin restarting jenkins services one at a time on the other masters18:39
clarkbk18:39
clarkbfungi: if you tail the jenkins.log for the masters as they come up you should see it registering gearman functions. you can use that to get a sense for what is being registered and how long it takes18:39
fungiINFO: ---- Worker pypi.slave.openstack.org_exec-0 registering 184 functions18:43
fungiclarkb: ^ that?18:43
clarkbyeah18:43
clarkbit should happen for all the workers and go on and on. the list are failry large which is why I wonder if geard may not keep up or gearman plugin not keep up18:43
fungiso, status is still returning absolutely nothing from the gear socket on new zuul, fwiw18:43
clarkbreally18:43
fungia few minutes after starting jenkins on jenkins.o.o18:44
fungimaking me wonder if the geard is kaput18:44
clarkbya18:44
clarkboh I bet status fails due to that keyerror18:44
clarkband once that happens geard is kaput18:44
fungiso stop jenkins.o.o again, restart zuul, then start jenkins again?18:45
clarkbsure?18:45
*** lucasagomes has joined #openstack-infra18:46
*** lucasagomes has left #openstack-infra18:46
fungistatus is working now18:47
*** _ruhe is now known as ruhe18:48
*** herndon has quit IRC18:50
clarkbermagerd 67025 is running python26 job18:50
*** smarcet has joined #openstack-infra18:50
clarkbfungi: I wonder, could the reenqueue thing that speaks rpc be breaking zuul/geard because of some bug?18:51
fungiclarkb: maybe. though i gathered that's how stuff was reenqueued on the last zuul too18:52
clarkbfungi: k, probably worth retrying with the reenqueue rpc and if it fails *AGAIN* then fall back on reverify/recheck18:52
*** salv-orlando has joined #openstack-infra18:53
fungiyep, confirmed that all the jenkins masters are restarted and gear status is still responding18:53
*** vkozhukalov has joined #openstack-infra18:53
*** markwash has joined #openstack-infra18:53
clarkbyay!18:54
fungireenqueued the savana change which was bailing on us before18:54
clarkbI think that is a real bug in geard, when the dust settles we should grab relevant logs, and submit a bug18:54
SergeyLukjanovfungi, sorry for our naughty jobs :)18:55
fungireenqueued the devstack-gate change18:55
fungigeard status is still fine18:55
*** markmcclain has joined #openstack-infra18:55
fungiSergeyLukjanov: your jobs were fine. our servers were not18:55
*** markmcclain has quit IRC18:55
*** markmcclain has joined #openstack-infra18:56
fungisomeone snuck a neutron change in, but it's looking fine too18:56
fungiso far everything has workers and no "lost" results18:57
*** thuc_ has quit IRC18:57
clarkbfungi: yup looks good from my end too18:57
fungitrying the mass reenqueue again now18:57
*** thuc has joined #openstack-infra18:57
fungievent queue is spiking, of course18:57
clarkbfungi: pretty sure the registration and starting of jobs is racy in the zuul, geard, gearman-plugin stack and if you catch it just right it causes geard to crash18:57
*** markwash_ has joined #openstack-infra18:58
fungireenqueue scripts have returned18:59
clarkbnice18:59
fungizuul seems to be tearing through the event queue now18:59
clarkbfungi: now, one last quick sanity check. if you grep for 'zuul.Repo' in the debug.log you will get timestamps for all of the git operations18:59
clarkbit used to take 9-15 seconds per change, but tmpfs should make that faster19:00
*** markwash has quit IRC19:00
*** markwash_ is now known as markwash19:00
fungiload on zuul is not at all heavy thus far19:01
clarkbjust looking at the status we are up to 80 something changes in the gate pipeline and it only took a few minutes, much better than the 15-20 it took before19:02
fungigerrit's really not breaking a sweat either19:02
*** thuc has quit IRC19:02
SergeyLukjanovclarkb, have you already proved that problem is in IO?19:02
*** rfolco has quit IRC19:02
*** azherkhna has joined #openstack-infra19:02
clarkb'checking out master' is now a subsecond operation19:03
clarkbSergeyLukjanov: 'proved'. preliminary results look very very good19:03
fungiSergeyLukjanov: we suspect there was a lot of write delay/contention based on the system profiling stats, but i think we need to watch this go for a while under constant load to be certain it's improved significantly (we'll get that oppotrunity)19:03
*** galstrom is now known as galstrom_zzz19:04
SergeyLukjanovk, see it19:04
*** marun has quit IRC19:04
*** marun has joined #openstack-infra19:05
*** jaypipes has joined #openstack-infra19:05
*** julim has quit IRC19:05
*** julim has joined #openstack-infra19:06
*** vipul is now known as vipul-away19:09
fungi#status notice zuul.openstack.org underwent maintenance today from 16:50 to 19:00 UTC, so any changes approved during that timeframe should be reapproved so as to be added to the gate. new patchsets uploaded for those two hours should be rechecked (no bug) if test results are desired19:09
*** vipul-away is now known as vipul19:09
fungidid we lose statusbot?19:09
clarkbapparently19:09
fungiyup. fixing19:09
*** openstackstatus has joined #openstack-infra19:11
mordredclarkb: nice!19:12
fungi#status notice zuul.openstack.org underwent maintenance today from 16:50 to 19:00 UTC, so any changes approved during that timeframe should be reapproved so as to be added to the gate. new patchsets uploaded for those two hours should be rechecked (no bug) if test results are desired19:13
openstackstatusNOTICE: zuul.openstack.org underwent maintenance today from 16:50 to 19:00 UTC, so any changes approved during that timeframe should be reapproved so as to be added to the gate. new patchsets uploaded for those two hours should be rechecked (no bug) if test results are desired19:13
fungithe event/result queues are back to trivial levels already, and enormous pipeline lengths are active19:14
fungistatsd is still broken though, even though i downgraded the new zuul's statsd package to be the same as the old one's19:14
clarkbfungi: is statsd erroring?19:14
fungigood question19:15
clarkboh we just got our first gate reset19:15
clarkblets see how long it takes to clear19:15
fungiand then snipe it out, because outdated sample config19:16
fungino, wait, i misread the log. wrong job entirely for that anyway19:16
clarkbreset processed19:16
clarkbin ~1.75 minutes? not bad :)19:17
*** pballand has joined #openstack-infra19:17
zaroclarkb: do you want to review scp-plugin on github?19:17
clarkbzaro: I don't see a new pull request19:18
*** nati_ueno has joined #openstack-infra19:18
*** nati_ueno has quit IRC19:18
*** nati_ueno has joined #openstack-infra19:19
*** CaptTofu has joined #openstack-infra19:19
clarkbzaro: I am going to head into the office around lunch, if you are in today we can go over it there19:20
*** yolanda has joined #openstack-infra19:20
zarook. i'll just wait for you. see you later.19:21
*** sarob has joined #openstack-infra19:22
clarkbfungi: I think I know the statsd problem19:22
clarkbfungi: that is one place where the firewall rules on the remote end may need updating19:22
clarkbfungi: if you start the iptables persistent service it should redig DNS records and update the ruleset19:22
*** nati_uen_ has quit IRC19:23
fungiright, it's updated by dns name!19:23
fungifixing19:23
*** tjones has quit IRC19:23
fungiwow the graphite server is running at a crawl too19:24
*** sarob has quit IRC19:27
*** thuc has joined #openstack-infra19:27
fungiclarkb: good call. stats are updating again19:28
mordredyay stats19:28
*** marun has quit IRC19:28
*** marun has joined #openstack-infra19:29
fungithere's another gate reset19:29
fungiBadRequest: Multiple possible networks found, use a Network ID to be more specific. (HTTP 400)19:30
*** tjones has joined #openstack-infra19:30
fungioh, that's the one which snuck into the gate behind my one test reenqueue19:31
clarkbhahahahaha19:31
openstackgerritA change was merged to openstack-infra/elastic-recheck: add bug metadata to graph list  https://review.openstack.org/6751019:31
*** denis_makogon has joined #openstack-infra19:31
*** tjones has quit IRC19:31
fungilooks like the last two patchsets were uploaded while zuul was offline, and then it was approved with no check results19:31
clarkbwell it has been taken care of now :)19:31
fungiindeed19:31
clarkbfungi: was statsd the last remaining major issue?19:32
*** tjones has joined #openstack-infra19:32
fungiclarkb: my home firewall is my next major remaining issue19:32
fungii worry when a 15-year-old sparc64 server starts randomly segfaulting running processes19:33
clarkbnotes from switchover, should puppet known_hosts file for zuul ssh, should puppet zuul .gitconfig, gearman-plugin + geard + zuul is not happy with registering our jobs and needs handholding currently (believe this is a bug in geard)19:33
*** azherkhna has quit IRC19:33
clarkbfungi: you know you can buy dirt cheap power sipping boxes that work as great routers right?19:33
fungiserver comes up with too-new statsd, need to reload firewall rules on graphite server19:33
fungiclarkb: yes, i know this. i even have the hardware spec'd out and everything but... so little available free time lately19:34
clarkbI am going to afk now and catch up on my morning. If no one beats me to it I will write bugs up for what we learned today19:34
fungiclarkb: sounds good19:34
clarkbalso scp plugin, and lca expense reports19:34
*** vipul is now known as vipul-away19:35
*** vipul-away is now known as vipul19:35
*** vipul is now known as vipul-away19:35
clarkbfungi: when I get back you should just stop working for the rest of the afternoon19:36
clarkbbecause EWHENDOYOUSLEEP?19:36
fungiclarkb: that would be appreciated. i have the gf's folks in town visiting one more night and should at least pretend i enjoy their company19:36
*** mgagne has quit IRC19:37
fungiso will probably be disappearing for dinner again maybe 2300utc-ish19:37
clarkbfungi: yup no worries. ok really afking now so that I am able to cover the afternoon19:37
sdaguefungi: puppet question ...19:37
fungisdague: sure19:37
sdagueso we're going to add another elastic recheck program that runs on cron19:38
* fungi nods19:38
sdagueand what I'd also like to do is trigger these jobs after CD19:38
*** hogepodge has joined #openstack-infra19:39
*** mattray has left #openstack-infra19:39
sdaguebecause we might be landing a change, and we'd like to trigger that output19:39
sdaguebut right now the cron jobs are defined on the status site19:40
*** sarob has joined #openstack-infra19:40
sdaguewhich is done because the state dir is set there19:41
fungiokay, so you want a script which is called from a cron entry and from an exec, and wrap them both in lockfile (or implement a locking mechanism within the script) presumably, then subscribe the exec to the vcsrepo object19:41
*** marun has quit IRC19:41
*** oubiwann_ has quit IRC19:41
fungiam i answering the right question?19:41
*** marun has joined #openstack-infra19:41
sdagueI think so19:41
sdagueI am wondering if we could define the command as a var in the elastic_recheck/init.pp19:42
*** oubiwann_ has joined #openstack-infra19:42
fungialmost certainly19:42
sdaguecan we get vars from one pp to another easily?19:42
sdagueyou have a call example for something like that?19:42
*** sarob has quit IRC19:43
fungioh, hrm... class scope lookup19:43
sdagueyeh19:43
*** sarob has joined #openstack-infra19:43
fungii know how to do it in an erb template...19:43
dkranzfungi: Grr. So the error log gate run is being bitten by https://bugs.launchpad.net/tempest/+bug/126053719:43
uvirtbotLaunchpad bug 1260537 in tempest "Generic catchall bug for non triaged bugs where a server doesn't reach it's required state" [High,Confirmed]19:43
fungitrying to remember if i've seen it in a puppet manifest19:43
dkranzfungi: Do I just do a reverify now or is some other action appropriate?19:43
fungidkranz: reverify once it dies (i can abort the remaining running jobs) and then when it gets into the queue i'll promote it19:44
dkranzfungi: Will reverify kill the current faliing build?19:44
dkranzfungi: ok19:44
fungidkranz: nope, that's why i need to abort the jobs19:44
fungiokay, it's out of the gate now. should be safe to reverify19:45
fungidkranz: ^19:46
*** mrmartin has joined #openstack-infra19:46
*** yassine has joined #openstack-infra19:46
*** vipul-away is now known as vipul19:47
dkranzfungi: Thanks, I did the reverify.19:47
fungii see it19:47
fungipromoting now19:47
*** reed has quit IRC19:47
fungibam. there it is19:47
fungisnappy, snappy new zuulie19:47
*** sarob has quit IRC19:48
fungioh zuulie you nut19:48
openstackgerritMatthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot  https://review.openstack.org/6754019:49
*** AaronGr is now known as aarongr_afk19:49
*** vipul is now known as vipul-away19:50
mrmartinre19:50
*** denis_makogon_ has joined #openstack-infra19:52
mrmartinfungi: if you have 5 minutes during this day, please comment this review request: https://review.openstack.org/#/c/67443/ This contains the gating / distro tarball task required for community portal.19:52
salv-orlandoI might be stating the obvious but since I see still a consistent number of failures in unit test jobs, perhaps there is a case for bumping up patches for bug 127021219:52
uvirtbotLaunchpad bug 1270212 in oslo "regression: multiple calls to Message.__mod__ trigger exceptions" [Critical,In progress] https://launchpad.net/bugs/127021219:52
*** pballand has quit IRC19:52
openstackgerritMatthew Treinish proposed a change to openstack-infra/elastic-recheck: Add multi-project irc support to the bot  https://review.openstack.org/6754019:52
openstackgerritSean Dague proposed a change to openstack-infra/elastic-recheck: fix css style to make page more readable  https://review.openstack.org/6756019:54
*** Ajaeger has quit IRC19:55
*** yassine has quit IRC19:55
*** smarcet has quit IRC19:55
*** denis_makogon has quit IRC19:55
*** kgriffs has left #openstack-infra19:56
clarkbsalv-orlando: are there fixes for that change yet?19:57
clarkber for that bug19:57
fungisdague: i think you want http://docs.puppetlabs.com/puppet/2.7/reference/lang_scope.html#accessing-out-of-scope-variables19:57
clarkbfungi: sdague: you can reference variables in manifests like $::somescope::innerscope::variablename19:57
sdagueok19:58
clarkbyou do need to make sure you have previously included that class that defines the variable19:58
*** denis_makogon_ is now known as denis_makogon19:58
sdaguecool19:58
fungimrmartin: i don't see a change 67443 at all. did you maybe experiment with gerrit's drafts option, or is that a typo?19:59
clarkbzaro: I am on my way in now19:59
mrmartinit was a draft :D19:59
mrmartinhow can I share this draft review with you?20:00
sdaguecool, i'll see if I can figure it out20:00
fungimrmartin: just set them work-in-progress in the future. drafts are implemented in gerrit in a fairly broken fashion20:00
*** markmc has quit IRC20:00
mrmartingood to know that.20:00
*** herndon has joined #openstack-infra20:00
*** marun has quit IRC20:00
fungimrmartin: in the interim, you can add me as a reviewer (just add "fungi" in the requested reviewers line)20:00
sdaguealso - http://status.openstack.org/elastic-recheck/ - shift reload, and we have descriptions on bugs now20:00
*** marun has joined #openstack-infra20:00
*** ruhe is now known as _ruhe20:01
fungimrmartin: it will resolve it to my name and e-mail address when you do that20:01
mrmartinfungi: I did it20:01
fungisdague: great!20:01
fungimrmartin: i can see it now20:01
clarkbsdague: flashgordon: fwiw I think some of the jenkins errors will be false positives. When zuul aborts a job occasionally that menifests as an uncaught exception (I forget which) and the job fails20:03
mrmartinfungi: ok add as many comments as you can, so if anything missing, I can correct the patch. thnx!20:03
clarkbbut zuul aborting jobs is perfectly normal20:03
clarkbthat said the vast majority are likely slaves falling over and running tests to failure as quickly as they can20:03
fungimrmartin: will do20:03
*** mrodden1 is now known as mrodden20:04
notmyname...and I thought a 100+ jobs inthe check queue yesterday were a lot20:05
*** galstrom_zzz is now known as galstrom20:05
funginotmyname: yeah, i'm hoping they go far faster now that zuul is on an even bigger server and is doing all its git scratch work on tmpfs20:06
fungias of an hour ago20:07
*** nati_uen_ has joined #openstack-infra20:07
clarkbit definitely seems to have made the gate reset cost much lower20:07
clarkbwhich was putting the brakes on everything20:07
*** SergeyLukjanov is now known as SergeyLukjanov_20:08
fungithe event/result queue pileup is completely resolved20:08
clarkbnow we suffer from having about 1/3 to 1/4 of the test infra needed to run all of the tests20:08
*** nati_uen_ has quit IRC20:08
notmynameis that a matter of getting more workers in the nodepool?20:09
*** nati_uen_ has joined #openstack-infra20:09
fungiwell, the nodepool capacity is driven somewhat by gate resets still, since a gate reset near the front of the gate will decimate the entire quota and need them all rebuilt20:10
clarkbnotmyname: sort of. we need more cloud quota to do that and we have to be careful that adding more nodes doesn't make jenkins flakier20:10
clarkband we just saw geard get cranky...20:10
*** mfink has quit IRC20:10
clarkbfor now I think we are better off working to make jenkins and geard happier then ramp up nodepool20:10
fungiat our current aggregate quota i saw things moving fairly smoothly even though a modest reset rate when the gate was around 25-30 changes deep20:10
*** nati_ueno has quit IRC20:10
fungionce it got bigger than that, it got into a decimate/rebuild pendulum swing20:11
*** smarcet has joined #openstack-infra20:11
fungiwhich makes me think that if we do decide to arbitrarily limit the number of testable changes at the front of an integrated queue, the sweet spot is currently somewhere around there20:12
clarkbfungi: I don't think we arbitrarily limit the number of testable changse, I think we let it scale a window based on performance20:13
fungiclarkb: i agree that makes more sense20:13
*** hashar has joined #openstack-infra20:13
fungii liked the slow-start/backoff idea, as much as i can like any pessimistic model for this20:13
clarkbI don't think it will be too hard to implement either as zuul basically takes a list and iterates over it until done. we can slice that list first20:14
openstackgerritDavanum Srinivas (dims) proposed a change to openstack-infra/devstack-gate: Temporary HACK : Enable UCA  https://review.openstack.org/6756420:14
clarkbthe trickier bits will be in presenting it to users so that folks know they are in the queue but not being tested20:14
clarkbdimsum: re ^ do we expect libvirt to work now?20:15
*** marun has quit IRC20:15
*** marun has joined #openstack-infra20:15
sdagueclarkb: so if that's the case, realize that it's being reported as a FAILURE to ES and graphite20:15
sdaguewhich means it will make the gate look worse than it is20:16
sdaguewhen you run stats on it. So it would be good if those could be classified as a different status20:16
dimsumclarkb, i have a vm with UCA and don't see the problems reported hence trying to run it in d-g20:16
clarkbsdague agree but it is a jenkins limitation20:17
clarkbsdague the way they implement job aborts is by raising an exception. if not caught cleanly you lose20:17
*** galstrom has left #openstack-infra20:17
clarkbdimsum: did you run nova unittests too?20:17
salv-orlandoclarkb: neutron fix is up for review. I can prepare patches for other projects if you're ok to bump them ahead of the queue20:18
salv-orlandoclarkb: neutron patch --> https://review.openstack.org/#/c/67537/20:18
sdagueclarkb: so the abort job exception is a different exception20:18
*** tjones has quit IRC20:18
fungiclarkb: does jenkins report it as "FAILURE" state though in that case rather than "ABORT"?20:18
sdaguefrom what I can tell20:18
clarkbfungi in some corner cases yes20:18
sdagueI've definitely seen ABORT20:18
clarkbya abort is the 99% case20:19
sdagueclarkb: right, that's one of the reasons I wanted to raise the question20:19
dimsumclarkb, yep20:19
clarkbbut when jenkins doesnt cleanly catch the abort exception it looks like failure20:19
notmynameI'm not sure who to direct this at, so I'm throwing it in here:20:19
fungidimsum: interesting idea. i was trying to test it myself using d-g on a vm, but our recent refactor moved some repos around from where my script/instructions expect them20:19
notmynameI'm currently working on the Swift 1.12.0 release. I consider this somewhat of a test run for the gates for next week's i2 stuff.20:20
*** goofy-nic-friday is now known as cp16net20:20
notmynamemy plan is to get the last patches through the gate for an RC (today or when stuff lands, whichever is last)20:20
notmynameI'm currently looking at these patches: https://review.openstack.org/#/q/branch:master+AND+Approved%253D1+AND+status:open+AND+project:openstack/swift,n,z20:20
notmynameother patches would be whatever else is approved today, including one for the release notes update20:21
notmynameI don't think I need anything specific from -infra (beyond the hard work you're already doing). I wanted to give you a status update, especially because of the milestone next week (this is sort of a trial run, I'd think)20:22
funginotmyname: makes sense. as far as i know we're done with emergency disruptions. we spent this morning doing what we can to try to beef up gating performance/throughput in preparation for the bigger rush next week20:22
dimsumfungi, don20:22
dimsumfungi, don't know if this will work - https://review.openstack.org/#/c/67564/ - taking a shot20:23
fungidimsum: it looks like i would expect it to, but set that to wip because we won't actually put that change as it stands into production. we'd want to do that in nodepool prep scripts instead and/or in puppet configuration (but it may make for a worthwhile proof-of-concept)20:24
fungidimsum: the other place you could try testing it would be with a change to devstack (before it starts installing packages)20:26
dimsumah. right20:26
dimsumwill do20:26
fungibut either way will probably work20:26
*** prad_ has joined #openstack-infra20:26
*** salv-orlando has quit IRC20:28
*** herndon has quit IRC20:28
*** prad has quit IRC20:28
*** prad_ is now known as prad20:28
openstackgerritEvgeny Fadeev proposed a change to openstack-infra/askbot-theme: made launchpad importer read and write data separately  https://review.openstack.org/6756720:30
sdagueclarkb, fungi: easy change - gate status to dedicated page, so we can pull it off er - https://review.openstack.org/#/c/65700/20:30
sdagueif anyone's up for walking away from fire :)20:30
*** DinaBelova is now known as DinaBelova_20:36
*** Ryan_Lane has quit IRC20:36
*** Ryan_Lane has joined #openstack-infra20:36
*** mrmartin has quit IRC20:36
*** salv-orlando has joined #openstack-infra20:38
*** herndon has joined #openstack-infra20:38
*** yolanda has quit IRC20:38
*** markwash has quit IRC20:39
*** markwash has joined #openstack-infra20:41
*** marun has quit IRC20:41
*** marun has joined #openstack-infra20:41
notmynamewow. I am noticing that zuul is picking up approved changes _much_ more quickly now20:44
*** carl_baldwin has quit IRC20:46
*** senk has quit IRC20:47
*** carl_baldwin has joined #openstack-infra20:47
*** markmcclain has quit IRC20:47
*** vipul-away is now known as vipul20:47
*** jaypipes has quit IRC20:48
*** jaypipes_ has joined #openstack-infra20:48
*** jaypipes_ has quit IRC20:48
*** dprince has quit IRC20:49
*** pballand has joined #openstack-infra20:49
funginotmyname: that's thanks to the event queue no longer being backlogged20:49
rustlebeequeues are huge :)20:50
fungirustlebee: yep, i expect them to start dropping once the check pipeline catches up on worker assignments now20:51
rustlebeecool20:52
fungirustlebee: without your awesome collapseypatch, my browser would have choked on the current status page i think20:52
rustlebeeheh20:52
openstackgerritEmilien Macchi proposed a change to openstack-infra/config: gerritbot: Add API doc git notifications on #openstack-doc  https://review.openstack.org/6757320:53
dimsumrustlebee, ya, very handy!20:55
* rustlebee clicks expand all ... poor chrome20:56
fungi*boom*20:56
*** tjones has joined #openstack-infra20:58
*** hashar has quit IRC20:59
*** herndon has quit IRC20:59
*** thomasem has quit IRC21:01
*** marun has quit IRC21:01
*** marun has joined #openstack-infra21:01
*** nati_ueno has joined #openstack-infra21:03
*** smarcet has left #openstack-infra21:05
very_tiredrustlebee: yes thanks for the collapsy patc21:05
very_tiredh21:05
*** herndon has joined #openstack-infra21:06
rustlebeeyou're welcome :)21:06
rustlebeeit was fun.21:06
*** herndon has quit IRC21:06
rustlebeeanything web related is out of my normal comfort zone21:06
very_tiredfungi: email alert, I just sent this: http://lists.openstack.org/pipermail/openstack-infra/2014-January/000661.html21:06
*** herndon has joined #openstack-infra21:07
*** nati_uen_ has quit IRC21:07
very_tiredwill ping at 8pm and if they haven't responded, no voting for them21:07
*** herndon has quit IRC21:07
very_tiredrustlebee: you did a nice job of it21:07
fungivery_tired: sounds good. in the meantime, get some rest21:07
very_tiredfungi: :D21:07
very_tiredcode sprint winding down21:07
very_tiredwe have patches to gate21:07
*** herndon_ has joined #openstack-infra21:08
clarkbfungi: I am back at a different desk now21:08
very_tiredfungi: https://etherpad.openstack.org/p/montreal-code-sprint21:08
very_tiredunder the to be promoted section21:08
fungivery_tired: more stability fixes?21:08
sdaguefungi: yes, these should decrease load on the neutron side21:09
sdaguewhich should make it more likely to pass21:09
very_tiredstill working on getting +A on all the neutron patches, marun is going through them21:09
very_tiredso is nati_ueno21:10
fungisdague: very_tired: if you could work up a preferred sequence, we can promote the whole batch. more stable gate means more faster gate21:10
fungineed to know changenum,psnum21:11
very_tiredfungi mtreinish is double checking that now21:11
*** jgrimm has quit IRC21:12
mtreinishfungi: I just reordered the tempest test list21:13
fungiclarkb: do you think we have a chance of being able to sanely quiesce zuul tomorrow for that project rename maintenance?21:13
*** oubiwann_ has quit IRC21:14
*** nati_ueno has quit IRC21:14
*** marun has quit IRC21:14
*** marun has joined #openstack-infra21:14
*** oubiwann_ has joined #openstack-infra21:15
*** nati_ueno has joined #openstack-infra21:15
very_tiredfungi: they responded to my email, so you might not need to do anything21:16
mikalMorning21:16
very_tiredmikal: morning21:16
very_tiredhappy saturday21:16
clarkbfungi: maybe? but it is looking less likely21:16
*** nati_ueno has quit IRC21:17
fungiclarkb: i try to look at it as we're load-testing the new zuul ;)21:18
*** nati_ueno has joined #openstack-infra21:18
*** oubiwann_ has quit IRC21:19
mikalvery_tired: you're anteaya?21:19
openstackgerritAndreas Jaeger proposed a change to openstack-infra/config: Add build job for Japanese Install Guide  https://review.openstack.org/6748121:20
openstackgerritMichael Krotscheck proposed a change to openstack-infra/storyboard-webclient: [WIP] Storyboard API Interface and basic project management  https://review.openstack.org/6758221:22
very_tiredmikal: I am21:23
mikalvery_tired: so, I don't think I caused the recheck backlog... The script didn't run for that long.21:24
fungimikal: oh, the thing to recheck stale patches?21:25
mikalfungi: yeah21:25
fungianyway, no, the check volume is from us dumping the state of the zuul check and gate pipelines, moving to a bigger badder zuul and restoring them... so they all needed fresh workers and then new patchsets came in on top of that21:26
clarkbbut, bigger badder zuul is pretty awesome21:26
*** marun has quit IRC21:26
*** marun has joined #openstack-infra21:27
fungibigger badder zuul will eat your spleen for breakfast it's so awesome21:27
*** UtahDave has joined #openstack-infra21:28
fungior at least, according to our design specs it has a taste for spleen. more testing required21:28
*** pcrews has quit IRC21:28
mikalSo, are gate flushes still hurting us?21:28
*** krotscheck has quit IRC21:29
*** pcrews has joined #openstack-infra21:29
clarkbmikal: yes in that they force us to retest stuff, no they don't cause zuul to stop for forever to process them21:29
fungimikal: they will still severely deplete our available job workers for prolonged periods21:29
*** NikitaKonovalov_ is now known as NikitaKonovalov21:30
mikalOk, so I got to the point with my rechecker where it would run until it found something to recheck, recheck that, and then exit. I would then go and hand verify the recheck. I hadn't found any incorrect rechecks in a while.21:30
fungithough apparently the neutron+qa testing/stability sprint has a stack of patches which they think will make a big improvement on reset frequency21:30
*** NikitaKonovalov is now known as NikitaKonovalov_21:30
mikalI'm wondering if I should turn it back on this morning, or if the queues are so long I should just let it rest for a day21:31
mikalThe queues do look pretty long...21:31
*** vipul is now known as vipul-away21:31
fungimikal: i see it as a tradeoff there. at least some of the more persistent gate resets we're getting are actually from stale changes getting approved after bit-rotting in review for too long21:32
mikalfungi: I was surprised by how many stale checks there were last night21:32
fungiso catching those early might help keep cores from approving them21:32
mikalIt was a non-trivial percentage of reviews21:32
mikalNoting that sdague doesn't want checks on stable at the moment because of pip21:32
mikal(wow, the nova check fail rate at the moment is really high)21:33
*** SumitNaiksatam has joined #openstack-infra21:37
*** marun has quit IRC21:37
*** marun has joined #openstack-infra21:38
mordredmikal: I, for one, support your rechecker21:38
openstackgerritA change was merged to openstack-infra/elastic-recheck: fix css style to make page more readable  https://review.openstack.org/6756021:40
mikalI just don't want to break the world with my well meaning flailing21:41
mikalThere's only so much kermit arms can do21:41
portantemordred: do we run the devstack environments with GRO turned on (the generic receive off-load stuff)?21:41
portanteI am guessing it is not a concern, but just checking21:41
*** vipul-away is now known as vipul21:43
*** aarongr_afk is now known as AaronGr21:45
very_tiredheh, kermit arms21:45
fungithe muppet geek in me knew exactly what he meant21:46
*** vipul is now known as vipul-away21:47
*** rustlebee is now known as russellb21:52
*** sdake has quit IRC21:52
very_tiredfungi: the patches in the "to be promoted" section are all +A'd and in the order they need to go into the gate: https://etherpad.openstack.org/p/montreal-code-sprint21:52
very_tiredfungi: let me know if you need more21:52
*** beekneemech has quit IRC21:52
very_tiredmore as in more information, not more as in more work to do21:53
fungivery_tired: they're separated by project... is that the order you want them in? (neutron block first, then that standalone "please also this" change, then the tempest changes)?21:53
*** herndon_ has quit IRC21:53
*** derekh has joined #openstack-infra21:54
fungilooks like that standalone 67537 isn't approved anyway21:54
*** thedodd has quit IRC21:54
*** carl_baldwin has quit IRC21:54
very_tiredfungi: this one goes first please: Please also this: https://review.openstack.org/#/c/67537/21:54
*** carl_baldwin has joined #openstack-infra21:55
*** marun has quit IRC21:55
*** sarob has joined #openstack-infra21:55
very_tiredfungi: yes, salv-orlando is getting a +A on that, sorry I thought we were ready on our end21:55
fungino problem21:55
*** sarob has quit IRC21:55
*** marun has joined #openstack-infra21:55
*** sandywalsh has quit IRC21:56
*** herndon_ has joined #openstack-infra21:56
fungivery_tired: though you may need to get clarkb's help on those. i'm about to disappear to go out for food21:57
clarkbfungi: go disappear, I will be mostly here in a few minutes21:57
*** UtahDave has quit IRC21:57
very_tiredfungi: happy food, I will work with clarkb21:58
very_tiredthanks21:58
fungivery_tired: also, kudos to you and the attendees at the sprint--that's an impressive list of stability and debugging fixes21:58
very_tiredfungi thanks, it was very beneficial on many levels21:59
openstackgerritA change was merged to openstack-infra/elastic-recheck: Add query for bug 1261253  https://review.openstack.org/6753921:59
uvirtbotLaunchpad bug 1261253 in tripleo "oslo.messaging 1.2.0a11 is outdated and problematic to install" [High,Triaged] https://launchpad.net/bugs/126125321:59
very_tiredwe had a good group here21:59
*** rnirmal has joined #openstack-infra22:02
very_tiredclarkb: all the tempest patches in the "to be promoted" section can go in22:02
very_tiredhttps://etherpad.openstack.org/p/montreal-code-sprint22:02
*** tjones has left #openstack-infra22:03
very_tiredwe are waiting on +A on 67537 for it to go first on the neutron block and then once we have that, 67537 followed by the neutron block22:03
fungiclarkb: though keep in mind that anything you promote now will mean all the remaining changes in the check queue which have accumulated since the last gate reset will also be serviced before zuul takes a crack at what's in the gate (including 63934,3 which we intentionally placed at the front)22:04
*** vipul-away is now known as vipul22:04
clarkbvery_tired: I would like to do all of them at once as promotion requires a reset22:04
*** vipul is now known as vipul-away22:04
fungiclarkb: i agree that's probably the best choice22:04
*** melwitt has joined #openstack-infra22:05
clarkbvery_tired: so once everything has been approved and queued ping me and we will promote22:05
very_tiredclarkb: will do22:05
very_tiredclarkb: good to go22:06
flashgordonyou guys ever see this bug: http://logs.openstack.org/21/65121/2/gate/gate-grenade-dsvm/efd816b/console.html22:06
flashgordonSCPRepositoryPublisher aborted due to exception22:06
*** carl_baldwin has quit IRC22:06
mordredflashgordon: it means that java hates us22:07
*** carl_baldwin has joined #openstack-infra22:07
flashgordonmordred: yup22:08
flashgordonbut about to file a bug if we don't have one22:08
flashgordon263 hits in logstash22:08
*** sandywalsh has joined #openstack-infra22:08
fungiflashgordon: that log you linked doesn't seem to have been associated with a result posted to the associated change22:09
fungiflashgordon: i wonder if that job got intentionally killed when a job on a change ahead of it failed in the gate22:09
*** jcooley_ has joined #openstack-infra22:10
very_tiredclarkb: problem22:10
clarkbvery_tired: ?22:10
very_tiredhttps://review.openstack.org/#/c/67537/ never passed check22:10
openstackgerritSean Dague proposed a change to openstack-infra/config: add in elastic-recheck-unclassified report  https://review.openstack.org/6759122:10
very_tiredso salv-orlando says go with the tempest block22:10
very_tiredlet https://review.openstack.org/#/c/67537/ come back with check22:11
*** CaptTofu has quit IRC22:11
fungiflashgordon: there are only two grenade failures on the change for that log, and neither of them refer to that particular job run22:11
very_tiredand then if it does promote it and the rest of the neutron block22:11
openstackgerritMonty Taylor proposed a change to openstack-infra/storyboard: Fix the intial db migration  https://review.openstack.org/6759222:11
very_tireddoes that sound reasonable?22:11
clarkbvery_tired: I am not doing two promotions22:11
*** gema has quit IRC22:11
*** nati_uen_ has joined #openstack-infra22:11
clarkbpromotions are very expensive22:11
flashgordonfungi: hmm22:12
*** MarkAtwood has quit IRC22:13
fungiflashgordon: i'm guessing java.lang.InterruptedException is something akin to sigint22:13
very_tiredclarkb: I understand22:13
flashgordonfungi: that makes sense22:14
fungiflashgordon: "Thrown when a thread is waiting, sleeping, or otherwise occupied, and the thread is interrupted, either before or during the activity."22:14
fungi(from oracle's language doc reference)22:14
*** med_ has quit IRC22:14
*** UtahDave has joined #openstack-infra22:14
flashgordonfungi: that makes a lot of sense22:15
fungiflashgordon: so i think you have a cancelled/aborted job there that jenkins reported as a failure22:15
flashgordonyup22:15
fungibecause EJENKINS22:15
very_tiredclarkb: this is our fault and we will wear it22:15
flashgordonso I will add a elastic-jenkins fingerprint for that so we can ignore those and get better classification rate numbers22:15
*** CaptTofu has joined #openstack-infra22:16
flashgordonif that sounds good to you22:16
flashgordonwhich means add a bug marked as resolved22:16
fungiflashgordon: sounds like a good call22:16
*** nati_ueno has quit IRC22:16
fungianyway, really disappearing for several hours starting now... back later for more fun22:16
clarkbfungi: have fun22:17
*** mfer has quit IRC22:17
*** reed has joined #openstack-infra22:18
very_tiredfungi: enjoy22:18
*** thedodd has joined #openstack-infra22:19
flashgordonfungi: so this happens in the gate queue only which fits your hypothesis22:19
*** ewindisch is now known as zz_ewindisch22:19
dimsumflashgordon, i've seen many stack traces that finally end up in the wait interrupt at line hudson.remoting.Request.call(Request.java:146)22:20
flashgordondimsum: link?22:22
flashgordondimsum: I am using this query: message:"java.lang.InterruptedException" AND filename:"console.html"22:22
*** salv-orlando has quit IRC22:22
*** nati_uen_ has quit IRC22:22
*** med_ has joined #openstack-infra22:23
*** nati_ueno has joined #openstack-infra22:23
*** vkozhukalov has quit IRC22:24
*** eharney has quit IRC22:24
flashgordonfungi: https://bugs.launchpad.net/openstack-ci/+bug/127030922:24
uvirtbotLaunchpad bug 1270309 in openstack-ci "jenkins java.lang.InterruptedException" [Undecided,New]22:24
flashgordoncan you triage that, I think won't fix makes sense but your call22:24
flashgordonbut something closed22:24
*** bnemec has joined #openstack-infra22:24
*** rossella_s has quit IRC22:25
*** carl_baldwin has quit IRC22:26
*** carl_baldwin has joined #openstack-infra22:26
dimsum"hudson.remoting.Request.call(Request.java"22:28
very_tiredI'm out for the weekend and Monday, I expect to be online again on Tuesday22:28
*** sarob has joined #openstack-infra22:28
mordredhave a great weekend very_tired22:28
very_tiredclarkb and fungi thanks for all your help22:28
very_tiredthanks22:28
*** marun has quit IRC22:28
very_tired:D22:28
*** very_tired is now known as anteaya22:28
*** marun has joined #openstack-infra22:29
*** gema has joined #openstack-infra22:30
*** carl_baldwin has quit IRC22:30
*** lcestari has quit IRC22:31
flashgordonfungi: I think it is a valid infra bug actually, these shouldn't be marked as failures22:32
*** obondarev_ has quit IRC22:32
*** reed_ has joined #openstack-infra22:32
*** emagana has quit IRC22:32
flashgordondimsum: I think that is the same issue, that is part of the InterruptedException stacktrace22:32
flashgordondimsum: see https://bugs.launchpad.net/openstack-ci/+bug/127030922:33
uvirtbotLaunchpad bug 1270309 in openstack-ci "jenkins java.lang.InterruptedException" [Undecided,New]22:33
*** nati_ueno has quit IRC22:33
*** reed__ has joined #openstack-infra22:34
notmynamewhy would this change https://review.openstack.org/#/c/67538/ be marked as SKIPPED in zuul?22:34
notmynameit's towards the bottom of the gate queue22:35
*** reed has quit IRC22:35
*** reed__ has quit IRC22:35
*** senk has joined #openstack-infra22:36
*** reed_ has quit IRC22:37
*** HenryG has quit IRC22:37
openstackgerritJoe Gordon proposed a change to openstack-infra/elastic-recheck: Add query for bug 1270309  https://review.openstack.org/6759422:39
uvirtbotLaunchpad bug 1270309 in openstack-ci "jenkins java.lang.InterruptedException" [Undecided,New] https://launchpad.net/bugs/127030922:39
clarkbnotmyname: probably a merge conflict, if you hover over the red bubble it will tell you22:41
openstackgerritJoe Gordon proposed a change to openstack-infra/elastic-recheck: Use short build_uuids in elasticSearch queries  https://review.openstack.org/6759622:45
zaroclarkb: new scp plugin is on jenkins-dev.o.o22:45
*** ArxCruz has quit IRC22:49
*** flashgordon is now known as jog022:50
*** marun has quit IRC22:50
*** mrda has joined #openstack-infra22:53
*** dstanek has quit IRC22:56
*** prad has quit IRC22:57
*** mrda has quit IRC22:57
*** thedodd has quit IRC22:59
*** radix has joined #openstack-infra23:01
radixjenkins seems to be ignoring one of my patches, https://review.openstack.org/#/c/67006/3 , is there something wedged?23:01
radixor is there something messed up with my patch because I've done something wrong, maybe23:02
*** rcleere has quit IRC23:02
clarkbradix: it is being rechecked23:02
radixoh ok cool :)23:03
clarkbradix: looks like it was a draft at one point though23:03
radixyep, started out as one23:03
clarkbdrafts are evil and don't work at all in the CI systems23:03
clarkbyou can use Work in progress instead23:03
radixwell, I assumed jenkins would notice the first non-draft I posted23:03
clarkbdepends on how the non draft is posted23:03
*** dcramer_ has quit IRC23:04
clarkbif it is just published jenkins won't notice23:04
clarkbif it is pushed as a fresh non draft patchset jenkins should notice and in that case jenkins may have missed it beacuse we have been hitting zuul with a hammer to make it go quicker23:04
radixah, ok23:04
radixyeah, I just pushed a new rev as a non-draft, so it was probably that23:04
*** zz_ewindisch is now known as ewindisch23:05
radixI'll point out that https://wiki.openstack.org/wiki/Gerrit_Workflow explains how to use drafts, and doesn't discourage them23:05
clarkbgah23:06
* clarkb goes on a bug filing spree23:06
radixhehe :)23:06
clarkbsince the chances I get all of this done today are slim23:06
*** burt1 has quit IRC23:07
*** sarob has quit IRC23:10
*** sarob has joined #openstack-infra23:10
openstackgerritA change was merged to openstack-infra/devstack-gate: comparison to stable/grizzly is not numeric  https://review.openstack.org/6393423:11
*** jergerber has quit IRC23:11
*** thuc has quit IRC23:12
*** thuc has joined #openstack-infra23:12
sdagueyay, the non numeric patch finally landed!23:14
*** senk has quit IRC23:14
*** reed__ has joined #openstack-infra23:14
sdaguealso, there is a fix for stable/grizzly devstack in the gate now23:14
sdagueno need to promote it, it's fine if it churns through the weekend23:14
sdaguebut that should be handy23:14
*** sarob has quit IRC23:15
*** thuc_ has joined #openstack-infra23:16
clarkbsdague: woot23:16
clarkbsdague: what was the fix?23:16
sdaguehttps://review.openstack.org/#/c/67425/23:16
*** markmcclain has joined #openstack-infra23:16
sdaguebasically, we were so wrapped up in the pip 1.5 thing, we forget the broken run arounds on pip 1.423:16
sdaguethat never got back ported23:17
clarkb:(23:17
*** thuc has quit IRC23:17
sdaguehowever, it passed23:17
sdagueso I think it will fix things23:17
sdaguechmouel has additional good backports and fixes for grizzly23:17
sdaguebut that one should be sufficient to get stable/havana working23:17
*** soleblaze has quit IRC23:18
*** markmcclain1 has joined #openstack-infra23:19
clarkbbugs 1270321 1270319 and 1270320 submitted to cover the stuff we ran into today23:19
uvirtbotLaunchpad bug 1270321 in openstack-ci "Puppet manifests for zuul install too new statsd." [Medium,Triaged] https://launchpad.net/bugs/127032123:19
clarkbradix: I think I am just going to update the wiki now23:19
*** mrodden has quit IRC23:20
radixthanks :)23:20
*** denis_makogon has quit IRC23:20
*** markmcclain has quit IRC23:21
sdagueclarkb: was your hack to disable draft perms ever something that worked?23:21
*** herndon_ has quit IRC23:22
*** soleblaze has joined #openstack-infra23:23
*** CaptTofu has quit IRC23:24
*** sarob has joined #openstack-infra23:24
*** CaptTofu has joined #openstack-infra23:25
*** reed__ has quit IRC23:25
*** carl_baldwin has joined #openstack-infra23:27
sdaguenow that we did a zuul restart with the durable enqueue times in it - https://review.openstack.org/#/q/status:open+project:openstack-infra/config+branch:master+topic:status_ui,n,z could land any time, which displays enqueue duration in jobs23:28
sdagueand makes the merge conflict changes black, so they are easier to distinguish23:29
*** CaptTofu has quit IRC23:29
clarkbsdague: haven't had a chance to test that23:30
clarkbzaro: is that something you can test on review-dev? disable push rights to refs/drafts/* for all projects23:31
*** jcooley_ has quit IRC23:32
openstackgerritSean Dague proposed a change to openstack-infra/config: add in elastic-recheck-unclassified report  https://review.openstack.org/6759123:35
sdagueclarkb: there is actually a url in the review that shows it in action23:35
clarkbsdague: cool, I will take a look momentarily23:35
sdagueit's all just status ui on the zuul json23:35
sdagueso you can just gvfs-open it locally actually23:35
sdaguecd config/modules/openstack_project/files/status && gvfs-open index.html23:36
*** emagana has joined #openstack-infra23:38
*** mfink has joined #openstack-infra23:39
*** jcooley_ has joined #openstack-infra23:39
mordredsdague: looks good to me23:40
sdaguemordred: cool23:40
sdaguemordred: so the grizzly devstack thing23:40
mordredyeah?23:40
sdagueapparently you pushed a fix for that in august23:40
sdaguewhich got lost23:40
sdagueand someone found it23:40
mordredAWESOME23:40
sdaguehttps://review.openstack.org/#/c/67425/23:40
sdaguewhy it only started screwing us now... I don't know23:41
mordredso broken23:41
sdagueso anyway, once that gets through the gate, havana patches can land again23:42
sdagueI think23:42
*** vipul-away is now known as vipul23:42
*** jcooley_ has quit IRC23:44
*** boris-42 has quit IRC23:45
*** rnirmal has quit IRC23:45
mordredsdague, clarkb: perhaps we should make some of the different colors different shapes too - for people with colorblindness23:46
zaroclarkb, sdague : i'll give disabling drafts a try.23:46
clarkbzaro: thank you23:46
*** salv-orlando has joined #openstack-infra23:47
sdaguemordred: yeh, I think that would be good. Honestly, we should probably do the shape draws with svg anyway.23:47
sdaguemaybe after turning status.js into templates I'll do that23:47
*** jerryz has quit IRC23:47
*** obondarev_ has joined #openstack-infra23:48
mordredsdague: yeah. and on your plane - def look at the bower/grunt stuff for that - if we're going to get fancier, I think we should consider not just being files in the config repo23:49
sdagueyep, I'd be fine with that23:49
mordredit also may be way overkill - which is why you should look at it and not me23:49
sdagueheh23:49
portantesdague, mouse of the circle was a well hidden feature in zuul for me23:49
portantemouse over23:49
portantethanks for pointing that out23:50
*** krotscheck has joined #openstack-infra23:50
clarkbok fixing the wiki article finally :)23:51
*** markmcclain1 has quit IRC23:52
*** flaper87 is now known as flaper87|afk23:53
*** carl_baldwin has quit IRC23:53
*** jerryz has joined #openstack-infra23:56
clarkbhttps://wiki.openstack.org/wiki/Gerrit_Workflow#Work_in_Progress how does that look?23:57
*** pballand has quit IRC23:57

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!