Monday, 2017-10-02

*** jeblair has joined #openstack-infra-incident15:21
*** dmsimard has joined #openstack-infra-incident15:21
jeblairinfra-root: hi, who's here?15:21
dmsimardI'm not root but I'm here15:21
*** kiennt26 has joined #openstack-infra-incident15:22
fungii am15:22
fungihad to re-find the right buffer for this one15:22
* mordred waves to jeblair15:22
*** pabelanger has joined #openstack-infra-incident15:22
*** Shrews has joined #openstack-infra-incident15:23
*** clarkb has joined #openstack-infra-incident15:24
clarkbo/15:24
pabelangero/15:25
jeblairokay, i think all of us who are awake are here15:25
jeblairi think we should plan how we're going to deal with the flood of problems15:26
fungiagreed15:26
jeblairif i'm going to work on memory usage, i won't be able to deal with anything else.  probably for days.  maybe the whole week.15:26
jeblairso, assuming that doesn't just immediately convince everyone that we should roll back...15:27
fungii figured as much, and am trying not to raise your attention on anything unrelated to the performance/memory stuff15:27
fungii expect the rest of us can handle job configuration related issues15:27
jeblairi think we'll need folks to jump into debugging problems, and if you find something you need me to dig into, put it on a backlog for me15:27
jeblairso maybe we should have an etherpad with all the issues being raised, and whose working on them, and resolution, and then we can have a section for problems that need deeper debugging15:28
mordredjeblair: works for me15:29
Shrewswfm15:29
pabelangeryah, that works well15:29
jeblairhow about https://etherpad.openstack.org/p/zuulv3-issues ?15:29
clarkbsounds good15:29
mordredmaybe three sections- problems/who's working them - need deeper debugging - and need jeblair15:29
fungii'm also prioritizing review on jeblair's zuul patches since that's basically the one thing which would make me strongly consider rollback at this point, and the sooner performance improves the faster job configuration changes get tested/merged15:29
mordredbecause some deeper debugging is stuff various of us can dig in to15:29
mordredbut there are times when the answer is 'need jeblair to look at XXX' - and we should prioritize which things we raise the jeblair flag on15:30
fungii would argue that unless the problem is more severe than the current performance/resource consumption situation with zuul, we should just find ways to not disturb jeblair15:31
pabelanger++15:31
pabelangerI'm happy to focus on job failures15:31
fungitriage existing problems, and also attempt to help others in the community come up to speed on fixing their own jobs as much as possible to free up more bandwidth for all of us15:32
dmsimardcan we use a review topic for fixes that need to be prioritized ?15:33
dmsimardlike jeblair's performance/memory patches, or other important things15:33
fungii was wondering the same earlier. we could use topic:zuulv3 but there's a bunch of open stuff under that topic already15:34
jeblairyeah, maybe something new; zuulv3-fixes ?15:34
fungithough i expect most of jeblair's critical patches will be project:openstack-infra/zuul branch:feature/zuulv3 and there's not a ton of those15:34
jeblairalso, i'm not expecting a flood of patches related to memory use :)15:34
dmsimardyeah I feel zuulv3 might be overloaded right now (especially with other projects starting to use it for their own purposes)15:35
mordredyah15:35
fungimore considering how to prioritize any random changes which are fixing broken jobs15:35
fungithose could certainly use some topic coordination15:35
dmsimard^15:35
mordredI used 'unbreak-zuulv3' for some over the weekend, but zuulv3-fixes seems less negative15:35
mordredfungi: what about 'zuulv3-jobs' for things fixing a job, 'zuulv3-fixes' for things that are fixing wider issues (like a patch to zuul-cloner, for instance)15:36
*** kiennt26 has quit IRC15:36
fungiyes, let's avoid creating our own negative imagery. we have enough of a public relations challenge just getting the community to not revolt over all the currently broken jobs15:36
fungiso far they've been amazingly tolerant15:37
fungibut i only expect that to last for so long15:37
jeblairi also added a fixed issues section we can move things to to help folks know when something known has been fixed15:37
fungithat'll help, thanks15:38
mordredinfra-manual publishing works again too - so we can also start shifting FAQ content into infra-manual15:38
jeblairwho's working on neutron stadium?15:40
pabelangerit looks like I am focusing on our zuul-executors right now, I am seeing large load on some of them15:41
Shrewsi'll attempt to poke at the branch matcher failure for us. If we don't have a test for that, that's where I'll start with it.15:41
jeblairi moved the disk-filling line from issues with jobs to deeper debugging; looks like we have 2 instances of that now15:42
jeblairoh ha that's the same instance15:42
jeblairShrews: thanks, can you put your name on it in the etherpad?15:42
mordredpabelanger: tobiash submitted a patch which was +A'd this morning (may not havelanded yet) which should reduce executor load15:43
ShrewsI can and shall!15:43
fungipabelanger: SpamapS has a feature proposed to limit picking up new jobs when load exceeds a given threshold (2.5 x cpu count currently)15:43
jeblairfungi: oh, maybe i should review that now?15:43
pabelangermordred: ya, we might also need more executors, currently ze04.o.o is running 400 ansible-playbook processes, with load of 8115:43
mordredjeblair: ++15:43
fungijeblair: i'm just now pulling it back up to see what state it's in, but he was working through it over the weekend15:43
pabelangermordred: ze03 seems stopped, trying to see why15:43
mordredpabelanger: cool15:44
mordredinfra-root: if you didn't see - simple patch from tobiash which increases the ansible check interval from 0.001 to 0.01 which in his testing signficantly reduces CPU load on executors https://review.openstack.org/#/c/508805/ - it has landed, so we shoud restart executors with it applied15:45
fungijeblair: yeah, i reviewed it and +2'd yesterday, as did tobiash: https://review.openstack.org/50864915:45
Shrewsmordred: cool. but why not an even larger value?15:46
Shrews(not sure what that affects, tbh)15:46
jeblairShrews: it increases the amount of time between tasks, which can become noticeable with lots of small tasks15:47
Shrewsah15:47
* AJaeger joins and reads backscroll15:48
pabelangerokay ze03.o.o started again15:48
jeblairpabelanger: what was wrong with ze03?15:48
fungiwe may also want to consider using promote to get things like 508344 through faster15:48
pabelangerjeblair: not sure, it was stopped15:49
jeblairpabelanger: like, no zuul process running?15:49
pabelangerjeblair: yes, i think because somebody stopped it15:49
jeblairpabelanger: can you look into that?  or do you need me to?15:50
pabelangerI see the following as last lines: http://paste.openstack.org/show/622478/15:50
pabelangerjeblair: yup, looking into history now15:50
pabelangerbut across all zuul-executors, we are having load issues, which is resulting in ansible-playbook timeouts15:50
clarkbpabelanger: that probably makes tobiash's patch an important one15:51
dmsimardclarkb: I was about to say that15:51
pabelangerclarkb: yah, looking at it now15:51
jeblairpabelanger: please put 'ze03 was stopped' on the etherpad list and investigate the cause15:51
pabelangerok15:52
jeblairzuul components don't just stop -- if one does, that's a pretty critical bug15:52
jeblairpabelanger, mordred, fungi: SpamapS change lgtm though i'd like to change the config param.  however, we can land it now if we think it's important.15:53
mordredshould we go ahead and start working on adding more executors? Also - we currently run zuul-web co-located with zuul-scheduler - since zuul-scheduler wants all the memory and CPU at the moment- should we spin up a zuul-web server?15:53
jeblair(we can change the tunable in a followup)15:53
jeblairmordred: afaict, zuul-web uses almost nothing15:53
jeblairmordred: we have 7 idle cpus on zuulv3.o.o, so cpu is not a problem15:53
jeblair(remember, the gil means that the scheduler only gets to use one)15:54
fungipabelanger: the auth log on ze03 doesn't indicate anyone running sudo over the past few days until a few minutes ago when you started working on restarting the service15:54
fungialso, no oom killer events in dmesg15:54
mordredjeblair: I'm fine changing the tunable in the followup - it we can limit the number of jobs a single executor tries to run, that should hopefully at least reduce the instances of playbook timeout becaue of executor resource starvation15:54
mordredjeblair: ok - cool15:54
jeblairokay, i'll land that now15:55
jeblairand yeah, we should spin up more executors15:55
pabelangerfungi: http://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/cmd/executor.py?h=feature/zuulv3#n76 seems to be the last log entry.  The line 'Keep running until the parent dies' peaked my interest15:55
Shrewspabelanger: yeah, the log streamer is a forked process of the executor. when the executor goes, we want the streamer to quit too15:57
fungipabelanger: any chance you tried to `journalctl -t zuul-executor` before restarting? right now i only see a traceback from 15:46z which i expect is when you were trying to start it back up15:57
mordredjeblair: should we consider landing tristanC's move-status-to-zuul-web since those status requests will be potential thread contention on the scheduler? or do you think that's unlikley to be worth the distraction?15:57
pabelangerfungi: the traceback is from me when I did service zuul-executor stop, systemd that the process was still running15:58
mordredAJaeger: ^^ also - see scrollback if you didn't notice that we had conversation in here - especially https://etherpad.openstack.org/p/zuulv3-issues15:59
pabelangerwow, 282 playbooks on ze03 already15:59
pabelanger150 was the average for zuul-launchers15:59
clarkbis spamaps patch for limiting job grabbing up yet?16:00
clarkbbecause that will also help wiht ^16:00
jeblairclarkb: yeah we were just discussing that; 508649 is +316:01
clarkbawesome so we have a couple of load limiting options going in16:01
AJaegermordred: I read scrollback - thanks16:01
jeblairokay, i think we have a plan.  i'd say that once 508649 lands and is deployed, we should restart the executors.  i have no idea what the status of either graceful or stop are in zuulv3.  restarting the executors at this point may very well be disruptive.16:02
pabelangerokay, ze03 should have the fix from tobiash but up to 110.0 load16:02
clarkbI have a hunch that 8649 will also assist with potential disk consumption concerns16:02
pabelangerwith zuul-executor process taking 115% CPU, I wonder if we should think about adding another zuul-executor or two16:03
clarkbpabelanger: yes was mentioned above we should add more16:03
jeblairclarkb: i am very suspicious about disk consumption.  i think we should avoid assuming that the problem is merely that the disk is full.  the executor for the job you linked had something like 32G available.16:04
pabelangerclarkb: okay, I'll start doing that16:04
clarkbjeblair: ya, I've been watching it on ze04 and seen a 10s of GB swing but not to full yet16:04
clarkbtrying to sort of why its not in cacti for proper data collection16:04
pabelangerclarkb: I'll do 2 at a time16:05
jeblairpabelanger: to be clear, you are going to look into why ze03 was stopped, right?16:05
jeblairpabelanger: you put "pabelanger started again" on the etherpad, but it's not clear you took the task16:06
pabelangerjeblair: yes, I also added it to the list. Would you like me to do that now?16:06
jeblairpabelanger: it can wait, just wanted to clarify :)16:06
pabelangerjeblair: At first glance, i think our start_log_streamer function finished, which has os._exit(0) in it. But I need to read up on why that would be, based on comments our pipe maybe died16:07
jeblairokay, i'm going to go back to my hole and look at memory stuff.  i'm not going to be following irc, so ping me if you need me16:07
fungipabelanger: based on the fact that systemd recorded an executor traceback at stop, are you sure there was no running executor process (vs a running process which was simply failing to do anything useful)?16:14
pabelangerfungi: Yes, the traceback is becure systemd tried to use zuul stop (via socket) and failed.16:15
fungiokay, got it16:16
fungiyeah, i see now the traceback is for the z-e cli not the daemon16:16
fungiso this isn't another case of hung executors like we had last week16:17
AJaegerI added reviews for changes that have fixes to the etherpad - releasenotes and neutron. For neutron, it's not clear which approach to use and I would love if pabelanger or mordred could check lines 15/16 in the therpad16:23
mordredAJaeger: I'm not sure I understand the difference - they both look the same to me for neutron16:31
AJaegermordred: https://review.openstack.org/#/c/508822/ creates new jobs that include neutron - and that is then used everywhere. Alternative is: Adding the requires everywhere with the default job17:15
AJaegerdid I link to the wrong changes?17:15
mordredAJaeger: maybe? I think we need to make the new jobs / project-template - as otherwise we're going to have to just add required-projects to non-template version and it'll be harder to go back and clean up once we figure out a better strategy for the neutron jobs that need this17:16
AJaegermordred: I'm not following - and tired...17:19
AJaegermordred: so, you propose to follow 508822 and https://review.openstack.org/#/c/508775/3/zuul.d/projects.yaml which uses the new jobs?17:20
AJaegerand suggest to -1 the other changes and ask them to do it the same way?17:20
AJaegermordred: https://review.openstack.org/#/c/508785/16/zuul.d/projects.yaml is the alternative way of doing it - I had wrong links in etherpad. will move around17:28
mordredAJaeger: ah! thanks - that's helpful17:28
mordredAJaeger: maybe a mix of the two - make a new project-template that just makes variants adding the neutron repo - then apply that to networking- projects - some of them still may need to do things lke 508785 did (in that case it's also adding openstack/networking-sfc)17:30
AJaegermordred: So, 508822 as basis - and some others might still need adding more repos. Yeah, works for me.17:31
mordredAJaeger: yah - we could further update 508822 to not actually create new jobs but just do the thing 508785 is doing but in the project-template definition of openstack-python-jobs-neutron17:33
AJaegermordred: do you want to comment and tell frickler about it? Perhaps discuss further on #openstack-infra?17:33
mordredAJaeger: yah - I'll go ping frickler in #openstack-infra?17:35
AJaegerthat's best - tthanks17:35
pabelangerfungi: clarkb: mordred: all zuul-executors restarted to pickup latest zuul fixes.  And we also have ze09.o.o and ze10.o.o online18:13
clarkbpabelanger: 9 and 10 had the fixes from the start right?18:13
pabelangerclarkb: yes18:13
pabelangerI am hoping all jobs were aborted properly, however ze07.o.o was in a weird state18:14
pabelangerwas running 3 zuul-executor processes18:14
mordredpabelanger: we have not yet figured out why thathappens sometimes18:18
pabelangermordred: I'd like to first replace our init.d script with proper systemd scripts (not today) then see if it is still an issue18:19
mordredpabelanger: yes - I imagine that will either have an impact or be one of the things we wind up doing - but a) totally agree, not today and b) I mostly want to make sure we keep track of the main issue - which is that sometimes start/stop is weird, and that the sysvinit/systemd overlap right now may be a cause, a symptom, or something else (and I'd RELALY like to understand the third process18:21
mordredwe sometimes see)18:21
pabelanger++18:22
clarkblooks like scheduler ran into swap with the latest reconfig? but its starting new jobs so still functioning18:37
clarkbalso implies that the schedulers are still happy with the changes made to them18:37
clarkber executors18:37
clarkbwhat do people think of having a queue of changes that are ready for review in the etherpad? for example my tripleo fixes aren't ready for review because we need to collect more data on them and whether or not the fs detection fixes cp. But I am sure there are othre changes that are ready and otherwise lost in the shuffle?18:40
clarkbmaybe if you see one of your changes is ready just add it to the list then remove it once it merges?18:40
fungiwfm18:52
pabelangermordred: clarkb: fungi: okay, I am stepping away for the next hour or so. I'll catch up on backscoll then19:44
fungithanks for the heads up, pabelanger19:44
mordredclarkb: wfm20:14
fungisimilarly, if you find a change which is ready and you're going to +1/+2 but it needs another +2 to approve, go ahead and add it20:32
fungito the list in the etherpad20:32

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!