Monday, 2017-10-02

*** jeblair has joined #openstack-infra-incident		15:21
*** dmsimard has joined #openstack-infra-incident		15:21
jeblair	infra-root: hi, who's here?	15:21
dmsimard	I'm not root but I'm here	15:21
*** kiennt26 has joined #openstack-infra-incident		15:22
fungi	i am	15:22
fungi	had to re-find the right buffer for this one	15:22
* mordred waves to jeblair		15:22
*** pabelanger has joined #openstack-infra-incident		15:22
*** Shrews has joined #openstack-infra-incident		15:23
*** clarkb has joined #openstack-infra-incident		15:24
clarkb	o/	15:24
pabelanger	o/	15:25
jeblair	okay, i think all of us who are awake are here	15:25
jeblair	i think we should plan how we're going to deal with the flood of problems	15:26
fungi	agreed	15:26
jeblair	if i'm going to work on memory usage, i won't be able to deal with anything else. probably for days. maybe the whole week.	15:26
jeblair	so, assuming that doesn't just immediately convince everyone that we should roll back...	15:27
fungi	i figured as much, and am trying not to raise your attention on anything unrelated to the performance/memory stuff	15:27
fungi	i expect the rest of us can handle job configuration related issues	15:27
jeblair	i think we'll need folks to jump into debugging problems, and if you find something you need me to dig into, put it on a backlog for me	15:27
jeblair	so maybe we should have an etherpad with all the issues being raised, and whose working on them, and resolution, and then we can have a section for problems that need deeper debugging	15:28
mordred	jeblair: works for me	15:29
Shrews	wfm	15:29
pabelanger	yah, that works well	15:29
jeblair	how about https://etherpad.openstack.org/p/zuulv3-issues ?	15:29
clarkb	sounds good	15:29
mordred	maybe three sections- problems/who's working them - need deeper debugging - and need jeblair	15:29
fungi	i'm also prioritizing review on jeblair's zuul patches since that's basically the one thing which would make me strongly consider rollback at this point, and the sooner performance improves the faster job configuration changes get tested/merged	15:29
mordred	because some deeper debugging is stuff various of us can dig in to	15:29
mordred	but there are times when the answer is 'need jeblair to look at XXX' - and we should prioritize which things we raise the jeblair flag on	15:30
fungi	i would argue that unless the problem is more severe than the current performance/resource consumption situation with zuul, we should just find ways to not disturb jeblair	15:31
pabelanger	++	15:31
pabelanger	I'm happy to focus on job failures	15:31
fungi	triage existing problems, and also attempt to help others in the community come up to speed on fixing their own jobs as much as possible to free up more bandwidth for all of us	15:32
dmsimard	can we use a review topic for fixes that need to be prioritized ?	15:33
dmsimard	like jeblair's performance/memory patches, or other important things	15:33
fungi	i was wondering the same earlier. we could use topic:zuulv3 but there's a bunch of open stuff under that topic already	15:34
jeblair	yeah, maybe something new; zuulv3-fixes ?	15:34
fungi	though i expect most of jeblair's critical patches will be project:openstack-infra/zuul branch:feature/zuulv3 and there's not a ton of those	15:34
jeblair	also, i'm not expecting a flood of patches related to memory use :)	15:34
dmsimard	yeah I feel zuulv3 might be overloaded right now (especially with other projects starting to use it for their own purposes)	15:35
mordred	yah	15:35
fungi	more considering how to prioritize any random changes which are fixing broken jobs	15:35
fungi	those could certainly use some topic coordination	15:35
dmsimard	^	15:35
mordred	I used 'unbreak-zuulv3' for some over the weekend, but zuulv3-fixes seems less negative	15:35
mordred	fungi: what about 'zuulv3-jobs' for things fixing a job, 'zuulv3-fixes' for things that are fixing wider issues (like a patch to zuul-cloner, for instance)	15:36
*** kiennt26 has quit IRC		15:36
fungi	yes, let's avoid creating our own negative imagery. we have enough of a public relations challenge just getting the community to not revolt over all the currently broken jobs	15:36
fungi	so far they've been amazingly tolerant	15:37
fungi	but i only expect that to last for so long	15:37
jeblair	i also added a fixed issues section we can move things to to help folks know when something known has been fixed	15:37
fungi	that'll help, thanks	15:38
mordred	infra-manual publishing works again too - so we can also start shifting FAQ content into infra-manual	15:38
jeblair	who's working on neutron stadium?	15:40
pabelanger	it looks like I am focusing on our zuul-executors right now, I am seeing large load on some of them	15:41
Shrews	i'll attempt to poke at the branch matcher failure for us. If we don't have a test for that, that's where I'll start with it.	15:41
jeblair	i moved the disk-filling line from issues with jobs to deeper debugging; looks like we have 2 instances of that now	15:42
jeblair	oh ha that's the same instance	15:42
jeblair	Shrews: thanks, can you put your name on it in the etherpad?	15:42
mordred	pabelanger: tobiash submitted a patch which was +A'd this morning (may not havelanded yet) which should reduce executor load	15:43
Shrews	I can and shall!	15:43
fungi	pabelanger: SpamapS has a feature proposed to limit picking up new jobs when load exceeds a given threshold (2.5 x cpu count currently)	15:43
jeblair	fungi: oh, maybe i should review that now?	15:43
pabelanger	mordred: ya, we might also need more executors, currently ze04.o.o is running 400 ansible-playbook processes, with load of 81	15:43
mordred	jeblair: ++	15:43
fungi	jeblair: i'm just now pulling it back up to see what state it's in, but he was working through it over the weekend	15:43
pabelanger	mordred: ze03 seems stopped, trying to see why	15:43
mordred	pabelanger: cool	15:44
mordred	infra-root: if you didn't see - simple patch from tobiash which increases the ansible check interval from 0.001 to 0.01 which in his testing signficantly reduces CPU load on executors https://review.openstack.org/#/c/508805/ - it has landed, so we shoud restart executors with it applied	15:45
fungi	jeblair: yeah, i reviewed it and +2'd yesterday, as did tobiash: https://review.openstack.org/508649	15:45
Shrews	mordred: cool. but why not an even larger value?	15:46
Shrews	(not sure what that affects, tbh)	15:46
jeblair	Shrews: it increases the amount of time between tasks, which can become noticeable with lots of small tasks	15:47
Shrews	ah	15:47
* AJaeger joins and reads backscroll		15:48
pabelanger	okay ze03.o.o started again	15:48
jeblair	pabelanger: what was wrong with ze03?	15:48
fungi	we may also want to consider using promote to get things like 508344 through faster	15:48
pabelanger	jeblair: not sure, it was stopped	15:49
jeblair	pabelanger: like, no zuul process running?	15:49
pabelanger	jeblair: yes, i think because somebody stopped it	15:49
jeblair	pabelanger: can you look into that? or do you need me to?	15:50
pabelanger	I see the following as last lines: http://paste.openstack.org/show/622478/	15:50
pabelanger	jeblair: yup, looking into history now	15:50
pabelanger	but across all zuul-executors, we are having load issues, which is resulting in ansible-playbook timeouts	15:50
clarkb	pabelanger: that probably makes tobiash's patch an important one	15:51
dmsimard	clarkb: I was about to say that	15:51
pabelanger	clarkb: yah, looking at it now	15:51
jeblair	pabelanger: please put 'ze03 was stopped' on the etherpad list and investigate the cause	15:51
pabelanger	ok	15:52
jeblair	zuul components don't just stop -- if one does, that's a pretty critical bug	15:52
jeblair	pabelanger, mordred, fungi: SpamapS change lgtm though i'd like to change the config param. however, we can land it now if we think it's important.	15:53
mordred	should we go ahead and start working on adding more executors? Also - we currently run zuul-web co-located with zuul-scheduler - since zuul-scheduler wants all the memory and CPU at the moment- should we spin up a zuul-web server?	15:53
jeblair	(we can change the tunable in a followup)	15:53
jeblair	mordred: afaict, zuul-web uses almost nothing	15:53
jeblair	mordred: we have 7 idle cpus on zuulv3.o.o, so cpu is not a problem	15:53
jeblair	(remember, the gil means that the scheduler only gets to use one)	15:54
fungi	pabelanger: the auth log on ze03 doesn't indicate anyone running sudo over the past few days until a few minutes ago when you started working on restarting the service	15:54
fungi	also, no oom killer events in dmesg	15:54
mordred	jeblair: I'm fine changing the tunable in the followup - it we can limit the number of jobs a single executor tries to run, that should hopefully at least reduce the instances of playbook timeout becaue of executor resource starvation	15:54
mordred	jeblair: ok - cool	15:54
jeblair	okay, i'll land that now	15:55
jeblair	and yeah, we should spin up more executors	15:55
pabelanger	fungi: http://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/cmd/executor.py?h=feature/zuulv3#n76 seems to be the last log entry. The line 'Keep running until the parent dies' peaked my interest	15:55
Shrews	pabelanger: yeah, the log streamer is a forked process of the executor. when the executor goes, we want the streamer to quit too	15:57
fungi	pabelanger: any chance you tried to `journalctl -t zuul-executor` before restarting? right now i only see a traceback from 15:46z which i expect is when you were trying to start it back up	15:57
mordred	jeblair: should we consider landing tristanC's move-status-to-zuul-web since those status requests will be potential thread contention on the scheduler? or do you think that's unlikley to be worth the distraction?	15:57
pabelanger	fungi: the traceback is from me when I did service zuul-executor stop, systemd that the process was still running	15:58
mordred	AJaeger: ^^ also - see scrollback if you didn't notice that we had conversation in here - especially https://etherpad.openstack.org/p/zuulv3-issues	15:59
pabelanger	wow, 282 playbooks on ze03 already	15:59
pabelanger	150 was the average for zuul-launchers	15:59
clarkb	is spamaps patch for limiting job grabbing up yet?	16:00
clarkb	because that will also help wiht ^	16:00
jeblair	clarkb: yeah we were just discussing that; 508649 is +3	16:01
clarkb	awesome so we have a couple of load limiting options going in	16:01
AJaeger	mordred: I read scrollback - thanks	16:01
jeblair	okay, i think we have a plan. i'd say that once 508649 lands and is deployed, we should restart the executors. i have no idea what the status of either graceful or stop are in zuulv3. restarting the executors at this point may very well be disruptive.	16:02
pabelanger	okay, ze03 should have the fix from tobiash but up to 110.0 load	16:02
clarkb	I have a hunch that 8649 will also assist with potential disk consumption concerns	16:02
pabelanger	with zuul-executor process taking 115% CPU, I wonder if we should think about adding another zuul-executor or two	16:03
clarkb	pabelanger: yes was mentioned above we should add more	16:03
jeblair	clarkb: i am very suspicious about disk consumption. i think we should avoid assuming that the problem is merely that the disk is full. the executor for the job you linked had something like 32G available.	16:04
pabelanger	clarkb: okay, I'll start doing that	16:04
clarkb	jeblair: ya, I've been watching it on ze04 and seen a 10s of GB swing but not to full yet	16:04
clarkb	trying to sort of why its not in cacti for proper data collection	16:04
pabelanger	clarkb: I'll do 2 at a time	16:05
jeblair	pabelanger: to be clear, you are going to look into why ze03 was stopped, right?	16:05
jeblair	pabelanger: you put "pabelanger started again" on the etherpad, but it's not clear you took the task	16:06
pabelanger	jeblair: yes, I also added it to the list. Would you like me to do that now?	16:06
jeblair	pabelanger: it can wait, just wanted to clarify :)	16:06
pabelanger	jeblair: At first glance, i think our start_log_streamer function finished, which has os._exit(0) in it. But I need to read up on why that would be, based on comments our pipe maybe died	16:07
jeblair	okay, i'm going to go back to my hole and look at memory stuff. i'm not going to be following irc, so ping me if you need me	16:07
fungi	pabelanger: based on the fact that systemd recorded an executor traceback at stop, are you sure there was no running executor process (vs a running process which was simply failing to do anything useful)?	16:14
pabelanger	fungi: Yes, the traceback is becure systemd tried to use zuul stop (via socket) and failed.	16:15
fungi	okay, got it	16:16
fungi	yeah, i see now the traceback is for the z-e cli not the daemon	16:16
fungi	so this isn't another case of hung executors like we had last week	16:17
AJaeger	I added reviews for changes that have fixes to the etherpad - releasenotes and neutron. For neutron, it's not clear which approach to use and I would love if pabelanger or mordred could check lines 15/16 in the therpad	16:23
mordred	AJaeger: I'm not sure I understand the difference - they both look the same to me for neutron	16:31
AJaeger	mordred: https://review.openstack.org/#/c/508822/ creates new jobs that include neutron - and that is then used everywhere. Alternative is: Adding the requires everywhere with the default job	17:15
AJaeger	did I link to the wrong changes?	17:15
mordred	AJaeger: maybe? I think we need to make the new jobs / project-template - as otherwise we're going to have to just add required-projects to non-template version and it'll be harder to go back and clean up once we figure out a better strategy for the neutron jobs that need this	17:16
AJaeger	mordred: I'm not following - and tired...	17:19
AJaeger	mordred: so, you propose to follow 508822 and https://review.openstack.org/#/c/508775/3/zuul.d/projects.yaml which uses the new jobs?	17:20
AJaeger	and suggest to -1 the other changes and ask them to do it the same way?	17:20
AJaeger	mordred: https://review.openstack.org/#/c/508785/16/zuul.d/projects.yaml is the alternative way of doing it - I had wrong links in etherpad. will move around	17:28
mordred	AJaeger: ah! thanks - that's helpful	17:28
mordred	AJaeger: maybe a mix of the two - make a new project-template that just makes variants adding the neutron repo - then apply that to networking- projects - some of them still may need to do things lke 508785 did (in that case it's also adding openstack/networking-sfc)	17:30
AJaeger	mordred: So, 508822 as basis - and some others might still need adding more repos. Yeah, works for me.	17:31
mordred	AJaeger: yah - we could further update 508822 to not actually create new jobs but just do the thing 508785 is doing but in the project-template definition of openstack-python-jobs-neutron	17:33
AJaeger	mordred: do you want to comment and tell frickler about it? Perhaps discuss further on #openstack-infra?	17:33
mordred	AJaeger: yah - I'll go ping frickler in #openstack-infra?	17:35
AJaeger	that's best - tthanks	17:35
pabelanger	fungi: clarkb: mordred: all zuul-executors restarted to pickup latest zuul fixes. And we also have ze09.o.o and ze10.o.o online	18:13
clarkb	pabelanger: 9 and 10 had the fixes from the start right?	18:13
pabelanger	clarkb: yes	18:13
pabelanger	I am hoping all jobs were aborted properly, however ze07.o.o was in a weird state	18:14
pabelanger	was running 3 zuul-executor processes	18:14
mordred	pabelanger: we have not yet figured out why thathappens sometimes	18:18
pabelanger	mordred: I'd like to first replace our init.d script with proper systemd scripts (not today) then see if it is still an issue	18:19
mordred	pabelanger: yes - I imagine that will either have an impact or be one of the things we wind up doing - but a) totally agree, not today and b) I mostly want to make sure we keep track of the main issue - which is that sometimes start/stop is weird, and that the sysvinit/systemd overlap right now may be a cause, a symptom, or something else (and I'd RELALY like to understand the third process	18:21
mordred	we sometimes see)	18:21
pabelanger	++	18:22
clarkb	looks like scheduler ran into swap with the latest reconfig? but its starting new jobs so still functioning	18:37
clarkb	also implies that the schedulers are still happy with the changes made to them	18:37
clarkb	er executors	18:37
clarkb	what do people think of having a queue of changes that are ready for review in the etherpad? for example my tripleo fixes aren't ready for review because we need to collect more data on them and whether or not the fs detection fixes cp. But I am sure there are othre changes that are ready and otherwise lost in the shuffle?	18:40
clarkb	maybe if you see one of your changes is ready just add it to the list then remove it once it merges?	18:40
fungi	wfm	18:52
pabelanger	mordred: clarkb: fungi: okay, I am stepping away for the next hour or so. I'll catch up on backscoll then	19:44
fungi	thanks for the heads up, pabelanger	19:44
mordred	clarkb: wfm	20:14
fungi	similarly, if you find a change which is ready and you're going to +1/+2 but it needs another +2 to approve, go ahead and add it	20:32
fungi	to the list in the etherpad	20:32

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!