Tuesday, 2021-02-02

*** yamamoto has joined #openstack-infra		00:26
*** rlandy has quit IRC		00:30
*** tjgresha has joined #openstack-infra		00:49
-openstackstatus- NOTICE: The Gerrit service on review.opendev.org is being quickly restarted to apply a new security patch		00:56
*** gyee has quit IRC		01:09
*** tjgresha has quit IRC		01:30
*** __ministry1 has joined #openstack-infra		01:38
*** rcernin has quit IRC		01:44
*** dviroel has quit IRC		02:04
*** zzzeek has quit IRC		02:07
*** rcernin has joined #openstack-infra		02:09
*** zzzeek has joined #openstack-infra		02:10
*** tjgresha has joined #openstack-infra		02:13
*** zzzeek has quit IRC		02:24
*** zzzeek has joined #openstack-infra		02:25
*** tjgresha has quit IRC		02:30
*** hamalq has quit IRC		02:36
*** yonglihe has joined #openstack-infra		02:47
*** zzzeek has quit IRC		02:48
*** zzzeek has joined #openstack-infra		02:52
*** yamamoto has quit IRC		03:26
*** yamamoto_ has joined #openstack-infra		03:26
*** irclogbot_0 has quit IRC		03:27
*** dchen has quit IRC		03:34
*** dchen has joined #openstack-infra		03:34
*** ykarel has joined #openstack-infra		04:11
*** ricolin_ has joined #openstack-infra		04:13
*** ykarel_ has joined #openstack-infra		04:15
*** ykarel has quit IRC		04:17
*** jfan has quit IRC		04:26
*** yamamoto has joined #openstack-infra		04:26
*** yamamoto_ has quit IRC		04:29
*** ramishra has quit IRC		04:40
*** irclogbot_0 has joined #openstack-infra		04:41
*** irclogbot_0 has quit IRC		04:54
*** irclogbot_2 has joined #openstack-infra		04:58
*** ramishra has joined #openstack-infra		05:04
*** vishalmanchanda has joined #openstack-infra		05:07
*** ociuhandu has joined #openstack-infra		05:07
*** ociuhandu has quit IRC		05:12
*** ykarel_ is now known as ykarel		05:16
*** ricolin_ has quit IRC		05:35
*** ricolin has joined #openstack-infra		05:39
*** priteau has quit IRC		05:47
*** ykarel_ has joined #openstack-infra		05:51
*** ykarel has quit IRC		05:53
*** ykarel_ is now known as ykarel		06:22
*** lbragstad_ has joined #openstack-infra		06:24
*** lbragstad has quit IRC		06:24
*** ysandeep\|away is now known as ysandeep		06:43
*** ralonsoh has joined #openstack-infra		06:49
*** slaweq has joined #openstack-infra		06:55
*** jcapitao has joined #openstack-infra		07:00
*** sboyron has joined #openstack-infra		07:02
*** amoralej\|off is now known as amoralej		07:05
*** sboyron_ has joined #openstack-infra		07:19
*** sboyron has quit IRC		07:22
*** eolivare has joined #openstack-infra		07:32
*** rcernin has quit IRC		07:37
*** slaweq has quit IRC		07:40
*** slaweq has joined #openstack-infra		07:42
*** xek has joined #openstack-infra		07:48
*** ralonsoh has quit IRC		07:54
*** dklyle has quit IRC		07:59
*** ralonsoh has joined #openstack-infra		08:01
*** ralonsoh has quit IRC		08:03
*** ralonsoh has joined #openstack-infra		08:05
*** dchen has quit IRC		08:11
*** hashar has joined #openstack-infra		08:13
*** rcernin has joined #openstack-infra		08:14
*** rpittau\|afk is now known as rpittau		08:25
*** andrewbonney has joined #openstack-infra		08:27
*** rcernin has quit IRC		08:31
*** dtantsur\|afk is now known as dtantsur		08:36
*** gfidente has joined #openstack-infra		08:44
*** kopecmartin has quit IRC		08:48
*** kopecmartin has joined #openstack-infra		08:50
*** lxkong has quit IRC		08:52
*** lxkong has joined #openstack-infra		08:53
*** lxkong has quit IRC		08:53
*** lxkong has joined #openstack-infra		08:54
*** jpena\|off is now known as jpena		08:56
*** priteau has joined #openstack-infra		08:57
*** rcernin has joined #openstack-infra		09:00
*** tosky has joined #openstack-infra		09:02
*** lucasagomes has joined #openstack-infra		09:06
*** hberaud has joined #openstack-infra		09:13
*** rcernin has quit IRC		09:18
*** rcernin has joined #openstack-infra		09:23
*** ociuhandu has joined #openstack-infra		09:29
*** d34dh0r53 has quit IRC		09:39
*** derekh has joined #openstack-infra		09:39
*** d34dh0r53 has joined #openstack-infra		09:39
*** ociuhandu has quit IRC		09:44
*** ociuhandu has joined #openstack-infra		09:44
*** d34dh0r53 has quit IRC		09:48
*** d34dh0r53 has joined #openstack-infra		09:49
*** rcernin has quit IRC		10:08
*** rcernin has joined #openstack-infra		10:19
*** tosky has quit IRC		10:33
*** tosky has joined #openstack-infra		10:34
*** rcernin has quit IRC		11:13
*** zbr1 has joined #openstack-infra		11:14
*** dviroel has joined #openstack-infra		11:15
*** zbr has quit IRC		11:16
*** zbr1 is now known as zbr		11:16
*** rcernin has joined #openstack-infra		11:35
*** gfidente has quit IRC		11:35
*** sshnaidm\|ruck is now known as sshnaidm\|afk		11:38
*** ysandeep is now known as ysandeep\|afk		11:44
*** gfidente has joined #openstack-infra		11:47
*** lpetrut has joined #openstack-infra		11:51
*** rcernin has quit IRC		12:04
*** jcapitao is now known as jcapitao_lunch		12:08
*** iurygregory_ has joined #openstack-infra		12:09
*** iurygregory has quit IRC		12:09
*** piotrowskim has joined #openstack-infra		12:12
*** ysandeep\|afk is now known as ysandeep		12:14
*** yamamoto has quit IRC		12:15
noonedeadpunk	fungi: hi! returning to the question with citycloud. there's some mess in the ticket I created. Is floating IP https://opendev.org/opendev/system-config/src/branch/master/inventory/base/hosts.yaml#L532-L538 is still assigned to the mirror inside your project?	12:19
noonedeadpunk	which should be https://opendev.org/opendev/system-config/src/branch/master/playbooks/templates/clouds/nodepool_clouds.yaml.j2#L155-L164 right?	12:20
noonedeadpunk	can you also get network id or vm id so folks could double check that we're looking at the right thing...	12:21
*** eolivare_ has joined #openstack-infra		12:24
*** eolivare has quit IRC		12:26
*** rlandy has joined #openstack-infra		12:26
frickler	noonedeadpunk: the old floating IP currenty isn't being used by us anymore. there are new ones listed here along with the IDs involved: http://paste.openstack.org/show/xkQX8wH09PsR4JzP7fpr/	12:29
*** hashar is now known as hasharLunch		12:29
noonedeadpunk	aha, gotcha	12:30
*** yamamoto has joined #openstack-infra		12:31
*** jpena is now known as jpena\|lunch		12:36
*** sshnaidm\|afk is now known as sshnaidm\|ruck		12:36
*** ysandeep is now known as ysandeep\|mtg		12:37
*** yamamoto has quit IRC		12:39
*** tbachman has quit IRC		12:44
*** eolivare_ has quit IRC		12:46
*** tbachman has joined #openstack-infra		12:48
*** yamamoto has joined #openstack-infra		12:49
*** yamamoto has quit IRC		12:50
frickler	noonedeadpunk: if I look at the router, I see a completely different address there, not sure if that is as designed or may to part of the issue: {"subnet_id": "0cff86a9-a33a-4550-b2ee-f2c909dee4d2", "ip_address": "77.81.6.17"}	12:58
*** amoralej is now known as amoralej\|lunch		12:59
*** iurygregory_ is now known as iurygregory		13:04
*** yamamoto has joined #openstack-infra		13:07
*** yamamoto has quit IRC		13:07
*** yamamoto has joined #openstack-infra		13:08
*** Tengu has quit IRC		13:09
*** Tengu has joined #openstack-infra		13:10
*** Tengu has quit IRC		13:10
*** Tengu has joined #openstack-infra		13:10
*** Tengu has quit IRC		13:10
*** Tengu has joined #openstack-infra		13:11
*** Tengu has quit IRC		13:11
*** yamamoto has quit IRC		13:12
*** jcapitao_lunch is now known as jcapitao		13:14
*** Tengu has joined #openstack-infra		13:18
*** hasharLunch is now known as hashar		13:18
*** eolivare_ has joined #openstack-infra		13:18
noonedeadpunk	frickler: yeah folks have moved router. can you check if this has solved the issue	13:23
*** jpena\|lunch is now known as jpena		13:25
noonedeadpunk	at least they're reachable for me now	13:26
frickler	noonedeadpunk: I can ping both, too, and log into mirror1, so this seems fine again, thanks for your help	13:50
frickler	infra-root: I don't have time today to do the followup work of changing the address everywhere, maybe one of you can do that? also not sure what the idea with the second mirror based on focal was? seems it doesn't have all users deployed, likely due to lack of connectivity?	13:52
*** tbachman has quit IRC		13:52
*** ociuhandu has quit IRC		13:53
*** tbachman has joined #openstack-infra		13:55
dansmith	clarkb: are you able to generate me an updated paste of the percentage of gate resources used by each of the projects? now that neutron has dropped the tripleo jobs I'm curious what the new numbers are	13:58
*** amoralej\|lunch is now known as amoralej		14:02
*** ociuhandu has joined #openstack-infra		14:10
*** sreejithp has joined #openstack-infra		14:14
openstackgerrit	Akihiro Motoki proposed openstack/openstack-zuul-jobs master: translation: Handle renaming of Chinese locales in Django https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/773689	14:21
openstackgerrit	Akihiro Motoki proposed openstack/openstack-zuul-jobs master: translation: Handle renaming of Chinese locales in Django https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/773689	14:24
*** ociuhandu has quit IRC		14:30
*** ociuhandu has joined #openstack-infra		14:31
*** ociuhandu has quit IRC		14:36
*** __ministry1 has quit IRC		14:38
*** ociuhandu has joined #openstack-infra		14:48
*** ociuhandu has quit IRC		14:58
*** dwalt has joined #openstack-infra		15:01
*** slaweq has quit IRC		15:14
*** slaweq has joined #openstack-infra		15:14
*** aarents has quit IRC		15:16
*** ociuhandu has joined #openstack-infra		15:18
*** ociuhandu has quit IRC		15:22
*** ociuhandu has joined #openstack-infra		15:23
*** ociuhandu has quit IRC		15:23
*** ociuhandu has joined #openstack-infra		15:25
fungi	noonedeadpunk: frickler: yes, after my reboots didn't work, ianw tried replacing the floating ip with a new one (so new address, not updated in dns yet), and when that didn't work he tried to boot a new instance there which also didn't work, but if it had he was considering using it as an opportunity to get the mirror server upgraded	15:26
fungi	i can work on correcting dns and booting nodes there again for now, to get things back in operation, since i think that's currently the only region supplying some specific node types	15:27
*** hashar is now known as hasharAway		15:27
*** lpetrut has quit IRC		15:27
*** ociuhandu has quit IRC		15:31
*** ociuhandu has joined #openstack-infra		15:31
openstackgerrit	Mohammed Naser proposed openstack/project-config master: Switch to using v3-standard-8 flavors https://review.opendev.org/c/openstack/project-config/+/773710	15:33
*** ysandeep\|mtg is now known as ysandeep		15:41
*** ociuhandu has quit IRC		15:41
*** gfidente has quit IRC		15:47
*** gfidente has joined #openstack-infra		15:49
*** ysandeep is now known as ysandeep\|away		15:54
openstackgerrit	Merged openstack/project-config master: Switch to using v3-standard-8 flavors https://review.opendev.org/c/openstack/project-config/+/773710	15:56
clarkb	dansmith: yes I can regenerate that after meetings today	15:58
dansmith	clarkb: thanks	15:58
*** dklyle has joined #openstack-infra		15:58
*** dklyle has quit IRC		15:59
*** david-lyle has joined #openstack-infra		15:59
*** amoralej is now known as amoralej\|off		16:00
*** jamesmcarthur has joined #openstack-infra		16:02
*** jamesmcarthur has quit IRC		16:03
*** jamesmcarthur has joined #openstack-infra		16:03
*** hasharAway is now known as hashar		16:04
*** yamamoto has joined #openstack-infra		16:17
*** ykarel has quit IRC		16:20
*** lbragstad_ is now known as lbragstad		16:21
*** david-lyle is now known as dklyle		16:24
*** yamamoto has quit IRC		16:26
dansmith	mnaser: do you know if the gerrit instance is running in one of the same flavors that is io-restricted?	16:28
mnaser	dansmith: gerrit runs at rax afaik	16:28
dansmith	okay	16:29
*** ociuhandu has joined #openstack-infra		16:34
*** ociuhandu has quit IRC		16:35
*** ociuhandu has joined #openstack-infra		16:35
*** jcapitao has quit IRC		16:52
*** jamesmcarthur has quit IRC		16:52
*** jamesmcarthur has joined #openstack-infra		16:54
*** jamesmcarthur has quit IRC		16:57
fungi	and is on a 64gb ram 16vcpu instance with the data on a cinder-attached ssd volume	16:59
*** jamesmcarthur has joined #openstack-infra		17:00
*** lucasagomes has quit IRC		17:04
*** ociuhandu has quit IRC		17:06
*** ociuhandu has joined #openstack-infra		17:07
*** ociuhandu has quit IRC		17:13
*** zbr1 has joined #openstack-infra		17:16
clarkb	dansmith: http://paste.openstack.org/show/MIfg7ByqwceE1rFgu8gw/	17:17
dansmith	clarkb: wow, no real change there	17:17
*** zbr has quit IRC		17:18
*** zbr1 is now known as zbr		17:18
*** ociuhandu has joined #openstack-infra		17:19
clarkb	dansmith: each report covers a month of logs so we may not see shifts until we roll enough logs over. If we really need it I can modify the script to look at say only the last 7 days instead	17:24
*** ociuhandu has quit IRC		17:25
*** ociuhandu has joined #openstack-infra		17:36
dansmith	clarkb: okay, I was on a call when I looked, I see yeah it goes back to like beginning of jan, so fair enough	17:38
*** rlandy is now known as rlandy\|biab		17:40
*** ociuhandu has quit IRC		17:40
*** d34dh0r53 has quit IRC		17:42
*** ralonsoh has quit IRC		17:45
*** jamesmcarthur has quit IRC		17:45
*** d34dh0r53 has joined #openstack-infra		17:50
clarkb	dansmith: ~last week http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/	17:50
dansmith	clarkb: ah thanks, sorry you had to do that,	17:51
dansmith	but that's what I expected to see.. the tripleo number swell to fill the void left by neutron	17:51
clarkb	like a seesaw	17:52
*** gfidente is now known as gfidente\|afk		17:52
dansmith	frustrating	17:52
*** ociuhandu has joined #openstack-infra		17:52
*** jpena is now known as jpena\|off		17:55
fungi	it's not unexpected. basically the long queue times create backpressure reducing the amount of activity below what it would be if unbounded, so shrinking one source allows the others to naturally expand into the void it leaves	17:55
fungi	when people see things are merging more readily, they're likely to approve more changes than they would if things were backed up more and taking forever	17:57
fungi	also when developers are getting feedback from builds more quickly, they can iterate faster and push new revisions with greater frequency	17:58
*** jamesmcarthur has joined #openstack-infra		17:59
*** ociuhandu has quit IRC		17:59
*** rcernin has joined #openstack-infra		17:59
dansmith	fungi: all the numbers have to add up to 100%, so obviously things have to swell	18:00
dansmith	without normalizing this against number of commits it's kinda guessing anyway	18:00
clarkb	ya the original purpose of the script was to determine if the fear that all these new projects were killing the zuul queues was valid	18:02
*** ociuhandu has joined #openstack-infra		18:02
clarkb	it needs work if we want to improve it to do more detailed analysis	18:02
dansmith	yeah	18:03
fungi	sort of, if you look at a full week the system is not running at 100% for the duration. it does catch up here and there, mostly on weekends, but the amount of time it spends caught up is where the total work volume becomes apparent	18:03
dansmith	I think we pretty much need to set some goals for individual job runtime, and maximum number of heavy jobs per commit	18:03
dansmith	and try to get projects to move things to periodic or experimental outside of that	18:03
*** derekh has quit IRC		18:04
*** rcernin has quit IRC		18:04
fungi	but beyond the work volume the system manages to handle in a given week, i'm suggesting there are psychological and logistical effects at play too, that reductions in work volume elsewhere will also be pitted against	18:04
dansmith	the tripleo guys asked today for what is "reasonable usage" so we should probably just try to define that	18:05
clarkb	dansmith: In the past I've suggested that using OSA and kolla as comparisons as similar deployment projects may be a valid way of defining things	18:05
dansmith	yeah,	18:05
*** eolivare_ has quit IRC		18:05
dansmith	I was going to say something like figure out how long a tempest run takes, plus the devstack setup time plus some slack, and use that	18:06
dansmith	maybe kolla is a better metric	18:06
fungi	average node-hours per change merged might be a good metric, because efficient developing and reviewing practices can lead to fewer revisions, more resilient/robust jobs mean fewer rechecks, and so on	18:06
*** ociuhandu has quit IRC		18:06
zbr	fungi: do you know that https://opendev.org/opendev/system-config/src/branch/master/playbooks/apply-package-updates.yaml#L1 is a syntax error?	18:06
dansmith	fungi: sure, just like if tripleo is 50% of all the changes in openstack and using 40% of the gate, then that's not as bad as we think it is	18:07
fungi	if we're going to have a measuring stick, i'd like it to be one which encourages good practices and efficient use of system, because that's where perverse incentives will lead people	18:07
zbr	use of variable in hosts, not quite something to make ansible happy	18:07
fungi	zbr: hah, that's amusing	18:07
clarkb	I'm guessing that was intended to be used with -e target=foo ?	18:08
zbr	it does pass if you define the variable but	18:08
zbr	ansible-playbook --syntax-check playbooks/apply-package-updates.yaml	18:08
zbr	is still a failure	18:08
clarkb	I mean if its only ever used in that manner then is it a problem?	18:08
zbr	the funny bit is that is not used like this, is used with --limit, which is the better way of doing it	18:08
clarkb	zbr: it is used eaxctlythe way I describe it	18:09
clarkb	in launch node	18:09
clarkb	with -e target=foo	18:09
zbr	found it while testing the new linter, which does syntax checking using ansible.	18:09
zbr	workaround: - hosts: "{{ target \| default([]) }}"	18:10
clarkb	zbr: will it fail if target is defined with -e target?	18:10
clarkb	if not then it isn't a syntax error as used	18:10
zbr	clarkb: tell this to syntax check. imho a playbook that crashes when not given some magic inputs is still a syntax error.	18:11
zbr	it is easy to add a localhost: fail if not defined.... to avoid it.	18:11
clarkb	well the whole point is it should only run against the remote, we don't want it to run against localhost	18:12
clarkb	and no that wouldn't eb a syntax error it would be a runtime error due to invalid inputs	18:12
zbr	clarkb: the test for undefined would run on localhost, not the task.	18:12
zbr	i can show you. give me few minutes.	18:12
fungi	it sounds like you're basically saying the syntax checker is broken	18:13
clarkb	zbr: well I'm asking if there is actually anything to fix here	18:13
clarkb	can we save 15 minutes and accept that it is correct as used and move on?	18:13
fungi	because it lacks context for how the playbook is invoked	18:13
zbr	imho is not broken, it detects code that is not well written. is like writing python code and assuming a variable is defined, without checking for it.	18:14
zbr	yes, as used it works, but that does not make the code good.	18:14
clarkb	zbr: functions can have required arguments	18:14
clarkb	if you want to think of the playbook in that way target is a required parameter	18:15
*** ociuhandu has joined #openstack-infra		18:15
clarkb	not providing it is an error just as calling a python function without the required arguments	18:15
*** ociuhandu has quit IRC		18:15
*** ociuhandu has joined #openstack-infra		18:15
zbr	see https://review.opendev.org/c/opendev/system-config/+/773782/1/playbooks/apply-package-updates.yaml	18:17
clarkb	zbr: but why is that better?	18:17
clarkb	the end result is the same in both cases, an error becuse target is not defined	18:17
clarkb	but one requires you to write a ton of error checking code	18:18
zbr	because it avoids a crash	18:18
zbr	try to compare them with python code, usually any piece of encapsulated code should check for inputs, is just good practice.	18:18
clarkb	most python functions assume that their required arguments are provided	18:19
clarkb	and it is up to the caller to get it right, just as is the case here	18:19
zbr	in that particular case the benefit is minor, but think about other playbooks that may use lots of vars that are needed or not.	18:19
*** jamesmcarthur has quit IRC		18:24
*** ociuhandu has quit IRC		18:27
*** ociuhandu has joined #openstack-infra		18:30
*** ociuhandu has quit IRC		18:30
clarkb	zbr: so the linter is running ansible-lint's syntax checker and the syntax check expects hosts: to always be defined even though it is valid to have a variable there?	18:31
clarkb	(I'm looking at the change and trying to udnerstand the concern within that context)	18:31
*** sshnaidm\|ruck is now known as sshnaidm\|afk		18:34
*** dtantsur is now known as dtantsur\|afk		18:35
*** rpittau is now known as rpittau\|afk		18:39
*** hashar is now known as hasharDinner		18:41
*** tdasilva has joined #openstack-infra		18:44
zbr	ansible syntax check expects to be able resolve hosts, if it fails it will give a syntax error.	18:47
clarkb	zbr: does that mean it needs an inventory file too?	18:47
zbr	nope	18:47
zbr	putting hosts: dskfndlgnlf is perfectly valid, but using jinja2 can produce an error. depends on how you write it.	18:49
fungi	definitely sounds broken then	18:50
fungi	if it doesn't care that the hosts value has any meaning, then it should just ignore if it contains variable expansion	18:51
zbr	the same kind of broken as writing a python function that receives an argument and not checking that its type is ok	18:51
zbr	imho, it is quite good that they did it like this.	18:51
fungi	or perhaps ansible-lint needs to be fed whatever variables ansible itself would be supplied on invocation	18:51
*** jamesmcarthur has joined #openstack-infra		18:52
fungi	well, this is more like complaining that a python function requires an argument, without knowing whether the caller will supply that argument	18:52
zbr	you can take the linter out from this debate, now is between you and ansible-playbook --syntax-check, something is already used on many repos.	18:53
fungi	got it, either way, the idea is that you should avoid valid constructs if they're hard to test/evaluate out of context	18:54
fungi	there are reasonable points on both sides	18:54
fungi	is a checker which lacks context a suitable tool to use in every situation? is it worth the effort to alter a correctly working implementation to make it easier to check for correctness?	18:56
*** diablo_rojo_phon has joined #openstack-infra		18:56
fungi	where "correctness" may also be someone's opinion	18:56
zbr	fungi: take a look at https://docs.ansible.com/ansible/latest/dev_guide/developing_collections.html -- and see playbooks/tasks/ -- i am bit surprised to see that I need to explain why mixing tasks and playbooks inside a folder is a bad idea.	19:00
zbr	and this has nothing to so with collections, is about layouting code.	19:01
fungi	zbr: it may not be a good idea, but it also may not be worth the time it takes to debate, review and improve if it's already working	19:02
fungi	it might be worth avoiding doing the same thing in the future, sure	19:02
*** sboyron_ has quit IRC		19:03
zbr	fungi: tbh: system-config is in very good shape by ansible standards, i would refrain from naming other more messy cases i have to deal with ;)	19:03
*** jamesmcarthur has quit IRC		19:03
zbr	and that mixing of tasks/vars/playbooks is quite a common mistake, but now the linter complains about it.	19:04
zbr	in fact we can blame ansible a little bit for that with the generic "include" that was deprecated as being so confusing.	19:05
zbr	i seen people wondering why they cannot include a playbook from inside a tasks file, again an again.	19:05
zbr	FYI, the filetype detection uses patterns from https://github.com/ansible-community/ansible-lint/blob/master/src/ansiblelint/config.py#L5-L20 -- the list is not hardcoded and subject to change based on feedback.	19:07
*** jamesmcarthur has joined #openstack-infra		19:15
*** andrewbonney has quit IRC		19:15
*** rlandy\|biab is now known as rlandy		19:35
*** hasharDinner is now known as hashar		19:39
openstackgerrit	Merged openstack/project-config master: Revert "Temporarily stop booting nodes in citycloud-kna1" https://review.opendev.org/c/openstack/project-config/+/773240	19:40
*** tdasilva_ has joined #openstack-infra		19:44
*** tdasilva has quit IRC		19:46
dansmith	clarkb: fungi: do you know why this job was paused for 3ish hours? https://zuul.opendev.org/t/openstack/build/8af7cfabcaff4f2b83d26395d6a9b19f/log/job-output.txt#4160	19:55
clarkb	dansmith: yes, that is the tripleo job that builds all their container images, then it sits arounds serving them for the child jobs	19:56
clarkb	I think the breakdown is something like an hour of building and 2 hours of servinig	19:56
fungi	dansmith: looks like it's running a server which other builds are pulling content from	19:56
fungi	so it has to pause until those builds complete	19:56
clarkb	it could potentially stop sooner, though I'm not sure if zuul makes that easy (wait for all child jobs to say "we have the data you are serving you can go away now")	19:57
clarkb	ya it starts at ~11:30 then pauses at ~12:28 after building the images	19:59
clarkb	then the remaining ~ 2 hours is spent serving those images to the downstream consuming jobs	19:59
*** rcernin has joined #openstack-infra		20:00
*** tdasilva_ has quit IRC		20:03
*** tdasilva_ has joined #openstack-infra		20:03
*** rcernin has quit IRC		20:04
*** jamesmcarthur has quit IRC		20:05
*** jamesmcarthur has joined #openstack-infra		20:06
*** jamesmcarthur has quit IRC		20:07
*** jamesmcarthur has joined #openstack-infra		20:12
*** yamamoto has joined #openstack-infra		20:23
dansmith	fungi: clarkb: sorry for the delayed response.. so there's other jobs running that uses that worker or something so there's just no output during that time?	20:24
openstackgerrit	Jeremy Stanley proposed openstack/project-config master: Move bindep to opendev tenant https://review.opendev.org/c/openstack/project-config/+/773793	20:24
dansmith	we were wondering if that was zuul pausing a job for a reschedule or something like that	20:24
*** rcernin has joined #openstack-infra		20:25
fungi	dansmith: correct, the "job" actually starts a server which serves content to other builds running as part of the same buildset	20:27
*** yamamoto has quit IRC		20:27
dansmith	fungi: okay buildset is zuulv3 lingo that does not mean "multiple nodes" right?	20:27
fungi	when a ref (e.g. a change) is enqueued into a pipeline, builds for each of the selected jobs are started. that collection of builds is a buildset	20:28
fungi	they get reported together once all builds within the buildset complete	20:29
dansmith	oh, so one job can serve stuff to another job?	20:29
dansmith	like, the jobs depend on each other?	20:29
fungi	so a buildset might be the set of linters, unit tests and functional test jobs which ran	20:29
fungi	they can depend on each other, yes	20:29
fungi	and can even interact	20:30
dansmith	okay, I've never known such a thing, other than multinode jobs	20:30
fungi	we started doing it initially with our container image testing workflow, where one job sets up a registry server and then other jobs depending on it build and push images into that registry and then yet still other jobs can pull those images and exercise or publish them to a durable location	20:31
dansmith	certainly that ends up with workers waiting around for another worker to get to the usable point right?	20:32
fungi	correct	20:33
fungi	which, depending on how the jobs are written and what they need to do, could just be a few minutes	20:33
dansmith	is there anything easy to grep for to figure out how long a worker waited?	20:33
dansmith	and once a worker has completed its job in a buildset, presumably it goes on to do something else, we don't need a py27 worker hanging around until the end of the devstack worker just because it's the same buildset...	20:34
dansmith	the reason I ask about the grep'able thing is just curious if there is a way to spot inefficient configurations where one worker ends up waiting 45 minutes for another to get to a usable place	20:35
fungi	i'd have to get much more familiar with the pausing mechanism, i'm not sure if there's visible evidence of it in the task output	20:36
dansmith	okay	20:36
clarkb	the job paused/job resumed lines that you linked are produced by the zuul_return pause thing iirc	20:37
*** jamesmcarthur has quit IRC		20:40
dansmith	right, so the job has some way of entering a "while true: sleep" loop so it can serve, yeah?	20:41
dansmith	and presumably the dependent job need to poll for readiness or be told by zuul that the other job is at the sync point so it can start using it right?	20:42
clarkb	dansmith: yes, zuul provides an ansible module called: zuul_return which allows a job to provide state back to the scheduler	20:42
clarkb	I think in this case zuul won't start the children jobs until the parent either exits successfully or paused so it is quite simple	20:43
clarkb	and the parent won't stop after being paused until the children are all done	20:43
dansmith	oh jeez	20:43
dansmith	so this thing might be 45 minutes in, hit the pause, and then we start building the thing that is going to need this, which could take 45 minutes on its own such that we're sitting idle for that long?	20:44
clarkb	dansmith: yes, though in this case its about 60 minutes then 120 minutes	20:44
dansmith	120 minutes for the depdendent child thing to build	20:45
dansmith	?	20:45
clarkb	yes	20:45
dansmith	uh	20:45
dansmith	am I missing how that's not a super bad waste of resources?	20:45
clarkb	oh actually it might be 180, how have they managed that? I guess paused jobs areb't subject to timeouts in nromal ways	20:46
clarkb	dansmith: well its intended use is to avoid needing to perform duplicate work in many jobs	20:46
dansmith	sure, it's a tool, but in this case, it could be working against us it seems like	20:46
clarkb	basically that first hour is performed once rather than say 5 times in jobs that are all multinode. So we save 4 hours in my contrived example	20:46
clarkb	but ya it is possible to set it up such that we don't win in the final tally balance	20:47
dansmith	clarkb: but did I read you right that you can tell that it took 120 minutes to build the child worker and all the time we were sitting idle?	20:47
dansmith	and if so, can you show me how to figure that math?	20:48
clarkb	dansmith: https://zuul.opendev.org/t/openstack/build/8af7cfabcaff4f2b83d26395d6a9b19f/log/job-output.txt#4160-4161 shows you the time paused (it was actually almost 3 hours not two) I assumed 2 horus bceause we have a 3 hour job timeout and it had already spent an hour building images at that point. But I think zuul must not do timeouts in paused jobs in a normal way	20:48
clarkb	dansmith: look at the timestamps on the left side of the text there	20:48
dansmith	clarkb: right, I thought those times between the paused and resumed were when this node was busy serving images	20:49
dansmith	are you saying that's all idle wait time?	20:49
clarkb	well it is idle from Zuul's perspective.	20:49
clarkb	logging for any active period while idel from zuul's perspective will depend on the job itselfr	20:50
dansmith	heh, okay sure, I'm just wondering how to connect this to the thing that is dependent on it, to figure out if this thing is sitting around longer than it needs	20:50
dansmith	but maybe the answer is "it's totally dependent on the config of the job"	20:50
fungi	keep in mind that while it's one node waiting and serving content to other nodes for several hours, all that time there are at least several multi-node jobs running and using the content it's serving, so that one node is a fairly small percentage of what's in use	20:50
dansmith	fungi: well, that's what I was trying to understand,	20:51
clarkb	dansmith: I don't know wheer tripleo logs their "idle" workload	20:51
dansmith	I thought clarkb was saying we don't even start to build the child jobs until this job gets here	20:51
dansmith	i.e. not parallelizing the builds of the parent and children	20:51
clarkb	dansmith: correct	20:51
fungi	and yeah, as clarkb points out, it's not necessarily "idle" in the usual sense, it's not running job tasks but it's doing something (serving content to nodes for other running builds)	20:52
dansmith	clarkb: okay but you don't know how much of that three hours was build vs serving	20:52
clarkb	dansmith: I know the build was 1 hour, that all happened before the pause. The serving all happens during the 3 hour pause	20:52
dansmith	I gotcha, I thought it was asserted that the time between those two markers was just the waiting for build	20:52
dansmith	clarkb: yeah, I'm talking about the building of the things that depend on this	20:52
fungi	the three hours pause was the amount of time it took the other builds which say they rely on that to complete	20:53
dansmith	right, okay got it	20:53
fungi	once they were done. that build serving the content for them resumed, cleaned up and finished	20:53
dansmith	so that could be two hours of not using this and one hour of using it, or much worse or much better	20:53
clarkb	what this setup has done is avoid needing all of those child jobs to spend an hour doing image builds. So we save roughly 1 hour * num_child_jobs	20:54
dansmith	clarkb: presumably yes I get that	20:54
fungi	minus the node which is doing the serving of course	20:54
clarkb	dansmith: yes, and if all it is doing is serving docker images it is possible thatthose get pulled like 20 minutes into the pause and then the idle node goes properly idle	20:54
clarkb	the zuul pause mechanism isn't rich enough to say we're done early you can go away now	20:55
dansmith	clarkb: right, but it consumes a worker for the full period until those jobs (which were done with this thing in 20 minutes) have finished three hours later	20:55
dansmith	yeah	20:55
fungi	if that node spends too much time sitting around because the jobs which pulled images from it in their first five minutes take hours to compete, then it's possible that we still end up using more node-hours overall than if each job had done redundant activity	20:55
clarkb	dansmith: right it can likely be optimized further, but I believe this si still an improvement on the simple alternative	20:55
clarkb	particualrly for tripleo which has long image builds	20:55
dansmith	clarkb: yeah, I'm sure in a lot of cases it is	20:56
dansmith	I'm just trying to understand what we're looking at	20:56
clarkb	(and many multinode jobs that need the images)	20:56
dansmith	it also seems like the kind of thing that could easily be done for convenience when it's just lines in a yaml file, but without realizing the imapct	20:56
dansmith	presumably the three hours could also be the time it takes to build three children, only the last of which actually needs this, all of which are serialized	20:57
fungi	napkin math, a 4-hour single-node content serving job (1 hour creating the content + 3 hours serving it) which is providing content to three two-node jobs which run for up to three hours saves us 2 node-hours	20:57
clarkb	dansmith: yes, if the child jobs don't actually need those resources then we've optimized for a non existant problem and likely made things worse	20:57
dansmith	I'm really not trying to say this is not a gain, it totally is, I'm just saying I can see writing a job with dependencies and not realize I have created a 4-node serialization that didn't necessarily need to happen, or for which there is a better optimization	20:58
dansmith	like, if you don't understand all these nuances	20:58
dansmith	one way to maybe spot that is if you know there's not three hours of work that depends on this worker, that'd be a sign that maybe you've created a monster	20:59
clarkb	yup I think that is a possibility. My understanding of the tripleo situation is that they do actually need those images, but it is possible there are better ways to get them (like quay or something)	20:59
dansmith	clarkb: well, right, in the tripleo case it probably is perfect for that they need, but if a four hour job seems longer than necessary, then it might help point to somewhere that you've done something bad	21:00
fungi	right, my napkin math example shows the break-even point is probably if you have at least 5-6 nodes occupied with redundant work (if it's only 4, then the separating it out and serving it costs you more than it nets you), but that will depend to a great extent on the durations	21:02
dansmith	fungi: I was going to say "or rearranging the dependencies might be more efficient"	21:02
fungi	yeah	21:02
dansmith	I dunno how zuul reserves workers, so maybe not,	21:02
dansmith	but it would be helpful to be able to visualize that somehow	21:03
clarkb	I think it is fair to say that depending on the situation this tool can make things better or worse from a through put perspective. A lto of that will depend on the actual job workload. It is also a good indicatioin that something might be worth reviewing if it takes a very long time	21:03
fungi	workers don't really get reserved, they're satisfied from an available pool	21:03
clarkb	https://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=1 can help visualize some of that (there are also nodepool specific dashbaords there as well)	21:04
fungi	so zuul puts out a node request for a given build and waits for a nodepool launcher to fulfil that request from available resources	21:04
dansmith	fungi: I would never dare to suggest a zuul feature, but presumably there'd also be an optimization where you have a named sync point and zuul can build both in parallel until the point at which they need to both be at a certain state of readiness,	21:05
dansmith	plus the "okay I'm done with you now" thing clarkb mentioned	21:05
dansmith	but I'd much rather hear a job author say "I'd like to be able to do X but can't" for one of those situations	21:05
fungi	which gets a little complicated since dependent builds need their node requests fulfilled from the same nodepool region	21:05
fungi	dansmith: yeah, that sounds like a potentially useful evolution of the job dependency handling, i don't know what would be involved in implementing it	21:08
dansmith	yeah, I'm saying I wouldn't even consider it until some job optimizer claims no more can be done without something like that :)	21:08
fungi	the "i'm done with you now" mechanism could be essentially the same as the sync point mechanism	21:10
fungi	traffic control, in a more general sense	21:10
clarkb	another aspect here is that you may be paused and waiting some time for the child jobs tostart due to contention or cloud flakyness	21:11
fungi	everyone stop here, when this pint is reached you go but you stay, et cetera	21:11
clarkb	we should be able to measure that without any job changes, but I'm not sure zuul/nodepool expose that info	21:11
clarkb	having something like a "time waiting for this to boot" in graphite would be nice though	21:11
clarkb	corvus: ^ do you know if that is somethign we already expose?	21:11
clarkb	we capture a boot time but I'm pretty sure that clock starts once we believe we've got free quota available so would ignore the time waiting for quota to become available?	21:12
clarkb	https://grafana.opendev.org/d/4JjHXp2Gk/nodepool?orgId=1 time to ready is the boot time I'm thinking of	21:13
clarkb	a zuul level time from node request being sent to filled is what I think I'm tlaking about as being useful	21:14
*** thiago__ has joined #openstack-infra		21:14
*** vishalmanchanda has quit IRC		21:15
*** tdasilva_ has quit IRC		21:17
*** thiago__ has quit IRC		21:18
*** thiago__ has joined #openstack-infra		21:19
*** tbachman has quit IRC		21:19
*** dwalt has quit IRC		21:21
*** rcernin has quit IRC		21:22
*** tbachman has joined #openstack-infra		21:26
*** gfidente\|afk has quit IRC		21:33
*** thiago__ has quit IRC		21:33
*** thiago__ has joined #openstack-infra		21:33
*** hashar has quit IRC		21:40
*** jamesmcarthur has joined #openstack-infra		21:41
*** rcernin has joined #openstack-infra		21:52
*** ociuhandu has joined #openstack-infra		22:01
*** thiago__ is now known as tdasilva		22:04
*** ociuhandu has quit IRC		22:05
*** rcernin has quit IRC		22:08
*** rcernin has joined #openstack-infra		22:09
*** yamamoto has joined #openstack-infra		22:12
openstackgerrit	Merged openstack/project-config master: Move bindep to opendev tenant https://review.opendev.org/c/openstack/project-config/+/773793	22:14
corvus	ohai, reading	22:19
*** xek has quit IRC		22:25
corvus	clarkb: node request timing is sent to graphite under zuul.nodepool.requests.fulfilled and zuul.nodepool.requests.fulfilled.label.$LABEL	22:29
corvus	so you can get stats on how long, say, a "centos7" node request takes to fill with the second, or any node request takes to fill with the first	22:30
clarkb	corvus: oh cool	22:32
clarkb	and that is the time from request sent to fulfilled from zuul's perspective	22:32
clarkb	dansmith: ^ so I think you use that to answer (on average or typical cases) how long the nodes will be "booting" while the parnet job is paused	22:33
*** jamesmcarthur has quit IRC		22:49
*** slaweq has quit IRC		22:55
*** tdasilva_ has joined #openstack-infra		22:57
*** tdasilva has quit IRC		22:59
*** yamamoto has quit IRC		23:03
*** yamamoto_ has joined #openstack-infra		23:03
*** JayF has quit IRC		23:13
dansmith	clarkb: sorry I got distracted.. I think getting it from graphite won't actually answer the real question I have, which is "for what percentage of this 4h was this thing useful to children"	23:13
dansmith	because the answer is really in the job and how it's used	23:13
dansmith	so I should just to pick apart a couple runs and compare timestamps I think	23:13
dansmith	because even if it's 30m to find the next node, the one image pull or whatever it does, before 2h of idle time is what I really want to know :)	23:14
clarkb	dansmith: ya if you found a representative sample the timestamps in the logs should give you that info too (when did each job start and end and how does that compare with the pause time)	23:14
dansmith	aye	23:15
dansmith	someone that knows how that job works may also tell me "oh it's pulling images all the damn time"	23:15
*** JayF has joined #openstack-infra		23:17
*** thiago__ has joined #openstack-infra		23:18
*** tdasilva_ has quit IRC		23:20
*** tdasilva_ has joined #openstack-infra		23:20
*** thiago__ has quit IRC		23:23
*** calbers has quit IRC		23:28
*** dchen has joined #openstack-infra		23:32
*** calbers has joined #openstack-infra		23:37
openstackgerrit	Akihiro Motoki proposed openstack/openstack-zuul-jobs master: translation: Handle renaming of Chinese locales in Django https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/773689	23:53
*** rlandy has quit IRC		23:55

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!