Tuesday, 2021-02-02

*** yamamoto has joined #openstack-infra00:26
*** rlandy has quit IRC00:30
*** tjgresha has joined #openstack-infra00:49
-openstackstatus- NOTICE: The Gerrit service on review.opendev.org is being quickly restarted to apply a new security patch00:56
*** gyee has quit IRC01:09
*** tjgresha has quit IRC01:30
*** __ministry1 has joined #openstack-infra01:38
*** rcernin has quit IRC01:44
*** dviroel has quit IRC02:04
*** zzzeek has quit IRC02:07
*** rcernin has joined #openstack-infra02:09
*** zzzeek has joined #openstack-infra02:10
*** tjgresha has joined #openstack-infra02:13
*** zzzeek has quit IRC02:24
*** zzzeek has joined #openstack-infra02:25
*** tjgresha has quit IRC02:30
*** hamalq has quit IRC02:36
*** yonglihe has joined #openstack-infra02:47
*** zzzeek has quit IRC02:48
*** zzzeek has joined #openstack-infra02:52
*** yamamoto has quit IRC03:26
*** yamamoto_ has joined #openstack-infra03:26
*** irclogbot_0 has quit IRC03:27
*** dchen has quit IRC03:34
*** dchen has joined #openstack-infra03:34
*** ykarel has joined #openstack-infra04:11
*** ricolin_ has joined #openstack-infra04:13
*** ykarel_ has joined #openstack-infra04:15
*** ykarel has quit IRC04:17
*** jfan has quit IRC04:26
*** yamamoto has joined #openstack-infra04:26
*** yamamoto_ has quit IRC04:29
*** ramishra has quit IRC04:40
*** irclogbot_0 has joined #openstack-infra04:41
*** irclogbot_0 has quit IRC04:54
*** irclogbot_2 has joined #openstack-infra04:58
*** ramishra has joined #openstack-infra05:04
*** vishalmanchanda has joined #openstack-infra05:07
*** ociuhandu has joined #openstack-infra05:07
*** ociuhandu has quit IRC05:12
*** ykarel_ is now known as ykarel05:16
*** ricolin_ has quit IRC05:35
*** ricolin has joined #openstack-infra05:39
*** priteau has quit IRC05:47
*** ykarel_ has joined #openstack-infra05:51
*** ykarel has quit IRC05:53
*** ykarel_ is now known as ykarel06:22
*** lbragstad_ has joined #openstack-infra06:24
*** lbragstad has quit IRC06:24
*** ysandeep|away is now known as ysandeep06:43
*** ralonsoh has joined #openstack-infra06:49
*** slaweq has joined #openstack-infra06:55
*** jcapitao has joined #openstack-infra07:00
*** sboyron has joined #openstack-infra07:02
*** amoralej|off is now known as amoralej07:05
*** sboyron_ has joined #openstack-infra07:19
*** sboyron has quit IRC07:22
*** eolivare has joined #openstack-infra07:32
*** rcernin has quit IRC07:37
*** slaweq has quit IRC07:40
*** slaweq has joined #openstack-infra07:42
*** xek has joined #openstack-infra07:48
*** ralonsoh has quit IRC07:54
*** dklyle has quit IRC07:59
*** ralonsoh has joined #openstack-infra08:01
*** ralonsoh has quit IRC08:03
*** ralonsoh has joined #openstack-infra08:05
*** dchen has quit IRC08:11
*** hashar has joined #openstack-infra08:13
*** rcernin has joined #openstack-infra08:14
*** rpittau|afk is now known as rpittau08:25
*** andrewbonney has joined #openstack-infra08:27
*** rcernin has quit IRC08:31
*** dtantsur|afk is now known as dtantsur08:36
*** gfidente has joined #openstack-infra08:44
*** kopecmartin has quit IRC08:48
*** kopecmartin has joined #openstack-infra08:50
*** lxkong has quit IRC08:52
*** lxkong has joined #openstack-infra08:53
*** lxkong has quit IRC08:53
*** lxkong has joined #openstack-infra08:54
*** jpena|off is now known as jpena08:56
*** priteau has joined #openstack-infra08:57
*** rcernin has joined #openstack-infra09:00
*** tosky has joined #openstack-infra09:02
*** lucasagomes has joined #openstack-infra09:06
*** hberaud has joined #openstack-infra09:13
*** rcernin has quit IRC09:18
*** rcernin has joined #openstack-infra09:23
*** ociuhandu has joined #openstack-infra09:29
*** d34dh0r53 has quit IRC09:39
*** derekh has joined #openstack-infra09:39
*** d34dh0r53 has joined #openstack-infra09:39
*** ociuhandu has quit IRC09:44
*** ociuhandu has joined #openstack-infra09:44
*** d34dh0r53 has quit IRC09:48
*** d34dh0r53 has joined #openstack-infra09:49
*** rcernin has quit IRC10:08
*** rcernin has joined #openstack-infra10:19
*** tosky has quit IRC10:33
*** tosky has joined #openstack-infra10:34
*** rcernin has quit IRC11:13
*** zbr1 has joined #openstack-infra11:14
*** dviroel has joined #openstack-infra11:15
*** zbr has quit IRC11:16
*** zbr1 is now known as zbr11:16
*** rcernin has joined #openstack-infra11:35
*** gfidente has quit IRC11:35
*** sshnaidm|ruck is now known as sshnaidm|afk11:38
*** ysandeep is now known as ysandeep|afk11:44
*** gfidente has joined #openstack-infra11:47
*** lpetrut has joined #openstack-infra11:51
*** rcernin has quit IRC12:04
*** jcapitao is now known as jcapitao_lunch12:08
*** iurygregory_ has joined #openstack-infra12:09
*** iurygregory has quit IRC12:09
*** piotrowskim has joined #openstack-infra12:12
*** ysandeep|afk is now known as ysandeep12:14
*** yamamoto has quit IRC12:15
noonedeadpunkfungi: hi! returning to the question with citycloud. there's some mess in the ticket I created. Is floating IP https://opendev.org/opendev/system-config/src/branch/master/inventory/base/hosts.yaml#L532-L538 is still assigned to the mirror inside your project?12:19
noonedeadpunkwhich should be https://opendev.org/opendev/system-config/src/branch/master/playbooks/templates/clouds/nodepool_clouds.yaml.j2#L155-L164 right?12:20
noonedeadpunkcan you also get network id or vm id so folks could double check that we're looking at the right thing...12:21
*** eolivare_ has joined #openstack-infra12:24
*** eolivare has quit IRC12:26
*** rlandy has joined #openstack-infra12:26
fricklernoonedeadpunk: the old floating IP currenty isn't being used by us anymore. there are new ones listed here along with the IDs involved: http://paste.openstack.org/show/xkQX8wH09PsR4JzP7fpr/12:29
*** hashar is now known as hasharLunch12:29
noonedeadpunkaha, gotcha12:30
*** yamamoto has joined #openstack-infra12:31
*** jpena is now known as jpena|lunch12:36
*** sshnaidm|afk is now known as sshnaidm|ruck12:36
*** ysandeep is now known as ysandeep|mtg12:37
*** yamamoto has quit IRC12:39
*** tbachman has quit IRC12:44
*** eolivare_ has quit IRC12:46
*** tbachman has joined #openstack-infra12:48
*** yamamoto has joined #openstack-infra12:49
*** yamamoto has quit IRC12:50
fricklernoonedeadpunk: if I look at the router, I see a completely different address there, not sure if that is as designed or may to part of the issue: {"subnet_id": "0cff86a9-a33a-4550-b2ee-f2c909dee4d2", "ip_address": "77.81.6.17"}12:58
*** amoralej is now known as amoralej|lunch12:59
*** iurygregory_ is now known as iurygregory13:04
*** yamamoto has joined #openstack-infra13:07
*** yamamoto has quit IRC13:07
*** yamamoto has joined #openstack-infra13:08
*** Tengu has quit IRC13:09
*** Tengu has joined #openstack-infra13:10
*** Tengu has quit IRC13:10
*** Tengu has joined #openstack-infra13:10
*** Tengu has quit IRC13:10
*** Tengu has joined #openstack-infra13:11
*** Tengu has quit IRC13:11
*** yamamoto has quit IRC13:12
*** jcapitao_lunch is now known as jcapitao13:14
*** Tengu has joined #openstack-infra13:18
*** hasharLunch is now known as hashar13:18
*** eolivare_ has joined #openstack-infra13:18
noonedeadpunkfrickler: yeah folks have moved router. can you check if this has solved the issue13:23
*** jpena|lunch is now known as jpena13:25
noonedeadpunkat least they're reachable for me now13:26
fricklernoonedeadpunk: I can ping both, too, and log into mirror1, so this seems fine again, thanks for your help13:50
fricklerinfra-root: I don't have time today to do the followup work of changing the address everywhere, maybe one of you can do that? also not sure what the idea with the second mirror based on focal was? seems it doesn't have all users deployed, likely due to lack of connectivity?13:52
*** tbachman has quit IRC13:52
*** ociuhandu has quit IRC13:53
*** tbachman has joined #openstack-infra13:55
dansmithclarkb: are you able to generate me an updated paste of the percentage of gate resources used by each of the projects? now that neutron has dropped the tripleo jobs I'm curious what the new numbers are13:58
*** amoralej|lunch is now known as amoralej14:02
*** ociuhandu has joined #openstack-infra14:10
*** sreejithp has joined #openstack-infra14:14
openstackgerritAkihiro Motoki proposed openstack/openstack-zuul-jobs master: translation: Handle renaming of Chinese locales in Django  https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/77368914:21
openstackgerritAkihiro Motoki proposed openstack/openstack-zuul-jobs master: translation: Handle renaming of Chinese locales in Django  https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/77368914:24
*** ociuhandu has quit IRC14:30
*** ociuhandu has joined #openstack-infra14:31
*** ociuhandu has quit IRC14:36
*** __ministry1 has quit IRC14:38
*** ociuhandu has joined #openstack-infra14:48
*** ociuhandu has quit IRC14:58
*** dwalt has joined #openstack-infra15:01
*** slaweq has quit IRC15:14
*** slaweq has joined #openstack-infra15:14
*** aarents has quit IRC15:16
*** ociuhandu has joined #openstack-infra15:18
*** ociuhandu has quit IRC15:22
*** ociuhandu has joined #openstack-infra15:23
*** ociuhandu has quit IRC15:23
*** ociuhandu has joined #openstack-infra15:25
funginoonedeadpunk: frickler: yes, after my reboots didn't work, ianw tried replacing the floating ip with a new one (so new address, not updated in dns yet), and when that didn't work he tried to boot a new instance there which also didn't work, but if it had he was considering using it as an opportunity to get the mirror server upgraded15:26
fungii can work on correcting dns and booting nodes there again for now, to get things back in operation, since i think that's currently the only region supplying some specific node types15:27
*** hashar is now known as hasharAway15:27
*** lpetrut has quit IRC15:27
*** ociuhandu has quit IRC15:31
*** ociuhandu has joined #openstack-infra15:31
openstackgerritMohammed Naser proposed openstack/project-config master: Switch to using v3-standard-8 flavors  https://review.opendev.org/c/openstack/project-config/+/77371015:33
*** ysandeep|mtg is now known as ysandeep15:41
*** ociuhandu has quit IRC15:41
*** gfidente has quit IRC15:47
*** gfidente has joined #openstack-infra15:49
*** ysandeep is now known as ysandeep|away15:54
openstackgerritMerged openstack/project-config master: Switch to using v3-standard-8 flavors  https://review.opendev.org/c/openstack/project-config/+/77371015:56
clarkbdansmith: yes I can regenerate that after meetings today15:58
dansmithclarkb: thanks15:58
*** dklyle has joined #openstack-infra15:58
*** dklyle has quit IRC15:59
*** david-lyle has joined #openstack-infra15:59
*** amoralej is now known as amoralej|off16:00
*** jamesmcarthur has joined #openstack-infra16:02
*** jamesmcarthur has quit IRC16:03
*** jamesmcarthur has joined #openstack-infra16:03
*** hasharAway is now known as hashar16:04
*** yamamoto has joined #openstack-infra16:17
*** ykarel has quit IRC16:20
*** lbragstad_ is now known as lbragstad16:21
*** david-lyle is now known as dklyle16:24
*** yamamoto has quit IRC16:26
dansmithmnaser:  do you know if the gerrit instance is running in one of the same flavors that is io-restricted?16:28
mnaserdansmith: gerrit runs at rax afaik16:28
dansmithokay16:29
*** ociuhandu has joined #openstack-infra16:34
*** ociuhandu has quit IRC16:35
*** ociuhandu has joined #openstack-infra16:35
*** jcapitao has quit IRC16:52
*** jamesmcarthur has quit IRC16:52
*** jamesmcarthur has joined #openstack-infra16:54
*** jamesmcarthur has quit IRC16:57
fungiand is on a 64gb ram 16vcpu instance with the data on a cinder-attached ssd volume16:59
*** jamesmcarthur has joined #openstack-infra17:00
*** lucasagomes has quit IRC17:04
*** ociuhandu has quit IRC17:06
*** ociuhandu has joined #openstack-infra17:07
*** ociuhandu has quit IRC17:13
*** zbr1 has joined #openstack-infra17:16
clarkbdansmith: http://paste.openstack.org/show/MIfg7ByqwceE1rFgu8gw/17:17
dansmithclarkb: wow, no real change there17:17
*** zbr has quit IRC17:18
*** zbr1 is now known as zbr17:18
*** ociuhandu has joined #openstack-infra17:19
clarkbdansmith: each report covers a month of logs so we may not see shifts until we roll enough logs over. If we really need it I can modify the script to look at say only the last 7 days instead17:24
*** ociuhandu has quit IRC17:25
*** ociuhandu has joined #openstack-infra17:36
dansmithclarkb: okay, I was on a call when I looked, I see yeah it goes back to like beginning of jan, so fair enough17:38
*** rlandy is now known as rlandy|biab17:40
*** ociuhandu has quit IRC17:40
*** d34dh0r53 has quit IRC17:42
*** ralonsoh has quit IRC17:45
*** jamesmcarthur has quit IRC17:45
*** d34dh0r53 has joined #openstack-infra17:50
clarkbdansmith: ~last week http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/17:50
dansmithclarkb: ah thanks, sorry you had to do that,17:51
dansmithbut that's what I expected to see.. the tripleo number swell to fill the void left by neutron17:51
clarkblike a seesaw17:52
*** gfidente is now known as gfidente|afk17:52
dansmithfrustrating17:52
*** ociuhandu has joined #openstack-infra17:52
*** jpena is now known as jpena|off17:55
fungiit's not unexpected. basically the long queue times create backpressure reducing the amount of activity below what it would be if unbounded, so shrinking one source allows the others to naturally expand into the void it leaves17:55
fungiwhen people see things are merging more readily, they're likely to approve more changes than they would if things were backed up more and taking forever17:57
fungialso when developers are getting feedback from builds more quickly, they can iterate faster and push new revisions with greater frequency17:58
*** jamesmcarthur has joined #openstack-infra17:59
*** ociuhandu has quit IRC17:59
*** rcernin has joined #openstack-infra17:59
dansmithfungi: all the numbers have to add up to 100%, so obviously things have to swell18:00
dansmithwithout normalizing this against number of commits it's kinda guessing anyway18:00
clarkbya the original purpose of the script was to determine if the fear that all these new projects were killing the zuul queues was valid18:02
*** ociuhandu has joined #openstack-infra18:02
clarkbit needs work if we want to improve it to do more detailed analysis18:02
dansmithyeah18:03
fungisort of, if you look at a full week the system is not running at 100% for the duration. it does catch up here and there, mostly on weekends, but the amount of time it spends caught up is where the total work volume becomes apparent18:03
dansmithI think we pretty much need to set some goals for individual job runtime, and maximum number of heavy jobs per commit18:03
dansmithand try to get projects to move things to periodic or experimental outside of that18:03
*** derekh has quit IRC18:04
*** rcernin has quit IRC18:04
fungibut beyond the work volume the system manages to handle in a given week, i'm suggesting there are psychological and logistical effects at play too, that reductions in work volume elsewhere will also be pitted against18:04
dansmiththe tripleo guys asked today for what is "reasonable usage" so we should probably just try to define that18:05
clarkbdansmith: In the past I've suggested that using OSA and kolla as comparisons as similar deployment projects may be a valid way of defining things18:05
dansmithyeah,18:05
*** eolivare_ has quit IRC18:05
dansmithI was going to say something like figure out how long a tempest run takes, plus the devstack setup time plus some slack, and use that18:06
dansmithmaybe kolla is a better metric18:06
fungiaverage node-hours per change merged might be a good metric, because efficient developing and reviewing practices can lead to fewer revisions, more resilient/robust jobs mean fewer rechecks, and so on18:06
*** ociuhandu has quit IRC18:06
zbrfungi: do you know that https://opendev.org/opendev/system-config/src/branch/master/playbooks/apply-package-updates.yaml#L1 is a syntax error?18:06
dansmithfungi: sure, just like if tripleo is 50% of all the changes in openstack and using 40% of the gate, then that's not as bad as we think it is18:07
fungiif we're going to have a measuring stick, i'd like it to be one which encourages good practices and efficient use of system, because that's where perverse incentives will lead people18:07
zbruse of variable in hosts, not quite something to make ansible happy18:07
fungizbr: hah, that's amusing18:07
clarkbI'm guessing that was intended to be used with -e target=foo ?18:08
zbrit does pass if you define the variable but18:08
zbransible-playbook --syntax-check playbooks/apply-package-updates.yaml18:08
zbris still a failure18:08
clarkbI mean if its only ever used in that manner then is it a problem?18:08
zbrthe funny bit is that is not used like this, is used with --limit, which is the better way of doing it18:08
clarkbzbr: it is used eaxctlythe way I describe it18:09
clarkbin launch node18:09
clarkbwith -e target=foo18:09
zbrfound it while testing the new linter, which does syntax checking using ansible.18:09
zbrworkaround: - hosts: "{{ target | default([]) }}"18:10
clarkbzbr: will it fail if target is defined with -e target?18:10
clarkbif not then it isn't a syntax error as used18:10
zbrclarkb: tell this to syntax check. imho a playbook that crashes when not given some magic inputs is still a syntax error.18:11
zbrit is easy to add a localhost: fail if not defined.... to avoid it.18:11
clarkbwell the whole point is it should only run against the remote, we don't want it to run against localhost18:12
clarkband no that wouldn't eb a syntax error it would be a runtime error due to invalid inputs18:12
zbrclarkb: the test for undefined would run on localhost, not the task.18:12
zbri can show you. give me few minutes.18:12
fungiit sounds like you're basically saying the syntax checker is broken18:13
clarkbzbr: well I'm asking if there is actually anything to fix here18:13
clarkbcan we save 15 minutes and accept that it is correct as used and move on?18:13
fungibecause it lacks context for how the playbook is invoked18:13
zbrimho is not broken, it detects code that is not well written. is like writing python code and assuming a variable is defined, without checking for it.18:14
zbryes, as used it works, but that does not make the code good.18:14
clarkbzbr: functions can have required arguments18:14
clarkbif you want to think of the playbook in that way target is a required parameter18:15
*** ociuhandu has joined #openstack-infra18:15
clarkbnot providing it is an error just as calling a python function without the required arguments18:15
*** ociuhandu has quit IRC18:15
*** ociuhandu has joined #openstack-infra18:15
zbrsee https://review.opendev.org/c/opendev/system-config/+/773782/1/playbooks/apply-package-updates.yaml18:17
clarkbzbr: but why is that better?18:17
clarkbthe end result is the same in both cases, an error becuse target is not defined18:17
clarkbbut one requires you to write a ton of error checking code18:18
zbrbecause it avoids a crash18:18
zbrtry to compare them with python code, usually any piece of encapsulated code should check for inputs, is just good practice.18:18
clarkbmost python functions assume that their required arguments are provided18:19
clarkband it is up to the caller to get it right, just as is the case here18:19
zbrin that particular case the benefit is minor, but think about other playbooks that may use lots of vars that are needed or not.18:19
*** jamesmcarthur has quit IRC18:24
*** ociuhandu has quit IRC18:27
*** ociuhandu has joined #openstack-infra18:30
*** ociuhandu has quit IRC18:30
clarkbzbr: so the linter is running ansible-lint's syntax checker and the syntax check expects hosts: to always be defined even though it is valid to have a variable there?18:31
clarkb(I'm looking at the change and trying to udnerstand the concern within that context)18:31
*** sshnaidm|ruck is now known as sshnaidm|afk18:34
*** dtantsur is now known as dtantsur|afk18:35
*** rpittau is now known as rpittau|afk18:39
*** hashar is now known as hasharDinner18:41
*** tdasilva has joined #openstack-infra18:44
zbransible syntax check expects to be able resolve hosts, if it fails it will give a syntax error.18:47
clarkbzbr: does that mean it needs an inventory file too?18:47
zbrnope18:47
zbrputting hosts: dskfndlgnlf is perfectly valid, but using jinja2 can produce an error. depends on how you write it.18:49
fungidefinitely sounds broken then18:50
fungiif it doesn't care that the hosts value has any meaning, then it should just ignore if it contains variable expansion18:51
zbrthe same kind of broken as writing a python function that receives an argument and not checking that its type is ok18:51
zbrimho, it is quite good that they did it like this.18:51
fungior perhaps ansible-lint needs to be fed whatever variables ansible itself would be supplied on invocation18:51
*** jamesmcarthur has joined #openstack-infra18:52
fungiwell, this is more like complaining that a python function requires an argument, without knowing whether the caller will supply that argument18:52
zbryou can take the linter out from this debate, now is between you and ansible-playbook --syntax-check, something is already used on many repos.18:53
fungigot it, either way, the idea is that you should avoid valid constructs if they're hard to test/evaluate out of context18:54
fungithere are reasonable points on both sides18:54
fungiis a checker which lacks context a suitable tool to use in every situation? is it worth the effort to alter a correctly working implementation to make it easier to check for correctness?18:56
*** diablo_rojo_phon has joined #openstack-infra18:56
fungiwhere "correctness" may also be someone's opinion18:56
zbrfungi: take a look at https://docs.ansible.com/ansible/latest/dev_guide/developing_collections.html -- and see playbooks/tasks/ -- i am bit surprised to see that I need to explain why mixing tasks and playbooks inside a folder is a bad idea.19:00
zbrand this has nothing to so with collections, is about layouting code.19:01
fungizbr: it may not be a good idea, but it also may not be worth the time it takes to debate, review and improve if it's already working19:02
fungiit might be worth avoiding doing the same thing in the future, sure19:02
*** sboyron_ has quit IRC19:03
zbrfungi: tbh: system-config is in very good shape by ansible standards, i would refrain from naming other more messy cases i have to deal with ;)19:03
*** jamesmcarthur has quit IRC19:03
zbrand that mixing of tasks/vars/playbooks is quite a common mistake, but now the linter complains about it.19:04
zbrin fact we can blame ansible a little bit for that with the generic "include" that was deprecated as being so confusing.19:05
zbri seen people wondering why they cannot include a playbook from inside a tasks file, again an again.19:05
zbrFYI, the filetype detection uses patterns from https://github.com/ansible-community/ansible-lint/blob/master/src/ansiblelint/config.py#L5-L20 -- the list is not hardcoded and subject to change based on feedback.19:07
*** jamesmcarthur has joined #openstack-infra19:15
*** andrewbonney has quit IRC19:15
*** rlandy|biab is now known as rlandy19:35
*** hasharDinner is now known as hashar19:39
openstackgerritMerged openstack/project-config master: Revert "Temporarily stop booting nodes in citycloud-kna1"  https://review.opendev.org/c/openstack/project-config/+/77324019:40
*** tdasilva_ has joined #openstack-infra19:44
*** tdasilva has quit IRC19:46
dansmithclarkb: fungi: do you know why this job was paused for 3ish hours? https://zuul.opendev.org/t/openstack/build/8af7cfabcaff4f2b83d26395d6a9b19f/log/job-output.txt#416019:55
clarkbdansmith: yes, that is the tripleo job that builds all their container images, then it sits arounds serving them for the child jobs19:56
clarkbI think the breakdown is something like an hour of building and 2 hours of servinig19:56
fungidansmith: looks like it's running a server which other builds are pulling content from19:56
fungiso it has to pause until those builds complete19:56
clarkbit could potentially stop sooner, though I'm not sure if zuul makes that easy (wait for all child jobs to say "we have the data you are serving you can go away now")19:57
clarkbya it starts at ~11:30 then pauses at ~12:28 after building the images19:59
clarkbthen the remaining ~ 2 hours is spent serving those images to the downstream consuming jobs19:59
*** rcernin has joined #openstack-infra20:00
*** tdasilva_ has quit IRC20:03
*** tdasilva_ has joined #openstack-infra20:03
*** rcernin has quit IRC20:04
*** jamesmcarthur has quit IRC20:05
*** jamesmcarthur has joined #openstack-infra20:06
*** jamesmcarthur has quit IRC20:07
*** jamesmcarthur has joined #openstack-infra20:12
*** yamamoto has joined #openstack-infra20:23
dansmithfungi: clarkb: sorry for the delayed response.. so there's other jobs running that uses that worker or something so there's just no output during that time?20:24
openstackgerritJeremy Stanley proposed openstack/project-config master: Move bindep to opendev tenant  https://review.opendev.org/c/openstack/project-config/+/77379320:24
dansmithwe were wondering if that was zuul pausing a job for a reschedule or something like that20:24
*** rcernin has joined #openstack-infra20:25
fungidansmith: correct, the "job" actually starts a server which serves content to other builds running as part of the same buildset20:27
*** yamamoto has quit IRC20:27
dansmithfungi: okay buildset is zuulv3 lingo that does not mean "multiple nodes" right?20:27
fungiwhen a ref (e.g. a change) is enqueued into a pipeline, builds for each of the selected jobs are started. that collection of builds is a buildset20:28
fungithey get reported together once all builds within the buildset complete20:29
dansmithoh, so one job can serve stuff to another job?20:29
dansmithlike, the jobs depend on each other?20:29
fungiso a buildset might be the set of linters, unit tests and functional test jobs which ran20:29
fungithey can depend on each other, yes20:29
fungiand can even interact20:30
dansmithokay, I've never known such a thing, other than multinode jobs20:30
fungiwe started doing it initially with our container image testing workflow, where one job sets up a registry server and then other jobs depending on it build and push images into that registry and then yet still other jobs can pull those images and exercise or publish them to a durable location20:31
dansmithcertainly that ends up with workers waiting around for another worker to get to the usable point right?20:32
fungicorrect20:33
fungiwhich, depending on how the jobs are written and what they need to do, could just be a few minutes20:33
dansmithis there anything easy to grep for to figure out how long a worker waited?20:33
dansmithand once a worker has completed its job in a buildset, presumably it goes on to do something else, we don't need a py27 worker hanging around until the end of the devstack worker just because it's the same buildset...20:34
dansmiththe reason I ask about the grep'able thing is just curious if there is a way to spot inefficient configurations where one worker ends up waiting 45 minutes for another to get to a usable place20:35
fungii'd have to get much more familiar with the pausing mechanism, i'm not sure if there's visible evidence of it in the task output20:36
dansmithokay20:36
clarkbthe job paused/job resumed lines that you linked are produced by the zuul_return pause thing iirc20:37
*** jamesmcarthur has quit IRC20:40
dansmithright, so the job has some way of entering a "while true: sleep" loop so it can serve, yeah?20:41
dansmithand presumably the dependent job need to poll for readiness or be told by zuul that the other job is at the sync point so it can start using it right?20:42
clarkbdansmith: yes, zuul provides an ansible module called: zuul_return which allows a job to provide state back to the scheduler20:42
clarkbI think in this case zuul won't start the children jobs until the parent either exits successfully or paused so it is quite simple20:43
clarkband the parent won't stop after being paused until the children are all done20:43
dansmithoh jeez20:43
dansmithso this thing might be 45 minutes in, hit the pause, and then we start building the thing that is going to need this, which could take 45 minutes on its own such that we're sitting idle for that long?20:44
clarkbdansmith: yes, though in this case its about 60 minutes then 120 minutes20:44
dansmith120 minutes for the depdendent child thing to build20:45
dansmith?20:45
clarkbyes20:45
dansmithuh20:45
dansmitham I missing how that's not a super bad waste of resources?20:45
clarkboh actually it might be 180, how have they managed that? I guess paused jobs areb't subject to timeouts in nromal ways20:46
clarkbdansmith: well its intended use is to avoid needing to perform duplicate work in many jobs20:46
dansmithsure, it's a tool, but in this case, it could be working against us it seems like20:46
clarkbbasically that first hour is performed once rather than say 5 times in jobs that are all multinode. So we save 4 hours in my contrived example20:46
clarkbbut ya it is possible to set it up such that we don't win in the final tally balance20:47
dansmithclarkb: but did I read you right that you can tell that it took 120 minutes to build the child worker and all the time we were sitting idle?20:47
dansmithand if so, can you show me how to figure that math?20:48
clarkbdansmith: https://zuul.opendev.org/t/openstack/build/8af7cfabcaff4f2b83d26395d6a9b19f/log/job-output.txt#4160-4161 shows you the time paused (it was actually almost 3 hours not two) I assumed 2 horus bceause we have a 3 hour job timeout and it had already spent an hour building images at that point. But I think zuul must not do timeouts in paused jobs in a normal way20:48
clarkbdansmith: look at the timestamps on the left side of the text there20:48
dansmithclarkb: right, I thought those times between the paused and resumed were when this node was busy serving images20:49
dansmithare you saying that's all idle wait time?20:49
clarkbwell it is idle from Zuul's perspective.20:49
clarkblogging for any active period while idel from zuul's perspective will depend on the job itselfr20:50
dansmithheh, okay sure, I'm just wondering how to connect this to the thing that is dependent on it, to figure out if this thing is sitting around longer than it needs20:50
dansmithbut maybe the answer is "it's totally dependent on the config of the job"20:50
fungikeep in mind that while it's one node waiting and serving content to other nodes for several hours, all that time there are at least several multi-node jobs *running* and using the content it's serving, so that one node is a fairly small percentage of what's in use20:50
dansmithfungi: well, that's what I was trying to understand,20:51
clarkbdansmith: I don't know wheer tripleo logs their "idle" workload20:51
dansmithI thought clarkb was saying we don't even start to build the child jobs until this job gets here20:51
dansmithi.e. not parallelizing the builds of the parent and children20:51
clarkbdansmith: correct20:51
fungiand yeah, as clarkb points out, it's not necessarily "idle" in the usual sense, it's not running job tasks but it's doing something (serving content to nodes for other running builds)20:52
dansmithclarkb: okay but you don't know how much of that three hours was build vs serving20:52
clarkbdansmith: I know the build was 1 hour, that all happened before the pause. The serving all happens during the 3 hour pause20:52
dansmithI gotcha, I thought it was asserted that the time between those two markers was just the waiting for build20:52
dansmithclarkb: yeah, I'm talking about the building of the things that depend on this20:52
fungithe three hours pause was the amount of time it took the other builds which say they rely on that to complete20:53
dansmithright, okay got it20:53
fungionce they were done. that build serving the content for them resumed, cleaned up and finished20:53
dansmithso that could be two hours of not using this and one hour of using it, or much worse or much better20:53
clarkbwhat this setup has done is avoid needing all of those child jobs to spend an hour doing image builds. So we save roughly 1 hour * num_child_jobs20:54
dansmithclarkb: presumably yes I get that20:54
fungiminus the node which is doing the serving of course20:54
clarkbdansmith: yes, and if all it is doing is serving docker images it is possible thatthose get pulled like 20 minutes into the pause and then the idle node goes properly idle20:54
clarkbthe zuul pause mechanism isn't rich enough to say we're done early you can go away now20:55
dansmithclarkb: right, but it consumes a worker for the full period until those jobs (which were done with this thing in 20 minutes) have finished three hours later20:55
dansmithyeah20:55
fungiif that node spends too much time sitting around because the jobs which pulled images from it in their first five minutes take hours to compete, then it's possible that we still end up using more node-hours overall than if each job had done redundant activity20:55
clarkbdansmith: right it can likely be optimized further, but I believe this si still an improvement on the simple alternative20:55
clarkbparticualrly for tripleo which has long image builds20:55
dansmithclarkb: yeah, I'm sure in a lot of cases it is20:56
dansmithI'm just trying to understand what we're looking at20:56
clarkb(and many multinode jobs that need the images)20:56
dansmithit also seems like the kind of thing that could easily be done for convenience when it's just lines in a yaml file, but without realizing the imapct20:56
dansmithpresumably the three hours could also be the time it takes to build three children, only the last of which actually needs this, all of which are serialized20:57
funginapkin math, a 4-hour single-node content serving job (1 hour creating the content + 3 hours serving it) which is providing content to three two-node jobs which run for up to three hours saves us 2 node-hours20:57
clarkbdansmith: yes, if the child jobs don't actually need those resources then we've optimized for a non existant problem and likely made things worse20:57
dansmithI'm really not trying to say this is not a gain, it totally is, I'm just saying I can see writing a job with dependencies and not realize I have created a 4-node serialization that didn't necessarily need to happen, or for which there is a better optimization20:58
dansmithlike, if you don't understand all these nuances20:58
dansmithone way to maybe spot that is if you know there's not three hours of work that depends on this worker, that'd be a sign that maybe you've created a monster20:59
clarkbyup I think that is a possibility. My understanding of the tripleo situation is that they do actually need those images, but it is possible there are better ways to get them (like quay or something)20:59
dansmithclarkb: well, right, in the tripleo case it probably is perfect for that they need, but if a four hour job seems longer than necessary, then it might help point to somewhere that you've done something bad21:00
fungiright, my napkin math example shows the break-even point is probably if you have at least 5-6 nodes occupied with redundant work (if it's only 4, then the separating it out and serving it costs you more than it nets you), but that will depend to a great extent on the durations21:02
dansmithfungi: I was going to say "or rearranging the dependencies might be more efficient"21:02
fungiyeah21:02
dansmithI dunno how zuul reserves workers, so maybe not,21:02
dansmithbut it would be helpful to be able to visualize that somehow21:03
clarkbI think it is fair to say that depending on the situation this tool can make things better or worse from a through put perspective. A lto of that will depend on the actual job workload. It is also a good indicatioin that something might be worth reviewing if it takes a very long time21:03
fungiworkers don't really get reserved, they're satisfied from an available pool21:03
clarkbhttps://grafana.opendev.org/d/9XCNuphGk/zuul-status?orgId=1 can help visualize some of that (there are also nodepool specific dashbaords there as well)21:04
fungiso zuul puts out a node request for a given build and waits for a nodepool launcher to fulfil that request from available resources21:04
dansmithfungi: I would never *dare* to suggest a zuul feature, but presumably there'd also be an optimization where you have a named sync point and zuul can build both in parallel until the point at which they need to both be at a certain state of readiness,21:05
dansmithplus the "okay I'm done with you now" thing clarkb mentioned21:05
dansmithbut I'd much rather hear a job author say "I'd like to be able to do X but can't" for one of those situations21:05
fungiwhich gets a little complicated since dependent builds need their node requests fulfilled from the same nodepool region21:05
fungidansmith: yeah, that sounds like a potentially useful evolution of the job dependency handling, i don't know what would be involved in implementing it21:08
dansmithyeah, I'm saying I wouldn't even consider it until some job optimizer claims no more can be done without something like that :)21:08
fungithe "i'm done with you now" mechanism could be essentially the same as the sync point mechanism21:10
fungitraffic control, in a more general sense21:10
clarkbanother aspect here is that you may be paused and waiting some time for the child jobs tostart due to contention or cloud flakyness21:11
fungieveryone stop here, when this pint is reached you go but you stay, et cetera21:11
clarkbwe should be able to measure that without any job changes, but I'm not sure zuul/nodepool expose that info21:11
clarkbhaving something like a "time waiting for this to boot" in graphite would be nice though21:11
clarkbcorvus: ^ do you know if that is somethign we already expose?21:11
clarkbwe capture a boot time but I'm pretty sure that clock starts once we believe we've got free quota available so would ignore the time waiting for quota to become available?21:12
clarkbhttps://grafana.opendev.org/d/4JjHXp2Gk/nodepool?orgId=1 time to ready is the boot time I'm thinking of21:13
clarkba zuul level time from node request being sent to filled is what I think I'm tlaking about as being useful21:14
*** thiago__ has joined #openstack-infra21:14
*** vishalmanchanda has quit IRC21:15
*** tdasilva_ has quit IRC21:17
*** thiago__ has quit IRC21:18
*** thiago__ has joined #openstack-infra21:19
*** tbachman has quit IRC21:19
*** dwalt has quit IRC21:21
*** rcernin has quit IRC21:22
*** tbachman has joined #openstack-infra21:26
*** gfidente|afk has quit IRC21:33
*** thiago__ has quit IRC21:33
*** thiago__ has joined #openstack-infra21:33
*** hashar has quit IRC21:40
*** jamesmcarthur has joined #openstack-infra21:41
*** rcernin has joined #openstack-infra21:52
*** ociuhandu has joined #openstack-infra22:01
*** thiago__ is now known as tdasilva22:04
*** ociuhandu has quit IRC22:05
*** rcernin has quit IRC22:08
*** rcernin has joined #openstack-infra22:09
*** yamamoto has joined #openstack-infra22:12
openstackgerritMerged openstack/project-config master: Move bindep to opendev tenant  https://review.opendev.org/c/openstack/project-config/+/77379322:14
corvusohai, reading22:19
*** xek has quit IRC22:25
corvusclarkb: node request timing is sent to graphite under zuul.nodepool.requests.fulfilled and zuul.nodepool.requests.fulfilled.label.$LABEL22:29
corvusso you can get stats on how long, say, a "centos7" node request takes to fill with the second, or any node request takes to fill with the first22:30
clarkbcorvus: oh cool22:32
clarkband that is the time from request sent to fulfilled from zuul's perspective22:32
clarkbdansmith: ^ so I think you use that to answer (on average or typical cases) how long the nodes will be "booting" while the parnet job is paused22:33
*** jamesmcarthur has quit IRC22:49
*** slaweq has quit IRC22:55
*** tdasilva_ has joined #openstack-infra22:57
*** tdasilva has quit IRC22:59
*** yamamoto has quit IRC23:03
*** yamamoto_ has joined #openstack-infra23:03
*** JayF has quit IRC23:13
dansmithclarkb: sorry I got distracted.. I think getting it from graphite won't actually answer the real question I have, which is "for what percentage of this 4h was this thing useful to children"23:13
dansmithbecause the answer is really in the job and how it's used23:13
dansmithso I should just to pick apart a couple runs and compare timestamps I think23:13
dansmithbecause even if it's 30m to find the next node, the one image pull or whatever it does, before 2h of idle time is what I really want to know :)23:14
clarkbdansmith: ya if you found a representative sample the timestamps in the logs should give you that info too (when did each job start and end and how does that compare with the pause time)23:14
dansmithaye23:15
dansmithsomeone that knows how that job works may also tell me "oh it's pulling images all the damn time"23:15
*** JayF has joined #openstack-infra23:17
*** thiago__ has joined #openstack-infra23:18
*** tdasilva_ has quit IRC23:20
*** tdasilva_ has joined #openstack-infra23:20
*** thiago__ has quit IRC23:23
*** calbers has quit IRC23:28
*** dchen has joined #openstack-infra23:32
*** calbers has joined #openstack-infra23:37
openstackgerritAkihiro Motoki proposed openstack/openstack-zuul-jobs master: translation: Handle renaming of Chinese locales in Django  https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/77368923:53
*** rlandy has quit IRC23:55

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!