Tuesday, 2021-09-28

ianwclarkb: should we merge 811233 and restart with it?01:11
Clark[m]Ya I think we should. If you approve it I can sort out the restart tomorrow01:26
Clark[m]Or if you want to get it done fee free01:26
ianwi'll get it in and see how we go01:36
opendevreviewMerged opendev/system-config master: Properly copy gerrit static files  https://review.opendev.org/c/opendev/system-config/+/81123302:26
opendevreview赵晨凯 proposed openstack/project-config master: add taibai namesapce and base project  https://review.opendev.org/c/openstack/project-config/+/81129002:46
opendevreviewNMG-K proposed openstack/project-config master: add taibai namesapce and base project  https://review.opendev.org/c/openstack/project-config/+/81129002:50
opendevreviewNMG-K proposed openstack/project-config master: add taibai namesapce and base project  https://review.opendev.org/c/openstack/project-config/+/81129003:17
fricklerehm, did the latest stuff clean up autoholds? my hold from yesterday evening seems to be gone and there is only one currently which has an id of 000000000003:32
fricklerah, I should've read all backlog03:32
opendevreviewfupingxie proposed openstack/project-config master: test  https://review.opendev.org/c/openstack/project-config/+/81129504:03
opendevreviewIan Wienand proposed opendev/system-config master: Refactor infra-prod jobs for parallel running  https://review.opendev.org/c/opendev/system-config/+/80767204:48
opendevreviewIan Wienand proposed opendev/system-config master: infra-prod: clone source once  https://review.opendev.org/c/opendev/system-config/+/80780804:48
*** ysandeep|out is now known as ysandeep05:51
ianwclarkb: i doubt i will make the meeting, but i added some notes on the parallel job changes.  i think they're ready for review now07:03
ianwhttps://hub.docker.com/layers/opendevorg/gerrit/3.2/images/sha256-8d847be97aea80ac1b395819b1a3197ff1e69c5dcb594bec2a16715884b540cc?context=explore07:05
ianwis the latest gerrit image07:05
ianwthat matches what we have for gerrit 3.2 tag ... "opendevorg/gerrit@sha256:8d847be97aea80ac1b395819b1a3197ff1e69c5dcb594bec2a16715884b540cc"07:06
ianwso i'll just do a quick restart to pick up the fixed static content changes07:06
ianw#status log restarted gerrit to pickup https://review.opendev.org/c/opendev/system-config/+/81123307:08
opendevstatusianw: finished logging07:08
opendevreviewNMG-K proposed openstack/project-config master: add taibai namesapce and base project  https://review.opendev.org/c/openstack/project-config/+/81129007:09
*** ianw is now known as ianw_pto07:21
*** jpena|off is now known as jpena07:31
*** ykarel is now known as ykarel|lunch09:02
*** ykarel|lunch is now known as ykarel10:15
opendevreviewAlfredo Moralejo proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB  https://review.opendev.org/c/openstack/diskimage-builder/+/81139210:44
*** ysandeep is now known as ysandeep|brb10:47
opendevreviewYuriy Shyyan proposed openstack/project-config master: Adjusting tenancy limits for this cloud.  https://review.opendev.org/c/openstack/project-config/+/81139510:59
yuriysianw: clarkb: just saw from yesterday, unlucky... adjusting limit today. We use the libvirt driver, and libvirt uses the kvm virt_type, not qemu. The biggest issue that I'm trying to figure out is why when a codeset requires multiple instances for testing is - do they all seem to get started/push on the same baremetal node11:06
yuriysit has an explosive effect and doesn't naturally balance out.11:08
*** ysandeep|brb is now known as ysandeep11:13
*** bhagyashris_ is now known as bhagyashris|rover11:16
yuriysin the nl files what does rate: do , cant find that info in the zuul-ci.org/docs11:16
*** jpena is now known as jpena|lunch11:24
*** dviroel|out is now known as dviroel11:27
opendevreviewAlfredo Moralejo proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB  https://review.opendev.org/c/openstack/diskimage-builder/+/81139212:14
*** jpena|lunch is now known as jpena12:17
*** ysandeep is now known as ysandeep|brb12:23
*** ysandeep|brb is now known as ysandeep12:31
fungiyuriys: the rate there is a throttle for how quickly nodepool will make api calls to the provider14:12
fungisome providers have rate limiters in front of their apis and will return errors if nodepool makes too many calls in rapid succession14:13
yuriysI'm just concerned that placement may not be instant, still validating that part. So if it's returning the same zone node nova availability this may be causing same instances to be schedules on same infra nodes.14:23
opendevreviewMerged openstack/project-config master: Adjusting tenancy limits for this cloud.  https://review.opendev.org/c/openstack/project-config/+/81139514:24
yuriysAlthough there is a whole message queuing system inside, so that part isn't up to your rate limit, but rather how placement handles it's queue I suppose.14:24
yuriysBasically the problem I'm trying to solve, is sometimes I'll see a node let's say with 11 instances on it while the other one only has 5 or 6, so it's not 'balanced'. I've been thinking maybe tweaking overcommits to force the correct balancing behavior, so that let's say once one node gets to 8 or so, it has to be provisioned on some other node and wont be in the node list placement returns.14:27
fungii guess placement tries to follow something like a round-robin or least-loaded scheme?14:27
yuriysFrom everything I have seen that is not the case, or maybe we didn't configure that part properly.14:28
fungileast-loaded could work cross purposes since that might allow one under-utilized host to suddenly get a lot of instances placed14:28
yuriysimo it should always return the least loaded node.14:28
fungidepending on how racy the determination is14:28
yuriysYeah, when I see a subset of tests return this:14:29
yuriys(victoria) [root@lucky-firefox ~]# for i in 682b7f82-3433-474a-a4c1-76c8a8316abd 64f48d2c-9cf8-4c3d-86f7-017a4f7f6ad8 aaf52bf4-e0a9-41b8-a307-1b0e637bcb69; do openstack server show $i -c OS-EXT-SRV-ATTR:host -f shell; done14:29
yuriysos_ext_srv_attr_host="dashing-tiglon.local"14:29
yuriysos_ext_srv_attr_host="dashing-tiglon.local"14:29
yuriysos_ext_srv_attr_host="dashing-tiglon.local"14:29
yuriysI go full /reeeeee14:29
yuriysAnd I'm not sure throwing more hardware would solve this problem, unless I figure out why placement is misbehaving basically.14:30
yuriysOtherwise it looks like you'll be like 'hey cloud, give me 3 instances for this test', and it will just create 3 instances on 1 node regardless of how many there are in the cloud.14:31
yuriysYeah looks like we don't really customize placement, womp womp. It's all like defaults.14:34
fungii see placement.randomize_allocation_candidates is false by default, i wonder if that would help14:34
fungijust looking through the config sample and docs for it now, i'm unfortunately not particularly familiar with it14:35
yuriysmaybe enabling randomize_allocation_candidates would help , idk, worth a try 14:35
*** ykarel is now known as ykarel|away14:36
fungihttps://docs.openstack.org/placement/latest/configuration/config.html#placement.randomize_allocation_candidates14:37
fungiyeah, it seems like a long shot14:37
yuriysthat docstring is i think what im having happen sometimes: That is, all things # being equal, two requests for allocation candidates will return the same # results in the same order14:38
fungiit looks like more advanced load distribution would maybe have to be done in nova's scheduler still? i'm trying to quickly digest the docs14:40
fungiahh, maybe what i was expecting to be static configuration is actually behaviors set through the placement api, via creation of "resource providers"?14:45
fungianyway, i need to switch gears, more meetings on the way14:45
yuriysyup, ty for taking a peek, we'll see how we do with the new limit, tired of ianw yelling at me!14:46
*** ysandeep is now known as ysandeep|out14:57
clarkbyuriys: fungi: melwitt helped with placement things when we had the leaks and may have input too. THough I think this morning everyone is still trying to sort through the devstack apache issue15:10
*** marios is now known as marios|out15:50
corvusclarkb: i believe the change at the head of the starlingx gate queue is stuck due to the zuul issue.16:14
opendevreviewAlfredo Moralejo proposed openstack/project-config master: Add support for CentOS Stream 9 in nodepool elements  https://review.opendev.org/c/openstack/project-config/+/81144216:15
fungicorvus: clarkb and i are both on a call at the moment but i can try to take a look16:17
corvusfungi: no need, i'm looking into the zuul bug.  at some point we may want to dequeue/enqueue to see if it fixes it, but for now i'd appreciate the opportunity to learn more in situ16:18
fungicorvus: oh, thanks, no problem i can try to get the tests going for that change again once you're done looking at it16:19
fungii'd probably try a promote on it first just to "reorder" the queue in the same order and see if that would be less disruptive16:20
corvusi have a suspcion that neither would work and we would need to dequeue it and to touch it for 2 hours to fix it (absent external zk intervention)16:26
opendevreviewAlfredo Moralejo proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB  https://review.opendev.org/c/openstack/diskimage-builder/+/81139216:28
*** jpena is now known as jpena|off16:31
*** artom_ is now known as artom16:35
clarkbcorvus: my call is done. Let me knwo if I can help, but from what I can tell you've got it under control and possibly need reviews for changes in the near future17:00
corvusclarkb: yep, making progress. will update soon.17:01
clarkbthanks17:01
corvusclarkb, fungi: i think i'm done inspecting the state.  i suspect now that a dequeue/enqueue may actually fix the immediate issue (that is, if the dequeue manages to complete).  if you want to try a promote (but i'm 80% confident that won't work), and then dequeue/enqueue on 810014,2 i think that's appropriate.17:06
fungithanks, i'll try in that sequence17:07
fungias anticipated, the promote seems to have done nothing17:12
fungizuul dequeue also doesn't seem to have done anything17:14
*** slaweq_ is now known as slaweq17:22
corvusfungi: hrm i don't see the dequeue command in the log :/17:23
fungiyeah, i was trying to find it17:23
fungiran as `sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec zuul-scheduler zuul dequeue --tenant=openstack --pipeline=gate 810762,6` but that exited 1 so it may not have worked17:24
corvusfungi: wrong change number17:24
clarkbfungi: sudo docker exec zuul-scheduler_scheduler_1 zuul dequeue --tenant openstack --pipeline gate --project openstack/placement --change 809366,1 is what i used last week17:25
corvusoh you were promoting the change behind it 17:25
fungii tried both17:25
corvusadding a '--change' argument like clarkb may help17:26
fungioh, yep17:26
fungi--help also returns nothing though17:26
fungisudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec zuul-scheduler zuul dequeue --tenant=openstack --pipeline=gate --change=810014,217:27
fungiis what i ran just now17:27
fungii'll try via docker exec instead17:27
corvusfungi: sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul --help17:27
fungioh, probably need --project too17:27
corvusservice name is 'scheduler' not 'zuul-scheduler'17:27
fungiaha, yes thank you17:28
fungiokay, now the promote try first17:29
fungiall the builds for 810014,2 are back to a waiting state17:29
fungii guess we'll know in a moment whether they get nodes assigned17:29
fungii see builds starting17:30
fungicorvus: the promote seems to have done the trick17:30
corvusfungi: i beleive a re-enqueue of 810014,2 should be okay17:31
corvusoh nm17:31
corvusthat's the one you promoted :)17:31
corvusso we're all done17:31
fungiyeah17:31
fungii first tried promoting the stuck change to "reorder" it in the same order17:32
fungiit was just a matter of getting the docker-compose command plumbing correct, thanks!17:32
corvuslemme check the change object ids real quick and see if i can anticipate further problems or not17:32
fungifor posterity i did this:17:32
fungisudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul promote --tenant=openstack --pipeline=gate --changes=810014,2fungi@zuul02:~17:32
fungier, my cursor also seems to have grabbed the prompt on the next line17:33
fungisudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul promote --tenant=openstack --pipeline=gate --changes=810014,217:33
fungithat17:33
corvusfungi: unfortunately, i think that this will not work, it's still using the outdated change object.17:34
fungistatus page suggests an eta of 10 minutes 'til merge for 810014,217:34
corvusfungi: i think it's going to require a dequeue/enqueue (and optionally promote)17:34
fungii guess it would get stuck at completion?17:35
corvusyep17:35
fungiokay, dequeuing it now17:35
fungiand enqueuing17:36
fungiand promoting17:36
corvusfungi: great, that looks like it's using the new change object, so we should be good17:36
fungiokay, it's at the top of the queue again17:36
fungithanks corvus!17:37
fungiso the bug has to do with outdated change objects in zk?17:37
corvusfungi, Clark: we are highly suceptible to this error; basically, any network issue between zuul<->gerrit could cause this.17:37
corvusfungi: outdated in memory actually; full explanation in commit msg on https://review.opendev.org/81145217:38
fungioh, cool looking17:38
corvusi think we should restart with that asap.17:38
clarkbI've approved the fix17:38
fungiyes, restart as soon as there are new images sounds prudent17:39
corvushttps://zuul.opendev.org/t/openstack/status/change/805981,3 is a lot of jobs17:39
fungioh wow17:41
fungii guess they're running all their molecule jobs because of the ansible bump17:43
clarkbI'm amazed they all succeeded17:45
clarkbcorvus: any idea why some changes have ended up in zuul's periodic pipeline that don't appear to belong there? https://zuul.opendev.org/t/zuul/status17:52
corvusclarkb: it may be related to the other traceback i haven't started digging into yet.  that was hitting periodic pipelines.17:53
corvusi'm going to afk for a bit, then resume work on that17:53
clarkbok17:54
melwittyuriys: re: your placement query from earlier, default behavior of the nova scheduler is to "stack" instances/maximize efficiency. if you want to "spread" instances you can adjust configuration,18:46
melwitthttps://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.host_subset_size is the main one. increasing it will increase the spread by picking randomly from a subset of hosts that can fit the instance18:48
melwittyuriys: there is also https://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.shuffle_best_same_weighed_hosts which will randomly shuffle hosts that have the same weight to get more spread. this one says it's particularly well suited for ironic deployments18:52
melwittyuriys: and finally, as fungi mentioned https://docs.openstack.org/placement/latest/configuration/config.html#placement.randomize_allocation_candidates is useful when you have more compute nodes than https://docs.openstack.org/nova/latest/configuration/config.html#scheduler.max_placement_results it will shuffle hosts before truncating at the max results which will allow spread placement19:00
clarkbmelwitt: thanks for all the pointers!19:00
fungithanks melwitt! and yeah, i realized after digging deeper that most of the control over that was from the nova side rather than the placement side19:01
melwittnp19:01
fungiclarkb: corvus: looks like promote on 811452 completed roughly 20 minutes ago19:01
yuriysmelwitt: clarkb: tyty! that makes big sense!19:12
yuriysI did find the weight docs, and was probably going that way as well. ideally we pick best suitable host per instance, per create call.19:13
clarkbfungi: if you have time can you weigh in on https://review.opendev.org/c/opendev/system-config/+/810284 I think the replication issue being corrected by network updates shows this isn't necessarythough it may still help improve things20:10
clarkbcurious what you think about it given what we've learned20:10
corvusfungi, clarkb: how about i restart zuul now?20:12
clarkblet me see what queues look like20:13
clarkbI don't see any openstack release jobs20:13
clarkbI think we're good and fungi gave them notice a bit earlier20:13
clarkbthere is a stack of tripleo changes that may be mergeable in ~17 minutes20:13
corvuscool, restarting now20:14
corvusoh :(20:14
clarkbI don't think that is very critical20:14
clarkbthey also release after openstack does20:14
corvusi had just hit enter when i got that msg; so restart is proceeding20:14
clarkbno worries20:14
clarkbThe bug is bad enough that we should get it fixed20:15
fungiyeah, sorry, stepped away to do dinner prep but now seems like a good enough time to restart20:27
fungii'll approve 810284 once zuul's running again20:28
corvusre-enqueing20:32
fungiclarkb: once 810284 is in for a few days or a week we can check cacti graphs for the gitea servers and see if maybe it helps or worsens cpu, memory, i/o, et cetera20:36
clarkb++20:36
fungiwith the level of churn some projects like nova see, i wouldn't be surprised if a week is a long time to go between repacks20:36
clarkbya20:36
clarkbone reason I suspected that was it seemed like projects like nova, cinder, ironic, etc were more likely to hit the replication issues. That could be because they are more active or because they are larger (or both). In this case because they are more active but when I wrote that change I was trying to hedge against the various concerns20:37
corvusre-enqueue complete20:40
fungithanks corvus!20:40
corvus#status log restarted all of zuul on commit 29d0534696b3b541701b863bef626f7c804b90f2 to pick up change cache fix20:41
opendevstatuscorvus: finished logging20:41
priteauDo we need to recheck any change submitted during the restart?20:44
clarkbpriteau: if they don't show up in the status dashboard then yes20:44
priteauThanks, a recheck has put it in the queue20:45
*** elodilles is now known as elodilles_pto20:52
opendevreviewMerged opendev/system-config master: GC/pack gitea repos every other day  https://review.opendev.org/c/opendev/system-config/+/81028421:35

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!