Tuesday, 2021-09-21

Clark[m]fungi: TheJulia: we're over subscribed in that cloud and have that ability to control that to an extent00:45
Clark[m]I think the idea is we don't know what the right balance is so we have to fiddle with it00:46
fungiyep00:47
yuriysWe did scale down a bit as of 10 hours ago. But yeah, performance over availability there.00:47
yuriysHow can I get runtime data for the 809895 run?00:48
Clark[m]yuriys I can get links in a bit. Finishing dinner first00:50
ianwyuriys: i'd suspect it was https://zuul.opendev.org/t/openstack/build/626c1caaf4e34e91b4a1b961e3a2a21d/00:51
fungiyeah, i think that's likely it00:53
fungiit's the only voting job to be reported on change 809895 with a failure state in the most recent buildset00:53
clarkbok just sent out tomorrow's meeting agenda. Sorry that was late01:24
yuriysWe can probably scale down a bit more, maybe 32-36 limit. I still saw some ooms after we went down to 40. We had 1 launch error in the last 10ish hours, but I'm assuming that test runtime is outside of that.01:36
yuriysIf test duration is a good show of performance, I'm interested in that as well, longterm.01:37
clarkbthere are likely tests that are a decent indicator of that, but I'm notsure what those might be.01:38
clarkband ya the test runtime should be orthogonal to launch failures. This would b after we have a VM how quickly can it run the job content01:38
fungibut also, as we've observed, performance on one vm in isolation is often far better than performance on a vm when the same hosts are also running a bunch of other test instances flat out all at the same time01:40
fungithe "noisy neighbor" effect01:41
clarkbya and we see that across clouds01:42
yuriysI want to say this 'expansion' created a lot of internal tickets for us, from how we approach overcommitting resources, to memory optimizations, to even ceph optimizations. Like one of the OSDs are caching at an insane rate, we're at 38GB for the 3 OSDs, which is hilarious since we set osd_memory_target at 4G... 01:42
yuriysAlso in all brutal honestly for your use case, ceph is actually bad. We'd want to provision LVM on these NVMes, which we're testing inhouse now.01:43
fungiyeah, donnyd had observed that local storage on the compute nodes performed far better01:46
fungithat's what he ended up doing in the fortnebula cloud01:46
clarkbhe might have input on over subscription ratios too01:46
yuriysI'm calling it a night, if you guys see performance issues, feel free to scale down (Maybe to 32). We'll eventually find the right tenant maximum that doesn't impact any testing negatively, and that will help us later making correct calculations when scaling hardware. I'm not concerned over getting big numbers here, just finding the right numbers, and getting some experience what bad numbers end up doing to infra.02:00
yuriysInterestingly enough ended up watching Deploy Friday: E50 during this chat, which is heavily Zuul focused, many shoutouts to what you guys do there.02:01
clarkbhttps://www.youtube.com/watch?v=2c3qJ851QVI neat02:04
fungiwoah02:06
TheJuliaIs it bad I didn't remember which video that was until I saw myself and the background?03:46
*** ysandeep|away is now known as ysandeep05:38
*** rpittau|afk is now known as rpittau07:24
*** jpena|off is now known as jpena07:28
*** ykarel is now known as ykarel|away07:34
opendevreviewdaniel.pawlik proposed opendev/puppet-log_processor master: Add capability with python3; add log request cert verify  https://review.opendev.org/c/opendev/puppet-log_processor/+/80942407:38
*** ysandeep is now known as ysandeep|lunch07:48
opendevreviewBalazs Gibizer proposed opendev/irc-meetings master: Add Sylvain as nova meeting chair  https://review.opendev.org/c/opendev/irc-meetings/+/81016508:23
opendevreviewBalazs Gibizer proposed opendev/irc-meetings master: Add Sylvain as nova meeting chair  https://review.opendev.org/c/opendev/irc-meetings/+/81016508:26
*** ysandeep|lunch is now known as ysandeep09:09
*** ykarel|away is now known as ykarel10:26
*** frenzy_friday is now known as anbanerj|ruck10:36
*** jpena is now known as jpena|lunch11:24
*** dviroel|out is now known as dviroel11:31
*** ysandeep is now known as ysandeep|afk11:47
opendevreviewYuriy Shyyan proposed openstack/project-config master: Scaling down InMotion nodepool resource.  https://review.opendev.org/c/openstack/project-config/+/81021312:04
*** jpena|lunch is now known as jpena12:22
*** ysandeep|afk is now known as ysandeep12:39
opendevreviewMerged opendev/irc-meetings master: Add Sylvain as nova meeting chair  https://review.opendev.org/c/opendev/irc-meetings/+/81016512:53
*** tristanC_ is now known as tristanC13:16
fungiTheJulia: you did a good job in it, i watched it all the way through last night13:21
TheJulia\o/13:22
yuriysNothing like waking up to a fire. fungi can you approve|c/r the scale down please.13:22
fungiyuriys: just did, sorry didn't spot it until moments before you asked13:23
yuriysAwesome ty. I already adjusted overcommit stuff on the cloud itself.13:23
fungiyuriys: also nodepool will pay attention to quotas, so if the cloud side scales down the ram, cpu or disk quota it will adjust its expectations accordingly13:24
yuriysVery cool, did not know. Will probably tackle via quotas as well then.13:26
fungiyeah, we have some public cloud providers who burst our activity by simply adjusting the ram quota on their side, so when they know they have extra capacity they ramp it up temporarily and then when they expect to be under additional load for other reasons they scale it way down, maybe even to 013:31
fungia more dynamic way to make adjustments, and faster than going through configuration management13:32
opendevreviewMerged openstack/project-config master: Scaling down InMotion nodepool resource.  https://review.opendev.org/c/openstack/project-config/+/81021313:36
*** artom_ is now known as artom13:41
opendevreviewJeremy Stanley proposed opendev/system-config master: Use Apache to serve a local OpenDev logo on paste  https://review.opendev.org/c/opendev/system-config/+/81025314:21
opendevreviewDanni Shi proposed openstack/diskimage-builder master: Update keylime-agent and tpm-emulator elements Story: #2002713 Task: #41304  https://review.opendev.org/c/openstack/diskimage-builder/+/81025414:23
opendevreviewDanni Shi proposed openstack/diskimage-builder master: Update keylime-agent and tpm-emulator elements  https://review.opendev.org/c/openstack/diskimage-builder/+/81025414:25
opendevreviewMerged opendev/system-config master: lodgeit: use logo from system-config assets  https://review.opendev.org/c/opendev/system-config/+/80951014:28
opendevreviewMerged opendev/system-config master: gerrit: copy theme plugin from plugins/  https://review.opendev.org/c/opendev/system-config/+/80951115:13
clarkbdigging into the replication leaks a bit more: there is a current task to replicate tobiko to gitea02 from just over an hour ago. On gitea02 there is no receive pack process but there are processes for a couple of other replications that are happening15:14
clarkbLooking at netstat -np | grep 222 I see three ssh connections that correspond to the three receive packs that are present15:15
clarkbAll that to say it really does seem like we aren't properly connecting to the remote end when this happens15:16
yuriysI noticed a couple instances are in Shut Down state. Is that normal? Is that the 'Available' state?15:17
clarkbyuriys: it is possible for test jobs to request a reboot. But typically I'm not sure that is normal15:19
fungireboots also don't generally enter shutdown state, as they're just performed soft from within the guest and not via the nova api15:21
clarkbI think libvirt may detect that through the acpi stuff though15:22
fungiahh, any any case they shouldn't remain in that state for more than a few seconds if so15:22
clarkbinfra-root does anyone else want to look at gitea02 before I kill and restart the tobiko replication task?15:22
*** dviroel is now known as dviroel|lunch15:23
fungiwhen you say "Looking at netstat..." do you mean on gitea02 or review?15:23
*** ysandeep is now known as ysandeep|dinner15:23
clarkbgitea0215:24
fungii guess the gitea side since netstat isn't installed on review15:24
fungii went ahead and installed net-tools on review15:26
clarkbfungi: I think you areexpected to use ss on newer systems liek focal15:26
clarkbbut net-tools shouldn't hurt either15:27
funginever heard of ss, thanks15:27
clarkbfungi: ss is to netstat what ip is to ifconfig aiui15:27
fungiinterestingly, there is an ssh socket to gitea02 which exists only on the review side and has no corresponding socket tracked on gitea0215:29
fungi199.204.45.33:32798 -> 38.108.68.23:22215:29
opendevreviewClark Boylan proposed opendev/system-config master: GC/pack gitea repos every other day  https://review.opendev.org/c/opendev/system-config/+/81028415:38
clarkbI'm less confident ^ will help but it also shouldn't hurt15:38
fungii have a feeling it's something network related, like a pmtud blackhole impacting one random router which only gets a small subset of the flow distribution or some stateful middlebox dropping a small percentage of tracked states at random15:39
clarkbfungi: ya the network timeout that we need to restart gerrit for is likely the best bet15:40
clarkbfungi: any objection to me stopping and restargin that tobiko replication task now?15:42
fungiclarkb: nah, go for it. i'm curious to see whether these connections clear up, and on which ends15:45
*** ykarel is now known as ykarel|away15:49
fungihuh, a `git remote update` in zuul/zuul-jobs is taking forever on my workstation at "Fetching origin" (which should be one of the gitea servers)15:49
fungifatal: unable to access 'https://opendev.org/zuul/zuul-jobs/': GnuTLS recv error (-110): The TLS connection was non-properly terminated.15:49
clarkbthat would be to the haproxy not the backends ?15:49
clarkbbut maybe something is busy (gitea02 was fairly quiet with a system load of 1015:50
clarkber 115:50
fungimaybe? we terminate ssl on the backends15:50
fungihaproxy is just a plani layer 4 proxy15:50
fungiplain15:50
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all15:50
fungiat least in our config15:50
clarkbsomeone decided to update their openstack ansible install?15:50
clarkb:P15:50
fungilooks like it15:50
fungisupposedly osad now sets a unique user agent string when it pulls from git servers15:51
clarkblooks like 02 did suffer some of that15:51
clarkbsmall bumps on some other servers buit was largely 0215:52
fungii'm currently being sent to CN = gitea06.opendev.org15:52
clarkbseems like it may be recovering now? possibly because haproxy told things to go away15:52
clarkblooks like it did OOM though15:53
*** ysandeep|dinner is now known as ysandeep15:56
fungilooks like if it's openstack-ansible it'll have a ua of something along the lines of "git/unknown (osa/X.Y.Z/component)15:57
*** marios is now known as marios|out15:59
clarkbI don't see 'osa' in any of the UA strings during the period of time it appears to have been busy16:00
fungi[21/Sep/2021:15:58:03 +0000] "POST /openstack/zun-tempest-plugin/git-upload-pack HTTP/1.1" 200 224778 "-" "git/2.27.0 (osa/23.1.0.dev43/aio)"16:00
fungias a sample16:00
jrosserthats a test / CI run becasue it has 'aio' in the string16:01
fungithanks, i haven't tried to put any numbers together yet, just confirming whether i can find those16:02
clarkbjrosser: is there a reason the CI runs don't use the local git repos on the node?16:02
jrosserthey try to as far as possible16:02
clarkbI think gitea06 may have become collateral damage in whateverthis is. I can't reach it16:03
fungiso at least looking for osa ua strings, there's not a substantial spike in those on gitea02's access log around the time things started to go sideways16:05
clarkb06 does ping, but maybe OOMkiller hit it in an unrecoverable manner? I Guess we can wait and see for a bit16:06
fungifiltering out any with /aio as the component, i can see some definite bursts but no clear correlation to spikes on the haproxy established connections graph16:07
fungiclarkb: unfortunately 06 is not so broken that haproxy has taken it out of the pool16:08
fungii'm still getting my v6 connections balanced to 0616:08
clarkbfungi: are you sure? the haproxy log shows it as down16:09
fungi`echo | openssl s_client -connect opendev.org:https | openssl x509 -text | grep CN`16:09
clarkboh then it flipped it back UP again weird16:09
fungiSubject: CN = gitea06.opendev.org16:09
clarkbit flip flopped 05 and 0616:10
opendevreviewElod Illes proposed openstack/project-config master: Add stable-only tag to generated stable patches  https://review.opendev.org/c/openstack/project-config/+/81028716:10
clarkbyesterday it properly detected the update to the images which do a rolling restart of the services (kind of cool to see that in the log as happening properly)16:10
fungiit's definitely hosed to the point where cacti can't poll snmpd though16:10
clarkbyup and no ssh either16:10
fungibut i guess apache is still semi-responsive, enough to complete tls handshakes16:11
fungilooks like gitea05 has also been knocked offline, yeah16:12
fungii was able to ssh into 05 but it took a while to do the login16:13
clarkbsame here16:14
fungisyslem load average is around 9016:14
fungisystem16:14
fungiit's heavy into swap thrash16:14
fungiout of swap altogether in fact16:14
clarkbeverytime this happens I seriously wonder if we shouldn't cgit again16:14
clarkb(it was far more resilient to connetion count based load balancing rather than source)16:15
fungior use apache to handle the git interactions and just rely on gitea as a browser16:15
fungiyeah, there's a gitea process using almost 12gb of virtual memory16:16
clarkbfor 05 and 06 should we ask the cloud to reboot them? and/or manually remove them from the haproxy pool?16:16
fungiwell, at the moment i expect they're acting as a tarpit for whatever's generating all thos load16:16
fungiif we take them out of the pool, those connections will get balanced to another backend and knock it offline quickly as well16:17
fungithere was a sizeable spike in non-aio osa ua strings in requests to gitea05 at 14:33 and again at 14:4416:19
clarkbyup, but without access to the web server logs on those hosts it is hard to figure out what is causing this, but I'm looking at haproxy to see if there is any clues16:19
fungii'll see if i can isolate those and possibly map them back to a client16:19
clarkbfungi: those happen well before the spike we see in cacti fwiw16:20
clarkbwe are lookingat a start of ~15:30 according to cacti16:20
fungioh, yep, you're right. i'm looking at the wrong hour16:20
fungi/openstack/nova/info/refs?service=git-upload-pack is a clone, right?16:22
fungior could be a fetch too16:22
fungiPOST /openstack/nova/git-upload-pack16:23
fungii think that's what i'm looking for actually16:23
clarkbya that sounds right16:24
fungiyeah, a spike of 42 of those within the 15:35 minute on gitea0516:24
clarkbfwiw on the load balancer: `grep 'Sep 21 15:[345].* ' syslog | grep gitea06 | cut -d' ' -f 6 | sed -e 's/\(.*\):[0-9]\+/\1/' | sort | uniq -c | sort` shows a couple of interesting things16:24
funginormally we see around 1-5 of them in a minute during the surrounding timeframe16:25
clarkbif I add 05 to that list then there is strong correlation between two IPs16:25
fungii'm going to try to identify the client addresses associated with those nova clones at 15:3516:25
*** jpena is now known as jpena|off16:25
*** rpittau is now known as rpittau|afk16:25
clarkbfungi: I just PM'd you the IP I expect it to be based on the haproxy data16:25
fungiyep, thanks, i'll see if they like nova a lot16:26
*** dviroel|lunch is now known as dviroel16:28
fungithese were the source ports on the haproxy side of the connection for those nova clones at 15:35:16:29
fungi60478 60452 60702 60652 60588 60692 60660 60442 60466 60610 60464 60566 60460 60650 60616 60958 60916 60400 60730 32862 60992 60802 60838 32808 60910 32818 60882 60834 60954 60852 60932 60972 60746 60694 60978 32848 60918 60962 60876 32776 60844 3331416:29
fungito gitea0516:29
clarkb60478 maps to the IP I shared with you16:30
clarkb60852 does as well16:31
clarkbthe correlation is starting to get stronger :)16:31
clarkbfungi: though it looks like that IP did end up stopping about half an hour ago16:33
clarkbfungi: maybe if we restart things we'll be ok16:33
clarkb?16:33
clarkbbased on that correlation and the lack of that IP showing up for the last bit that is my suggestion16:34
clarkbI suspect what happened according to the log is that 6 went down then they were setnt to 5. Then 6 came up and 5 went down and htat happened back and forth16:35
clarkband UP here isn't a very strong metric apparently :)16:36
fungiso 100% of those 42 nova clone operations logged during the 15:35 minute on gitea05 came from the same ip address you noted as having a surge in connections through the proxy16:38
fungithough they showed up as being during the 15:37 minute on the haproxy log16:38
clarkbfungi: I think that is beacuse it takes a few minutes for haproxy to close disconnect the connections16:39
clarkbfungi: note they all have status of cD or sD in the log lines which is an exceptional state from haproxy aiui16:39
clarkb-- is normal16:39
fungicacti is starting to be able to get through to gitea05 again16:39
fungiand gitea06 seems like it wants to finish logging me in... it did print the motd just no shell prompt yet16:40
clarkbprogress!16:40
fungispike in nova clones on gitea06 was logged at 15:32-15:33 but it was hit much harder too, and stopped really doing anything according to its logs after 15:3416:43
clarkbfungi: ya then I think it flipped over to 05 when 06 was noted as down16:44
fungi115 nova clones ni that 120 second timeframe16:44
clarkbSep 21 15:35:27 gitea-lb01 docker-haproxy[786]: [WARNING]  (9) : Server balance_git_https/gitea06.opendev.org is DOWN, reason: Layer4 timeout16:44
clarkbSep 21 15:40:14 gitea-lb01 docker-haproxy[786]: [WARNING]  (9) : Server balance_git_https/gitea05.opendev.org is DOWN, reason: Layer4 timeout16:44
clarkbbasically it went from 06 to 0516:44
fungirunning the cross-log analysis with haproxy now16:45
clarkband for some reason didn't continue on to 01 02 03 04 etc16:45
clarkbpossibly because 06 went back up Sep 21 15:37:47 gitea-lb01 docker-haproxy[786]: [WARNING]  (9) : Server balance_git_https/gitea06.opendev.org is UP, reason: Layer4 check passed and so it stuck to only 05 and 0616:45
clarkbload average on 05 doesn't seem to be getting better16:49
clarkbfungi: are you good with attempting to reboot 05 and 06 now?16:50
fungiyeah, let's16:50
clarkbI can't get to 06, if you are still on it did you want to try sudo rebooting both of them?16:50
clarkbthen if taht doesn't work we can ask the cloud to do it for us16:50
fungii'm logged into both so yep, can do16:54
fungiand done16:54
fungiwe'll see if they manage to shut themselves down cleanly and reboot16:54
fungi05 is back up again16:56
clarkbgitea05 closed my connection at least16:56
fungi06 is still booting i think16:56
fungior might still be shutting down, but it closed my connection at least16:56
clarkbload is a bit high on 0516:56
clarkbbut not as high as before16:56
clarkbis there a secondary dos?16:56
fungijrosser: user agent on these nova clones we observed was just "git/1.8.3.1" so i have a feeling it still could be an osa site16:57
jrosserit could easily be16:58
fungiin a 3-minute span we saw >150 clones of nova from a single ip address, so likely behind a nat16:58
jrosseri think we backported that user agent stuff all the way back to T16:58
jrosserbut it does require them to have moved to a new tag16:58
clarkblooks like load is falling back down again on 05 I guess it just had to catch up16:59
clarkbalso gerrit replication shows retries enqueued for pushing to 05 (we want to see that so good to confirm it happens)16:59
fungigit 1.8.3.1 is fairly old... is that the default git on centos 7 maybe?17:00
clarkbfungi: I think it is17:00
clarkb05 looks normal now17:01
clarkbplenty of memory, reasonable system load etc17:01
fungihttps://centos.pkgs.org/7/centos-x86_64/git-svn-1.8.3.1-23.el7_8.x86_64.rpm.html17:01
fungiyeah, centos 7 seems likely17:01
fungii'm still getting a "no route to host" for gitea0617:02
clarkbI'm going to go find some breakfast since we are just waiting on 06 to restart and 05 showed a restart seems to make things happier and replication handles it properly17:03
clarkbI was hoping to do more zuul reviews this morning :) maybe I can do those this afternoon and the gerrit account emails can happen tomorrow17:03
fungii was trying to start on another zuul-jobs change which is what caused me to notice things weren't working right17:03
clarkbya I noticed beacuse I was trying to look at gitea06 like I had looked at gitae02 to debug the replication slowness17:04
clarkboh and at some point we really should restart gerrit to pickup the timeout change17:05
*** ysandeep is now known as ysandeep|away17:13
clarkbfungi: looks like 06 has been up for about 5 minutes17:13
fungiyeah, i'm able to ssh into it now17:13
fungigit sidetracked trying to write docs17:14
fungisystem load average is nice and low17:14
clarkbthere are three older replication tasks to 06 one each for cinder ironic and neutron. If they don't complete soon they may need to be restarted as well17:14
clarkbThere are many retry tasks for replication to 06 though so generally seems to have detected it needs to try again17:15
clarkbI only see a push for keystone and not the other three (implying they are in a similar state to the previous tobiko replication task. gitea knows nothing about them)17:16
clarkbI'll give them 5 more minutes then manually intervene17:16
*** sshnaidm is now known as sshnaidm|off17:22
clarkband done. Will check queue statuses in a bit but I expect we'll be recovering to more normal situation soon17:23
clarkbfungi: thinking out loud here maybe we can to run iperf3 tests between rax dfw and gitea0X and compare to similar run between review02 and gitea0X17:26
clarkbthe giteas were never in the same cloud region but it seems that replication might be a fair bit slower now?17:26
clarkbmnaser: ^ fyi if that is something you might already know about17:27
clarkbdon't have hard data on it but possibly also having started since we migrated the review server to the new dc?17:27
fungialso can't rule out that these hangs are related to activity spikes and oom events on the gitea side17:28
mnaserclarkb: that's strange, the hardware should be way quicker and should be faster ceph systems.  i wonder if it has something to do with the kernel version in your vm vs host (as that is signifcantly newer)17:28
clarkbfungi: well I have checked some of the hosts and some dont' have recent OOMs17:29
clarkbfungi: gitea03 for example is quite clean but also has had issues17:29
clarkbthe other odd thing is it seems to happen between 14:00UTC and 18:00UTC17:31
fungiokay, so yes that does seem like occasional problems corssing the internet17:31
clarkbour cacti data shows the last few days during this period of time has been clean except for today17:31
clarkbnote it is possible that is observer bias as I tend to check in the morning. It could be happening at other times but some timeout is finally occuring and cleaning them up so we don't seem them queue the next morning17:33
opendevreviewJeremy Stanley proposed zuul/zuul-jobs master: Deprecate EOL Python releases and OS versions  https://review.opendev.org/c/zuul/zuul-jobs/+/81029917:33
clarkb236edacc              17:05:46.304      [43642e93] push ssh://git@gitea03.opendev.org:222/openstack/releases.git17:34
clarkbThat one appears to have "leaked"17:34
clarkbgitea03 has not OOM'd17:34
fungicould they be getting cleaned up after 2 hours? 3? what's the oldest you observed?17:34
clarkbfungi: I've observed ~14:00ish at ~18:0017:34
fungiokay, so at least 4 hours i guess17:35
clarkbdo we want to see what happens with 236edacc ?17:35
fungisure, if someone complains about an old ref there we can always abort the experiment and kill that task so it catches up17:35
clarkbok17:36
clarkbf54021d3 and 1c9c079d appear to have leaked on 0517:42
clarkbthose are both post reboot tasks so no OOM there either17:42
*** ysandeep|away is now known as ysandeep17:44
clarkbtwo things I'll note. the giteas don't appear to have AAAA records but do have configured ipv6 addresses. This means gerrit is going to talk to them over ipv4 only17:44
clarkbpinging from gitea05 to review02 over ipv6 results in no route to host17:44
fungiyeah, i want to say the original kubernetes deployment design limited us to only using ipv4 addresses, but since the lb is proxying to them anyway it was irrelevant for end users17:48
fungisince we ended up not sticking with kubernetes there, we've got ipv6 addresses, just never added any aaaa records17:49
clarkbin this case it is a good thing beacuse it seems the ipv6 cannot route17:57
clarkbI did a ping -c 100 from gitea05 to review02 and vice versa and both had a 2% loss17:57
opendevreviewDanni Shi proposed openstack/diskimage-builder master: Update keylime-agent and tpm-emulator elements  https://review.opendev.org/c/openstack/diskimage-builder/+/81025418:06
clarkbrerunning the ping -c 100 test to see how consistent that is18:07
opendevreviewMerged openstack/project-config master: Update neutron-lib grafana dasboard  https://review.opendev.org/c/openstack/project-config/+/80613818:07
fungiclarkb: ianw: not urgent, but related to recently approved changes and i'm looking for a suggestion as to the best way to tackle it: https://review.opendev.org/81025318:15
clarkbyuriys: we're noticing some connectivity issues to https://registry.yarnpkg.com/@patternfly/react-tokens/-/react-tokens-4.12.15.tgz from 173.231.255.74 and 173.231.255.246 in the inmotion cloud. Currently I can fetch that url with wget from the hosts that have those IPs assigned to them.18:16
clarkbyuriys: I guess I'm wondering if there is potentail routing issues with those IPs or maybe the neutron routers/NAT mgiht be struggling?18:16
clarkboh there wouldn't be NAT18:16
clarkbjust the neutron router I think18:16
clarkbno packet loss on second pass of ping -c between gitea05 and review0218:17
clarkbfungi: left a response to your question on that change. I'm not 100% sure of that but maybe 90% sure18:19
opendevreviewJeremy Stanley proposed opendev/system-config master: Switch IPv4 rejects from host-prohibit to admin  https://review.opendev.org/c/opendev/system-config/+/81001318:19
fungithanks!18:19
opendevreviewJeremy Stanley proposed opendev/system-config master: Use Apache to serve a local OpenDev logo on paste  https://review.opendev.org/c/opendev/system-config/+/81025318:24
fungialso the screenshot for that in the run job was a huge help, made it quite obvious my naive first attempt was worthless18:25
opendevreviewClark Boylan proposed opendev/system-config master: Upgrade gitea to 1.15.3  https://review.opendev.org/c/opendev/system-config/+/80323118:29
opendevreviewClark Boylan proposed opendev/system-config master: DNM force gitea failure for interaction  https://review.opendev.org/c/opendev/system-config/+/80051618:29
opendevreviewClark Boylan proposed opendev/system-config master: Upgrade gitea to 1.14.7  https://review.opendev.org/c/opendev/system-config/+/81030318:29
clarkbinfra-root ^ I put a hold on the last change in that stack to verify 1.15.3. I think we should consider going ahead and landing the 1.14.7 upgrade to keep up to date there. Then for the 1.15.3 update I'd like to do that after the gerrit theme logo stuff that ianw has pushed is done18:29
clarkband I'm beginning to think maybe we do a combo restart of gerrit for the theme update and the replication timout config change18:30
fungia reasonable choice18:30
clarkbThen after all that we can do the buster -> bullseye updates for those images (I have changes up for those as well)18:31
*** ysandeep is now known as ysandeep|out18:32
clarkbI have +2'd the two gerrit changes at the end of the logo stack but didn't approve them as they are a bit more involved than the previous changes. I figure we can double check the above plan with ianw then proceed from there with approving those updates?18:39
fungiwfm18:40
fungiand hopefully 810253 will work as intended now18:42
opendevreviewJeremy Stanley proposed opendev/system-config master: Use Apache to serve a local OpenDev logo on paste  https://review.opendev.org/c/opendev/system-config/+/81025319:05
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/810303 has been +1'd by zuul. I'm around all afternoon if we want to proceed with that. I do have to pick up kids from school though for a shortish gap in keyboard time19:50
clarkbthat is the gitea 1.14.7 update19:50
ianwfungi: the paste not showing the logo in the screenshot is weird19:52
ianwespecially when it seems like the wget returned it correctly19:52
fungiianw: well, my test also failed and so there should be a held node for it now19:53
fungithe get returned a 5xx error19:53
clarkbfungi: have you ever seen anything like the error in https://zuul.opendev.org/t/openstack/build/e665dbc7368e44caa398e8c130c4151a ? seems apt had problems?19:54
clarkbmaybe we fetched and incomplete file? but hash verification sould catch that first?19:54
ianwfungi: oh indeed https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d26/810253/3/check/system-config-run-paste/d266360/bridge.openstack.org/test-results.html19:55
fungiclarkb: cannot copy extracted data for './usr/bin/dockerd' to '/usr/bin/dockerd.dpkg-new': unexpected end of file or stream19:56
fungimy guess is it was truncated, yeah19:56
clarkbI'll recheck that change once it reports I guess19:57
ianw"[pid: 14|app: -1|req: -1/2] 127.0.0.1 () {32 vars in 435 bytes} [Tue Sep 21 19:44:07 2021] GET /assets/opendev.svg" <- so the request made it to lodgeit, which it shouldn't have you'd think19:57
fungiclarkb: i'm betting the working run will show a larger file size for docker-ce than 21.2 MB19:58
ianwbut also, it looks like mysql wasn't ready -> https://zuul.opendev.org/t/openstack/build/d266360944434e288db1880729d809dc/log/paste01.opendev.org/containers/docker-lodgeit.log#14419:58
fungiianw: possible my location section in the vhost config isn't right. i can fiddle with it on the held node when i get a moment19:58
fungiaccording to the apache 2.4 docs, location /assets/ should cover /assets/opendev.svg19:59
fungiand therefore be excluded from the proxy20:00
ianwhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/lodgeit/templates/docker-compose.yaml.j2#L28 -> we sleep for 30 seconds for mariadb to be up20:00
ianwSep 21 19:42:03 -> Sep 21 19:42:39 paste01 docker-mariadb[10998]: 2021-09-21 19:42:39 0 [Note] mysqld: ready for connections.20:01
ianwthat's ... 36 seconds from start to ready?20:01
fungithat provider may still not be running at consistent performance levels to our others20:07
ianwyeah it was inmotion20:09
ianwit should use a proper polling wait; the sleep was just expedience but we could include the wait-for-it script 20:10
fungihuh, fun, my ssh to that held node seems to have just hung20:15
funginevermind, it resumed20:16
fungibizarre, i tried treating opendev.svg exactly like robots.txt in the apache config on the held node, and it's still getting proxied to lodgeit20:20
fungioh, duh20:25
opendevreviewJeremy Stanley proposed opendev/system-config master: Use Apache to serve a local OpenDev logo on paste  https://review.opendev.org/c/opendev/system-config/+/81025320:28
fungiianw: ^ turns out the mistake was a surprisingly simple one20:28
fungii only added the logo to the http vhost, not the https one20:29
* fungi sighs audibly20:29
ianwoh, doh.  that's right we allowed the http for the old config file20:32
clarkbits been just over 3 hours on those leaked replication tasks and they are still present20:38
fungiwell, we guessed the timeout is at least 4 hours there20:42
clarkbya or at least 4 hours20:43
clarkbjust calling out the data doesn't contradict this yet20:43
clarkbfollowing up on the nodepool zk data backups that appears to be working as expected20:46
clarkbgitea 1.15.3 continues to look good https://158.69.73.109:3081/opendev/system-config20:56
mnaserclarkb: i think you caught network in an odd time where we were flipping some bits, let me know if you continue to see some instability-ish21:00
clarkbmnaser: will do21:00
clarkbmnaser: fwiw the ipv6 ping from gitea05 to review02 still says Destination unreachable: No route21:04
mnaserclarkb: yes, that's still a 'working on fixing it' :(21:04
clarkbgot it21:04
*** dviroel is now known as dviroel|out21:07
yuriysJust caught up on chat. Okay, looks like we need to scale down once more to improve the CI performance, 36 sec to start MySQL is awful lol.21:35
yuriysQuick question on workload distribution, when a test is queued does it pull a 'worker' at random from a list of available instances, or does it pull an instance out of a serial list of available instances for testing?21:38
yuriysThe reason I ask is that one of the nodes in the inmotion cloud was heavily under used while the other exploded, which is where I'm guessing ianw's test ran, but it looks like there may have been multiple instances used at the same time, which is fine, but looking to optimize for better/more responsive load distribution.21:43
ianwyuriys: umm, the nodes are up and in a "ready" state before they are assigned to run tests21:47
yuriysYeah, what I've seen so far, is instance is created "Launch Attempt", then they are shut off and go to an "Available" state, then if they are selected they go to "In Use" state. This is the stuff that gets pushed to grafana so I'm going from that.21:49
ianwif you were just looking at the cloud side, you see vm's come up that are lightly used that will sit for an indeterminate amount of time, when under load not very long, before being assigned as workers when they start doing stuff21:49
ianwi need to look at why https://grafana.opendev.org/d/4sdNjeXGk/nodepool-inmotion?orgId=1 is not showing openstackapi stats21:50
yuriysNo other providers have that info.21:51
yuriysI thought it was just 'permabroken' lol.21:51
Guest490ianw: i haven't looked into it but https://zuul-ci.org/docs/nodepool/releasenotes.html#relnotes-3-6-0-upgrade-notes may be relevant to missing stats21:54
*** Guest490 is now known as corvus21:55
*** corvus is now known as _corvus21:56
*** _corvus is now known as corvus21:56
ianwi think it might be related to https://review.opendev.org/c/zuul/nodepool/+/78686221:57
ianwwe're graphing : stats.timers.nodepool.task.$region.ComputePostServers.mean22:07
ianwi think that should be compute.POST.servers now22:09
opendevreviewYuriy Shyyan proposed openstack/project-config master: Improve CI performance and reduce infra load.  https://review.opendev.org/c/openstack/project-config/+/81032622:11
opendevreviewIan Wienand proposed openstack/project-config master: grafana: fix openstack API stats for providers  https://review.opendev.org/c/openstack/project-config/+/81032922:25
fungiokay, back now. dinner ended up slightly more involved than i anticipated22:27
corvusi think everything we were interested in getting into zuul has landed, so i'd like to start working on a restart now22:32
fungii'm happy to help22:32
corvusfungi: you want to establish if now is okay time wrt openstack?22:33
corvusi'll run pull meanwhile22:33
fungiyeah, i'm checking in with the release team22:33
fungiour gerrit logo changes haven't been approved yet, so we can just do gerrit restart separately later22:33
corvusmost recent promote succeeded, and i've pulled images, so we're up to date now22:34
clarkbfungi: I think that is fine. The two restarts a sufficiently quick compared to each other that we don't need to try and squash them together I don't think22:34
opendevreviewMerged openstack/project-config master: Improve CI performance and reduce infra load.  https://review.opendev.org/c/openstack/project-config/+/81032622:34
fungii've let the openstack release team know we're restarting zuul, and there are no changes in any of their release-oriented zuul pipelines right now so should be non-impacting there22:35
fungishould be all clear to start22:36
corvusthanks, i'll save qs and run the restart playbook22:36
corvusstarting up now22:37
TheJuliaI was just about ask....22:37
fungisee, no need to ask! ;)22:38
TheJulialol22:38
fungiwe promise to try to avoid unnecessary restarts next week when we expect things to get more frantic for openstack ;)22:39
TheJuliaWell, I actually had ironic's last change before releasing in the check queue.... :)22:39
fungi(not that this restart is unnecessary, it fixes at least one somewhat nasty bug, and we don't like to release new versions of zuul without making sure opendev's happy on them)22:40
TheJuliaI *also* found a find against /opt in our devstack plugin which I'm very promptly ripping out because that makes us bad opendev citizens22:40
fungioof22:40
TheJuliafungi: that was my reaction when I spotted it22:40
fungithank you for helping take out the trash ;)22:41
corvusour current zuul effort is in making it so no one notices downtime again... so every restart now is an infinite number of restarts avoided in the future :)22:42
corvusyou can't argue with that math.  ;)22:42
fungithat too!22:42
clarkbya we've been doing a lot of incremental improvements to get closer to removing the spof22:42
fungirestartless zuul22:42
fungiit's nearing everpresence22:42
clarkbThis is one of the things that has motivated me to do all this code review :)22:43
corvusmuch appreciated :)22:43
clarkblooks like it is done reloading configs?22:49
corvusi think it's reloading something  (still?  or again?)22:50
corvusre-enqueing now22:51
clarkbchangs and jobs are showing up as queued22:51
fungilgtm22:52
yuriyshmmm how do you guys idenitfy which provider gets selected for a particular task?22:53
clarkbyuriys: every job records an inventory and in that inventory are hostnames that indicate the cloud provider22:54
clarkbyuriys: the beginningof the job-output.txt also records a summary of that info (so you can see it in the live stream console)22:54
yuriysthank you, found localhost | Provider: xxxx in one of the logs22:55
clarkbyuriys: a single job will always have all of its nodes provided by the same provider too22:56
yuriyswhen change is successfully merged by zuul, what triggers a build?22:56
clarkb*a single build of a job22:56
clarkbyuriys: zuul's gerrit driver will see the merge event sent by gerrit then the pipeline configs in zuul can match that and then trigger their jobs22:57
yuriys> a single job will always have all of its nodes provided by the same provider22:57
yuriysThis explains the explosions!!!22:57
fungiunless you're talking about speculative merges, rather than merging changes which have passed gating22:57
clarkbyuriys: basically zuul has an event stream open to gerrit and for every event that gerrit emits it evaluates against its pipelines22:57
yuriysso if you guys just restarted things22:58
fungi"merge" is used in multiple contexts, so it's good to be clear which scenario you're asking about22:58
yuriyswhat are the odds that stream got cut22:58
yuriyshttps://review.opendev.org/c/openstack/project-config/+/81032622:58
yuriysno build22:58
clarkbyuriys: the deploy job for that is currently running (zuul got restarted so we had to restore queues)22:59
fungicheck zuul's status page for the openstack tenant, all the builds for that change got re-added to pipelines22:59
clarkbhttps://zuul.opendev.org/t/openstack/status and look for 81032622:59
yuriysAh I saw stuff under [check] but deploy was empty for a bit22:59
yuriysi see it now22:59
corvusre-enqueue complete23:00
clarkbya it isn't instantaneous as each one of those enqueue actions after a restart has to requery git repos23:00
fungiyeah, the re-enqueuing was scripted so it doesn't all show back up at once23:00
corvus#status log restarted all of zuul on commit 0c26b1570bdd3a4d4479fb8c88a8dca0e9e38b7f23:00
opendevstatuscorvus: finished logging23:00
fungithanks corvus!23:00
clarkbfungi: its been alomst 6 hours on those leaked replications. I guess maybe we wait ~8 hours and then manually clean them up or do we want to leave them until tomorrow?23:01
clarkbthe mass of failures on some check changes seem to be legit (pip dep resolution problems)23:03
fungiclarkb: i think it's safe to assume the queue times you were observing for those replication tasks weren't particularly biased by the time you were checking them, so i'd be fine just cleaning them up at this poimt23:04
clarkbya I'm beginning to suspect there is something interesting about the time period they show up in. Network instability during thos eperiods of time for example23:05
clarkbrather than it being a side effect of some sort of long timeout23:05
clarkbI'll give them a bit longer. I don't have to make dinner for a bit23:06
yuriysbuild failed : (23:06
clarkbneat let me go see why23:06
yuriysis it waiting on logger?23:07
yuriyshttps://zuul.opendev.org/t/openstack/build/a9c7f49c293f4659befe7ae1e3353ca5/log/job-output.txt23:07
clarkbno that is the bit I was telling you about where we don't let zuul stream those logs out. We keep the logs on the bastion to avoid unexpected leakages of sensitive info23:07
clarkblooking at the log on the bastion it failed because nb01 and nb02 had some issue. I think your change is only needed on nl02 and so we should be good from the scale down perspective23:08
clarkbya https://grafana.opendev.org/d/4sdNjeXGk/nodepool-inmotion?orgId=1 reflects the change23:08
fungithat bit of log redaction is specific to our continuous deployment jobs, not typical of test jobs23:08
yuriyskk23:08
fungiwe just want to make sure that the ansible which pushes production configs doesn't inadvertently log things like credentials if it breaks23:09
clarkbnb01 and nb02 failed to update project-config which goes in /opt because /opt is full23:09
yuriysyeah i got that part, hard to track what failed though lol23:09
clarkbI'll stop buildres on them now and then work on cleaning them up23:09
yuriysCool, well, hopefully this is the last one, might have to fiddle with placement distribution limits, our weakness here is just the quantity of nodes.23:11
ianw... we just had an earthquake!23:17
yuriyswoah23:18
fungieverything okay there?23:19
ianwyep, well the internet is still working! :)  but wow, that got the heartrate up23:20
yuriyseasy calorie burn23:20
ianwi felt a few in bay area when i lived there, but this was bigger bumps than them23:21
clarkbwow23:21
artom"easy calorie burn" is it though? Feels like a lot of trouble for some cardio ;)23:22
ianwit wasn't knock things of shelves level.  still, why not add something else to worry about in 2021 :)23:25
yuriysdon't worry, 2021 not over yet23:26
yuriyssorry, correction, worry, 2021 not over yet23:26
clarkbany idea why we seem to have a ton of fedora-34 image?23:27
clarkbthat seems to be at least part of the reason that nb01 and nb02 have filled their disks23:27
clarkbI have cleaned out their /opt/dib_tmp as well as stale intermediate vhd images and that helped a bit23:27
opendevreviewMerged opendev/system-config master: Use Apache to serve a local OpenDev logo on paste  https://review.opendev.org/c/opendev/system-config/+/81025323:28
ianwwe should just have the normal amount (2).  but it is the only thing using containerfile to build so might be a bug in there23:28
clarkbianw: hrm I cross checked against focal as a sanity check and it has 2 for x86 and 2 ready + 1 deleting for arm6423:28
clarkbbut fedora-34 has many many23:28
clarkboh you know what23:28
clarkbone sec23:28
clarkb2021-09-21 23:29:03,556 ERROR nodepool.zk.ZooKeeper: Error loading json data from image build /nodepool/images/fedora-34/builds/000000738823:29
clarkbI suspect that issue in the zk db is preventing nodepool from cleaning up the older images23:29
clarkbcorvus: ^ is that something you think you'd like to look at or should we just rm the node or?23:29
clarkbfwiw I htink I cleaned enough disk that we can look at this tomorrow23:29
clarkbbut probably won't want to wait much longer than that23:30
opendevreviewMerged openstack/project-config master: grafana: fix openstack API stats for providers  https://review.opendev.org/c/openstack/project-config/+/81032923:31
clarkb/nodepool/images/fedora-34/builds/0000007388 is empty and is the oldest build23:35
clarkbI don't see the string 7388 in /opt/nodepool_dib on either builder23:36
clarkbI suspect some sort of half completed cleaning of the zk db and we should go ahead and rm that znode23:36
clarkbHowever, I'll let corvus confirm there isn't further debugging that wants to happen first23:37
clarkbI cleaned up the replication queue as it is getting close to dinner23:39
clarkbthe replication queue is now empty even after I reenqueued the tasks23:39
funginice, thanks23:39
corvusclarkb: i don't feel a compelling need to debug that right now, so if you want to manually clean up that's great23:41
fungiif it's an actual persistent or intermittent issue, i'm sure we'll have more samples soon enough23:41
clarkbok I will rm that single entry then I expect nodepool will clean up after itself from there23:42
clarkboh wait it won't let me rm it because it has subnodes and shows ovh-gra1 has the image?23:43
clarkblet me check on what ovh-gra1 sees23:43
clarkbnodepool image list didn't show it and its subnode for images/ whereI think it records that was empty so I went ahead and cleaned up everything below 7388 as well as 738823:46
clarkbthe exception listing dib images is gone23:47
ianwhttps://earthquakes.ga.gov.au/event/ga2021sqogij23:47
ianwclarkb: i'll let you poke at it, take me longer to context switch in i imagine23:48
clarkbianw: ya, I think this may be all that was necessary then the next time nodepool's cleanup routines run it will clean up the 460something extra records23:48
clarkbheh now /nodepool/images/fedora-34/builds/0000007392 is sad but we went from 467 to 460 :)23:50
clarkbthat one is in the same situation so I'll give it the same treatment23:51
corvuswow M6 is not nothing23:52
clarkb/nodepool/images/fedora-34/builds/0000007405 now23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!