Friday, 2021-05-14

clarkbbah my ssh key agent just expired my keys00:02
fungiprobably a sign you woke up too early00:02
clarkbthe one thing I wanted to check was sudo docker image list on ze01 as that was yellow during the prune( but I didn't think there were any images to prune)00:02
fungilooking00:02
corvusre-enqueue is done00:03
clarkbfungi: I would expect latest and 4.2.0 to be present00:03
clarkbI'll just reload my keys for a short time00:03
fungiclarkb: yep, just those two00:03
fungii don't see any others, and both are present00:03
clarkbcool I got on and checked the zuul.conf and docker-compose.yaml as well on ze01 those looks good00:05
clarkbzm01 looks good too. And docker compose on scheduler is all using latest00:06
clarkbI think we should be in a steady state now00:06
clarkbwe can leave zuul01 up for now. services are stopped on it and it is in the emergency file. I want to unenroll it from esm before I dleete it too00:06
clarkbinfra-root ^ maybe double check you don't want to prserve anything you've got on that server?00:06
clarkband I can aim to clean it up tomorrow or modnay?00:06
corvusclarkb: all clear from me; we've done everything on zuul02 i'd want to do on zuul0100:07
clarkbcool00:07
corvusunless we want to copy the logs to zuul02 first?00:08
corvusbut it's pretty rare we need to go back far in scheduler debug logs00:08
corvusso i'm okay rolling the dice on that00:08
clarkbprobably not a bad idea. I'm running out of steam today (as evidenced by my keys expiring) and can do that tomorrow if we think it is a good idea00:08
clarkbcorvus: re queue dumping I wonder if the background queue dumps are working00:09
clarkbthat just fetches the json file right? so that should still work  on the new server but let me see00:09
corvusi don't see zuul02 in cacti00:10
corvusdid the script to add hosts bitrot?00:10
clarkbcorvus: I think its a more subtle issue, its that the lsit on the left doesn't update for some reason00:11
clarkbif you go to http://cacti.openstack.org/cacti/graph_view.php?action=list and search zuul02 stuff shows up00:11
corvusah00:11
clarkbfrom that you can get things like http://cacti.openstack.org/cacti/graph.php?local_graph_id=70200&rra_id=all00:11
clarkbconfirmed the json status backups don't seem to be working at the moment00:12
clarkbI don't think those are critical though and I can followup with that tomorrow00:13
clarkbI've disconnected from the bridge screen but left it running for now in case we want to refer to anything tomorrow. Anything else you can think of that I should be checking on before i call it a day?00:13
corvusi see the cacti prob00:14
clarkbthe deploy pipeline base job  failed due to the apt-get autoremove issue I mentioend earlier seems to have hit a couple of mirrors00:16
corvusthe next time the create graphs job runs it should work (the name of the tree in cacti didn't match the name in the script)00:16
clarkbianw: fungi: ^ it seems to be shim-signed and other related problems with dpkg00:16
clarkbcorvus: thanks!00:17
clarkbbut ya I think we are sufficiently steady state now that I can go help with dinner and stuff. I'll followup on the json backups and log copies tomorrow00:17
clarkbthanks for all the help today00:17
corvus++00:17
clarkb#status Log swapped out zuul01.openstack.org for zuul02.opendev.org. The entire zuul + nodepool + zk cluster is now running on focal00:18
openstackstatusclarkb: finished logging00:18
clarkboh good it accepted the Log instead of log00:18
clarkbone last thought before my day ends: the json status backups may still work on zuul01 if we need them in the near future00:21
ianwi can log into cacti and remove all the old .openstack.org hosts, that's the only way i've found to do it01:23
ianwthere's a few old mirrors too01:23
*** ysandeep|away is now known as ysandeep01:35
ianwafs01.dfw.openstack.org 99 48 62 Down 70d 4h 40m01:48
ianwi wonder why01:48
openstackgerrityang yawei proposed openstack/project-config master: setup.cfg: Replace dashes with underscores  https://review.opendev.org/c/openstack/project-config/+/79134301:51
ianwsnmpwalk -v1 -c public afs01.dfw.openstack.org from cacti doesn't return anything01:55
ianwudp6       0      0 ::1:161                 :::*                                957/snmpd01:56
ianwseems like it should be listening01:57
ianwok, the server is *getting* snmp requests 01:58:07.218502 IP cacti02.openstack.org.38162 > afs01.dfw.openstack.org.snmp:  GetNextRequest(25)01:58
openstackgerritSteve Baker proposed openstack/diskimage-builder master: WIP Add a growvols utility for growing LVM volumes  https://review.opendev.org/c/openstack/diskimage-builder/+/79108301:59
ianwexcellent, i restarted snmpd and it "just works" ... sigh02:01
*** timburke_ has quit IRC02:10
*** timburke__ has joined #opendev02:10
*** brinzhang0 has quit IRC03:46
*** hemanth_n has joined #opendev03:48
*** hemanth_n has quit IRC04:08
*** brinzhang0 has joined #opendev04:25
ianw#status log cleared out a range of old hosts on cacti.openstack.org04:27
openstackstatusianw: finished logging04:27
ianwi've restarted a bunch of snmpd's that seemed to have stopped working, although i have no root cause04:28
*** ykarel has joined #opendev04:44
*** marios has joined #opendev04:55
*** ykarel_ has joined #opendev05:36
*** ykarel has quit IRC05:38
*** lpetrut has joined #opendev06:01
*** darshna has joined #opendev06:03
*** slaweq has joined #opendev06:06
*** brinzhang_ has joined #opendev06:31
*** whoami-rajat_ has joined #opendev06:32
*** brinzhang0 has quit IRC06:34
*** fressi has joined #opendev06:35
*** amoralej|off is now known as amoralej07:11
*** ykarel_ is now known as ykarel07:12
*** andrewbonney has joined #opendev07:13
*** jpena|off is now known as jpena07:30
*** tosky has joined #opendev07:47
*** DSpider has joined #opendev07:48
*** sshnaidm|afk is now known as sshnaidm|pto08:00
*** lucasagomes has joined #opendev08:00
*** yoctozepto6 is now known as yoctozepto08:03
*** ysandeep is now known as ysandeep|lunch08:08
*** ykarel is now known as ykarel|lunch08:25
openstackgerritMerged opendev/glean master: Remove Fedora 32 job  https://review.opendev.org/c/opendev/glean/+/79036808:51
*** ysandeep|lunch is now known as ysandeep09:12
*** prometheanfire has quit IRC09:24
*** prometheanfire has joined #opendev09:24
*** ykarel|lunch is now known as ykarel09:34
openstackgerritLucas Alvares Gomes proposed zuul/zuul-jobs master: [dnm] testing devstack 791085  https://review.opendev.org/c/zuul/zuul-jobs/+/79111709:37
*** ysandeep is now known as ysandeep|brb11:17
fricklerclarkb: seems logrotate config is broken for zuul02, is that a known issue? e.g. I see config for /var/log/zuul/zuul-debug.log while our log is named /var/log/zuul/debug.log11:25
*** jpena is now known as jpena|lunch11:34
*** whoami-rajat_ is now known as whoami-rajat11:51
*** mlavalle has joined #opendev12:08
*** ysandeep|brb is now known as ysandeep12:13
*** brinzhang0 has joined #opendev12:13
*** brinzhang_ has quit IRC12:16
*** jpena|lunch is now known as jpena12:31
*** lpetrut has quit IRC12:36
lucasagomeshi, someone knows how can I test https://review.opendev.org/c/zuul/zuul-jobs/+/791117/ ? Apparently the Depends-On is not honored in the "zuul-jobs-test-ensure-devstack" test run12:42
*** lpetrut has joined #opendev12:48
*** amoralej is now known as amoralej|lunch13:04
*** brinzhang_ has joined #opendev13:06
*** brinzhang0 has quit IRC13:09
*** ysandeep is now known as ysandeep|away13:16
*** brinzhang0 has joined #opendev13:18
*** brinzhang_ has quit IRC13:21
*** d34dh0r53 has joined #opendev13:38
*** amoralej|lunch is now known as amoralej13:38
openstackgerritLucas Alvares Gomes proposed zuul/zuul-jobs master: [dnm] testing devstack 791085  https://review.opendev.org/c/zuul/zuul-jobs/+/79111713:39
*** ysandeep|away is now known as ysandeep13:56
dmsimardbtw: https://news.ycombinator.com/item?id=27153338 "I am resigning along with most other Freenode staff"14:09
toskydmsimard: it seems it's still under discussion14:11
dmsimardtosky: yeah, it doesn't seem like it's a done deal but concerning in any case14:12
dmsimardjust sharing for visibility14:12
fungilucasagomes: according to this it did checkout the depends-on change into src/opendev.org/openstack/devstack: https://zuul.opendev.org/t/zuul/build/7263ee5c71c84cc581deb26b4657dfc9/log/zuul-info/inventory.yaml#60-6914:15
fungiit's possible the zuul-jobs-test-ensure-devstack doesn't install devstack the way a normal devstack job would14:15
gmannfungi: yeah it needs to be mention in ensure_devstack_git_refspec https://review.opendev.org/c/zuul/zuul-jobs/+/791117/4/zuul-tests.d/cloud-roles-jobs.yaml#914:18
fungithough it looks like it cloned from there into /opt/devstack and then changed to that directory and ran ./stack.sh14:18
gmannwith new PS it should pickup14:18
fungiahh, okay14:18
fungiahh, i see it got discussed over in #openstack-infra too14:20
*** mlavalle has quit IRC14:41
clarkbfrickler: no that isn't a known issue14:59
clarkbfrickler: the log file was also /var/log/zuul/debug.log on zuul01. I suspect that we added rules for zuul-debug.log in the ansible transition and it was wrong but the old puppet config remained15:00
clarkbfrickler: I'll take a look at that along with the status json backups today15:00
fungiyeah we've generally failed to clean up old cronjobs created by puppet when switching to ansible15:01
lucasagomesfungi, sorry for the delay, yeah I need to  set those ensure_devstack_git_{refspec,version}. Now it seems to be working... before I was only set the depends-on15:01
fungiso that explanation wouldn't surprise me15:01
lucasagomesthanks15:01
fungilucasagomes: no worries, i honestly wasn't sure what the fix was, i just knew that the ensure-devstack tests didn't do quite what the devstack abstract jobs do15:02
fungibecause they're targeted primarily at use in jobs which just need "a devstack" present to interact with, and not focused on testing any of the components which go into devstack itself15:03
funginamely, testing nodepool, where we need some functional openstack as a fixture to test interactions in the openstack provider driver15:03
clarkbinfra-root are we generally happy with zuul's operation other than the log rotation and status backups? Should I give the openstack release team the all clear?15:05
fungiyeah, things seem fine so far this morning. i caught up on all the irc channels and mailing lists i monitor and see no alarms raised15:06
clarkbcool I'll let them know15:06
openstackgerritLucas Alvares Gomes proposed zuul/zuul-jobs master: [dnm] testing devstack 791085  https://review.opendev.org/c/zuul/zuul-jobs/+/79111715:08
clarkbfwiw I think I see the issues on the config management side for both logrotate and status json backups. BUt I need to load ssh keys and verify against hosts before I push a change up15:09
clarkbalso I need tea15:09
*** mlavalle has joined #opendev15:11
*** tkajinam has quit IRC15:14
openstackgerritClark Boylan proposed opendev/system-config master: Fixup small issues on new zuul scheduler  https://review.opendev.org/c/opendev/system-config/+/79150815:23
clarkbinfra-root ^ I think that will address the issues we'ev identified so far15:23
*** gothicserpent has quit IRC15:31
*** marios is now known as marios|out15:34
*** amoralej is now known as amoralej|off15:48
*** marios|out has quit IRC15:49
*** ykarel has quit IRC15:52
*** lpetrut has quit IRC16:10
openstackgerritLucas Alvares Gomes proposed zuul/zuul-jobs master: [dnm] testing devstack 791436  https://review.opendev.org/c/zuul/zuul-jobs/+/79111716:11
clarkbfungi: can I get a review on https://review.opendev.org/c/opendev/system-config/+/791508 as zuul is happy with it now?16:15
fungiyeah, can do16:15
clarkbI'll double check things after that lands then start looking at copying log files from the old server16:16
clarkbThen I guess plan to cleanup the old server monday16:18
fungiapproved it, but left comments... i don't see fingergw creating any logs on the new server16:18
clarkbfungi: ya I saw that too and haven't had a chnce to look at it. Same situation on the old server too16:18
clarkbI suspect that we don't provide a logging config and it is just going to stdout/stderr?16:18
fungithat's what i assumed16:19
fungibut the logrotate entries are good for when we decide to change that16:19
clarkbyup exactly16:20
clarkbif we fix the logging we dno't want to miss the rotation beacuse we helpfully cleaned it up :)16:20
clarkbif I copy the zuul01 log files over and keep them in logrotate .1.gz .2.gz etc will that confuse logrotate when it runs on zuul02?16:22
clarkbI guess I can also manually run logrotate by hand and see what happens16:23
clarkbone thing at a time, first one logrotate to be in the correct config16:23
*** gothicserpent has joined #opendev16:25
*** gothicserpent has quit IRC16:25
fungiit shouldn't confuse them as long as the names are what it expects. logrotate won't know the difference16:25
clarkbcool16:25
fungilogrotate just looks at filenames, after all16:25
fungiwell, and file size when determining whether to rotate under certain configurations16:26
clarkbwell I know it has to run at least once before it starts rotating because it keeps a record of some sort16:26
fungibut that generally only matters for the active log16:26
*** lucasagomes has quit IRC16:26
clarkbLooking ahead to next week I think it would be good to try and land the mailman ansiblification too before all that context goes away16:27
fungioh, yes absolutely16:27
fungialso i have the base nodeset change to ubuntu-focal scheduled for tuesday, planning to approve that an hour before the meeting16:27
*** jpena is now known as jpena|off16:28
clarkbthe way the changes are stacked is the first change should stop automatic management of the list servers. We can then run it manually against each server (probably with lists.kc.io first as it is simpler) check the results, then land the followup which will add the job to the periodic list16:28
clarkbfungi: ++16:28
clarkbThere is also a hostvars update that needs to be done with that mailman change16:28
clarkbsmall one not a big deal. Just need to remember to do it16:29
clarkbdoes ansible have a noop mode? that would probably be useful in this scenario16:29
fungiansible-test?16:30
clarkbansible-playbook --check looks like16:30
fungiahh, no ansible-test is for conformance testing of collections16:31
*** gothicserpent has joined #opendev16:31
clarkbcheck doesn't provide a way to simulate registered command output though so may not work well for us16:31
fungiand yes, ansible-playbook --help indicates --check is what you're looking for16:31
fungiwell, it does pretty clearly indicate that it only tries to predict what changes might occur when running a playbook16:32
clarkbit may still be useful to see that all the file and directory changes noop16:32
openstackgerritMerged opendev/system-config master: Fixup small issues on new zuul scheduler  https://review.opendev.org/c/opendev/system-config/+/79150816:54
fungiclarkb: ^ now we just need it to deploy16:54
fungii'm going to self-approve 791176 so i can proceed with some dnm testing of that16:55
clarkbfungi: sounds good and ya I'll wait for deploy to get zuul updated then take a look at cleaning stuff up and making sure it is happy now16:56
*** timburke_ has joined #opendev16:57
*** timburke__ has quit IRC16:59
clarkbok zuul scheduler has status.json backups now and logrotate updated17:04
clarkbI'll rmove the zuul-debug.log config17:04
fungicool17:06
clarkbfungi: `/usr/sbin/logrotate /etc/logrotate.conf` seems to be the command logrotate's systemd timer/service runs do you think it is worth running that by hand now? or just copy the old logs and let it sort it out on its own?17:09
fungii'm indifferent. if you're impatient or don't want to have to sort it out later, then sure run it manually and make sure it's working as intended17:10
openstackgerritMerged opendev/base-jobs master: Test VERSION_INFO default for mirror-info role  https://review.opendev.org/c/opendev/base-jobs/+/79117617:10
clarkbwe have plenty of disk there so I don't think its urgent, I'll just do the file copies for now17:11
clarkbalso I won't bother with the none debug logs as the debug log should be a superset of the non debug17:12
clarkbinfra-root logs for scheduelr and web have been moved over17:23
fungiawesome, thanks!17:23
fungijrosser: what's an example of a job which was hitting broken version info on bullseye? i'll do some do-not-merge tests of it reparented to base-test now that 791176 has merged and make sure it fixes things there17:25
openstackgerritClark Boylan proposed opendev/system-config master: Sync zuul status json backup list with current tenants  https://review.opendev.org/c/opendev/system-config/+/79152117:33
clarkbthat is another cleanup / sync up I noticed17:33
clarkbI'll cleanup the root screen we used yesterday now17:36
fungisounds good17:38
clarkbinfra-root for gerrit_ssh_rsa_pubkey_contents should we just update the all.yaml value to be what is in private host and group vars? then we can clean up the private host and group vars?17:39
clarkbthen everything should be in sync and far less confusing17:39
clarkbI wonder if one reason we don't do that is gerrit testing?17:39
clarkbwe'd end up writing out the wrong ssh host key for the private key and then things won't be happy?17:39
clarkbwe don't set that var as a test specific var17:40
*** andrewbonney has quit IRC17:41
jrosserfungi: the patch which triggered the bullseye version trouble was https://review.opendev.org/c/openstack/openstack-ansible/+/78360617:41
clarkbya I think that may be the reason it is the way it is17:41
jrosserfungi: though i did add a temporary hack to that so i could keep working on the rest of it https://review.opendev.org/c/openstack/openstack-ansible/+/783606/14/scripts/bootstrap-ansible.sh17:42
jrosserfeel free to adjust that patch to test base-test17:43
clarkbNow I'm thinking the right fix for this is to put the public key for testing in the zuul specific group vars then we can put our prod value in all.yaml. I need to look more closely at stuff before I feel confident in that though17:44
*** timburke_ is now known as timburke17:44
clarkbhrm I bet it is more than just review that needs that in testing though. I bet that is part of the struggle17:48
clarkbhowever, if the current value is only valid in testing and only valid for tset gerrit maybe we can address any of those problems as they pop up17:49
*** ysandeep is now known as ysandeep|away18:10
mordredclarkb: if you have a sec - https://review.opendev.org/c/openstack/openstacksdk/+/791023 ... there is a feature/r1 branch for openstacksdk but it's not running functional tests. that patch is an attempt from gtema to fix it - which is an obviously wrong patch. but looking at the branch I can't see why they wouldn't be running18:35
mordredclarkb:  I feel like there is something obvious I'm not seeing18:35
fungibranch restrictions placed on the same job in a master branch?18:37
clarkbyou expect hte -ironic job to run?18:50
clarkbI think maybe the problem is actually devsatck not having a feature/r1 branch18:53
clarkbiirc with grenade if you want all the child jobs to stop running on $stablebranch in openstack you just delete the job/branch from grenade?18:53
clarkbdo you need a pragma that tells it that this is mapped onto master everywhere else?18:53
fungiyeah, i forget which of branch-override or override-checkout that is18:56
fungithough it should fall back to master18:56
clarkbdo job definitions fall back to master though? I thought they didn't which is why this works for grenade to simply remove the jobs from the old branches18:57
fungioh, maybe that's only true for checkouts and not for job inheritance18:57
clarkbya I think there is a pragma directive that you can use to avoid this problem18:58
clarkbhttps://zuul-ci.org/docs/zuul/reference/pragma_def.html#attr-pragma.implied-branches18:59
clarkb"This may be useful if two projects share jobs but have dissimilar branch names."18:59
clarkbmordred: ^ fyi18:59
fungiaha, i wasn't familiar with that one18:59
openstackgerritMerged opendev/system-config master: Sync zuul status json backup list with current tenants  https://review.opendev.org/c/opendev/system-config/+/79152119:03
clarkboh cool I'll remove the kata crontab entries once ^ has had a chance to apply19:04
openstackgerritClark Boylan proposed opendev/system-config master: Double the default number of ansible forks  https://review.opendev.org/c/opendev/system-config/+/79152819:15
clarkbafter forgetting to use -f 50 on the base playbook I wonder if part of the deploy job throughput slowness is simply using 5 forks by default19:15
clarkbthat change doubles it to 1019:15
fungia worthwhile experiment. do we have a good baseline for the current jobs so we can compare?19:16
clarkbfungi: we have the logs on bridge we can compare19:17
clarkband also job runtimes in zuul19:17
fungiwfm19:17
clarkbI have removed the kata containers status json cron job entries on zuul0219:17
fungiclarkb: i wonder if we can extend the current export script to fetch the zuul tenant list and iterate over it? then we don't have to remember to add and remove cronjobs19:19
clarkbfungi: ya we probably can19:19
clarkbI assume the current dump script does similar but then also does the conversion to reenqueue commands19:20
clarkbwe'd basically want the everything before reenqueue?19:20
fungihttps://zuul.opendev.org/api/tenants seems to give is json we can parse19:20
clarkbthe dump script also needs updates to spit out docker exec commands so maybe we can sort something out in there19:20
fungiyeah, though thinking about it, this is all soon moot19:21
clarkboh ya because zuulv519:21
clarkbso ya maybe better to just leave this as is and clean it up when we get to v519:21
fungiso maybe better we just leave it as is. i doubt we'll add or remove tenants before we get to the point that the queues are persisted in zk19:21
*** whoami-rajat has quit IRC19:21
fungiyep, totes19:22
mordredclarkb: hrm. I mean - the issue is that none of the functional jobs are being triggered on patches to that branch: https://review.opendev.org/c/openstack/openstacksdk/+/791527 is running right now19:43
clarkbmordred: ya beacuse they all parent to a job in devstack and devstack doesn't define that job on that branch I think19:44
mordredand it's running dib-nodepool-functional-openstack-centos-8-stream-src but no other functional19:44
mordredOH19:44
clarkbmordred: its the same situation in reverse when we delete a job in old grenade to stop that running everwhere19:44
mordredit's the parenting19:44
clarkbyes19:44
clarkbto devstack-tox-functional I think19:44
mordredso we want the pragma in the sdk repo to point at master?19:44
clarkbI think so or maybe in devstack-tox-functional to include the r1 branch? I'm not sure which direction would be better19:45
mordreddoesn't really make much sense for the devstack repo to know anything about the feature/r1 branch in sdk19:45
clarkbinfra-root most of us have a few things in our zuul01 homedirs. I'd like to delete the server on Monday if possible. Can you check and make sure you don't have anything in there you want to keep?19:46
clarkbcorvus: ^ you mentioned you have what you want, but not sure if you looked in your homedir? it has a fair bit of stuff19:46
mordrednope. pragma in sdk repo doesn't do anything19:46
mordred:(19:46
mordredis the other option to just delete the stuff in the branch .zuul.yaml and let the master definitions pick up implied branch matchers?19:47
clarkbI don't know if it will fallback that way.19:48
mordredit doesn't19:49
mordredit did not work :)19:49
mordredI'm stumped19:50
mordredlet me try adding the pragma to devstack just to see19:50
mordredI don't think that's the right thing to do - but let's test the hypothesis19:50
fungitried turning on debug in the pipeline?19:52
fungiin the project pipeline i mean19:52
mordredactually - ...19:52
mordrednodepool-build-image-siblings is being run19:53
mordredand it also doesnt' have a feature/r1 in the nodepool repo19:53
mordredok. adding the pragma to the devstack repo worked19:53
fungiwacky19:53
mordredoh - wait - is it because devstack has branches ?19:56
mordredwhile nodepool doesn't?19:57
mordred"In the case of an untrusted-project, if the project has only one branch, no implied branch specifier is applied to Job definitions. If the project has more than one branch, the branch containing the job definition is used as an implied branch specifier."19:57
fungiyeah, that would make sense19:57
fungiand matches what clarkb was indicating19:57
clarkbya it is unambiguous in the single branch case19:57
clarkbin the multi branch case it doesn't know what is correct so does the more conservative thing with an outlet to bypass19:58
mordredbut there's no mechanism within the sdk repo to steer this in the right direction? Or would adding an explicit branch matcher to the sdk child job help do you think?19:59
clarkbI half expected the pragma on the child jobs in sdk to do it, but I guess not19:59
clarkbI don't know that an explicit branch matcher would help since it should already implicitly match feature/r1 and setting it to master would do the wrong thing20:00
clarkbthats interesting, I've just realized that for whatever reason the swap device that was created by launch node on zuul02 is only 7MB large or so20:43
clarkbI wonder if make_swap.sh isn't working properly on focal when memory is quite large?20:44
clarkbI noticed bceause i looked at cacti20:44
clarkbI'm not sure what the best appraoch to fixing that is. Maybe a swapfile on / ? or we can probably schedule a zuul downtime, copy logs off /var/log/zuul, reformat it the way we want, put the logs back, remount and start zuul again?20:45
*** fressi has left #opendev20:46
clarkbianw: ^ review02 has done the same thing20:46
clarkbI suspect this is a bug with focal and large memory hosts20:47
clarkbze01 which is also focal but has less memory looks the way I would expect it20:48
clarkboh interesting zk04 is like these other servers though20:48
clarkbugh20:48
fungiall the new zk hosts, or just 04?20:49
clarkball of them. Looks like zk04-zk06, zuul02, and review02 exhibit this. zm*, ze, nl, nb seem ok20:50
fungiokay, so basically anything we've tried to build on focal recently, i guess20:51
clarkbno that is what is confusing. ze, zm and nl are recent too20:52
clarkbreview02 is older than all of these20:52
clarkbreview02 uses a swapfile not a swapdevice so I suspect this is related to maths in make_swap.sh20:53
clarkbthinking out loud here: since this is a bigger problem than just zuul02 and zuul02 has plenty of memory for the moment I think we should debug the script through the use of asking it to make swapfiles on a test node. Fix the script, then swing around and either add/enlarge swapfiles to zuul02 and zk04-06 and review02 or redo the xvde partitionong on zuul02 and zk04-06 and enlarge the swapfile on20:55
clarkbreview0220:55
clarkbI think redoing the partitioning on zk04-06 will be much easier than zuul02 since it is just moutned as a tiny swap and /opt there20:56
clarkbmaybe other infra-root can take a look at that and we can dig into making those changes next week?20:56
clarkbfungi: my hunch is that the output of some tool has changed to make things more human redable on newer distros and now make_swap.sh works depending on the size of available memory20:57
fungiand calculated on several orders magnitude less than it should have20:58
clarkbhttps://opendev.org/opendev/system-config/commit/2e629bfb969c444a345503e5bcb0842f2f467f2d I think that did it21:01
clarkbwe want MB not GB so the min of 8 should be min of 819221:02
clarkbI think the reason ze's and zm's are ok is that I ran launch out of an older checkout in my homedir21:03
clarkbsomething like that21:03
clarkbI'm just trying to double check that parted mkpart wants MB values by default21:07
clarkband I'll push a fix for make_swap.sh after21:07
clarkbthe manpage is completely useless21:07
clarkbhttps://www.gnu.org/software/parted/manual/parted.html#mkpart implies that megabytes are the default21:08
openstackgerritClark Boylan proposed opendev/system-config master: Fix min swap value in make_swap.sh  https://review.opendev.org/c/opendev/system-config/+/79155421:10
clarkbI think that fixes it21:10
clarkbwell for new boots21:10
clarkbanyone know why zk servers get a mostly empty /opt/containerd dir?21:11
clarkb`sudo lsof | grep /opt/containerd` doesn't show any results there so I suspect we can simply copy the contents of /opt to another fs, unmount and repartition xvde, remount and copy /opt back again21:12
clarkbits trickier with zuul02 because we write the logs to that parition so we have size constraints (may need to trim logs prior to doing this) as well as active services using the device21:12
clarkbconsidering the zk servers have been up for a while now with no apparent issues and zuul02 has significant memory overhead I think I'm going to pause here, let others take a look and make sure I'm not missing anything obvious then we can dive into fixing them when it isn't beer thirty on a friday :)21:14
clarkbBut assuming no one finds anything different I guess I'll try starting with one of the zks on monday21:15
clarkb(and we should probably do a more thorough audit)21:15
fungiyeah, this doesn't seem urgent enough for a friday evening21:16
fungibut i agree we should regroup on monday and not lose track of it21:17
fungii'm happy to help swizzle partitions around on servers next week21:17
clarkblooking at the logs in our hosts file the reason the ze, zm, nl servers are good is they happened before the above change21:17
clarkblooking there I've discovered we have two mirror nodes that also exhibit this problem. I think that list is complete21:18
clarkbtwo mirrors, zuul02, zk04-06, and review02 but having a second set of eyes double check would be appreciated21:18
clarkbof those I suspect the only one that really poses a problem is zuul0221:18
fungithe mirrors might21:18
clarkbthe mirrors and review02 should all be swapfiles, we can simply make a new bigger swapfile21:19
clarkbfungi: well the mirrors aren't in rax so don't have an xvde so should use swapfiles21:19
fungiahh, yeah if they're not partitions of the ephemeral disk then it's easy21:19
clarkbwe can just swap off, rm swapfile, make new spapfile larger, spwaon21:19
fungiagreed, simple as long as memory isn't exhausted at the time21:19
clarkbso maybe we do those first next week, then try fixing xvde on a zk since they are redudant, if that goes well enough we do all the zks and then plan for zuul02 outage21:20
clarkbianw: can you please review https://review.opendev.org/c/opendev/system-config/+/791554 and read the scrollback about make_swap.sh when your weekend ends?21:20
fungifor zuul02 we could do it with two scheduler restarts and a temporary cinder volume21:21
clarkbfungi: I think we may just have enough space on / to copy the logs over21:21
fungior even just one restart if we pause long enough to move the active logfile over and back. if we force a logrotate before we start that could even be so small it requires very little outage21:21
clarkbcurrently need 16GB for all the zuul logs (this will probably grow a bit as the non debug logs grow since I didn't copy those over) and we have about 35GB free on /21:22
fungiahh, yeah that'll hold through monday at leadt21:22
fungileast21:22
clarkbfungi: it didn't take super long to copy the logs from one fs to the other on zuul02 after I copied tehm from 01 to 0221:22
clarkbI think we can probably stop the scheduler and web stuff on 02, copy the logs to a staging dir on /, unmount, partiion, format, remount, copy logs back then start zuul again21:23
fungiso we can force a logrotate, copy all the compressed logs to the rootfs, stop the scheduler, copy the active log to the rootfs, redo partitioning on the ephemeral disk, move the active log back to /opt, start the scheduler, then move all the compressed logs back to /opt21:23
clarkbnote it isn't /opt on zuul02 it is /var/log/zuul, but ya21:24
fungier, right21:24
fungiprepping the repartitioning/formatting commands would also help shorten the outage21:24
clarkbthe two mirrors with this problem are osuosl and inmotion fwiw21:25
fungihappy to put together a maintenance plan on an etherpad on monday, fires permitting21:25
clarkbfungi: that would be great! I suspect that the swapfile hosts we don't need such a thing but something like that for the zks and zuul02 would be great21:25
fungifor the zks i wouldn't even bother, we can take one node out of rotation at a time since it's a redundant cluster?21:26
clarkbwe can and good point21:26
fungimainly concerned with the zuul scheduler host21:26
fungibut it shouldn't be hard to shorten that one to almost as quick as a straight up scheduler restart21:27
clarkb++21:27
clarkbfwiw zuul01 has a 30GB swap partition but the min change made to mk_swap.sh intended for it to have an 8GB partition21:28
clarkbI think I'm ok with 8GB in that case21:28
fungiyeah, we also dropped the ram for 02 anyway right?21:28
clarkbwe did not. I had planned to but corvus requested that we don't21:29
fungiahh, okay21:29
fungionce we have redundant schedulers we can change that fairly easily though21:29
clarkbyup21:31
clarkbseems like while this is annoying none of the services that got hit by it are immediately having trouble from it21:32
clarkbI'm going to step out for a bit now and enjoy some sunshine. Back in a bit21:32
fungigo enjoy, i sat on the patio and grilled hamburgers and corn21:33
fungiit was lovely21:33
clarkbnice21:34
fungiwaiting for the hardware store to tell me my chopsaw is ready for pickup21:34
*** timburke has quit IRC22:08
*** timburke_ has joined #opendev22:08
*** dpawlik has quit IRC22:45
*** dpawlik7 has joined #opendev22:52
*** tosky has quit IRC23:11
*** mlavalle has quit IRC23:46

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!