Thursday, 2021-05-13

clarkbI suspect it is the first name in the name list that determines the output file. I'm not sure https://review.opendev.org/c/opendev/system-config/+/791060 will fix it?00:13
clarkbmaybe instead we should just swap the order of the names in the zuul02 list and have zuul.opendev.org come first?00:13
ianwohh, you know what, i think you're right00:21
fungibut also interpolating the filename in the vhost configs makes sense00:23
*** openstackgerrit has joined #opendev00:26
openstackgerritIan Wienand proposed opendev/zone-opendev.org master: Add acme challenge for zuul01  https://review.opendev.org/c/opendev/zone-opendev.org/+/79106900:26
openstackgerritIan Wienand proposed opendev/system-config master: zuul-web : use hostname for LE cert  https://review.opendev.org/c/opendev/system-config/+/79106000:26
ianwit might be worth doing ^ anyway just for consistency of having the cert cover the hostname as well as the CNAMEs00:27
openstackgerritIan Wienand proposed openstack/diskimage-builder master: bootloader: remove extlinux/syslinux path  https://review.opendev.org/c/openstack/diskimage-builder/+/54112900:34
openstackgerritIan Wienand proposed openstack/diskimage-builder master: Futher bootloader cleanups  https://review.opendev.org/c/openstack/diskimage-builder/+/79087800:34
openstackgerritIan Wienand proposed openstack/diskimage-builder master: Add fedora-containerfile element  https://review.opendev.org/c/openstack/diskimage-builder/+/79036500:44
*** whoami-rajat has quit IRC01:13
*** brinzhang has joined #opendev02:03
openstackgerritSteve Baker proposed openstack/diskimage-builder master: Add element block-device-efi-lvm  https://review.opendev.org/c/openstack/diskimage-builder/+/79019202:46
openstackgerritSteve Baker proposed openstack/diskimage-builder master: WIP Add a growvols utility for growing LVM volumes  https://review.opendev.org/c/openstack/diskimage-builder/+/79108302:46
*** hemanth_n has joined #opendev02:52
*** brinzhang_ has joined #opendev03:21
*** brinzhang has quit IRC03:24
openstackgerritIan Wienand proposed openstack/diskimage-builder master: [WIP] test devstack  https://review.opendev.org/c/openstack/diskimage-builder/+/79109104:07
*** ralonsoh has joined #opendev04:31
*** ykarel has joined #opendev04:38
*** brinzhang0 has joined #opendev04:50
*** brinzhang_ has quit IRC04:54
*** marios has joined #opendev04:58
*** hemanth_n has quit IRC05:00
*** hemanth_n has joined #opendev05:00
*** vishalmanchanda has joined #opendev05:06
*** darshna has joined #opendev05:08
jrosseri think /etc/ci/mirror_info.sh might be broken for bullseye due to missing VERSION_ID05:22
*** slaweq has joined #opendev06:26
*** ykarel has quit IRC06:46
*** jpena|off is now known as jpena06:48
*** zbr has quit IRC06:49
*** zbr has joined #opendev06:51
openstackgerritIan Wienand proposed zuul/zuul-jobs master: ensure-devstack: allow for minimal configuration of pull location  https://review.opendev.org/c/zuul/zuul-jobs/+/79111606:57
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [dnm] testing devstack 791085  https://review.opendev.org/c/zuul/zuul-jobs/+/79111707:00
openstackgerritIan Wienand proposed zuul/zuul-jobs master: ensure-devstack: allow for minimal configuration of pull location  https://review.opendev.org/c/zuul/zuul-jobs/+/79111607:03
openstackgerritIan Wienand proposed zuul/zuul-jobs master: [dnm] testing devstack 791085  https://review.opendev.org/c/zuul/zuul-jobs/+/79111707:03
*** lucasagomes has joined #opendev07:04
*** amoralej|off is now known as amoralej07:06
*** andrewbonney has joined #opendev07:25
*** tosky has joined #opendev07:47
*** jaicaa has quit IRC08:36
*** jpena is now known as jpena|lunch11:32
*** jhesketh has quit IRC11:43
fungijrosser: if it's the same problem as before, it's because base-files 11 is missing version information which base-files 11.1 will provide once it migrates to bullseye11:56
fungiso ansible can't find a version and substitutes "n/a"11:56
jrosserhrrm, is there a workaround anyone has been using for this?11:58
funginot sure, to be honest. i mean, bullseye isn't released yet so it makes some sense that ansible doesn't recognize it12:15
fungii expect the ansible community considers it to be working as designed12:16
jrosseri've tried to patch stuff to insert VERSION_ID=11 into /etc/os-release12:17
fungiyou could try diffing /etc/os-release between base-files 11 and 11.1 and see if it's one of the other missing values which is needed12:18
jrosserhttps://zuul.opendev.org/t/openstack/build/fa59a627370949be96e3982d31683837/log/job-output.txt#284412:19
*** amoralej is now known as amoralej|lunch12:21
*** jhesketh has joined #opendev12:21
fungijrosser: the source for that is here: https://opendev.org/opendev/base-jobs/src/branch/master/roles/mirror-info/templates/mirror_info.sh.j2#L26-L3412:24
fungiis there a fallback value we could grab when VERSION_ID is unset, do you think?12:25
fungii'm setting up a bullseye machine now to see if i can get any ideas12:25
fungiall mine are either sid (which has had base-files 11.1 for months) or buster12:25
*** jpena|lunch is now known as jpena12:33
jrosserfungi: there is potential fallback information in the node info | localhost | Distro: Debian 11.012:33
jrosserbut i guess there are two slightly different things, making mirror_info.sh robust and then somewhat seperate the ansible n/a version12:34
fungijrosser: yeah, so this is the diff between base-files 11 and 11.1 os-release files: http://paste.openstack.org/show/805351/12:39
fungiand this is the diff of /etc/debian_version as a possibility: http://paste.openstack.org/show/805353/12:41
jrosseroh, hmm https://zuul.opendev.org/t/openstack/build/5132e46c48f64f4ba324b70f94d86eab/log/zuul-info/host-info.debian-bullseye.yaml#126-13212:41
jrosserso it might not be unreasonable for VERSION_ID to fall back to ansible_distribution_major_version12:43
fungiit does mix contexts a bit, but yeah we could essentially use ansible jinja interpolation to "hard code" a fallback value into the script12:44
jrossergiven that the script is a template that should be doable12:44
fungiwe could even switch to setting those variables with ansible and just using the values from /etc/os-release as fallbacks, though that's more likely to introduce regressions12:46
jrosserindeed - thats why i thought it was a good question for here as i'm sure theres good reason for how it is now12:47
fungiwell, lots of this grew out of shell scripts we ran with jenkins in the long-long ago, in the beforetime12:48
jrosserreally the motivation here is to get bullseye working ASAP even though it's unreleased, as that lets us drop a decent %age of CI jobs maybe a cycle earlier12:48
jrosserwhat with bionic/focal centos8/stream buster/bullseye the support matrix is really full right now12:48
fungiyep, i totally get that. also worth noting the only thing we ultimately use VERSION_ID in is the wheel mirror url, so that could be reworked as well maybe12:51
fungii feel like templating in a fallback string for the VERSION_ID assignment when it's unassigned or maybe even just preassigning before sourcing /etc/os-release would be the safest solution12:54
jrosserjust to make it super obvious there could be an ANSIBLE_DISTRIBUTION_MAJOR_VERSION={{ ansible_distribution_major_version }} so it's really clear when someone looks at the generated script12:56
fungiwell, code comments in the script work too12:57
jrosserof course :)12:57
fungii'll push up a prototype and we can hash it out in review12:57
openstackgerritJeremy Stanley proposed opendev/base-jobs master: Test VERSION_INFO default for mirror-info role  https://review.opendev.org/c/opendev/base-jobs/+/79117613:02
openstackgerritJeremy Stanley proposed opendev/base-jobs master: Revert "Test VERSION_INFO default for mirror-info role"  https://review.opendev.org/c/opendev/base-jobs/+/79117713:02
fungijrosser: for changes to base-jobs content, because it's a trusted repo where we don't get to take advantage of speculative job config changes, we try change the base-test job first and then we can try do-not-merge changes in untrusted repos which set base-test as the parent for some obvious jobs13:04
fungiif dnm changes parenting jobs to base-test work after 791176 merges, then we would merge the revert and propose a similar change to the normal mirror-info role used by the base job13:05
jrosserah ok13:06
*** amoralej|lunch is now known as amoralej13:11
*** DSpider has joined #opendev13:12
*** hemanth_n has quit IRC13:23
mnasiadkaI started to notice not started ntp on debian nodepool instances, timedatectl says "NTP service: inactive" - not all the time, but every 2nd-3rd CI run in kolla-ansible - any idea what might be wrong?13:36
fungido you collect the system journal on those builds? or maybe syslog? it will probably have some indication13:41
mnasiadkantpd claims it's running, but not synchronized - so maybe it's just timedatectl flaw it has problems finding ntpd running13:42
fungicould it be that ntpd simply hasn't settled yet by the time you're checking it?13:43
mnasiadkawell, I'm fine with unsynchronized, I'm not really fine with timedatectl saying NTP service: inactive - but maybe timedatectl has some problems checking ntpd (it's rather tied with networkd-timesyncd)13:45
fungiyeah, seems like a very systemd-centric tool13:47
fungiwhich i suppose is fine if you're running systemdos13:47
fungiand also don't care that much about precision and discipline in your time sources13:52
fungimnasiadka: do you have an example build with that i can look at?13:53
fungii would expect it do say something like "NTP synchronized: no13:54
mnasiadkafungi: I think it's really Kolla-Ansible prechecks rely on timedatectl (which ignores ntpd - it only checks for networkd-timesyncd) - but I don't think we've seen them fail in the past on Debian. recent build: https://935f2aace51477baa019-09dce2ec9ab39d19fdc97cba82216d08.ssl.cf2.rackcdn.com/787701/6/check/kolla-ansible-debian-source/b491ad2/primary/logs/ansible/deploy-prechecks13:55
fungithe syslog it collected from that node claimed ntpd started at 12:35:16 and was instructed to accept large time jumps to get the clock in sync13:57
fungiMay 13 12:35:17 debian-buster-rax-ord-0024662430 ntpd[649]: error resolving pool 0.debian.pool.ntp.org: Temporary failure in name resolution (-3)13:58
fungiso it was having trouble with dns resolution there13:58
fungibecause ntpd started before unbound13:58
mnasiadkaoops13:59
openstackgerritClark Boylan proposed opendev/system-config master: Add zuul02 to inventory  https://review.opendev.org/c/opendev/system-config/+/79048113:59
openstackgerritClark Boylan proposed opendev/system-config master: Clean up zuul01 from inventory  https://review.opendev.org/c/opendev/system-config/+/79048413:59
clarkbfungi: ianw  I left a review on https://review.opendev.org/c/opendev/system-config/+/791060 which is what triggered my updates above13:59
clarkbI'm thinking we do the swap then can improve the cert configs after? I think that simplifies stuff as its one less system to worry about LE on14:00
fungimnasiadka: well, it's built to handle that, it starts polling timeservers just after unbound starts, according to syslog14:00
fungithough as of 12:40:35 us still complains "kernel reports TIME_ERROR: 0x41: Clock Unsynchronized"14:00
clarkbfungi: ianw if you agree I think we can probably proceed to try and do https://etherpad.opendev.org/p/opendev-zuul-server-swap today. I'm feeling much better14:00
fungiclarkb: yeah, that sounds fairly straightforward14:00
fungithough systemd reported "Reached target System Time Synchronized.14:02
fungiat 12:35:1214:02
fungiwhich was before ntpd started?14:02
fungithe "Clock Unsynchronized" errors are apparently endemic of a system clock which is too erratic for ntpd to properly discipline14:04
fungiso maybe that's what's going on14:05
mnasiadkamight be, will add some debug and check14:15
clarkbfungi: I also want to test my playbook at https://review.opendev.org/c/opendev/system-config/+/790487 before we do the swap over so will look at that after bootstrapping my morning. If that loosk good and zuul02 look good I think we can proceed with the swap whenever others are ready14:21
*** sshnaidm is now known as sshnaidm|afk15:00
*** lucasagomes has quit IRC15:03
fungiclarkb: any opinion on the approach in 79117615:33
fungi?15:33
clarkbfungi: basically use the ansible value first and then let os-release override. If os-release doesn't override then at least we have something? that should work15:36
fungiyeah, i mean, i expect it to work. mainly that there are several ways we could go about solving it and i was tacking for the one with the least chance to cause regressions in behavior15:37
clarkbfungi: I think my only concern would be that the ansible fact and the os-release value may have different value tpyes? like one could be a number and the other string in some situations? But for this weird pre release debian situation it should be fine ?15:40
fungii think it's always going to be a string in the end because it's a text file template to a shell script15:41
fungiso even numbers are strings15:41
*** marios is now known as marios|out15:43
clarkbright I meant more at a highl evel like for ubuntu can one be 20.04 and the other Focal Fossa15:45
clarkb11 vs bullseye etc15:45
clarkbI'm not worried about that to the point where we can't make the change though15:47
*** marios|out has quit IRC15:47
clarkbos-release is the winner and that will preserve existing behavior for us which should be sufficient15:47
openstackgerritMerged opendev/system-config master: Add zuul02 to inventory  https://review.opendev.org/c/opendev/system-config/+/79048115:52
clarkbI'm ssh'd into zuul02 ^ and running a tail on syslog watching for ansible15:53
clarkbit will probably be a few minutes before it gets there though. I'll try to keep an eye on it15:53
*** jpena is now known as jpena|off16:01
clarkbI think there is an ssh host key problem with new zuul02 and bridge16:16
clarkbI ran the script to scan the dns ssh key records and that usually populates things properly16:18
clarkbnot sure what is going on yet16:18
clarkband now I'm grumpy that ssh reports errors using a sha256 hash of the key and keyscan gives you the base64 encoding of the key16:21
clarkbyou'd think having an option to make those line up would be done by now16:22
clarkbbase has already failed and LE should fail next16:23
clarkbfungi: ^ do you see what may have happened there?16:23
clarkbssh keys on the host were generated may 10 which is when I booted it16:25
clarkbI found my sudo sshfp.py command on bridge and that looks correct16:25
clarkbok I think I see what happened, there must've been IP reuse16:28
clarkbwe haev an earlier entry in known_hosts with a different key but the file last updated aroudn when I booted the new instance and the last entry in the file matches what I see when running keyscan on localhost16:28
clarkbI'll just remove the older entry and that should make things work16:28
clarkbhrm it is still asking me for a key16:29
clarkbok the current entry seems to be for the ipv6 address but ansible uses ipv416:31
clarkbnow they are both in there with what appears to be the correct key based on on host ssh keyscanning16:32
clarkbLooks like we bailed out of the runs for that change merge16:32
clarkbfungi: ^ should I run base, LE, and then zuul by hand?16:32
fungiclarkb: yeah, sorry didn't get a chance to look yet but i agree we fail to clean up stale records for hostkeys of deleted servers in the known_hosts file so i can see where it would cause problems. in the future we might consider generating known_hosts instead16:34
fungiand yes running those playbooks seems safe16:35
clarkbok I'll start base now16:35
clarkber let me wait for the puppet else run to finish to avoid any conflicts16:35
clarkbbase is running now16:39
*** amoralej is now known as amoralej|off16:40
clarkbI forgot to do -f 5016:41
clarkbthis might be a while. I'll touch the ansible stoppage file on bridge if it gets closer to 1700 UTC to avoid conflict with the hourly runs16:42
*** timburke_ has joined #opendev16:45
clarkb#status log Ran disable-ansible on bridge to avoid conflicts with reruns of playbooks to configure zuul0216:46
openstackstatusclarkb: finished logging16:46
*** timburke has quit IRC16:48
clarkbif anyone is wondering you really do not want to forget the -f 5016:58
fungicereal execution17:04
fungias in go eat a bowl of some and check back later17:05
fungior watch the zuul episode on openshift.tv17:05
clarkblooks like I also need to run service-borg-backup.yaml in my list of playbooks17:10
clarkbbase is done. There werew a few issues on other hosts like the rc -13 and apt-get autoremove -y on a few hosts being unhappy. I don't think thoseaffect zuul so will proceed17:18
clarkbletsencrypt playbook is running now17:19
fungithese would also have been rerun in the daily job, right?17:24
clarkbyes, but that doesn't happen for another 12 hours or something17:24
clarkband I want to maybe get a zuul swap in today17:24
clarkbI'm hoping I can get zuul02 all configured, we can double check it, eat lunch, then come back and run through the plan on the etherpad17:25
clarkbdepending on how this goes maybe tomorrow we can land mailman updates too17:26
clarkbwe'll see :)17:26
fungiyeah, i wasn't suggesting we wait, just checking that it would have been run within a day under normal circumstances17:27
clarkbyup they would be17:27
*** andrewbonney has quit IRC17:28
clarkbborg backup is now done. Next is running the zuul playbook17:28
clarkbactually I may need to run zookeeper first to update the firewall rules there17:29
clarkbdouble checking on that17:29
clarkbya zk playbook comes before zuul playbook17:30
clarkboh nevermind base runs the iptables role too so this is already done (but doesn't hurt to run service-zookeeper.yaml anyway)17:31
fungisure17:31
clarkbI realized this after I started it and it reported a bunch of noops :)17:31
clarkbok thats done. I'm going to run service-zuul.yaml now. Remember we don't expect this to cause problems beacuse we shouldn't start zuul services on the new scheduler. But keep an eye open :)17:33
clarkbI notice that we may install apt-transport-https on newer systems that no longer need it17:36
fungiyeah, i think it's supported directly on focal?17:41
clarkbthe start containers task was skipped on zuul02 for the scheduler (we expected and wanted this)17:42
clarkbfungi: ya17:42
clarkbreloading the scheduler failed (I guess I should've expected this too, going to check if there are any tasks we want that run after that)17:43
clarkbotherwise looks good from the ansible side17:43
*** ralonsoh has quit IRC17:43
clarkbah that is a handler so it should happen after everything else17:44
clarkbI think that means we are good17:44
clarkbzuul-web has a handler too that I don't see firing to reload apache2 so maybe I'll just do that by hand to be double sure17:44
clarkbthat is done. infra-root can you look over zuul02.opendev.org and see if it looks put together to you?17:45
clarkbNote: we do not want zuul containers to be running there yet ( and they are not according to docker ps -a )17:45
clarkbI'm going to test my gearman server config update playbook against ze01 and zm01 next17:46
clarkbthats done and looks good. I have restored the zuul.conf states on those two hosts to what they should be for now17:50
clarkbI will remove the disable ansible file now17:50
clarkbcorvus: ^ you've probably got the best sense for what a zuul scheduler should look like. Any chance you may have time to look at zuul02.opendev.org? (Not sure when your openshift.tv thing ends)17:57
clarkbspecifically the things I'm less sure of are the zk and gearman certs/keys/ca17:58
clarkbI'm going to take a break and start heating up some lunch.  https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 is the DNS update change that we will need to manually merge during the swap, reviews on that would be great. Also looking over https://etherpad.opendev.org/p/opendev-zuul-server-swap if you haven't yet and double checking zuul02 looks happy18:03
clarkbone difference I notice is that /opt/zuul/ doesn't exist on zuul02. We run the queue dumping script out of that so it isn't strictly necessary to have on zuul02 for this swap (and we could clone it to one of our homedirs if we need to)18:06
clarkband now really finding food18:07
corvusclarkb: are you having second breakfast?18:14
corvusoh lunch18:14
clarkbcorvus: the kids break from school at ~11am for lunch so I end up eating late breakfast/early lunch often18:15
clarkbI'm waiting for the oven to preheat and just found a problem with zuul02: cannot ssh to review.o.o due to host key problems18:15
corvusclarkb: yeah, i assume we just cloned /opt/zuul at some point.  it's not kept updated.  we could just manually do that again.18:16
corvus/var/log/zuul looks good (sufficient space)18:17
clarkbwe do write out a known_hosts at /home/zuuld/.ssh/known_hosts with our gerrit and the opendaylight gerrit host keys in it but at least ours doesn't seem to work for some reason (I've not cross checked the values yet)18:17
clarkbI'll keep looking at this after food if no one else beats me to it18:18
fungicould be that openssh on the new ubuntu is expecting a different key format?18:18
fungithough i would have expected our integration jobs to find that if so18:19
clarkbfungi: it reported it used rsa in the error18:19
clarkb`ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 zuul@review.opendev.org gerrit ls-projects` is the command I ran as zuuld on zuul02 if anyone else wnts to check it18:19
corvusi see a different value reported by the server18:19
corvusssh-keyscan != known_hosts18:19
corvusknown_hosts on zuul01 == zuul0218:20
clarkbI was just going to ssay I wonder how zuul01 works, is it possible we bind mount some other bath?18:20
clarkbs/bath/path/18:20
clarkbanyway oven needs attention. Back in a bit18:20
corvusssh test on 01 fails too18:20
corvusso afaict, they are ==18:21
corvusmaybe client.set_missing_host_key_policy(paramiko.WarningPolicy()) actually also means "don't fail if they differ"18:24
fungicomparing /home/zuuld/.ssh/known_hosts right?18:25
fungithat seems to be what we bindmount into the container18:26
*** d34dh0r53 has quit IRC18:32
fungiyeah, seems that's the one18:33
fungiclarkb: oh! it's sha2-256 rather than sha118:35
fungii think that's the problem?18:36
fungias for why it's not breaking for the current server, corvus's explanation seems reasonable18:36
fungior maybe paramiko is still using sha118:36
clarkbfungi: gerrit doesn't serve the sha2-256 though iirc18:38
clarkbso that would all have to be client side for user verification18:38
clarkbbasically that shouldn't have any impact on the known_hosts file, its purely a wire thing18:39
clarkband gerrit shouldn't even attempt it becuase its sshd doesn' support it18:39
fungiahh, maybe i'm just thrown off by the sha2 fingerprint18:40
clarkbcorvus: does zuul set that cleint policy? I suspect that may be it if we somehow magically work18:40
clarkbalso I feel like we've fixed this before (it was a port 22 vs 29418 mixup or something along those lines). I wonder if the changes never merged18:40
corvusclarkb: yeah that's a paste from zuul code18:41
fungiit does seem like the ssh hostkey hash in the known_hosts there doesn't match what i get for either 22 or 29418 on the current server18:43
fungithe one i get for 29418 matches this: https://opendev.org/opendev/system-config/src/branch/master/inventory/service/host_vars/review01.openstack.org.yaml#L7318:45
fungicould we be prepopulating with something other than gerrit_self_hostkey?18:45
clarkbfungi: its https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/all.yaml#L118:46
fungilooks like we're using https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/all.yaml#L118:46
clarkbthe gerrit vars aren't necessarily exposed to the zuul hosts in ansible18:47
clarkbmaybe the easiest thing to do is update all.yaml to match the gerrit value for now?18:47
fungiyeah18:47
clarkbok I'm still juggling food. I can write that chagne after I eat (or feel free to push it I won't care too much :) )18:48
fungii'm double-checking that gerrit_ssh_rsa_pubkey_contents isn't also used for something different18:48
*** amoralej|off is now known as amoralej18:51
*** amoralej is now known as amoralej|off18:52
fungithis supposedly writes it out on the gerrit server: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerrit/tasks/main.yaml#L107-L11318:52
*** DSpider has quit IRC18:55
fungibut the value in ~gerrit2/review_site/etc/ssh_host_rsa_key.pub doesn't match the gerrit_ssh_rsa_pubkey_contents value, it matches the key part from gerrit_self_hostkey18:56
fungisame goes for review02... this is most perplexing18:57
clarkbfungi: do we override them in private vars?18:57
fungioh, could be18:58
clarkbya looks like we do though I'm not quite in a spot to cross check values. One thing I notice is we don't set it for the zuul schedulers but do for mergers and executors18:59
fungiyes, gerrit_ssh_rsa_pubkey_contents is overridden in 7 different places in private host_vars and group_vars :/18:59
clarkbI think that is likely how we end up with the wrong value on the scheduler18:59
clarkbI wonder if that all.yaml value in system-config is largely there for testing18:59
clarkbI'm thinking maybe we update the zuul-scheduler.yaml group var to include this, rerun service-zuul.yaml, double check ssh works, then make a todo to clean this up?19:00
clarkbfungi: if that sounds reasonable I can do the group var update for zuul-scheduler.yaml now19:01
clarkbactually I'm beginning to wonder if the wires are super corssed here19:01
fungiit feels like we're reusing pubkey file contents as known hosts entries, but not very well19:02
clarkbin the zuul executor file it almost feels like this is the value for zuul's ssh pubkey but I need to read that ansible19:03
clarkbno nevermind I think we do the right thing we just don't set this value at all on the zuul-scheduler group. However, known_hosts is written by the base zuul role which we run against the zuul group19:05
clarkbI think the short term fix here is to set that var in group_vars/zuul.yaml then rerun service-zuul.yaml19:06
fungiyes, agreed, it would at least be consistent with how we're doing the other zuul servers19:06
clarkband then push up a change to all.yaml undefining it and sprinkle it into the testing vars as necessary?19:06
clarkbI'll do the bridge update now19:06
fungilonger term we should probably do something about the divergence between gerrit_ssh_rsa_pubkey_contents and gerrit_self_hostkey in system-config19:07
fungiyeah, maybe that19:07
clarkbok thats done want to check the git log really quickly then I'll rerun service-zuul.yaml19:08
clarkbfungi: ^19:09
fungiyeah19:10
fungiyep, that looks like the correct value19:11
clarkbok running service-zuul.yaml now19:11
clarkbAssuming that works are there other sanity checks people think we should run?19:13
fungimaybe make sure you can reach the geard port from some other servers?19:14
clarkbfungi: ya, I'm not sure how to do that with the ssl stuff but I guess we can try to sort that out. Similarly ensure that zuul02 can connect to the zk cluster19:15
fungioh, though zuul won't be running so, right19:15
clarkbya you'd need to boostrap some stuff19:16
fungiyou'd need a fake geard listening on the port19:16
clarkbprobably doable, possibly a lot of work19:16
clarkbfungi: yes and one that does ssl19:16
fungiif it's not working at cut-over i guess we can sort it out then19:16
clarkbls-projects works now. There is a blank line between the two known hosts entries now though. I don't think this is an issue but maybe it is?19:17
clarkbI can ls-projects on the other gerrit in the known hosts so ya seems to not be a problem19:19
clarkbI can connect to zk04 from zuul02 using nc. Not sure how to set it up to use the ssl and auth stuff19:21
clarkbiptables and ip6tables report that port 4730 is open for gearman connections at least19:21
clarkbwithout doing a ton of additional faked out bootstrapping and learning to zk client with ssl by hand I'm not sure there more connectivity checking we can do19:22
clarkbhttps://etherpad.opendev.org/p/opendev-zuul-server-swap I think I'm fast approaching step 7's "when happy" assertion19:23
clarkbs/step/line/19:23
clarkbI also gave the openstack release team a heads up a little while ago19:23
clarkbzuul seems fairly idle too. Maybe let people look at 02 for another hour or so and then plan to proceed with the swap? If I can get a second set of hands to help with things like manually merging the dns update and updating rax dns that would be great. Then I can focus on the ansible-playbook stuff19:25
fungiyeah, sounds fine. i'll be around19:50
fungiwas hoping to go out for a walk this afternoon, but our pest control people were due to come today and they still haven't shown up, so probably not getting out for a walk19:51
corvusclarkb: when do you want to start?  any chance it's ~now?20:04
*** clayg has quit IRC20:09
*** fresta has quit IRC20:09
*** jonher has quit IRC20:09
*** clayg has joined #opendev20:09
*** jonher has joined #opendev20:09
clarkbcorvus: in about 15-20 minutes?20:10
clarkbbut I guess I can speed up and do ~now :)20:10
*** mhu has joined #opendev20:11
clarkbI've started a root screen on bridge20:11
corvusclarkb: i'm ready to help whenever you are :)20:11
clarkbthanks!20:11
corvus(and my next task is a power-out ups maintenance, so i can't really overlap :)20:12
clarkbcorvus: can you review https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 but don't approve it yet? one of the steps in the plan is to manually merge that in gerrit after we stop zuul20:13
clarkb(and maybe you want to get ready to be able to manually merge that when the time is appropriate?)20:13
clarkbI'll tell the openstack release team we will proceed shortly20:14
corvuslgtm20:14
clarkbok I ran the disable-ansible script to prevent the hourly jobs from getting in our way20:15
clarkbI've notified the openstack release team. I think I'm ready to dump queues and stop zuul. Once zuul is stopped someone can submit https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 in gerrit20:16
corvusi'll look up how to do that :)20:16
clarkbcorvus: I think you do the promotion thing via ssh with your admin creds now to your regular user. then unpromote after being doen or you can try to do it as the admin user alone20:17
clarkbcorvus: do you think I should proceed with dumping queues and stopping zuul? nothing to wait on on your end?20:17
corvusyeah pulled up the docs at https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#gerrit-admins20:17
corvusclarkb: i think you're gtg20:17
*** fressi has joined #opendev20:18
clarkbok queues dumped. Stopping zuul next20:18
clarkbzuul is stopped I think you can approve the dns update whenever you are ready20:19
clarkbs/approve/submit/20:20
corvusack20:20
openstackgerritMerged opendev/zone-opendev.org master: Swap zuul.opendev.org CNAME to zuul02.opendev.org  https://review.opendev.org/c/opendev/zone-opendev.org/+/79048220:21
*** fressi has quit IRC20:21
clarkbonce that shows up on gitea servers I'll run the nameserver playbook20:21
corvussubmitted20:21
clarkblooks like it is there20:21
clarkbrunning nameserver playbook now20:21
clarkbmy record ttl is under a minute now. Will check it resolves properly before proceeding20:23
corvuszuul.opendev.org.300INCNAMEzuul02.opendev.org.20:23
clarkbzuul.opendev.org.300INCNAMEzuul02.opendev.org.20:24
corvusthat's from my local resolver20:24
clarkbyup I see the same thing. Proceeding with updating gearman config on executors and mergers20:24
clarkbcorvus: maybe you want to do a status notice? also I think the rax update for zuul.openstack.org is less urgent but that needs doing too20:24
* fungi joins the screen session late-ish20:24
corvusfungi: ^ rax updates maybe?20:25
fungiyeah, i can get that now20:25
corvusinfra-root calls "not it" on ttw dns updates ;)20:25
clarkbok the gearman config update lgtm on zm04 so I'll proceed to the next step whcih is starting zuul again20:26
clarkbcorvus: you ready for ^?20:26
corvusclarkb: yep20:26
clarkbthings are started20:27
clarkblooks like zuul01 was properly ignored20:27
clarkbs/started/starting/20:27
corvusstatus notice Zuul has been migrated to a new VM.  It should be up and operating now with no user visible changes to the service or hostname, but you may need to reload the status page.20:27
corvusclarkb: how's that for sending in a minute or so?20:27
clarkblgtm20:27
clarkbI'm going to stort out copying the queues.sh script now while we wait for it to come up20:28
corvusit's saving keypairs.  i'll double check some shasums20:28
fungiokay, zuul.openstack.org cname has been updated to point at zuul02.opendev.org, ttl was already 5min20:28
corvusfungi: what about updating it to point to zuul.opendev.org?20:29
fungioh, could do that too20:29
clarkbfungi: ya I checked its ttl a few days ago and its was already low20:29
corvusit's a double cname, but at this point we're probably not worried about network efficiency for folks still using it20:29
funginsd should be smart enough to return both records when asked for the zuul.opendev.org cname, at least i know bind does it that way20:29
corvusspot checks of some keys match on both hosts20:29
corvusgood point20:30
clarkbok I've just realized that to restore queues I need the zuul enqueue command to be present I ssuecpt this may be easiest if I start a shell on the scheduler container and run it there?20:30
clarkbI assume we don't want a global install anymore?20:30
corvusclarkb: ++20:30
fungiokay, zuul.openstack.org is now updated to be a cname to zuul.opendev.org20:30
clarkbfungi: thanks20:30
fungisaves us a step on future server changes20:30
corvusclarkb, fungi: did we not make a "zuul" alias?20:31
corvusi think we did that with nodepool20:31
corvusanyway, container shell for now, then later we can make a one-liner to have "zuul" do "docker-compose exec ...."20:31
fungioh, shell command alias/wrapper?20:31
corvusfungi: ya20:31
clarkbcorvus: I don't see any using `which zuul`20:32
corvushuh, i can't find that on nodepool either20:32
corvusprobably a change from mordred sitting in review20:32
clarkbcorvus: zuul02:/root/queues.sh has been edited to do docker exec can you check that and see if it looks right?20:32
fungilooks like /usr/local/bin/zuul on the old server is the old style entrypoint consolescript20:32
clarkbI think zuul is up now20:33
clarkbI'm ready to run the queues.sh script if it looks good to yall20:33
corvusclarkb: lgtm.  will be slow, but script isn't long.20:33
clarkbok running it now20:33
fungifrom here i get to the webui20:33
funginothing enqueued yet20:33
fungii take that back, four changes enqueued20:34
openstackgerritJames E. Blair proposed opendev/system-config master: Fix typo in gerrit sysadmin doc  https://review.opendev.org/c/opendev/system-config/+/79131420:34
clarkbthey should be showing up now ya20:34
corvusthere's another for ya :)20:34
fungimake that five ;)20:34
corvusretry_limit20:35
clarkbya but thats an airship job that does that alot? may or may not indicate a problem20:35
clarkboh I'm seeing a number of other retries now20:35
corvusother attempts20:35
clarkbprobably not airship specific20:35
clarkbyaml.constructor.ConstructorError: could not determine a constructor for the tag '!encrypted/pkcs1-oaep' I see that on ze0120:37
corvusclarkb: we may be out of version sync20:37
corvusprobably need to run the pull playbook then a full restart20:37
clarkbcorvus: ok20:37
fungioh, yes that makes sense, that work did just merge20:37
corvus(my guess is scheduser version is > executor)20:37
fungiso executors/mergers need restarting20:38
clarkbrunning pull now20:38
corvus#status notice Zuul is in the process of migrating to a new VM and will be restarted shortly.20:38
openstackstatuscorvus: sending notice20:38
corvusclarkb: in the mean time, can you save queues again (to a second file)?20:38
-openstackstatus- NOTICE: Zuul is in the process of migrating to a new VM and will be restarted shortly.20:38
clarkbya though I may need to do it on 0120:39
clarkbcorvus: the pulls seem to report they didn't do updatse?20:39
corvusclarkb: probably ok20:39
corvus(probably images were already local)20:39
clarkbbut why would be out of sync in that case?20:39
corvus(probably only restarts are needed, but good to double check)20:39
clarkbnot sure I undersatnd that last message20:40
fungiclarkb: executors weren't restarted when the images updated20:40
corvusoh hrm, if there really was a global restart with everything up to date then...20:40
fungiwere they?20:40
clarkbfungi: the output from the zuul_pull.sh implies this is the case20:40
clarkbPulling executor ... status: image is up to date for z...20:41
fungi20:27 start time, so yeah i guess they were20:41
corvusclarkb: let's make sure the image on zuul02 is up to date -- did the pull playbook do that?20:41
clarkbchecking20:41
openstackstatuscorvus: finished sending notice20:41
clarkbPulling scheduler ... status: image is up to date for z...20:42
clarkblooks like it20:42
corvusclarkb: can we do one more full stop / start just to make sure we got everything?20:42
corvusthen if it happens again, we'll call it a bug20:42
clarkbyes, I'll do the dump then stop20:42
*** vishalmanchanda has quit IRC20:43
fungidefinitely worrisome that executors couldn't parse the secrets from zk20:43
clarkbstopped, running start now20:43
corvusfungi: that was parsing the secrets over gearman20:43
fungiohh20:44
fungiright that's still going through gearman20:44
clarkbwe should be coming back up again20:44
corvuswe ship secret ciphertext over gearman now, then executors decrypt, we only got to the "decode off the wire" stage on the executor, not quite as far as the decrypt step20:44
corvusfungi: if they got to the decrypt step, they would get the keys from zk20:44
fungiyeah i forgot the secrets hadn't moved to zk as part of the serialization work20:45
corvusthe latest scheduler and executor images were build from the same change20:45
corvus(i checked docker image inspect on zuul02 and ze10)20:45
corvus"org.zuul-ci.change": "788376", on both20:45
clarkbI've prepped zuul02:/root/queues.new.sh20:46
fungii'm being hovered over to start heating a wok, but will try to be quick20:48
*** slaweq has quit IRC20:49
clarkbweb loads again20:49
clarkbcorvus: should I reenqeuue a change maybe?20:50
corvusthere's one already20:50
clarkbyup see a couple now actually20:50
clarkblooks like they are retrying20:50
clarkbI see the same traceback on ze01 concurrent with the newer restart20:51
corvuswell, that's fascinating20:51
corvusi wonder if yaml is different on our image builds, or if there's a path we're not testing20:51
clarkbI6d94c1d8da8b68e5fb60c27e73039155a02fb485 maybe?20:52
corvusoh that's certainly the change that broke it, but i don't see how20:53
clarkbgotcha20:53
corvusthere's *extensive* testing of secrets20:53
clarkbI suspect that the 13 day old executor image on ze01 isn't the one we want to run with as a fallback20:55
corvuswe need a sync'd executor and scheduler image20:55
corvusand unfortunately i don't think we tagged our last restart20:55
corvusbut i think we should be able to fall all the way back to 4.2.0?20:56
clarkboh thats an idea20:56
corvus(especially now that the keys are on disk)20:56
corvusclarkb: i think that's my vote20:56
clarkbok let me see what that looks like as far as changes we need to make20:57
corvusoh... 1 thing20:57
corvusthat might have the old repo layout20:57
corvusyep20:58
clarkbyes it does20:58
corvusthat will use extra space and cause a longer restart, but not a blocker20:59
corvus(we might as well just rm -rf before restarting to clean up the extra space)20:59
clarkbok so we're still goo to proceed. How do we want to do the modification to zuul's docker-compose configs? Should I just manually do that on my checkout of system-config on bridge?21:00
corvusclarkb: yeah, why don't you do that21:00
corvusclarkb: i'll run the stop playbook now, and delete the git caches21:00
clarkbok my computer has decided now is a fine time to swap or something21:00
clarkbI'll work on the docker-compose updates though21:01
clarkbimage: docker.io/zuul/zuul-scheduler:4.2.0 that look right?21:03
corvusyep21:03
clarkband I'm doing that for all the zuul services21:04
corvus++21:04
fungiokay, dinner's cooked and i'm back21:05
clarkbcorvus: ok thats done. Are you ready for me to run service-zuul.yaml and then rerun the gearman config updater?21:05
corvusoh because the playbook will write old config files?21:05
openstackgerritAde Lee proposed zuul/zuul-jobs master: Add role to enable FIPS on a node  https://review.opendev.org/c/zuul/zuul-jobs/+/78877821:05
corvusclarkb: yes i am ready21:05
clarkbok running in the screen and up because it will update the config fiel to point at old zuul21:06
clarkbso we just rerun the fixer playbook after too21:06
corvusclarkb: in fact, i'm now ready for you to proceed all the way to restart21:06
corvus(i have stopped everything and cleared out the git repo caches)21:06
fungicaught up again, so looks like we did a rollback to 4.2.0 everywhere21:06
fungior are in progress with that21:07
clarkbfungi: we're doing that21:07
fungiyep, don't let me interrupt21:07
fungibut lmk if there's something i can help with21:08
clarkbthe mergers are pulling 4.2.0 now21:11
clarkbthen it will be executors then scheduler images21:11
corvuswhile we're waiting, i have a revert staged locally; i'd like to merge that today and restart into it, verify it works, then tag it (timing on that will obviously depend on what happens next, but that's the process i'd like to do)21:13
fungimakes sense21:13
clarkbsounds good to me21:14
fungii agree i'd have expected the rather extensive testing to pick that up before the change merged though, so the fix is guaranteed to be an interesting one21:15
clarkbok service-zuul is done no errors and rc is 021:16
clarkbrunning the config fixup next21:16
clarkbcorvus: that is done. I think we are ready to start again if you are ready?21:16
clarkbdo you want to run the start playbook or should I?21:17
corvusclarkb: go for it21:17
clarkbok running zuul_start.yaml now21:17
clarkband done21:18
clarkbthis startup will take longer bceause repos need to be cloned right?21:18
corvusyes21:19
corvusthis should get a lot faster soon because the files that the cat jobs are asking for are going to be persisted in zk21:20
fungithat'll be nice21:21
corvuswe're actually almost at the point of doing that (i think those changes are just coming ready for review now)21:21
corvusaside from the fact that they now have 2 things ahead of them instead of just one :)21:22
clarkbis there a way to see progress? it seems idler than I would expect21:22
clarkbwhen tailing the scheduler debug log21:22
corvusexecutor/merger logs and grafana21:22
clarkbthe merger queue is supposedly 0?21:23
corvusthe zuul job queue is not high which means we haven't sent the cat jobs out yet21:23
clarkbzm01 hasn't merged anything in about 5 minutes21:23
corvuswe're... ratelimited on github?21:24
corvusit's possible we're sitting in a sleep in the github driver waiting to query more branches21:25
fungimaybe the successive restarts pushed us over query quota there21:25
corvusyeah, i'm like 90% sure that's what's happening21:26
clarkbhrm I see we're submitting a small number of cat jobs since the most recent start and they seem to all be things hosted on opendev?21:26
clarkbbut maybe I'm looking at the wrong logs stuff21:26
corvus2021-05-13 21:20:52,327 WARNING zuul.GithubRateLimitHandler: API rate limit reached, need to wait for 386 seconds21:27
corvusopendev is the first tenant, openstack is after that21:27
corvusand the github projects are in openstack21:27
clarkbgot it21:27
corvusthe 5m has expired and it's running again21:27
clarkband ya I see it doing a lot of work now21:28
clarkband zm01 is happily busy21:28
corvusshould be able to watch progress on 'zuul job queue' in grafana21:28
corvushttps://grafana.opendev.org/d/5Imot6EMk/zuul-status?viewPanel=19&orgId=121:28
clarkbthanks21:28
rm_workso, zuul good again? :D21:29
fungirm_work: close21:29
corvusabout 4,000 git clones away, we hope :)21:29
fungiat least we think so. hard to say for sure until we see it actually run some jobs successfully21:29
rm_workheh21:29
rm_worksitting here, finger hovering the return key on a `git review`21:30
fungisince this is a totally new server for the scheduler, there's plenty which could go wrong21:30
clarkbassuming this fixes things I would like to continue to land the followup changes, but can leave DISABLE-ANSIBLE in place and do the dns and service-zuul.yaml playbook runs by hand after rebasing my 4.2.0 update onto the zuul01 cleanup change.21:34
clarkbthen we can remove DISABLE-ANSIBLE when we start corvus' plan for revert and applying the revert and all that21:34
clarkbnova clones have started. Hopeflly we move quickly after that21:35
*** darshna has quit IRC21:38
corvus1200 bottles of beer on the wall....21:42
fungiclone one down, checkouts abound...21:43
mordred1500 bottles of beer on the wall ...21:45
corvusnearly done21:45
clarkbits up21:45
clarkbshould I enqueue a change?21:45
corvusclarkb: i just did in zuul21:46
clarkbk switching tenants on the dashboard21:46
clarkbI see a console log21:47
clarkbhttps://zuul.opendev.org/t/zuul/stream/e9f59ab4d01f4f729ec844cba722456b?logfile=console.log21:47
corvusand it's actually running playbooks21:47
corvusclarkb: maybe re-enqueue now?21:47
clarkbcorvus: will do21:47
clarkbin progress now21:48
clarkband actually now that I think about it I think we're ok to keep the 300 ttl and leave zuul01.openstack.org in the emergency file if we just want to remove DISABLE-ANSIBLE and proceed with revert stuff21:49
clarkbits a fairly safe steady state, just with a low ttl we can cleanup in the futher out near future21:49
fungiyep21:49
clarkbthe only issue is the gearman server directive on zm* and ze*21:50
clarkbI can push a new chagne that only updates that21:50
clarkbwe have a successful tox-linters job against zuul21:50
corvuson a change uploaded by me, no less.  shocking!21:51
clarkbfungi: corvus: opinions on fixing the gearman server config? do we want to blaze ahead and land the existing changes to do that or would you prefer we stay somewhat nimble with changing ttls and cleaning up zuul01 and I can push a change that only updates the gearman config21:51
clarkb(we're steady state right now due to DISABLE-ANSIBLE)21:52
clarkbrm_work: I think we're cautiously optimistic at this point if you want to push21:52
corvusclarkb: i say go all the way with existing changes21:53
rm_work:P21:53
rm_workthanks21:53
clarkbcorvus: ok wfm.21:53
clarkbfungi: corvus: can you review https://review.opendev.org/c/opendev/zone-opendev.org/+/790483/ ?21:53
clarkbthen next is https://review.opendev.org/c/opendev/system-config/+/79048421:53
clarkbthat will swap us back to latest but we want that for revert testing anyway21:54
corvusclarkb: was already doing that, +2 on both21:54
clarkbthanks!21:54
corvusclarkb: mind if i go offline for a little while?  maybe 30m?21:54
clarkbcorvus: sure things seem happy enough21:55
clarkbcorvus: before you go any reason to not remove DISABLE-ANSIBLE?21:55
ianwo/ ... just reading the excitement in scrollback ...21:55
corvusi'd like to squeeze this ups work in while building/updating is happening21:55
clarkbit will revert our gearman config and our docker-compose updates21:55
clarkbI'm ok with manually rerunning those playbooks again if necessary21:55
corvusclarkb: don't think that's a problem; i don't think we're going to auto restart anything21:55
clarkbcorvus: ok I'll do that now so that merging those changes can automatically do the right thing21:55
openstackgerritAde Lee proposed zuul/zuul-jobs master: Add role to enable FIPS on a node  https://review.opendev.org/c/zuul/zuul-jobs/+/78877821:56
clarkband done21:56
fungiclarkb: what was in need of fixing with the gearman server config?21:56
corvusclarkb, mordred, fungi: maybe you can go ahead and w+1 the revert?21:56
fungiyup21:56
corvusso that we get new images asap21:56
clarkbfungi: the old config points mergers and executors at zuul01.openstack.org. As part of the upgrade I ran an out of band playbook to set it to zuul02.opendev.org instead so that we could control when they swapped over21:57
clarkbI have approved the dns ttl cleanup21:57
clarkbfungi: the zuul01 cleanup change makes that value zuul02.opendev.org permanentaly21:57
clarkbcorvus: yup looking at that change now21:58
fungiahh, yep21:58
corvusbiab.21:58
clarkbianw: tldr is swapped zuul01 for zuul02 and discovered some recent chagnes to zuul executors were not happy. We have since rolled zuul02 back to zuul 4.2.0 which seems to be working21:58
*** corvus has quit IRC21:58
clarkbianw: we'll roll forward again with a revert of the zuul executor changes to see that that properly addresses the issue then zuulians can work on fixes21:59
clarkblooks like fungi got the zuul revert change so that is in the pipeline21:59
fungiwe rolled everything back to 4.2.0 right, not just the zuul02 scheduler?22:00
clarkbfungi: correct22:00
clarkbalso to be clear zuul01 is not in use and is in the emergency file (this makes running the zuul stop/start/etc playbooks safe)22:01
fungianyway, cleanup changes are approved22:01
clarkbshould we do a #status log We are cautiously optimistic that Zuul is functional now on the new server. We ran into some unexpected problems and want to do another restart in the near future to ensure a revert addresses the source of that problem.22:02
clarkber should that be #status notice?22:02
clarkboh you know what we didn't copy is the timing dbs but meh22:02
fungia bit wordy, but sure it works22:02
clarkb#status notice We are cautiously optimistic that Zuul is functional now on the new server. We ran into some unexpected problems and want to do another restart in the near future to ensure a revert addresses the source of that problem.22:03
openstackstatusclarkb: sending notice22:03
-openstackstatus- NOTICE: We are cautiously optimistic that Zuul is functional now on the new server. We ran into some unexpected problems and want to do another restart in the near future to ensure a revert addresses the source of that problem.22:03
openstackgerritMerged opendev/zone-opendev.org master: Reset zuul.o.o CNAME TTL to default  https://review.opendev.org/c/opendev/zone-opendev.org/+/79048322:03
openstackstatusclarkb: finished sending notice22:06
ianwok cool thanks.  i guess the zuul gate isn't tied up in devstack issues at least22:06
clarkbianw: yup and re devstack it sounds like gmann wants to revert and do the ovn switch properly22:10
clarkbrather than try and fix every new random issue that was masked by using the CI jobs to do the switch and not the devstack configs22:10
clarkbservice-nameserver job has started22:11
clarkbI had to reapprove 790484 because fungi approved it before its dependency merged. But that is done now22:13
ianwclarkb: yeah, i think i'm on board with the revert too, because it's not quite as simple to just stick that new file in and be done with it22:13
clarkbianw: ya it basically entirely relied on changing the job config to work in the job and won't work anywhere else22:14
ianwi think it's probably worth making ensure-devstack install devstack from zuul checkout and running that it the devstack gate as a check on this22:14
clarkbI see the new TTL for zuul.opendev.org so that looks good22:15
clarkbianw: ya devstack could run a job that uses it with an up to date checkout so that pre merge testing works but otherwise its not doing much22:15
clarkbat this point I think we are just waiting on the zuul01 cleanup change and corvus' revert in zuul. I haven't seen anything from zuul jobs that have run to indicate any major problems22:16
clarkbthe one problem I identify is we didnt' copy the timing dbs over to the new server so we don't have that data but its not the end of the world22:16
ianwi get that the ensure-devstack role is explicitly about using devstack, not testing it, but "can i use devstack" is a pretty good devstack test as well :)22:16
clarkb++22:17
clarkbthe hourly service-zuul.yaml run is likely to run before my zuul01 removal change lands. What this means is our docker-compose.yaml files on scheduler+web, mergers, and executors will be updated as well as the zuul.conf on mergers and executors to remove manual changes we have made22:24
clarkbthis should be fine as long as we put those changes back again before doing a restart22:24
clarkbin fact I may just run the gearman config fixup playbook after that run finsihes as we want the docker-compose changes to go away to do the revert22:25
fungiwhich may be easier than temporaro22:25
fungitemporarily adding them all to the emergency list22:25
clarkbyup22:25
clarkband when we get to the restart point we only need the gearman fixes in place which means as long as I've put those back we could potentially restart things on the revert before zuul01 cleanup has fully applied (then I can just manually go through it after the fact or let periodic catch it tonight)22:26
clarkbservice-zuul.yaml is done now. I'll rerun the out of band gearman config fix now22:53
clarkband that is done. We should be able to restart zuul safely on the revert now (docker-compose is back to latest) as long as 790484 runs before the next hourly pass22:54
*** corvus has joined #opendev22:56
corvuso/22:57
clarkbcorvus: all seems to be going well so far22:58
clarkbcorvus: still waiting on the zuul01 removal to land, but I went ahead and reran the gearman config fixup after the previous hourly service-zuul.yaml ran22:58
corvuscool, looking at eavesdrop now...22:58
clarkbcorvus: I think that means we're in an ok spot right now to restart on the zuul revert (docker-compose should point to latest now and gearman configs are correct)22:58
corvusi *think* we can copy over the timing db if we want22:59
corvusbut also, it'll sort itself out soon :)22:59
clarkbcorvus: I'm not super worried about it because ya it will handle itself22:59
fungii may even clean itself up a bit22:59
fungier, it may22:59
clarkbif we are going to do a restart maybe wait for 790848 to land first so that we can get that in without waiting again (it should be soon I think)22:59
corvuslooks like the revert just landed23:00
clarkbis this restart going to be another slow one because it will clone into the other repo paths?23:00
corvusclarkb: yep :(23:00
clarkb(wondering if we should keep the old repos around or if that complicates stuff somehow)23:00
corvusno choice, zuul is going to delete them this time23:00
clarkbgot it23:00
corvus(i thought about leaving the other ones around, but it would have been more surgery to delete the current ones when we switch back because zuul *wouldn't* do it in that case)23:01
clarkbso ya my only suggestion then is maybe wait for 790848 to land I think it may only be ~10 minutes away? then do restart?23:01
corvus(because it would have thought it already did the delete of the old scheme)23:01
corvuscool, i have some screws i still need to attach, i'll go do that :)23:01
clarkbianw: oh btw I ended up just changing the order of the altnames in the cert to fix the issue you were looking at. I was trying to keep moving parts to a minimum but if we still like those improvements (the log capture for sure) we can do followups to alnd them23:05
clarkband this way we don't need to add more acme records to openstack.org23:05
clarkbthat will just get deleted next week :)23:05
ianwyeah that's cool.  i think i had in my head the idea that we'd mostly use the inventory_hostname for consistency in general, but whatever works23:07
openstackgerritMerged opendev/system-config master: Clean up zuul01 from inventory  https://review.opendev.org/c/opendev/system-config/+/79048423:09
clarkbcorvus: ^ my timing was really good too :)23:10
corvusapproximately 10 minutes :)23:10
corvusclarkb: so what's next?23:10
clarkbcorvus: I just looked at zm01 and ze01 and they both have the correct docker-compose (latest) config as well as gearman configs23:11
clarkbcorvus: maybe run the pull script and double check that image looks like the one we want on the hosts and we can restart?23:11
corvusrunning pull now23:12
clarkbfor dumping queues on zuul02 we may need extra tooling. But previously I ran the dump on zuul01 and copied it to zuul02 after zuul01 was stopped with no problem23:12
clarkbcan probably just use zuul01 for the queue dump still23:12
corvusi'll try on 223:13
corvuspython3 ~corvus/zuul/tools/zuul-changes.py https://zuul.opendev.org23:14
corvusthat works23:14
clarkbcool23:14
corvusclarkb: then we need to mutate the output right?23:14
clarkbcorvus: yes you need to prefix the entries with the docker exec command one sec23:14
clarkbdocker exec zuul-scheduler_scheduler_1 prefix the data with that23:15
clarkbon each line23:15
corvusi'm going to make a custom zuul-changes for now23:15
clarkb/root/queues.sh is an example23:15
clarkbif you run that as not root you will also need a sudo at the front23:15
corvusi'll add it in so it works either way23:16
corvus ~root/zuul-changes.py https://zuul.opendev.org23:16
corvussudo docker exec zuul-scheduler_scheduler_1 zuul enqueue --tenant openstack --pipeline gate --project opendev.org/openstack/neutron-lib --change 791134,123:16
corvusproduces output like that ^23:16
clarkbthat looks right to me23:17
corvusclarkb: zuul-promote-image failed for the revermt23:17
corvuslooks like it may have failed after the retag though23:17
corvus(failed deleting the tag)23:17
clarkbah ya that races sometimes iirc23:18
corvusi'll check dockerhub and see if they look ok23:18
clarkb++23:18
fungiso long as that was all that failed, i guess we should still be clear to pull23:18
corvushrm, not looking good23:20
corvusi think it promoted zuul but not zuul-executor23:20
corvusi'll try re-enqueing that23:20
clarkbcorvus: ok23:21
*** tosky has quit IRC23:23
clarkbone of the jobs failed but I think it was a js one?23:23
clarkb(on the reenqueue)23:23
fungiif it's the tarball publication, yeah that's been broken23:24
corvusyeah, image promote looks good now23:24
corvusi'll pull again23:24
corvusdone23:27
corvusclarkb: i guess we save/restart/reenqueue now?23:28
clarkbcorvus: I think so.23:28
corvusrunning that now23:28
clarkbthe gearman config still looks good23:28
corvuscat jobs are going; we can watch the graph again23:33
clarkbseems like it went faster this time, but still waiting for the layout to be loaded23:38
corvusyeah, we may have one job still running?23:39
corvusand it may time out, which may mean we have to restart :?23:39
corvusi'm not sure why it went faster23:39
corvusit was slower than the earlier restarts, but faster than the last one23:40
corvusthat last job finished23:40
clarkbI think zm01 didn't have its old repos cleaned up23:40
clarkbwhich may explain the speed23:41
corvusclarkb: oh yep, i forgot to do that on the mergers23:41
corvusi'll shut them down and clean them up in a bit23:41
corvusloaded now23:41
clarkbok. opendev%2Fsystem-config shows an older system-cofnig head but it should be updated if it actually gets those merge jobs so I don't think that is a problem23:41
clarkblooks like web is responding again23:42
corvusre-enqueing23:42
corvusbtw, mouse over the 'waiting' labels on the tripleo-ci jobs23:43
clarkbhttps://zuul.opendev.org/t/openstack/stream/ed1ad277335145788067faf26d412417?logfile=console.log that seems to be acutally doing something?23:44
clarkboh nice on the mouse over23:44
clarkbspot checking some jobs this looks happy23:45
clarkbThe next thing to do is probably to manually run service-zuul in the foreground and ensure that we noop? We can wait for zuul to be a bit further along in its happyness before we do that though23:46
corvusgimme a sec to manually clean up the mergers23:47
clarkbyup no rush23:47
clarkbhrm a number are stuck at "updating repositories" but I suspect that is beacuse they need to fetch/clone repos/changes23:48
corvusyep23:48
clarkbah yup just saw at least one proceed from that state23:48
corvussince the cat jobs didn't do much priming23:48
corvusre-enqueue finished23:48
fungioh, is it on-demand clones now?23:49
fungiahh, for the executors23:49
corvusooh, check out the waiting mouseovers on the system-config deploy queue item23:50
clarkbI was just doing that :)23:50
corvusmergers are back up and running23:51
corvus#status log restarted zuul on commit ddb7259f0d4130f5fd5add84f82b0b9264589652 (revert of executor decrypt)23:51
openstackstatuscorvus: finished logging23:51
clarkbI've got the service-zuul.yaml command queued up in the screen if anyone wants to look at it. I think we're ok to run that now and get ahead of the deploy jobs and just ensure that it noops23:53
clarkbI'll run that at 00:00 if I don't haer any objections23:54
corvusclarkb: lgtm23:54
fungiyep, ready23:55
clarkbok I'll go ahead and run it then23:55
clarkb(now I mean)23:55
corvuswe have succeeding jobs23:55
fungiso much green23:55
fungiboth in the screen session and the status page i guess23:55
ianwclarkb: ++ be good to know it's in a steady state23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!