Thursday, 2021-12-02

clarkbfungi: completely unrelated to ^ at 00:00 I always get exim panic log emails from lists. I assume those are actually old panics and we might be able to clear them out? But I'm not clued into the dark arts of email to know for sure00:05
*** rlandy|ruck|biab is now known as rlandy|ruck00:06
*** rlandy|ruck is now known as rlandy|out00:30
fungiclarkb: the one from lists.o.o looks like something was going on around 07:05:52-07:07:34 tuesday and again at 00:01:59 wednesday which caused contention for access to /var/spool/exim4/db/retry.lockfile, possibly just collisions between deliveries for different mailman sites?00:38
fungier, no would have to be between mailman processes i suppose00:39
fungimaybe something else was locking it temporarily, but i can't imagine what00:39
fungis/mailman processes/exim processes/ i meant00:39
fungias for new debugging info in the zuul build inventories, is that the playbook context stuff?00:40
Clark[m]Yup the new playbook context00:43
corvusyeah, that's a bunch of info that was only in the executor logs earlier; should help advanced users figure out what zuul did in complex situations00:45
fungiawesome, thanks!00:48
corvuswe should be good to do a rolling restart of schedulers+web whenever convenient to pick up the bugfix01:57
corvusi'll start on that now02:38
corvuszuul02 scheduler is restarting02:41
corvusthis time i'm just doing: docker-compose down; docker-compose up -d02:41
corvusthat seems to be working well so far02:41
corvus02 is done; restarting 01 now02:52
corvusah, this time zuul01 took too long to shut down and docker killed it; so i think we still need to tune that.02:54
corvusi think that means i'm a chaos monkey and we just tested "kill a scheduler while it's in the middle of re-enqueing all changes in a pipeline".  that appears to have worked fine.02:58
ianwhaha i've been called worse03:01
corvusi'm going to restart the web server now; so expect status page outage03:03
corvuslooks like everything is up now03:12
opendevreviewMerged opendev/system-config master: Upgrade to gerrit 3.3.8  https://review.opendev.org/c/opendev/system-config/+/81973303:14
ianwi was going to sneak ^ in but you beat me to it :)03:16
corvusoh sorry...03:21
corvusit looks like there's a problem with the periodic-stable pipeline; it may be a result of my chaos-monkey action03:22
corvusi'm going to see if i can manually correct it; otherwise we may need a full shutdown/start03:22
opendevreviewMerged openstack/diskimage-builder master: Fix BLS based bootloader installation  https://review.opendev.org/c/openstack/diskimage-builder/+/81885103:26
corvusokay, i perfomed zk surgery to completely empty the periodic-stable pipeline and am now re-enqueing it.  i'll try to figure out what went wrong from the log files tomorrow03:30
corvusthere are a lot of failures in that pipeline now; i can't tell if they're legitimate, or if it has something to do with the 00000 commit sha they are all enqueued with03:34
corvusi think it's too uncertain and we should just drop the queue03:35
corvuswhich is unfortunate since we have no way to restore it03:35
corvusi've done that now.03:36
corvusstatus summary: everything is up and running, but we won't have periodic-stable results for today03:37
corvusi'm out for the night03:44
ianwthanks for looking after it!  i'm sure i would have got it helplessly tangled up :)03:45
*** ysandeep|out is now known as ysandeep|ruck04:33
*** pojadhav- is now known as pojadhav05:22
*** ysandeep|ruck is now known as ysandeep|afk05:52
*** ysandeep|afk is now known as ysandeep|ruck06:15
*** raukadah is now known as chandankumar06:51
*** ykarel__ is now known as ykarel07:08
fricklerclarkb: fungi: I'm still trying to clean up exim paniclogs on other servers, but I didn't get mail from lists.o.o, likely because the aliases there were never updated. also excluding ianw, not sure if intentional or not07:14
fricklermost of the locking errors seems to be happening at logrotate time, which with the focal upgrade seems to have moved from 06:25 to 00:00?07:17
fricklerI'll see whether one can tune the timeout07:18
*** ysandeep|ruck is now known as ysandeep|lunch07:23
*** ysandeep|lunch is now known as ysandeep08:28
*** ysandeep is now known as ysandeep|ruck08:28
ianwfrickler: i didn't intentionally not update, i think just never got around to it!09:05
Unit193fungi: Well, not quite what we were hoping for, but at least https://launchpad.net/ubuntu/+source/pastebinit/1.5.1-1ubuntu1 is a start...10:05
ykarelIs there some issue with zuul.openstack.org? it's not loading10:06
ykarelhttps://zuul.opendev.org/t/openstack/status working though10:07
ykarelinspecting returns: TypeError: "r is undefined"10:08
fricklerianw: not sure if we were talking about the same thing. I meant to say that you are missing in the list of aliases to send root mail to on lists.o.o10:19
fricklerykarel: I can confirm that, best use zuul.opendev.org for now. will need to wait for corvus to dig deeper I guess10:23
ykarelfrickler, ack and Thanks for check10:24
opendevreviewArx Cruz proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563810:38
*** rlandy|out is now known as rlandy|ruck11:12
opendevreviewMarios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream  https://review.opendev.org/c/opendev/base-jobs/+/82001811:33
mariosfungi: whenever you next have some review time please add to your queue ^^^ i updated to use bash instead of jinja per comment thanks for looking11:34
opendevreviewMarios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream  https://review.opendev.org/c/opendev/base-jobs/+/82001811:58
opendevreviewMarios Andreou proposed opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream  https://review.opendev.org/c/opendev/base-jobs/+/82001812:01
*** pojadhav is now known as pojadhav|afk12:01
*** pojadhav|afk is now known as pojadhav12:54
*** pojadhav is now known as pojadhav|afk13:48
*** ysandeep|ruck is now known as ysandeep|afk14:14
dtantsurcan confirm the zuul.o.o problem14:17
*** ysandeep|afk is now known as ysandeep14:20
*** ysandeep is now known as ysandeep|afk14:27
*** ysandeep|afk is now known as ysandeep15:12
corvusthat should be fixed by https://review.opendev.org/82018415:27
*** ysandeep is now known as ysandeep|out15:34
clarkbas far as we know the system-config deploy jobs are running again right? I'll plan to approve the matrix-gerritbot update after gerrit user summit if so15:41
fungii believe so, yes. i haven't approved the lists.openinfra.dev addition yet though16:08
fungiwant to wait until i'm less distracted by meetings16:08
*** chandankumar is now known as raukadah16:12
*** tosky_ is now known as tosky16:17
*** marios is now known as marios|out16:35
*** priteau is now known as Guest738816:38
*** priteau_ is now known as priteau16:38
clarkbmaking this note here so I don't forget. Gerrit 3.4 (or is it 3.5?) allows usernames to be case insensitive. Existing installations remain case sensitive by default. We should check in our 3.3 to 3.4 test jobs that we don't break usernames16:45
clarkbwe can createa  zuul and Zuul user or similar and then going forward we should catch problems automatically16:45
clarkbexcept we may need to toggle the config explicitly to avoid the default on new installs being insensitive. Anyway the testing we've got should cover this well, just need to update the system a bit16:47
fungiwe could in theory check for collisions, but i expect they're many16:48
clarkbyes I know we have collisions just from the user cleanups I've done for the conflicting external ids problem16:53
clarkbwhen people end up with a second user they often make their username a variant of the original16:54
clarkboften by changing case of a character or three16:54
fungiclarkb: any opinion on whether we should be using base-test to vet https://review.opendev.org/820018 before approving?17:26
clarkbfungi: its probably sufficient to run the script locally if you want to avoid that dance17:28
fungii've approved 818826 to create lists.openinfra.dev and will keep an eye on it17:28
clarkbbut I think we should test it since the mirror config affects a lot of jobs17:28
fungiyes, i looked at it very closely in order to spot obvious syntax or logic issues which could have broader fallout, but i'm not confident in my skills as a shell parser17:28
clarkbBut also, that config is long since deprecated iirc17:28
fungiyes17:29
clarkbwe might suggest that starting with centos stream people use the proper mirror configuration tooling17:29
clarkbbut I'm indifferent to that as a shell script vars are useful in various contexts17:29
fungithat's not a bad idea, it would be starting with stream 9 specifically though17:30
fungistream 8 didn't need changes to the mirroring17:30
clarkbah17:30
clarkbya the -ge 917:30
fungicentos changed up their mirror path for stream 917:30
clarkbtl;dr if the script as proposed runs locally I think we can approve it17:33
opendevreviewMerged opendev/system-config master: Create a new lists.openinfra.dev mailing list site  https://review.opendev.org/c/opendev/system-config/+/81882617:57
clarkbone thing I notice is that the order of jobs isn't quite what I expected but that must be an artifact of actually writing down our dependencies :)18:29
fungiour dependencies aren't quite what we expected18:31
fungifwiw, looks like the periodic puppet-else job ran again, but /var/lib/storyboard/www/js/templates.js on storyboard.o.o did not get updated18:34
clarkbI think the source isn't updated the way we think it is18:35
clarkb/home/zuul/src/opendev.org/opendev/system-config git log -1 shows Merge "Cache Ansible Galaxy on CI mirror servers"18:36
clarkbwe should probably hold off on making updates otherwise we'll have a giant pile of them that all apply at once when we fix that18:36
clarkbalso before manage-projects runs do we need to stop ansible?18:36
clarkb(I don't know if we've chagned projects.yaml in the last few days)18:37
clarkbhttps://zuul.opendev.org/t/openstack/build/d72175e06a8c4b5999b058a77e984755 is the build that should've updated the source and looking at the logs I think we did18:38
clarkbbut then later jobs must've reset it or something?18:38
clarkbI'm confused, and can't really debug right now as I'm trying to pay ettention to gerrit user summit18:38
fungiyeah, should i disable ansible on bridge for now?18:39
clarkbprobably?18:40
clarkbthe problem with the ansible disable is that we retry every job 3 times :/18:40
clarkbbut I haven't come up with a better idea than that other than emergency filing everything but that is problematic for other reasons. I think ansible disable is probably warranted until we can understand this better18:40
clarkbhttps://zuul.opendev.org/t/openstack/build/d72175e06a8c4b5999b058a77e984755/log/job-output.txt#225-229 is not what is reflected on the system18:41
fungi#status log Temporarily disabled ansible deployment through bridge.o.o while we troubleshoot system-config state there18:41
opendevstatusfungi: finished logging18:41
clarkbit synced to a different host18:42
clarkbhttps://zuul.opendev.org/t/openstack/build/d72175e06a8c4b5999b058a77e984755/log/job-output.txt#214 is not bridge18:42
clarkbI think it was a single use test node?18:42
fungioho18:43
clarkbso basically we're not updating system-config on bridge then running things. I think that we're likely ok except for potentialyl recreating an old project on gerrit if we had done renames but we haven't done renames so should be fine18:43
clarkbanyway back to gerrit user summit now that I'ev largely convinced myself we aren't breaking anything, just not updating the way we expected18:44
fungiyeah, our deployments are basically just being deferred18:44
clarkbianw's day should be starting soon and may understand this18:44
clarkbthis is almost certainly a result of the switch to a single job to update system-config at the beginning of a buildset18:45
fungiyeah, the zuul inventory for that build indicates there's an ubuntu-focal node18:45
clarkbfungi: note that infra-prod-service-lists is running now (it must'ev started before you put the prevention in place, or we've broken the preventing in the CD refactor) but as mentioend previously I think this will just apply tuesdays state and we should be ok18:48
clarkb(sidenote the thing that tipped me off to updating a different host was I checked the reflog on system-config and didn't see the refs shown in the job log18:48
fungiyeah18:51
fungi6bcf28b from 21:03:56 tuesday was the last update to ~zuul/src/opendev.org/opendev/system-config on bridge18:52
fungic663d9b from 00:50:45 wednesday was the next change which should have been updated there18:54
fungiso the breakage started in that ~3.75hr timespan18:54
clarkbDISABLE-ANSIBLE is only evaluated in the setup src job18:55
clarkbsince we put the file in place after that job the other jobs are free to continue18:55
clarkbfungi: it was almost certainly 9cccb02bb09671fc98e42b335e649589610b33cf/42df57b545d6f8dd314678174c281c249171c1d018:57
fungiin theory 42df57b from 13:48:44 wednesday would have switched to running the correct job18:58
fungiand that much it seems to have done18:58
fungibut the job itself is not yet doing the right thing18:58
clarkbwell the key is we stopped updating system-config in the other jobs18:58
clarkband then started running a job that wasn't updating properly18:59
fungiyep18:59
clarkbWe might get away with a simple revert for now. Then reevaluate from there18:59
clarkbbut might be good to see if ianw has an opinion first19:00
clarkbIts still a bit early there though19:00
fungiyeah, once he's around he may already have a clearer picture of what it was supposed to be doing vs what it's actually doing19:00
clarkbopendev-infra-prod-base <- that job still seems to exist and the changes linked above switched us off of that. I think if we revert we'll go back to using this job and hsould work? maybe? I hope?19:02
clarkbheh19:02
clarkbthe hourly job runs are not running the source update job19:03
clarkbso we've got another layer of problem where once we get things working if we reenqueue stuff we'll apply updates then hourly will undo them19:03
clarkbI'm wondering if we shouldn't consider disabling ssh access since DISABLE-ANSIBLE is non functional19:03
clarkbya I think we need to revert for that reason either way19:04
clarkbwe can't safely roll forward without adding pipeline edits in addition to fixing the setup-src job19:05
fungiso squash a revert of 9cccb02+42df57b i guess19:05
fungii can push that up19:05
clarkbyes I think so. But I'm leaning towards lets disable ssh access, push the revert then wait for ianw to help untangle19:06
fungihow do we globally disable ssh access to our servers?19:07
fungior do you mean just disable ssh access for zuul@bridge19:07
clarkbfungi: you only need to disable it for zuul@bridge19:07
clarkbmove the authorized_keys file aside?19:07
fungiwe have a zuul-zone-zuul-ci.org-20200401 key and a zuul-opendev.org-20200401 authorized, i guess it's the latter?19:09
fungiahh, yeah the first is for dns i guess19:09
fungiokay, i've commented out the zuul-opendev.org-20200401 key19:09
clarkbok I think I'm understanding what the setup-src job is doing that is wrong. Because it has a regular node (no nodes: []) we run the normal repo setup against the remote host19:12
opendevreviewJeremy Stanley proposed opendev/system-config master: Revert "infra-prod: clone source once"  https://review.opendev.org/c/opendev/system-config/+/82025019:12
clarkbThen our tasks that run against bridge.openstack.org are completely skipped beacuse it isn't in the inventory19:12
fungi820250 is a squashing of reverts for commits 42df57b and 9cccb0219:12
clarkb70827542adfaf5816fdf396e61c5d021b0fa3769 is a flawed change19:14
clarkbthe assertion in the commit message is only half true19:15
clarkbfungi: we need to revert ^ as well19:15
clarkbbecause the inventory add in setup-keys is what was allowing setup-src.yaml to find bridge and update the system-config repo19:16
fungiokay19:16
clarkbwhen we dropped the inventory add from setup-keys we dropped the ability to update system-config19:16
fungii can't find that commit19:17
clarkbfungi: it is in opendev/base-jobs19:17
fungioh, got it19:17
clarkbI think the order is revert 70827542adfaf5816fdf396e61c5d021b0fa3769 then do 82025019:17
clarkbif we do it in the other order we'll still be broken19:17
opendevreviewJeremy Stanley proposed opendev/base-jobs master: Revert "infra-prod-setup-keys: drop inventory add"  https://review.opendev.org/c/opendev/base-jobs/+/82025119:18
opendevreviewJeremy Stanley proposed opendev/system-config master: Revert "infra-prod: clone source once"  https://review.opendev.org/c/opendev/system-config/+/82025019:19
fungidepends-on added19:19
clarkbOnce we're reverted I think the plan forward is to update the setup-src job to not run with nodes first, then update our pipeline config updates as before but ensure the src update job is in all the pipelines and that all the jobs hard depend on that setup src job. We want them to fail if setup src fails.19:20
clarkbBut maybe we get back to where each job is updating system-config today so that we can reenqueue stuff (we have to be careful doing this because reenqueing to deploy will use the exact chagne state which means if we reenqueue out of order or whatever we can have problems)19:21
clarkbthen pick up the break out again next week?19:21
fungiwfm19:22
clarkbre reenquing stuff a safer appraoch may be to let something update system-config (the hourly deploy jobs most likely) then manually run other playbooks that we want to pick up that stuff19:23
fungisure. the daily will also kick off in a few hours19:23
fungiwell, ~6.5 i think19:23
clarkbthe last thing we need to sort out is where DISABLE-ANSIBLE got broken. That might also need a (parial) revert19:24
clarkbok I think the existing revert to go back to the old base job will return DISABLE-ANSIBLE before19:27
clarkbs/before/behavior19:27
clarkbI've +2'd both changes and left notes about what I've found in my debugging. I guess we can wait another hour or so to see what ianw thinks?19:29
clarkbin the meantime infra-root please do not approve any system-cofnig changes19:29
clarkbfungi: we should make a list of changes to system-config and project-config to audit and rerun as necessary once happy again19:30
clarkbfor system-config f29aa2da1688ab445d78d3c6596467bae9281f48 3c993c317b79640c2f86d91559f6d2b7ec83d17a 4285b4092839daea4bb7d2574f2a8923310d8278 33fc2a4d4e0628f1580893579c275f0095ce7eec19:31
clarkbof those the lists update is probably the most scary one. I think the gerrit image update wouldn't have really affected prod since all we'd do is pull the image maybe19:31
clarkbthe haproxy changes might update haproxy in production.19:32
fungii've got to step away to cook dinner (christine has something pressing at 21:00) but i can take a look once we eat19:32
clarkbfor project-config 9d2f65a663df801beae4385368c86a21fca83c8e is the only one we need to check but I think it landed early enough to not be a problem19:33
fungii can probably scrape a list of changes reported in here by gerritbot as a cross-check19:33
clarkbso really just the system-config commits above and of those only the lists one is concerning. I think once we think we're fixed we manually update system-config and manually run the gitea load balancer, lists and gerrit playbooks19:34
clarkbThen we can fix ssh for zuul on bridge and see if gerrit does the right thing? I guess the fear there is it might revert our checkout somehow but I think the risk of that is low19:34
clarkbya I'm going to need lunch soon so this is probbaly all fine to pause a bit until ianw is awake and can review what we've found and decide if the plan is good19:36
*** artom__ is now known as artom19:36
Clark[m]I've switched to lunch mode but just realized that maybe landing the system-config revert will trigger all the things to run? And maybe that is better than trying to manually run stuff? If we choose to manually run stuff we should do that before approving the revert I guess19:47
fungiyeah, might make the most sense to put ssh key back and enable ansible when approving the system-config revert?19:48
Clark[m]Ya possibly19:51
fungifor visibility, should the disable-ansible check be its own role even? easier to see when and where we include it in each job that way20:03
Clark[m]++20:05
opendevreviewJeremy Stanley proposed opendev/base-jobs master: Make the disable-ansible check into its own role  https://review.opendev.org/c/opendev/base-jobs/+/82025820:31
fungithat's the role, we can switch to it where convenient i guess20:31
corvusi'd like to rolling restart zuul scheduler and web... any thoughts on timing?20:46
corvusi mean, should be non-disruptive, but also non-zero-risk20:47
clarkbcorvus: well we're hoping to untangle the system-config breakage when ianw's day starts. Might be good to get through that first just so that we're not debugging zuul and system-config?20:50
clarkbI think we've got the two changes necessary to do that proposed above https://review.opendev.org/c/opendev/base-jobs/+/820251 https://review.opendev.org/c/opendev/system-config/+/820250 but washoping ianw could weigh in as he was driving that work20:50
corvusyep, can wait.20:51
clarkbI'm not sure how long we should wait on the off chance that ianw isn't around today. The base-jobs change should be super straightforward to land. It is the system-config change that is a bit more intertwined, but from what I can see that change is afe too21:13
opendevreviewJames E. Blair proposed opendev/system-config master: Add a keycloak server  https://review.opendev.org/c/opendev/system-config/+/81992321:14
corvusi expect that to pass tests and ready for review now21:15
clarkbthat is unexecpted, the hourly jobs are still managing to run somehow21:18
fungicould they be authenticating with one of the other keys?21:19
clarkbor a # doesn't do what we think it does in that file?21:20
clarkboh yup its the wrong key21:20
clarkbthe system-config job use the system-config key21:20
clarkbthe key you commented out is for the opendev.org zone I think21:20
fungioh, is that entry misnamed?21:21
clarkbno it isn't misnamed, we just misinterpreted what it meant21:21
fungithe zuul-ci.org one has a comment of zuul-zone-zuul-ci.org-2020040121:21
fungithe zuul-opendev.org-20200401 doesn't say "zone" in it21:22
clarkbsystem-config/inventory/base/group_vars/all.yaml sets the value. I think it was just recorded that way21:22
fungii guess we should have called it zuul-zone-opendev.org-20200401 for consistency21:22
clarkbyes. But also maybe we should move the file aside as we don't really want anything running until we're haoppy with the fixups?21:22
fungidone, moved it temporarily to ~zuul/.ssh/disabled.authorized_keys21:23
clarkbin the meantime should we go ahead and approve the base-jobs revert?21:24
clarkbI'm going to rereview the system-config revert now with some fresh eyes to make sure we aren't missing anything21:24
fungiyeah, i can approve the one for base-jobs21:24
clarkbhttps://review.opendev.org/c/opendev/base-jobs/+/807807 was the last chagne to opendev-infra-prod-base. Which means we ran with that in place for about a week and seemed to be working. The system-config revert switches us back to using that job21:26
clarkbnow to double check the contents of that job for changes21:27
clarkbthe two changes to the playbooks that jobs run are the one to remove the inventory entry which we are reverting and the other rnames a playbook which I think is fine becuase it appears to have been 1:1 just a file change name for consistency with job names21:29
clarkband ya the git log for the rename shows no delta in the file itself21:29
clarkbso ya I think the system-config revert is also safe.21:29
clarkbfungi: once base-jobs lands should we approve the system-config revert and plan to move ssh authorized_keys back and also remove DISABLE-ANSIBLE?21:30
clarkbthen figure out if we need to run any playbooks by hand after it runs its jobs?21:30
clarkbbasically in my rereview I can't find anything that would indicate going back to the old situation of running the repo update for each job would be a problem21:31
opendevreviewMerged opendev/base-jobs master: Revert "infra-prod-setup-keys: drop inventory add"  https://review.opendev.org/c/opendev/base-jobs/+/82025121:34
clarkbI guess give it a little longer in case ianw's day is still booting up and then plan to approve the other revert at the top of the hour otherwise?21:36
clarkbunrelated gitea just made a new 1.15 release with a bunch of bugfixes21:40
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.15.7  https://review.opendev.org/c/opendev/system-config/+/82026721:47
clarkbunlikely to land that today, but we can start the CI process on it21:47
fungiyeah, top of the hour wfm... put ssh keys back, undo the disable-ansible, approve the change21:53
clarkbI think we might want to approve first so that the hourly jobs can quickly cycle out21:54
clarkbbut then ya reenable things with the plan being the change will and and have a go21:55
ianwsorry, here now!21:56
ianwjust reading21:57
clarkbianw: oh hi! so basically there are a few issues we discovered with the CD refactors that landed most recently. The main issue is system-config on bridge isn't being updated by the -src job21:57
clarkbianw: the reason for this is that we aren't adding bridge to the inventory anymore since we removed that from the keys playbook. But even if we fix that we also noticed that we aren't running the update job on hourly deploy or the daily periodic pipeline21:58
clarkbseparately we also found taht only the -src job was checking DISABLE-ANSIBLE which means you can't really get ahead of the next job only the next buildset21:58
clarkbfungi pushed up two revert changes the first of which has laned and restores the inventory stuff to the setup-keys playbook. THe other revert has us going back to the every job updates system-config state so that we can roll forward addressing the whole set of issues21:59
ianwok, i thought it all seemed to be going too easily :)22:00
clarkbianw: I tried to leave comments on the revert changes to serve as hints for the future fixups but right now the priority is getting things working around as we are building up a delta (gitea haproxy, gerrit image update, and lists.opeinfra.dev changes) that haven't applied fully22:00
clarkbWe suspect that if we land the system-config revert that a bunch of those jobs will run so we can reenable zuul access to brdige and approve that if you are happy with that plan22:01
clarkbwe disabled ansible so that we could figure out what was going on. I think at this point I'm reasonably well convinced it wasn't doing anything bad just not doing anything new. We can probably reenable whenever I susppose22:02
clarkbfungi: in https://review.opendev.org/c/opendev/base-jobs/+/820258 I think you can go ahead and add that role to the base job playbooks?22:03
fungiis that safe? i suppose it is22:04
clarkbfungi: ya it should be22:04
clarkbwith the usual caveats that updating base jobs is tricky and we should monitor22:04
fungidoes it need to be scoped to a specific inventory host?22:04
clarkbfungi: yes it needs to only check on brdige22:04
clarkbfungi: I think you can put that in the setup-keys playbook that adds bridge to the inventory22:05
clarkbsomething like that should work well. And we can land it later when we are able to monitor and out of the unhappy current state22:05
ianwthanks, 820250 is approved so we can get things moving22:05
fungiahh, okay, i assumed we'd want to explicitly add it to other jobs, but i guess if it's in base then it's implicitly added to all jobs without us needing to do anything22:05
clarkbfungi: exactly22:06
fungiwith 820250 approved i should put back the ssh keys and undo the disable-ansible now?22:06
clarkbfungi: if you do that the hourly jobs will run which will delay when the 820250 jobs start. I think if we can wait for hourly to finish and then reenable that would be best22:06
clarkbbut that only works if 820250 doesn't merge first :)22:06
fungigot it22:07
fungii'll try to keep an eye on the screen22:07
clarkbI think the hourly jobs need about 4-5 more minutes to cycle out. 820250 hasn't started all jobs yet so we should have some time22:07
clarkboh it just started and zuul says 26 minutes so ya we should be good to wait on the hourlies to finish first22:07
ianwi do wonder if we want every job checking DISABLE-ANSIBLE22:10
ianwi did totally overlook the other pipelines22:11
clarkbfor me at least its nice to be able to recognize there is an issue and then hit the off switch. I suppose if we want to keep things more fine grained we could say the ssh keys are the big red button and DISABLE-ANSIBLE is more graceful22:12
ianwi guess you're saying you might want to stop things between the end of the src job and the other jobs starting?22:13
clarkbyes or between some other job in the list and the next one if we realize something is off22:13
ianwi was mostly thinking that cloning the source would be the place it stops; i don't have  a problem with the flag as such22:14
ianwhmm, fair enough.  does the new zuul authentication bits give the option to cancel a buildset too?22:14
clarkbmaybe? we can dequeue with gearman as long as that still exists too22:15
clarkbfungi the last job in the hourly buildset is about to timeout once that is done I think we can restore the ssh keys and remove DISABLE-ANSIBLE22:16
clarkbfungi: its done we can reenable now. Were you going to do that or should I/22:17
corvusyes you can dequeue an item22:18
clarkbI went ahead and removed DISABLE-ANSIBLE and put the authorized_keys file back22:19
clarkbwe're making CD omelets22:20
opendevreviewMerged opendev/system-config master: Revert "infra-prod: clone source once"  https://review.opendev.org/c/opendev/system-config/+/82025022:23
clarkbre Gerrit User Summit I did try to take a bunch of notes which I'll try to curate and post up somewhere. I think the big thing for us to think about is case sensitive username settings in 3.4 before we upgrade. Just to be sure that doesn't bite us later22:23
clarkbbut I also understand how the new check stuff works22:24
clarkbFor the new checks stuff you write a plugin that queries some CI endpoint for a change (in our case it would hit the zuul rest api I think). Then the plugin emits data in their standard format to the central checks UI system22:25
clarkbthen they handle all the rendering for you22:25
clarkbthat was anti climactic it decided to not run any jobs22:27
clarkbI guess because no jobs trigger on the base job updating?22:27
clarkbjust when you think you understand how computers work they remind you that no no you do not :)22:27
clarkbShoudl we just wait for the hourly runs to happen then we can manually run the gitea-lb playbook and the lists playbook?22:27
fungithanks, i got back to the keyboard too late22:28
clarkbMy one concern with manually updating the system-config checkout is taht we won't know that the jobs are doing it propely22:28
clarkbI think I've decided we don't need to do the review playbook as all we did was update the image and those did promote to docker hub properly22:28
clarkbor we can enqueue the lists chagne to deploy22:29
clarkbthat was the last system-config chagne to land. I don't think we should enqueue any older changes as that will create confusion22:29
fungii'm okay waiting for the hourly deploy22:30
clarkbcool that wfm too then22:30
fungislightly worried that we've picked apart our deploy jobs enough that reenqueuing a particular change may not run everything anyway22:30
clarkbfungi: ya it would only run whatever jobs it enqueued previously22:32
clarkbthough will it use the old state of the jobs too? I don't think so22:32
ianwfungi: why do you think the deploy jobs won't run?22:34
clarkbianw: well the lists addition chagne won't run jobs for haproxy on gitea for example22:35
clarkbbut it will run some jobs related to lists22:35
ianwoh right, yes i see what you mean22:36
clarkbbut we can manually run those playbooks once we're happy the automated jobs are updating commits properly22:37
fungihence the list of missing commits from the dark time22:38
fungiso we know what needs to be rerun22:38
ianwdo we need infra-prod-setup-src, or should it just be part of infra-prod-install-ansible?22:53
clarkbianw: hrm thats a good question. I think if we're hard depending on the source update job and there is another job we always want to run it could pull double duty22:54
clarkbcall it prep-bridge or similar?22:54
ianwmaybe bootstrap-bridge?22:57
clarkb++22:59
clarkbhourly jobs are starting now23:00
fungigood, i'm mostly back around again now23:01
clarkbwoot it just updated system-config23:02
clarkbI think we're good. And can proceed with running the the lists and gitea haproxy playbooks when we like (I don't think either of those playbooks conflicts with the jobs that hourly runs23:02
clarkbservice-gitea-lb.yaml <- that is the playbook we run for the gitea lb. I'll go ahead and run it now23:04
clarkbthat is done. It updated the docker compose file to set the ro flag on the config bind mount and restarted the container23:06
clarkbI can still reach https://opendev.org23:07
fungisame23:07
clarkbI think we're good23:07
ianwthanks!23:07
clarkbservice-lists.yaml is the lists playbook. Fungi did you want to run that one?23:07
clarkb`sudo ansible-playbook -f 20 -v /home/zuul/src/opendev.org/opendev/system-config/playbooks/service-gitea-lb.yaml` is the command I used for the gitealb23:07
fungii can, just a sec23:08
clarkbjust need to wap out the playbook name23:08
clarkbI'm happy to join a screen if you want to run it in screen too23:08
fungicueued up in a root screen session now23:08
fungier, cued23:09
clarkbI'm in the screen and that command looks right to me23:09
fungiwell, also queued23:09
fungiokay, running23:09
clarkbinterestingly infra-prod-service-bridge needs to be retried?23:10
clarkbthere doesn't appear to be a new playbook log file from that job in our ansible log dir23:11
fungiit's working on adding the new site now23:11
clarkbcorvus: is there a good way to see those logs from a failed but will be retried job somewhere?23:11
clarkbfungi: considering how long this command is taking I wonder if it is stuck on a read like we had before23:12
fungiyeah, looking23:12
fungiit's in an epoll loop23:13
fungiepoll_wait(5, [], 2, 1000)              = 023:13
fungiwait4(2537565, 0x7ffdcda5976c, WNOHANG, NULL) = 023:13
fungiclock_gettime(CLOCK_MONOTONIC, {tv_sec=1398764, tv_nsec=118518927}) = 023:13
fungii think23:13
clarkbthat task is the one we fixed for the read23:14
clarkbby setting stdin: ''23:14
fungii don't see any child processes of that AnsiballZ_command.py anyway23:14
clarkbfungi: ps shows it `ps -elf | grep newlist`23:15
fungioh, yup, my ps afuxww wrapped at an inconvenient column23:15
fungithat newlist command looked like it wasn't a child so i skimmed past23:16
fungiso i wonder why newlist would hang23:16
clarkbstrace says a read on fp 023:17
fungiyes, it does23:17
clarkbwhich seems like the same issue as before23:17
fungiso waiting on a pipe23:17
* fungi sighs23:17
clarkbwell fd 0 is stdin23:17
fungiright, waiting on something to pipe into it i meant23:17
clarkbwhats weird is we fixed this and made sure the fix worked I thought23:17
fungii thought so too23:18
clarkbis there something special about newlisting the mailman list?23:18
fungiit was prompting for confirmation last time, right?23:18
clarkbfungi: prompting to send confirmation emails iirc ya23:18
clarkbto the list admin23:18
fungiansible was making it look like a tty which caused it to go interactive23:18
clarkbwe didn't catch it in testing because testing sets the flag to not send notifications23:19
clarkbbut we do want those notification in production :/23:19
fungii guess we should kill the hanging newlist or wait for the task to timeout23:19
clarkbya I think killing the newlist is probably best. Then we can put lists.o.o and lists.kc.io in the emergency file and go try and reproduce in testing?23:20
fungiwell, emergency file shouldn't be necessary for lists.k.i unless we try to add a list to it23:20
fungibut may as well23:20
clarkbgood point23:21
fungii'll add them both and then kill the newlist process23:21
clarkbsounds liek a plan. We may need to kill a few more newlists if it continues to try after the failed attempt (I think it will short circuit though and we should haev a half confiruged site that we can ignore?)23:21
clarkbya appears to have short circuited23:22
fungii didn't initially add any other lists so it was only trying to create the default metalist23:22
clarkbyup and I'm wondering if that metalist has additional prompts from newlist?23:22
clarkbsince we know that adding a normal list seems to work fine we have done a few of those iirc23:22
fungii suppose it might23:23
clarkbbut we should be able to work through it via held test nodes23:23
fungianyway, it's in emergency disable now, i can probably try to debug more tomorrow23:23
clarkbyup thanks23:23
clarkbI need to take a break to get some stuff done while the sun is still up23:23
clarkbThe other thing on my list was to restart gerrit on the new image. But will seewhere we are at later and if I've got brain space for that23:24
ianwi'll be happy to do that when it's a bit quieter in a few hours23:28
corvusclarkb: yes, you can find logs of retried builds by going to the buildset, and you can get to the buildset by clicking on any completed job in the buildset to get the build page for that build, then click the buildset link.  example: Bearer23:29
corvuseyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpYXQiOjE2Mzg0ODc0MDEuNjMwOTM5MiwiZXhwIjoxNjM4NDg4MDAxLjYzMDkzOTIsImlzcyI6Inp1dWxfb3BlcmF0b3IiLCJhdWQiOiJ6dXVsLmV4YW1wbGUuY29tIiwic3ViIjoicm9vdCIsInp1dWwiOnsiYWRtaW4iOlsibm9uZSJdfX0.ONXqLWPTlGEUa-rKkjYHnclbtsS2sxsD9FIPY7kjV3M23:29
corvusoh dear that's not the right example :)23:30
ianwi'll rework the parallel changes into a another series of "noop" jobs23:30
corvusexample: https://zuul.opendev.org/t/openstack/buildset/6cb2b00359e349ba954be34c2f06904a23:30
corvus(that is not an important token, ftr)23:30
ianwhopefully that meet the definition of noop this time23:31
opendevreviewMerged opendev/base-jobs master: Fix NODEPOOL_CENTOS_MIRROR for 9-stream  https://review.opendev.org/c/opendev/base-jobs/+/82001823:38
opendevreviewJames E. Blair proposed opendev/system-config master: Add local auth provider to zuul  https://review.opendev.org/c/opendev/system-config/+/82027623:39
ianwi'm keeping an eye on ^^.  it's a very quick revert, but it was only an if conditional23:40
ianw(i mean, if it does go wrong, it can be a quick revert)23:40
opendevreviewJames E. Blair proposed openstack/project-config master: Add REST api auth rules  https://review.opendev.org/c/openstack/project-config/+/82027723:43
corvusinfra-root: the ansible hostvars file group_vars/grafana_opendev.yaml is not checked into git.  should it be?23:44
corvusinfra-root: (also there are several *.old files which seems redundant for content that's in a git repo, should they be deleted?)23:45
fungiianw: ^ is that something you were working on?23:45
ianwyeah, looking, it might be something i've left behidn23:45
fungicorvus: i'd delete old/backup copies yes23:45
corvusi'll wait for ianw to clear before i do anything23:45
ianwyeah it was from the swizzle time; that group went with https://review.opendev.org/c/opendev/system-config/+/73962523:46
ianwi'll rm it23:46
ianw.. done23:47
corvusthx.  i'm going to rm emergency.yaml.old   groups.yaml.old  openstack.yaml.old23:48
ianw++23:52
opendevreviewJames E. Blair proposed openstack/project-config master: Add REST api auth rules  https://review.opendev.org/c/openstack/project-config/+/82027723:54
clarkbthansk for doing that cleanup. I'm back at the computer and will try to be useful again23:58
clarkbfirst up understanding why the bridge job retried23:58
corvusat this point in the day, i don't think i have time to do the rolling zuul restart i asked about earlier... if someone wants to do that once thing settle down, feel free, otherwise i'll ask again tomorrow.  meanwhile, https://review.opendev.org/819923 https://review.opendev.org/820276 and https://review.opendev.org/820277 are all ready to merge.  we should merge the latter two soon.  like, before the gearman removal happens.23:58
clarkbhttps://zuul.opendev.org/t/openstack/build/317db45bca0a45ba8d79e491b74b1f5c it hit the exact time the haproxy was not working23:58
clarkbI can review those. I've already reviewd the keycloak chagne, but really the other two seem urgent and worht a check23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!