Sunday, 2022-06-19

fricklerlp changed their favicon, my inner monk is upset11:50
Clark[m]Gerrit 3.5 upgrade begins at 20:00 utc today. I'll pop back in around then to help out if necessary14:45
fungii'll try to be around at that time as well18:44
fungi~1.25 hours from now18:44
ianwo/19:59
clarkbgood morning19:59
fungiohai!20:00
ianw#status notice "Gerrit will be unavailable for a short time as it is upgraded to the 3.5 release"20:00
opendevstatusianw: sending notice20:00
-opendevstatus- NOTICE: "Gerrit will be unavailable for a short time as it is upgraded to the 3.5 release"20:00
ianwhttps://etherpad.opendev.org/p/gerrit-upgrade-3.5 is the checklist20:01
clarkbianw: just rereviewing the checklist again we don't seem to have an explicit reindex step. Are we relying on online reindexing then?20:02
ianwi think so, must be what we did last time too?20:03
opendevstatusianw: finished sending notice20:03
clarkboh yes I think that is correct. I'm thinking of the init step not the reindex though20:03
clarkbrereading the upgrade notes there is no schema change and an offline reindex is only necessary if upgrading from 3.3 or older20:04
clarkbI've added a note to step 13 to ensure online reindexing is completed20:05
clarkbthat should cover all my concern here20:05
ianwoh, running the system backup with the mariadb container down doesn't work, doh20:08
ianwthat step should be before shutting down containers20:08
clarkbah right because it uses mysql_dump20:09
clarkband that needs a running mysql server20:09
ianwi just restarted mariadb and did another run, so we have the full backup now20:10
clarkbianw: did you stop the db again?20:11
ianwyep20:11
clarkback20:11
ianwthe 645dc2 image is still the lastest, and everything lines up there20:14
ianwi agree on waiting for the reindex.  i think we should still do the mariadb update in a separate step, just to make sure gerrit 3.5 is happy first20:15
clarkbwfm the mariadb upgrade is also much lower priority we can skip it if necessary20:16
clarkbany idea what those key exchange errors are about?20:17
ianwhrm, nothing too much in the error log, but there are two hosts that seem to be looping around failing to authenticate20:18
ianwit's weird that the id is null@<ip>20:18
clarkbianw: it may be the user isn't passed until after kex happens?20:19
clarkbI am able to ssh and gerrit ls-projects so ssh seems to work generally20:19
clarkbone host appears to be an opensuse host and another an IBM host? I think we cna likely proceed and try to followup with them later20:20
clarkband possibly block them via iptbales if the logging becomes too much20:21
ianwyeah, it's also been happening well before this, at least since 05-3120:21
clarkbah ok20:22
ianwmaybe we should iptables block them; perhaps whatever is doing it doesn't handle kex errors well, but might raise an error to it's owners if it's cut off?  clearly nobody is looking at whatever it's doing too closely20:22
clarkbbrowsing random changes I see that a little "login is required to perform this action" popup occurs in the bottom left when opening file diffs20:22
clarkbI suspect this is a regression in the UI not handling anonymous users properly. I don't think that is fatal enough to rollback either20:23
clarkbthats the sort of thing I can dig into later this week and probably push a fix for if no one else is interested 20:23
ianw++ i agree i see that on an anonymous browse too20:24
clarkbif I had to guess it is trying to mark the files as reviewed20:24
clarkbbut it can't do that unless logged in20:24
clarkb1717 tasks remaining down from 1882 a few minutes ago. Reindexing seems to be progressing20:25
ianwan initial watch of the network requests when it pops up doesn't show anything incredibly obvious.  so yeah agree we can work on it after20:26
fungilgtm so far20:30
fungi(sorry, had to step away for a few minutes)20:30
clarkbianw: I'll let you drive step 13, but let me knwo if I can help with any of those sub tasks20:31
clarkbif someone pushes a change that will check zuul and gitea replication transitively20:32
clarkbweb response I think looks good20:32
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] trigger bunch of jobs to test gerrit 3.5  https://review.opendev.org/c/opendev/system-config/+/84651020:34
clarkbthat also checks that gerritbot is happy :)20:34
clarkbzuul has queued up jobs for that change.20:34
fungiyay!20:34
clarkbthe replication logs for that look good too, now to check I can fetch the ref20:35
clarkbI can fetch refs/changes/10/846510/1 from at least one of the giteas using the load balanced frontend20:36
clarkbhave we rechecked any changes?20:37
clarkbdown to 1085 tasks20:37
ianwhttps://review.opendev.org/844912 is in the queue from a recheck20:37
clarkbagreed that lgtm too20:37
fungiyep, seems to be working20:43
clarkbunder 500 tasks now20:44
clarkbseems to be moving very quickly20:44
ianwhttps://twitter.com/opendevinfra/status/1538613440511713281 also put a pin in the right place.  i think that's the first time it's seen a notice level20:44
clarkb50 now20:45
clarkberror log reports it is done reindexing20:46
ianwi see it all done20:46
clarkbReindex changes to version 71 complete then Using changes schema version 7120:46
clarkbmaybe let it steady state for a few minutes then proceed with the db work? though it should be fine to proceed at this point20:46
ianw++20:48
fungiagreed20:52
fungiwe're still well within the hour estimate20:52
funginot that it's a big deal if we go longer20:53
fungithat was an estimate for the outage anyway, which was over in a few minutes20:53
clarkbyup though looks like ianw is proceeding20:53
clarkbAnd gerrit is up again. Time for me to login and review the changes that update our configs20:55
clarkbin the process of doing ^ I checked that I could mark files unreviewd and then review them and have them get marked reviewed again. That bit all looked fine to me20:57
ianwmark reviewed wfm20:57
clarkband sudo docker ps -a confirms we're running a 10.6 mariadb image (at least it is tagged that way)20:58
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/844362/1 and child are the two changes we need to land to reflect the new upgraded state if you are happy with the results20:58
ianwCode-Review 0 (vote reset) -- that feels new, that it points out this is a vote reset21:00
fungiyep, both of those lgtm21:00
clarkbyes I think that is new21:00
clarkbfungi has approved both changes. I think thats it for now? we wait for changes to land and remove the host from the emergency file?21:01
ianw++21:01
clarkbianw: I think you can remove the hosts from the emergency file nowish? since we don't run hourly service-review.yaml? But I'm happy to let you coordinate that as I'm likely to pay less attention to it than you are21:01
ianwprobably should have squished those actually, to avoid small window of us rewriting it back to mariadb 10.421:02
clarkbianw: since we don't auto restart things I think it will be ok. But agreed21:02
ianwi think for ^ above leave it in emergency until both merge, then we'll just write the latest config21:02
clarkbwfm21:02
clarkbthe gerrit image build jobs against 846510 are being retried21:03
clarkbthere don't appear to be logs for the first build. Not super concerned about that but something to followup on if we've got less reliable image builds for some reason21:03
clarkbthe gate jobs for the chagnes that matter will use the already built images21:04
ianw#status ok Gerrit 3.5 upgrade is complete.  Please reach us in #opendev if you see any issues21:07
opendevstatusianw: sending ok21:07
-opendevstatus- NOTICE: Gerrit 3.5 upgrade is complete. Please reach us in #opendev if you see any issues21:07
ianwi'm not sure if that works if not in alert21:07
clarkbthe 3.6 upgrade process is a bit more invovled. Will have to look at the upgrade job to see how to incorporate the extra bits for that21:07
ianwi guess it does :)21:07
clarkbthe second attempt at those image builds in for 846510 succeeded21:08
clarkbalso I think this cadence where we end up doing a major release update after that release has had a couple of bug fix releases is working out for us. Lots of people had problems with 3.6 initially after the upgrade21:10
clarkbBut at the same time if we get clsoe enough to master then our CI can maybe help catch those problems before anyone upgrades and that would be a big win too21:10
opendevstatusianw: finished sending ok21:12
clarkbianw: the green check mark has some trailing text on twitter for that ok message21:13
clarkb✅\efe0fGerrit 3.5 upgrade is complete21:13
ianwyeah, i'll look into that :)21:14
fungiquoting of the ok notice on twitter looks odd21:14
ianwyeah we don't need the "" on the alert21:14
ianwoh, yeah also the OK is a bit borked.  i think it's the first time we used it21:15
fungiyeah, the "\efe0f" looks like some encoding hork-up21:15
clarkbhttps://review.opendev.org/c/openstack/tripleo-heat-templates/+/841207 just merged21:31
clarkba good sign that zuul is happy21:31
fungiexcellent21:31
clarkbianw: re the emergency file I think the infra-prod-service-review job will only run when triggered by the changes merging. Or we wait for the periodic run later today or we trigger it manually.21:36
clarkbMaybe the idea is to update the emergency file just before the second change starts running its job but after the first one is complete?21:37
clarkbIn any case I'm not too concerned about it since it is a simple template update and we can followup later if necessary21:37
opendevreviewMerged opendev/system-config master: gerrit: Update mariadb to 10.6  https://review.opendev.org/c/opendev/system-config/+/84436221:39
opendevreviewMerged opendev/system-config master: gerrit: Update to 3.5 for production  https://review.opendev.org/c/opendev/system-config/+/84436321:40
clarkbThe deploy job for that first one is running already.21:40
clarkbianw: fungi: should I go ahead and remove review02 from the emergency file as soon as that first job is done running ansible?21:41
clarkbI think ti just finished21:42
clarkbI'm going to go ahead and remove it from the emergency fiel21:42
clarkbthats done21:43
clarkbok first jobs is done. Second one should start running shortly and it should noop apply the update21:44
clarkbthis one will run manage-projects too fwiw21:44
clarkb(because it updated the vars for gerrit not just the docker compose template)21:44
fungiyeah, that seems safe21:45
clarkbit said changed false on the put dockeer compose file in place task21:47
clarkbsame for the various gerrit config files21:47
clarkbok service-review is done and it lgtm21:49
clarkbI checked the docker compose file and it appears as I expect and the containers weren't restarted (also as expected)21:49
clarkbinfra-prod-manage-projects will run shortly21:50
clarkbit is starting to run its playbook now21:52
clarkbI think manage projects nooped as much as it normally does. Lots of reports that it is skipping management of projects because ACLs match21:58
clarkbI've marked off the last two items from the etherpad, but feel free to review the logs for the deployment for idempotency21:59
ianwthanks for updating that22:10
opendevreviewIan Wienand proposed opendev/statusbot master: Fix typo on OK tick  https://review.opendev.org/c/opendev/statusbot/+/84653323:42
opendevreviewIan Wienand proposed opendev/statusbot master: twitter: Fix typo on OK tick  https://review.opendev.org/c/opendev/statusbot/+/84653323:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!