12:59:28 <fungi> #startmeeting opendev-maint
12:59:29 <openstack> Meeting started Fri Nov 20 12:59:28 2020 UTC and is due to finish in 60 minutes.  The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot.
12:59:30 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
12:59:32 <openstack> The meeting name has been set to 'opendev_maint'
13:01:23 <fungi> #status notice The Gerrit service at review.opendev.org will be offline starting at 15:00 UTC (roughly two hours from now) for a weekend upgrade maintenance: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html
13:01:23 <openstackstatus> fungi: sending notice
13:04:35 <openstackstatus> fungi: finished sending notice
13:59:34 <fungi> #status notice The Gerrit service at review.opendev.org will be offline starting at 15:00 UTC (roughly one hour from now) for a weekend upgrade maintenance: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html
13:59:34 <openstackstatus> fungi: sending notice
14:02:45 <openstackstatus> fungi: finished sending notice
14:25:39 <clarkb> morning!
14:28:24 <clarkb> fungi: I think I'll go ahead and put gerrit and zuul in the emregency file now.
14:31:24 <clarkb> and thats done. Please double check I got all the hostnames correct (digits and openstack vs opendev etc)
14:31:47 <clarkb> and when y ou're done with that do you think we should do the belts and suspenders route of disabling the ssh keys for zuul there too?
14:38:58 <clarkb> fungi: also do you want to start a root screen on review? maybe slightly wider than normal :P
14:40:09 <fungi> done, `screen -x 123851`
14:41:03 <clarkb> and attached
14:41:03 <fungi> not sure what disabling zuul's ssh keys will accomplish, can you elaborate?
14:41:25 <clarkb> it will prevent zuul jobs from ssh'ing into bridge and making unexpected changes to the system should something "odd" happen
14:41:42 <clarkb> I think gerrit being down will effectively prevent that even if zuul managed to turn back on again though
14:41:43 <fungi> oh, there, i guess we can
14:41:52 <fungi> i thought you meant its ssh key into gerrit
14:42:05 <clarkb> sorry no ~zuul on bridge
14:44:06 <fungi> sure, i can do that
14:44:21 <clarkb> fungi: just move aside authorized_keys is probably easiest?
14:45:05 <fungi> as ~zuul on bridge i did `mv .ssh/{,disabled_}authorized_keys`
14:45:47 <clarkb> fungi: can you double check the emergency file contents too (just making sure we've got this correct on both sides then that way if one doesn't work as expected we've got a backup)
14:45:59 <clarkb> my biggest concern is mixing up a digit eg 01 instead of 02 and openstack and opendev in hostnames
14:46:03 <clarkb> I think I got it right though
14:46:41 <fungi> hostnames in the emergency file look correct, yes, was just checking that
14:47:25 <clarkb> thanks
14:47:41 <clarkb> I've just updated the maintenance file that apache will serve from the copy in my homedir
14:47:42 <fungi> checked them against our inventory in system-config
14:53:04 <clarkb> I plan to make my first cup of tea during the first gc pass :)
14:58:03 <fungi> yeah, i'm switching computers now and will get more coffee once that's underway
14:58:45 <clarkb> fungi: I've edited the vhost file on review. When you're at the other computer I think we check that then restart apache at 1500?
14:58:54 <clarkb> then we can start turning off gerrit and zuul
15:00:43 <fungi> lgtm
15:00:49 <fungi> ready for me to reload apache?
15:00:53 <clarkb> my clock says 1500 now I think so
15:01:04 <fungi> done
15:01:17 <fungi> maintenance page appears for me
15:01:20 <clarkb> me too
15:01:46 <clarkb> next we can stop zuul and gerrit. I don't think the order matters too much
15:01:48 <fungi> status notice or status alert? wondering if we want to leave people's irc topics altered all weekend given there's also a maintenance page up
15:02:05 <clarkb> ya lets not change the topics
15:02:17 <clarkb> if we get too many questions we can flip to topic swapping
15:02:45 <fungi> #status notice The Gerrit service at review.opendev.org is offline for a weekend upgrade maintenance, updates will be provided once it's available again: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html
15:02:45 <openstackstatus> fungi: sending notice
15:03:21 <clarkb> if you get the gerrit docker compose down I'll do zuul
15:03:25 <fungi> i guess we should save queues in zuul?
15:03:31 <clarkb> eh
15:03:35 <fungi> and restore at the end of the maintenance? or no?
15:03:42 <clarkb> I guess we can?
15:04:00 <clarkb> I hadn't planned on it
15:04:13 <clarkb> given the long period of time between states I wasn't entirely sure if we wanted to do that
15:04:21 <fungi> i guess don't worry about it. we can include messaging reminding people to recheck changes with no zuul feedback on thenm
15:04:40 <fungi> gerrit is down now
15:05:57 <fungi> i'll comment out crontab entries on gerrit next
15:05:58 <openstackstatus> fungi: finished sending notice
15:06:16 <clarkb> `sudo ansible-playbook -v -f 50  /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_stop.yaml` <- is what I'll run on bridge to stop zuul
15:06:36 <clarkb> actually I'll start a root screen there and run it there without the sudo
15:07:18 <fungi> i've got one going on bridge now
15:07:24 <fungi> if you just want to join it
15:07:32 <clarkb> oh I just started one too. I'll join yours
15:07:53 <fungi> ahh, okay, screen -list didn't show any yet when i created this one, sorry
15:08:26 <clarkb> hahaha we put them in the emergency file so the playbook doesn't work
15:08:32 <clarkb> I'll manually stop them
15:08:46 <fungi> oh right
15:08:49 <fungi> heh
15:09:48 <clarkb> scheduler and web are done. Now to do a for loop for the mergers and executors
15:11:25 <fungi> i'll double-check the gerrit.config per step 1.6
15:12:45 <corvus> clarkb: could probably still do "ansible -m shell ze*"; or edit the playbook to remove !disabled
15:12:54 <fungi> serverId, enableSignedPush, and change.move are still in there, though you did check them after we restarted gerrit earlier in the week too
15:12:54 <corvus> but i bet you already started the loop
15:13:43 <clarkb> yup looping should be done now if anyone wants to check
15:13:59 <fungi> i'll go ahead and start the db dump per step 1.7.1, estimated time is 10 minutes
15:13:59 <clarkb> fungi: ya I expected that one to be fine after our test but didn't remove it as it seemed like a good sanity check
15:14:39 <fungi> mysqldump command is currently underway in the root screen session on review.o.o
15:15:20 <fungi> in parallel i'll start the rsync update for our 2.13 ~gerrit2 backup in a second screen window
15:15:45 <clarkb> fungi: we don't want to start that until the db dump is done?
15:15:53 <clarkb> that way the db dump is copied properly too
15:16:02 <fungi> oh, fair, since we're dumping into the homedir
15:16:05 <fungi> yeah, i'll wait
15:16:25 <fungi> i guess we could have dumped into the /mnt/2020-11-20_backups volume instead
15:16:48 <clarkb> oh good point
15:16:53 <clarkb> oh well
15:22:10 <fungi> it'll be finished any minute now anyway, based on my earlier measurements
15:23:45 <fungi> mysqldump seems to have completed fine
15:23:53 <clarkb> ya I think we can rsync now
15:23:58 <fungi> 1.7gb compressed
15:24:18 <clarkb> is that size in line with our other backups?
15:24:32 <fungi> rsync update is underway now, i'll compare backup sizes in a second window
15:24:40 <clarkb> yes it is
15:24:50 <clarkb> I checked outside of the screen
15:25:26 <fungi> yeah, they're all roughly 1.7gb except teh old 2.13-backup-1505853185.sql.gz from 2017
15:25:35 <fungi> which we probably no longer need
15:26:03 <fungi> in theory this rsync should be less than 5 minutes
15:26:39 <fungi> though could be longer because of the db dump(s)/logrotate i suppose
15:27:27 <clarkb> even if it was a full sync we'd still be on track for our estimated time target
15:27:55 <fungi> yeah, fresh rsync starting with nothing took ~25 minutes
15:34:44 <clarkb> I think the gerrit caches and git dirs change a fair bit over time
15:34:51 <clarkb> in addition to the db and log cycling
15:35:06 <fungi> and it's done
15:35:15 <fungi> yeah periodic git gc probably didn't help either
15:36:05 <fungi> anybody want to double-check anything before we start the aggressive git gc (step 2.1)?
15:36:25 <clarkb> echo $? otherwise no I can't think of anything
15:37:15 <fungi> yeah, i don't normally expect rsync to silently fail
15:37:29 <fungi> but it exited 0
15:37:32 <clarkb> yup lgtm
15:37:36 <clarkb> I think we can gc now
15:37:42 <fungi> i have the gc staged in the screen session now
15:37:51 <fungi> and it's running
15:38:05 <clarkb> after the gc we can spot check that everything is still owned by gerrit2
15:38:08 <fungi> estimates time at this step is 40 minutes, so you can go get your tea
15:38:20 <clarkb> yup I'm gonna go start the kettle now. thanks
15:38:31 <fungi> i don't see any obvious errors streaming by anyway
15:38:56 <clarkb> keeping timing notes on the etherpad too because I'm curious to see how close the estimates particularly for today are
15:39:46 <fungi> good call, and yeah that's more or less why i left the time commands in most of these
16:01:15 <fungi> probably ~15 minutes remaining
16:01:54 <clarkb> I'm back fwiw just monitoring over tea and toast
16:13:08 <fungi> estimated 5 minutes remaining on this step
16:13:09 <clarkb> it is down to 2 repos
16:13:23 <clarkb> of course one of them is nova :)
16:13:41 <fungi> the other is presumably either neutron or openstack-manuals
16:13:46 <clarkb> it was airshipctl
16:13:51 <fungi> oh
16:13:52 <fungi> wow
16:13:52 <clarkb> I think it comes down to how find and xargs sort
16:14:10 <clarkb> I think openstack manuals was the third to last
16:14:19 <fungi> looks like we're down to just nova now
16:15:50 <fungi> here's hoping these rebuilt gerrit images which we haven't tested upgrading with are still fine
16:16:21 <clarkb> I'm not too worried about that, I did a bunch of local testing with our images over the last few months and the images moved over time and were always fine
16:17:00 <fungi> yeah, the functional exercises we put them through should suffice for catching egregious problems with them, at the very least
16:17:25 <clarkb> then ya we also put them through the fake prod marathons
16:20:02 <clarkb> before we proceed to the next step it appears that the track upstream cron fired?
16:20:11 <clarkb> fungi: did that one get disabled too?
16:20:12 <fungi> and done
16:20:25 <fungi> i thought i disabled them both, checking
16:21:05 <fungi> oh... it's under root's crontab not gerrit2's
16:21:24 <clarkb> we should disable that cron then kill the running container for it
16:22:12 <clarkb> I think the command is kill
16:22:15 <fungi> like that? or is it docker kill?
16:22:17 <clarkb> to line up with ps
16:22:27 <fungi> yup
16:22:30 <fungi> okay, it's done
16:22:54 <clarkb> we should keep an eye on those things because they use the explicit docker image iirc
16:23:06 <clarkb> the change updates the docker image version in hiera which will apply to all those scripts
16:23:22 <clarkb> granted they don't really run gerrit things just jeepyb in gerrit so its probably fine for them to use the old iamge accidentally
16:23:37 <fungi> the only remaining cronjobs for root are bup, mysqldump, and borg(x2)
16:23:38 <clarkb> ok I think we can proceed?
16:24:09 <fungi> and confirmed, the cronjobs for gerrit2 are both disabled still
16:24:20 <fungi> we were going to check ownership on files in the git tree
16:24:25 <clarkb> ++
16:25:18 <fungi> everything looks like it's still gerrit2, even stuff with timestamps in the past hour
16:25:19 <clarkb> that spot check looks good to me
16:25:45 <fungi> so i think we're safe (but also we change user to gerrit2 in our gc commands so it shouldn't be a problem any longer)
16:26:09 <clarkb> ya just a dobule check since we had problems with that on -test before we udpated the gc commands
16:26:13 <clarkb> I think its fine and we can proceed
16:26:27 <fungi> does that look right?
16:26:52 <clarkb> yup updated to opendevorg/gerrit:2.14
16:27:03 <clarkb> on both entries in the docker compose file
16:27:03 <fungi> okay, will pull with it now
16:27:27 <fungi> how do we list them before running with them?
16:27:42 <clarkb> docker image list
16:27:51 <fungi> i need to make myself a cheatsheet for container stuff, clearly
16:28:22 <fungi> opendevorg/gerrit   2.14                39de77c2c8e9        22 hours ago   676MB
16:28:33 <fungi> that seems right
16:28:37 <clarkb> yup
16:28:59 <fungi> ready to init?
16:29:07 <clarkb> I guess so :)
16:29:14 <fungi> and it's running
16:37:21 <clarkb> around now is when we would expect this one to finish, but also this was the one with the least consistent timing
16:37:36 <fungi> taking longer than our estimate, yeah
16:37:52 <clarkb> we theorized its due to hashing the http passwds
16:38:01 <clarkb> and the input for that has changed a bit recently
16:38:07 <clarkb> (but maybe we also need entrpoy? I dunno)
16:38:08 <fungi> should be far fewer of those now though
16:39:22 <corvus> it seems pretty idle
16:39:42 <clarkb> ya top isn't showing it be busy
16:40:08 <clarkb> the first time we ran it it took just under 30 minutes
16:40:33 <fungi> could also be that the server instance or volume or (more likely?) trove instance we used on review-test performed better for some reason
16:41:02 <fungi> the idleness of the server suggests to me that maybe this is the trove instance being sluggish
16:41:39 <corvus> | 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query   |  716 | copy to tmp table | ALTER TABLE change_messages ADD real_author INT |
16:41:51 <corvus> | Id     | User    | Host                | db       | Command | Time | State             | Info                                            |
16:41:54 <corvus> ^ column headers
16:42:02 <clarkb> ah ok so it is the db side then?
16:42:03 <corvus> fungi: so yeah, looks like
16:42:11 <corvus> yep that's "show full processlist"
16:42:15 <corvus> in mysql
16:42:15 <mordred> yeah - sounds like maybe the old db is tuned/sized differently
16:42:33 <mordred> or just on an old host or something
16:42:38 * fungi blames mordred since he created the trove instance for review-test ;)
16:42:53 <mordred> totally fair :)
16:42:55 <clarkb> this is one reason why we allocated tons of extra time :)
16:43:02 <fungi> s/blames/thanks/
16:43:11 <clarkb> as long as we can explain it (and sounds like we have) I'm happy
16:43:39 <clarkb> though its a bit disappointing we're investing in the db when we're gonna discard it shortly :)
16:44:02 <mordred> right?
16:44:18 <fungi> i'll just take it as an opportunity to catch up on e-mail in another terminal
16:46:09 <corvus> there should be a word for blame/thanks
16:46:47 <fungi> the germans probably have one
16:47:04 <corvus> mordred: _____ you very much for setting up that trove instance!
16:47:06 <fungi> deutsche has all sorts of awesome words english is missing
16:48:02 <mordred> schadendanke perhaps? (me making up new words)
16:48:48 <fungi> doch (the positive answer to a negative question) is in my opinion the greatest example of potentially solvable vagueness in english
16:48:58 <mordred> yup
16:49:05 <corvus> omg i need that in my life
16:49:25 <mordred> it fills the "no, yes it is"
16:49:30 <mordred> role
16:49:42 <fungi> somehow english, while a germanic language, decided to just punt on that
16:49:48 <mordred> yup
16:49:58 <mordred> I blame the normans
16:50:10 <corvus> | 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query   | 1227 | rename result table | ALTER TABLE change_messages ADD real_author INT |
16:50:20 <fungi> mordred: sshhhh, ttx might be listening
16:50:26 <corvus> changed from "copy" to "rename"  sounds like progress
16:50:44 <corvus> | 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query   |    5 | copy to tmp table | ALTER TABLE patch_comments ADD real_author INT |
16:50:48 <corvus> new table
16:51:13 <corvus> i wonder what the relative sizes of those 2 tables are
16:51:28 <mordred> also - in newer mysql that should be able to be an online operation
16:51:36 <mordred> but apparently not in the version we're running
16:52:13 <mordred> so it's doing the alter by making a new table with the new column added, copying all the data to the new table and deleting the old
16:52:16 <mordred> yay
16:52:18 <clarkb> ya our mysql is old. we used old mysql on review-test and it was fine so I dind't think we should need to upgrade first
16:52:23 <fungi> maybe the mysql version for the review-test trove instance was newer than for review?
16:52:30 <clarkb> fungi: I'm 99% sure I checked that
16:52:32 <clarkb> and they matched
16:52:34 <fungi> ahh, so that did get checked
16:52:39 <clarkb> but maybe I misread the rax web ui or something
16:52:45 <mordred> maybe they both did the copy and hte new one is just on better hypervisor
16:53:20 <fungi> or the dump/src process optimizes the disk layout a lot compared to a long-running server
16:53:39 <clarkb> I'm trying to identify which schema makes this change btu the way they do migrations doesn't make that easy for all cases
16:53:58 <clarkb> they guice inject db specific migrations from somewhere
16:54:00 <clarkb> I can't find the somewhere
16:54:31 <clarkb> anyway its proceeding I'll chill
16:54:31 <mordred> fungi: yeah - that's also potentially the case
16:54:37 <mordred> clarkb: they guice inject *everything*
16:54:57 <clarkb> I don't think the notedb conversion will be very affected by that either since its all db reads
16:55:12 <clarkb> so hopeflly the very long portion of the upgrade continues to just be long and not longer
16:55:29 <corvus> oof, it also looks like they're doing one-at-a-time
16:55:34 <corvus> | 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query   |   15 | copy to tmp table | ALTER TABLE patch_comments ADD unresolved CHAR(1) DEFAULT 'N' NOT NULL  CHECK (unresolved IN ('Y','N')) |
16:55:39 <corvus> second update to same table
16:55:57 <corvus> which, to be fair, is the way we usually do it too
16:56:09 <corvus> but now i feel more pressure to do upgrade rollups :)
16:56:19 <mordred> yah - to both
16:56:25 <fungi> "we" being zuul/nodepool?
16:56:42 <fungi> er, i guess not nodepool as it doesn't use an rdbms
17:04:15 <clarkb> ya still having no luck figuring out where the Schema_13X.java files map to actual sql stuff
17:04:32 <clarkb> I wonder if it automagic based on their table defs somewhere
17:04:37 <corvus> fungi: yes (also openstack)
17:05:51 <clarkb> I'm just trying to figure out what sort of progress we're making relative to the stack of schema migrations. Unfortunately it prints out all the ones it will do at the beginning then does them so you don't get that insight
17:05:58 <fungi> i would not be surprised if these schema migrations aren't somehow generated at runtime
17:06:04 <mordred> corvus: I think nova decided to do rollups when releases are cut - so if you upgrade from icehouse to juno it would be a rollup, but if you're doing CD between icehouse and juno it would be a bunch of individual ones
17:06:41 <mordred> which seems sane - I'm not sure how that would map into zuul - but maybe something to consider in the v4/v5 boundaries
17:07:18 <corvus> mordred: ++
17:07:45 <fungi> yay!
17:07:58 <fungi> it's doing the data migrations now
17:08:32 <clarkb> ok cool
17:08:39 <fungi> looks like it's coming in around 40 minutes?
17:08:40 <clarkb> seems like things may be slower but not catastrophically so
17:08:51 <fungi> (instead of 8)
17:09:04 <clarkb> 142 is the hashing schema change iirc
17:11:32 <clarkb> yup confirmed that one has content in the schema version java file because they hash java side
17:19:14 <clarkb> corvus: is it doing interesting db things at the moment? I wonder if it is also doing some sort of table update for the hashed data
17:19:22 <clarkb> rather than just inserting records
17:19:23 <fungi> looks like there's a borg backup underway, that could also be stealing some processor time... though currently the server is still not all that busy
17:19:32 <clarkb> ya I think it must be busy with mysql again
17:19:54 <mordred> db schema upgrades are the boringest
17:20:38 <clarkb> also note that we had originally thought that the notedb conversion would run overnight. Based on how long this is taking that may be the case again, but we've already buitl in that buffer so I don't think we need to rollback or anything like that yet
17:21:09 <clarkb> just need to be patient I guess (something I am terrible at accomplishing)
17:21:31 <corvus> clarkb: "UPDATE account_external_ids SET"
17:21:47 <fungi> that looks like what we expect, yeah
17:21:55 <corvus> then some personal info; it's doing lots of those individually
17:21:55 <clarkb> yup
17:22:18 <clarkb> db.accountExternalIds().upsert(newIds); <- is the code that should line up to
17:22:28 <clarkb> oh you know what
17:22:34 <fungi> yeah this is the stage where we believe it's replacing plaintext rest api passwords with bcrypt2 hashes
17:22:48 <clarkb> its updating every account even if they didn't have a hashed password
17:22:53 <corvus> yes
17:22:57 <corvus> i just caught it doing one :)
17:23:04 <clarkb> List<AccountExternalId> newIds = db.accountExternalIds().all().toList();
17:23:14 <corvus> password='bcrypt:...
17:23:17 <clarkb> rather than finding the ones with a password and only updating them
17:23:29 <clarkb> I guess that explains why this is slow
17:23:29 <fungi> is it hashing null for 99.9% of the accounts?
17:23:40 <clarkb> no it only hashes if the previous value was not null
17:23:41 <fungi> or just skipping them once it realizes they have no password?
17:23:49 <clarkb> but it is still upserting them back again
17:23:52 <clarkb> rather than skipping them
17:23:54 <corvus> it's doing an update to set them to null
17:23:58 <fungi> ahh, okay that's better than, you know, the other thing
17:24:08 <corvus> (which mysql may optimize out, but it'll at least have to go through the parser and lookup)
17:24:29 <clarkb> corvus: do you see sequential ids? if so that may give us a sense for how long this will take. I think we have ~36k ids
17:24:39 <corvus> ids seem random
17:25:00 <corvus> may be sorted by username though: it's at "mt.."
17:25:11 <corvus> now p..
17:25:27 <fungi> so maybe ~halfway
17:26:03 <corvus> hah, i saw 'username:rms...' and started, then moved the rest of the window in view to see 'username:rmstar'
17:26:36 <corvus> mysql is idle
17:26:53 <fungi> and done
17:26:54 <clarkb> it reports done on the java side
17:27:07 <fungi> exited 0
17:27:24 <clarkb> yup from what we can see it lgtm
17:27:36 <fungi> anything we need to check before proceeding with 2.15?
17:27:46 <clarkb> I think we can proceed and just accept these will be slower. Then expect notedb to run overnight again
17:27:53 <fungi> 57m11.729s was the reported runtime
17:28:01 <clarkb> ya I put that on the etherpad
17:28:38 <fungi> updated compose file for 2.15, shall i pull?
17:28:51 <clarkb> yes please pull
17:29:20 <fungi> opendevorg/gerrit   2.15                bfef80bd754d        23 hours ago        678MB
17:29:26 <fungi> looks right
17:29:29 <clarkb> yup
17:29:46 <fungi> ready to init 2.15?
17:29:52 <clarkb> I'm ready
17:30:04 <fungi> it's running
17:31:34 <clarkb> schema 144 is the writing to external ids in all users
17:31:48 <clarkb> 143 is opaque due to guice
17:32:01 <clarkb> anyway I shall continue to practice patience
17:32:14 * fungi finds a glass full of opaque juice
17:33:13 <clarkb> the java is very busy on 144
17:33:20 <clarkb> (as expected given its writing back to git)
17:34:15 <fungi> huh, it's doing a git gc now
17:34:24 <clarkb> only on all-users
17:34:24 <fungi> of all-users i guess
17:34:26 <clarkb> ya
17:34:27 <mordred> busy busy javas
17:34:45 <clarkb> you still need it for everything else to speed up the reindexing aiui
17:35:19 <fungi> sure
17:38:20 <fungi> this one's running long too, compared to our estimate
17:38:46 <fungi> but i have a feeling we're still going to wind up on schedule when we get to the checkpoint
17:39:44 <clarkb> 151 migrates groups into notedb I think
17:40:09 <fungi> we baked in lots of scotty factor
17:40:58 <clarkb> ya I think it "helps" that there was no way we thought we'd get everything done in one ~10 hour period. So once we assume an overnight being able to slot a very slow process in there makes for a lot of wiggle room
17:42:10 <clarkb> mordred: you've just reminded me that mandalorian has a new episode today. I know what I'm doing during the notedb conversion
17:42:21 <clarkb> busy busy jawas
17:42:38 <mordred> haha. I'm waiting until the whole season is out
17:42:48 <fungi> and done
17:42:56 <clarkb> just under 13 minutes
17:43:04 <fungi> 12m47.295s
17:43:24 <fungi> anybody want to check anything before i work on the 2.16 upgrade?
17:43:34 <clarkb> I don't think so
17:44:02 <fungi> proceeding
17:44:31 <fungi> good to pull images?
17:44:34 <clarkb> 2.16 lgtm I think you should pull
17:44:56 <fungi> opendevorg/gerrit   2.16                aacb1fac66de        24 hours ago        681MB
17:44:59 <fungi> also looks right
17:45:02 <clarkb> yup
17:45:14 <fungi> ready to init 2.16?
17:45:29 <clarkb> ++
17:45:36 <fungi> running
17:45:57 <fungi> time estimate is 7 minutes, no idea how accurate that will end up being
17:46:57 * mordred is excited
17:47:18 <fungi> after this we have another aggressive git gc followed by an offline reindex, then we'll checkpoint the db and homedir in preparation for the notedb migration
17:47:53 <fungi> this theoretically gives us a functional 2.16 pre-notedb state we can roll back to in a pinch
17:47:54 <clarkb> then depending on what time it is we'll do 3.0, 3.1, and 3.2 this evening or tomorrow
17:48:05 <fungi> yup
17:49:10 <clarkb> sort of related, I feel like notedb is sort of a misleading name. None of the db stuff lives in what git notes thinks are notes as far as I can tell
17:49:12 <clarkb> its just special refs
17:49:26 <clarkb> this had me very confused when I first started looking at the upgrade
17:50:22 <fungi> yeah, i expect that was an early name which stuck around long after they decided using actual git notes for it was suboptimal
17:53:53 <fungi> i think we'll make up some of the lost time in our over-estimate of the checkpoint steps
17:54:54 <fungi> glad we weren't late starting
17:56:14 <clarkb> ++ I never want to wake up early but having the extra couple of hours tends to be good for buffering ime
17:56:39 <fungi> happy to anchor the early hours while your tea and toast kick in
17:57:17 <fungi> in exchange, it's your responsibility to take up my slack later when my beer starts to kick in
17:58:33 <clarkb> ha
17:59:37 <fungi> sporadic java process cpu consumption at this stage
18:01:54 <clarkb> migration 168 and 170 are opaque due to guice. 169 is more group notedb stuff
18:02:07 <clarkb> not sure which one we are on now as things scrolled by
18:02:17 <clarkb> oh did it just finish?
18:02:27 <clarkb> oh interesting
18:02:39 <clarkb> the migrations are done but now it is reindexing?
18:02:45 <fungi> no, i was scrolling back in the screen buffer to get a feel for where we are
18:03:03 <fungi> it's been at "Index projects in version 4 is ready" for a while
18:03:15 <clarkb> ya worrying about whati t may be doing since it said 170 was done right?
18:03:19 <fungi> though maybe it's logging
18:03:32 <fungi> yeah, it got through the db migrations
18:03:45 <fungi> and started an offline reindex apparenrly
18:03:56 <fungi> there it goes
18:03:58 <fungi> done finally
18:03:59 <clarkb> ya that was expected for projects and accounts and groups
18:04:06 <clarkb> because accountsa nd groups and project stuff go into notedb but not changes
18:04:18 <fungi> 18m19.111s
18:04:29 <clarkb> yup etherpad updated
18:04:39 <clarkb> exit code is zero I think we can reindex
18:04:43 <fungi> ready to do a full aggressive git gc now?
18:04:48 <clarkb> er sorry not reindex
18:04:50 <clarkb> gc
18:04:57 <clarkb> getting ahead of myself
18:05:00 <fungi> yup
18:05:06 <fungi> okay, running
18:05:21 <fungi> 41 minutes estimated
18:05:33 <clarkb> the next reindex is a full reindex because we've done skip level upgrades
18:05:40 <clarkb> with no intermediate online reindexing
18:05:42 <fungi> should be a reasonably accurate estimate since no trove interaction
18:06:49 <clarkb> and we did one prior to the upgrades which was close in time too
18:07:34 <fungi> yup
18:30:28 <clarkb> one thing the delete plugin lets you do which I didn't manage to have time to test is to archive repos
18:30:46 <clarkb> it will be nice to test that a bit more for all of the dead repos we've got and see if that improves things like reindexing
18:40:28 <clarkb> down to nova and all users now
18:41:55 <fungi> yup
18:48:25 <fungi> done
18:48:32 <clarkb> looks happy
18:48:35 <clarkb> time for the reindex now?
18:48:39 <fungi> anything we should check before starting the offline reindex?
18:48:50 <clarkb> I don't think so. UNless you want to check file perms again
18:48:53 <fungi> we want to stick with 16 threads?
18:49:06 <clarkb> yes
18:49:14 <clarkb> I think so anyway
18:49:25 <fungi> file perms look okay still
18:49:31 <clarkb> one of the things brought up on the gerrit mailing list is that thread for these things use memory and if you overdo the threads you oom
18:49:39 <clarkb> so sticking with what we know shouldn't oom seems like a good idea
18:49:48 <clarkb> its 24 threads on the notedb conversion but 16 on reindexing
18:49:48 <fungi> yeah, i'm fine with sticking with the count we tested with
18:50:01 <fungi> okay, it's running
18:50:12 <fungi> estimates time to completion is 35 minutes
18:51:39 <fungi> gc time was ~43 minutes so close to our estimate. i didn't catch the actual time output
18:52:19 <clarkb> oh I didn't look, I should've
18:52:25 <fungi> for those watching the screen session, the java exceptions are about broken changes which are expected
18:52:46 <clarkb> ya we reproduced the unhappy changes on 2.13 prod
18:52:52 <clarkb> its just that newer gerrit complains more
18:53:08 <fungi> stems from some very old/early history lingering in the db
19:01:29 <clarkb> it is about a quarter of the way through now so on track for ~40 minutes
19:01:47 <fungi> fairly close to our estimate in that case
19:14:09 <clarkb> just over 50% now
19:32:45 <clarkb> just crossed 90%
19:37:39 <clarkb> down to the last hundred or so changes to index now
19:37:54 <fungi> and done
19:38:05 <clarkb> ~48minutes
19:38:24 <fungi> 47m51.046s yeah
19:38:34 <clarkb> 2.16 db dump now?
19:38:43 <fungi> yup, ready for me to start it?
19:38:49 <clarkb> yes I am
19:39:02 <fungi> and it's running
19:39:37 <clarkb> then we backup again, then start the notedb offline transition
19:39:43 <clarkb> such excite
19:42:30 <fungi> it all over my screen
19:42:35 <fungi> (literally)
19:43:30 <ianw> o/
19:43:36 <ianw> sounds like it's going well
19:43:44 <clarkb> ianw: slower than expected but no major issues otherwise
19:43:52 * fungi hands everything off to ianw
19:44:02 <fungi> [just kidding!]
19:44:12 <clarkb> we're at our 2.16 checkpoint step. backing up the db then copying gerrit2 homedir aside
19:44:22 <clarkb> the next step after the checkpoint is to run the offline notedb migration
19:44:29 * ianw recovers after heart attack
19:44:34 <fungi> yeah, i think we're basically on schedule, thanks to minor miracles of planning
19:44:58 <clarkb> whcih is good beacuse I'm getting hungry for lunch and notedb migration step is perfect time for that :)
19:45:42 <fungi> other than the trove instance being slower than what we benchmarked with review-test, it's been basically uneventful. no major issues, just tests of patience
19:45:44 <ianw> clarkb: one good thing about being in .au is the madolorian comes out at 8pm
19:45:52 <clarkb> ianw: hacks
19:46:08 * fungi relocates to a different hemisphere
19:47:08 <fungi> i hear there are plenty of island nations on that side of the equator which would be entirely compatible with my lifestyle
19:48:42 <clarkb> internet connectivity tends to be the biggest issue
19:48:59 <fungi> i can tolerate packet loss and latency
19:49:06 <fungi> okay, db dump is done
19:49:10 <fungi> rsync next
19:49:33 <fungi> ready to run?
19:49:44 <clarkb> let me check the filesize
19:50:04 <clarkb> still 1.7gb lgmt
19:50:07 <clarkb> I think you can run the rsync now
19:50:09 <fungi> oh, good call, thanks
19:50:15 <fungi> running
19:50:49 <fungi> the 10 minute estimate there is very loose. could be more like 20, who knows
19:50:57 <clarkb> we'll find out :)
19:51:04 <fungi> if it's accurate, puts us right on schedule
20:01:16 <fungi> and done!
20:01:22 <fungi> 10m56.072s
20:01:26 <fungi> reasonably close
20:01:28 <corvus> \o/
20:01:33 <clarkb> only one minute late
20:01:49 <corvus> hopefully not 10% late
20:02:19 <clarkb> well one minut against the estimated 10 minutes but also ~20:00UTCwas when I guessed we would start the notedb transition
20:02:34 <fungi> okay, notedb migration
20:03:10 <fungi> anything we need to check now, or ready to start?
20:03:37 <clarkb> just that the command has the -Xmx value which it does and the threads are overridden and they are. I can't think of anything else to check since we aren't starting 2.16 and interacting with it
20:03:43 <clarkb> I think we are ready to start notedb migration
20:04:03 <fungi> okay, running
20:04:17 <fungi> eta for this is 4.5 hours
20:04:32 <fungi> no idea if it will be slower, but seems likely?
20:05:06 <fungi> that will put us at 00:35 utc at the earliest
20:05:19 <clarkb> we should check it periodically too  just to be sure it hasn't bailed out
20:05:24 <fungi> i can probably start planning friday dinner now
20:05:35 <clarkb> ++ I'm going to work on lunch as well
20:05:47 <clarkb> also the docs say this process is resumable should we need to do that
20:05:50 <clarkb> I don't think we tested that though
20:06:01 <ianw> is this screen logging to a file?
20:06:05 <fungi> yeah, it always worked in the tests i observed
20:06:16 <fungi> ianw: no
20:06:36 <fungi> i can try to ask screen to start recording if you think that would be helpful
20:06:56 <ianw> might be worth a ctrl-a h if you like, ... just in case
20:07:20 <clarkb> what does that do?
20:07:29 <clarkb> (I suspect I'll learn something new about screen)
20:07:31 <ianw> actually it's a captial-H
20:07:31 <fungi> done. ~root/hardcopy.0 should have it
20:07:47 <ianw> clarkb: just keeps a file of what's going on
20:07:52 <fungi> okay, ~root/screenlog.0 now
20:08:02 <clarkb> TIL
20:08:08 <clarkb> alright I'm going to find lunch now then will check in again
20:08:45 <fungi> it's mostly something i've accidentally hit in the past and then later had to delete, though i appreciate the potential usefulness
20:35:52 <fungi> for folks who haven't followed closely, this is the "overnight" step, though if it completes at the 4.5 hour estimate (don't count on it) i should still be around to try to complete the upgrades
20:36:40 <fungi> the git gc which follows it is estimated at 1.5 hours as well though, will will be getting well into my evening at that point
20:37:21 <clarkb> ya as noted on the etherpad I kind of expected we'd finish with the gc then resume tomorroe
20:37:34 <clarkb> that gc is longer becauseit packs all the notedb stuff
20:37:42 <fungi> if both steps finish on schedule somehow, i should still be on hand to drive the rest assuming we don't want to break until tomorrow
20:38:04 <clarkb> ya I can be around to push further if you're still around
20:38:11 <fungi> the upgrade steps after the git gc should be fast
20:38:41 <fungi> the real risk is that we turn things back on and then there are major unforeseen problems while most of us are done for the day
20:38:43 <corvus> clarkb, fungi: etherpad link?
20:38:55 <fungi> https://etherpad.opendev.org/p/opendev-gerrit-3.2-upgrade-plan
20:39:10 <corvus> #link https://etherpad.opendev.org/p/opendev-gerrit-3.2-upgrade-plan
20:39:33 <clarkb> ya I dont think weturn it on even if we get to that point
20:39:38 <fungi> ooh, thanks for remembering meetbot is listening!
20:39:47 <clarkb> because we'll want to be around for that
20:40:50 <fungi> i definitely don't want to feel like i've left a mess for others to clean up, so am all for still not starting services up again until some point where everyone's around and well-rested
20:41:10 <corvus> we might be able to get through the 3.2 upgrade tonight and let it sit there until tomorrow
20:41:32 <fungi> that seems like the ideal, yes
20:41:41 <corvus> like stop at 5.17
20:41:55 <fungi> sgtm
20:42:20 <corvus> (i totally read that as "stop at procedure five decimal one seven")
20:43:32 <clarkb> ya I think that would be best.
20:43:54 <clarkb> fun fact: this notedb transion is running with the "make it faster" changes too
20:44:01 <clarkb> s/transion/migration/
20:44:08 <fungi> i couldn't even turn on the kitchen tap without filling out a twenty-seven b stroke six, bloody paperwork
20:44:18 <clarkb> I got really excited about those changes tehn realized we were already testing with it
21:17:09 <clarkb> hrm the output indicates we may be closer to finishing than I would've expected?
21:17:29 <clarkb> Total number of rebuilt changes 757000/760025 (99.6%)
21:17:30 <fungi> i'm not falling for it
21:18:02 <clarkb> ya its possible there is multiple passes to this or someting
21:18:16 <clarkb> the log says its switching primary to notedb now
21:18:50 <clarkb> I will continue to wait patiently but act optimistic
21:21:20 <clarkb> oh ya it is a multipass thing
21:21:25 <clarkb> I remember now that it will do another reindex
21:21:39 <clarkb> built in to the migrator
21:21:50 <clarkb> got my hopes up :)
21:22:59 <clarkb> [2020-11-20 21:21:59,798] [RebuildChange-15] WARN  com.google.gerrit.server.notedb.PrimaryStorageMigrator : Change 89432 previously failed to rebuild; skipping primary storage migration
21:23:03 <clarkb> that is the causeof the tracback we see
21:23:13 <clarkb> (this was expected for a number of changes in the 10-20 range)
21:40:07 <ianw> don't know why kswapd0 is so busy
21:40:13 <clarkb> ya was just going to mention that
21:40:28 <clarkb> we're holding steady at ~500mb swap use and have ~36gb memory available
21:40:48 <clarkb> but free memory is only ~600mb
21:40:49 <ianw> i've seen this before and a dop_caches sometimes helps
21:41:10 <clarkb> dop_caches?
21:41:11 <ianw> echo 3 > /proc/sys/vm/drop_caches
21:42:50 <fungi> dope caches dude
21:43:09 <clarkb> "This is a non-destructive operation and will only free things that are completely unused. Dirty objects will continue to be in use until written out to disk and are not freeable. If you run "sync" first to flush them out to disk, these drop operations will tend to free more memory. " says the internet
21:43:40 * fungi goes back to applying heat to comestible wares
21:43:42 <corvus> do we want to clear the caches?
21:44:24 <clarkb> presumably gerrit/java/jvm will just reread what it needs back itno the kernel caches when it needs them?
21:44:29 <clarkb> whether or not that will be a problem I don't know
21:44:45 <corvus> i guess that might avoid having the vmm write out unused pages to disk because more ram is avail?
21:44:57 <ianw> yeah, this has no affect on userspace
21:45:03 <ianw> well, other than temporal
21:45:04 <corvus> except indirectly
21:45:57 <corvus> (iow, if we're not using caches because sizeof(git repos)>sizeof(ram) and it's just churning, then this could help avoid it making bad decisions; but we'd probably have to do it multiple times.)
21:46:20 <corvus> (if we are using caches, then it'll just slow us down while it rebuilds)
21:47:02 <ianw> 2019-08-27
21:47:11 <ianw> * look into afs server performace; drop_caches to stop kswapd0,
21:47:11 <ianw> monitor
21:47:37 <ianw> that was where i saw it going crazy before
21:48:07 <corvus> ianw, clarkb: i think with near zero iowait and low cpu usage i would not be inclined to drop caches
21:48:15 <clarkb> the buff/cache value is going up slowly as the free value goes down slowly. But swap usage is stable
21:48:28 <clarkb> corvus: that makes sense to me
21:48:52 <corvus> could this be accounting for the cpu time spent performing cache reads?
21:49:24 <clarkb> I'm not sure I understand the question
21:49:38 <corvus> i don't know what actions are actually accounted for under kswapd
21:50:03 <ianw> https://www.suse.com/support/kb/doc/?id=000018698 ; we shoudl check the zones
21:50:10 <corvus> so i'm wondering if gerrit performing a bunch of io and receiving cache hits might cause cpu usage under kswapd
21:50:50 <corvus> ianw: If the system is under memory pressure, it can cause the kswapd scanning the available memory over and over again. This has the effect of kswapd using a lot of CPU cycles.
21:50:53 <corvus> that sounds plausible
21:53:06 <clarkb> my biggest concern is that the "free" number continue to fall slowly
21:53:41 <clarkb> do we think the cache value may fall on its own if we start to lose even more "free" space?
21:53:49 <corvus> clarkb: i think that's the immediate cause for kswapd running, but there's plenty of available memory because of the caches
21:54:06 <clarkb> corvus: I see, so once we actually need memory it should start to use what is available?
21:54:41 <clarkb> ah yup free just went back up to 554
21:54:47 <clarkb> from below 500 (thats MB)
21:54:57 <clarkb> so ya I think your hypothesis matches what we observe
21:54:59 <corvus> clarkb: yeah; i expect free to stay relatively low (until java exits)
21:55:43 <corvus> but it won't cause real memory pressure because the caches will be reduced to make more room.
21:55:58 <clarkb> in that case the safest thing may be to just let it run?
21:56:51 <corvus> i think so; if we were seeing cpu or io pressure i would be more inclined to intervene, but atm it may be working as designed.  no idea if we're benefitting from the caches on this workload, but i don't think it's hurting us.
21:57:29 <corvus> the behavior just changed
21:57:39 <corvus> (all java cpu no kswapd)
21:58:24 <clarkb> it switched to gc'ing all users
21:58:27 <clarkb> then I think it does a reindex
21:59:44 <ianw> yeah i think that dropping caches is a way to short-circuit kswapd0's scan basically, which has now finished
21:59:47 <clarkb> this is all included in the tool (we've manualyl done it in other context too, just clarifying that it is choosing these things)
22:06:03 <fungi> also with most of this going on in a memory-preallocated jvm, it's not clear how much fiddling with virtual memory distribution within the underlying operating system will really help
22:06:39 <clarkb> fungi: that 20GB is spoken for though aiui
22:07:08 <clarkb> which is about 1/3 of our available memory
22:07:18 <clarkb> (we should have plenty of extra)
22:08:57 <clarkb> I think this gc is single threaded. When we run the xargs each gc gets 16 threads and we do 16 of them
22:09:11 <clarkb> which explains why this is so much slower. I wonder if jgit gc isn't multithreaded
22:10:44 <clarkb> kids are out of school now. I may go watch the mandalorian now if others are paying attention
22:10:51 <clarkb> I'll keep an eye on irc but not the screen session
22:12:21 <clarkb> I just got overruled, great british bakeoff is happening
22:19:56 <fungi> i feel for you
22:20:07 <fungi> back to food-related tasks for now as well
22:39:36 * fungi find noel fielding to be the only redeeming factor for the bakeoff
22:40:19 <corvus> i've never seen a bakeoff, but i did recently acquire a pet colony of lactobacillus sanfranciscensis
22:40:40 * fungi keeps wishing julian barratt would appear and then it would turn into a new season of mighty boosh
22:42:12 <fungi> i have descendants of lactobacillus newenglandensis living in the back of my fridge which come out periodically to make more sandwich bread
22:43:36 <corvus> fungi: i await the discovery of lactobacillus fungensis.  that won't be confusing at all.
22:45:03 <fungi> it would be a symbiosis
22:45:08 <ianw> apropos the Mandalorian, the plant he's trying to reach is Corvus
22:45:23 <fungi> the blackbird!
22:47:31 * clarkb checked between baking challenges, it is on to reindexing now
22:47:36 <clarkb> iirc the reindexing is the last step of the process
22:48:09 <clarkb> it is slower than the reindexing we just did. I think beacuse we just added a ton of refs and haven't gc'd but not sure of that
22:51:04 <corvus> ianw: wow; indeed i was looking up https://en.wikipedia.org/wiki/Corvus_(constellation) to suggest where the authors may have gotten the idea to name a planet that and google's first autocomplete suggestion was "corvus star wars"
22:51:14 <ianw> my son's obsessions are Gaiman's Norse Mythology, with odin's ravens, and the Mandalorian, who is going to Corvus, and I have a corvus at work
22:52:52 <corvus> ianw: you have corvids circling around you
22:53:29 <ianw> (actually he's obsessed with thor comics, but i told him he had to read the book before i'd start buying https://www.darkhorse.com/Comics/3005-354/Norse-Mythology-1 :)
22:55:51 <corvus> wow the radio 4 adaptation looks fun: https://en.wikipedia.org/wiki/Norse_Mythology_(book)
22:57:11 <fungi> you'll have to enlighten me on gaiman's take on norse mythology, i read all his sandman comics (and some side series) back when they were in print, but he was mostly focused on greek mythology at the time
22:58:06 <fungi> clearly civilization has moved on whilst i've been dreaming
22:59:20 <fungi> i think i have most of sandman still in mylar bags with acid-free backing boards
23:01:07 <fungi> delirium was my favorite character, though she was also sort of a tank girl rip-off
23:04:56 <ianw> fungi: it's a very easy read book, a few chuckles
23:05:38 <ianw> you would probably enjoy https://www.audible.com.au/pd/The-Sandman-Audiobook/B086WR6FG8
23:06:15 <ianw> https://www.abc.net.au/radio/programs/conversations/neil-gaiman-norse-mythology/12503632 is a really good listen on the background to the book
23:09:01 <fungi> now i'm wondering if there's a connection with dream's raven "matthew"
23:11:24 <clarkb> oof only to 5% now. I wonder if this reindex will expand that 4.5 hour estimate
23:12:07 * clarkb keeps saying to himself "we only need to do this once so its ok"
23:12:32 <fungi> follow it up with "so long as it's done when i wake up we're not behind schedule"
23:12:36 <corvus> it is all out on the cpu
23:12:49 <corvus> we have 16 cpus and our load average is 16
23:12:56 <clarkb> ya its definitely doing its best
23:12:59 <fungi> sounds idyllic
23:19:34 <clarkb> ideally we get to run the gc today too, I can probably manage to hit the up arrow key a few times in the screen and start that if its too late for fungi :)
23:19:45 <clarkb> but ya as long as thats done before tomorrow we're still doing well
23:19:48 <clarkb> s/before/by/
23:20:25 <fungi> yeah, if this ends on schedule i should have no trouble initiating the git gc, but...
23:25:45 <clarkb> if this pace keeps up its actually on track for ~10 hours from now? thats rough napkin math, so I may be completely off
23:26:02 <clarkb> also if I remember correctly it does the biggest projects first then the smaller ones so maybe the pace will pick up as it gets further
23:26:12 <clarkb> since the smaller projects will have fewer changes (and thus refs) impacting reindexing
23:26:21 <clarkb> anyway its only once and should be done by tomorrow :)
00:05:47 <clarkb> counting off time to index 200 changes it does seem to be slowly getting quicker
00:05:59 <clarkb> but that might not be a wide enough sample to check
00:34:12 <clarkb> ~now is when we expected it to be done. It is not done if anyone is wondering. Still slow but maybe slowly getting quicker. I'll keep an eye on it
00:34:33 <clarkb> fungi: corvus: I'll aim to be back around about 15:00 tomorrow as well
00:34:40 <clarkb> but we'll see how I do
00:37:12 <clarkb> ~10k changes in ~17 minutes
00:37:21 <clarkb> not great
00:37:50 <clarkb> but also watching it like this may not be great for my health. I'm gonna take a break
01:29:20 <clarkb> I've discovered that there may actually have been a flag to tell the migrator to not reindex. That would have allowed us to do the gc'ing first then manually reindex. But at this point sticking to what we've tested is our best bet I think even if it takes all night
01:29:35 <corvus> ++
01:29:44 <corvus> plan the dive and dive the plan
01:29:56 <clarkb> are you mordred now?
01:30:21 <corvus> i, um, used to have a long daily commute by train and read pulp adventure novels
01:30:27 <clarkb> ha
01:30:54 <clarkb> for anyone following along I don't erally expect this to finish before I go to bed so that I can kick off the gc
01:31:40 <clarkb> I'll still check on it, but probably try and return tomorrow at 15:00 UTC. Assuming it exits 0 I think fungi you can probably go ahead and start the gc? but wait on others before doing the next steps. Or if you'd prefer to wait for me to be awake I'm cool with that too
01:32:18 <corvus> clarkb: it's probably going to be fungi that hits the button; but in case i (or someone else) happens to be around first... it's ....
01:32:31 <corvus> sorry what step?
01:32:51 <clarkb> currently 4.3: time find /home/gerrit2/review_site/git/ -type d -name "*.git" -print0 | xargs -t -0 -P 16 -n 1 -IGITDIR sudo -H -u gerrit2 git --git-dir="GITDIR" gc --aggressive
01:33:09 <clarkb> please run echo $? when this current command finishes so we can confirm it exits 0
01:33:22 <clarkb> during testing we discovered that gerrit commands don't always tell you they have errored when they error :?
01:33:40 <clarkb> so so its echo $? then if 0 step 4.3 from a couple lines above
01:34:21 <corvus> clarkb: so 4.1 (migrate-to-notedb) that's running now; then 4.2 when that finishes, and if it's zero and nothing seems to be on fire, 4.3 (gc).  right?
01:34:39 <clarkb> correct
01:34:51 <corvus> clarkb: can i 'strikethrough' the steps done on the etherpad?
01:34:58 <clarkb> corvus: yes I think that is fine
01:35:17 <corvus> done (and i bolded 4.1)
01:35:53 <corvus> clarkb: have a good evening!
01:36:05 <clarkb> I'll try! :) dinenr then the mandalorian I hope
01:50:18 <fungi> i just caught up, had two episodes to get through
01:50:36 <fungi> and yeah, this looks like it's taking a while
01:51:13 <fungi> i'm planning to fire off the git gc when i wake up, assuming the reindex is even done by then
03:47:28 <ianw> Reindexing changes: project-slices: 29% (785/2697), 30% (235273/760363) (-) fyi
04:47:34 <clarkb> just crossed 300k
04:54:42 <clarkb> also I've learned that one of the things the wikimedia changes does is shuffle the project "slices" They are supposed to be broken down into smaller chunks to prevent a single repo from dominating the cost like nova
04:55:04 <clarkb> however, that element of randomness may explain why we see times that vary so much ? at least contribute to it
04:56:24 <clarkb> I haven't done as much testing as wikimedia did, but I would be really surprised if it is faster to skip around like that. it seems like you want to keep things warm in the cache
04:56:37 <clarkb> eg do all of nova, then do all of neutron and so on
05:01:24 <clarkb> "It does mean that reindexing after invalidating the DiffSummary cache will be expensive" another tidbit from the source (I wonder if we're in that situation perhaps induced by the notedb migration?
05:09:31 <clarkb> oh neat they also split up slices based on changeid/number not actual ref count
05:09:50 <clarkb> so if you've got lots of changes with lots of refs (patchsets) in certain projects those won't be balanced well
05:11:19 <clarkb> they also use mod to split them up so change 1 and 2 go in different slices and 101 and 102 go in different slices if moddiny by 2. When you probably want them to be in the same slice due to git tree state cache warmth? Anyway thats probably enough java for me tonight. There is likely quite a bit of room for improvement in the reindexer to be more deterministic and less reliant on luck
05:17:00 <clarkb> oh and when we tested we would typically start gerrit at 2.16 and maybe that populates the DiffSummary caches? We didn't want to do that this time ebcause to interact with it we'd have to drop our web notice. It would be funny if not starting on 2.16 without notedb was the problem
05:19:24 <mnaser> o/ is there an etherpad with the steps that are occurring and what was done / left to do for those curious people who want to watch from the sidelines ?
05:19:32 <mnaser> (aka me)
05:19:45 <clarkb> mnaser: https://etherpad.opendev.org/p/opendev-gerrit-3.2-upgrade-plan the bolded item is the one we're on
05:20:46 <clarkb> mnaser: we are currently doing the last part of the notedb migration which is a full reindex (which is going slower than expected but we also planned for this long task to happen during the between days period)
05:21:11 <clarkb> when this is done we git gc all the repos to pack up the notedb contents (makes things faster), then upgrade to 3.0, 3.1, 3.2 and reindex again
05:22:32 <mnaser> Cool!  So it sounds like the major migration is done
05:22:55 <clarkb> the actual data migration part is ya. Now its a bucnh of house keeping around that (reindex and gc)
05:23:14 <mnaser> I’d argue that the actual migration into notedb is the trickier bit, indexing is indexing
05:23:17 <mnaser> Awesome
05:24:18 <mnaser> So I assume from now on, Gerrit will no longer use a database server
05:24:33 <mnaser> It will be using purely notedb I guess?
05:24:35 <clarkb> unfortunately that is a bad assumption :P
05:24:42 <clarkb> the accountPatchReviewDb remains in mysql
05:24:52 <clarkb> its the single table database that tracks when you have reviewed a file
05:25:08 <clarkb> but ya one of the changes I have proposed and WIP'd is one to remove the main db configuration from the gerrit config
05:25:46 <clarkb> we'll actually do that cleanup after we're settled on the new version as its ok to have the old db config in place. gerrit 3.2 will just ignore it
05:26:07 <mnaser> Oh I see
05:26:40 <mnaser> So in a way however the database is not that important, you’d just lose track of what patches you reviewed if that db is lost?
05:26:51 <clarkb> what files you have reviewed
05:26:59 <clarkb> the change votes are in notedb
05:27:10 <clarkb> you know when you look at a file diff and it gives you a checkmark on that file?
05:27:18 <mnaser> oh yes
05:27:18 <clarkb> thats all that database is doing is tracking those checkmarks next to files for you
05:27:26 <clarkb> and ya its not super critical
05:28:44 <clarkb> replication to gitea will also take a bit once this is all done as all that notedb state will be replicated for changes
05:28:45 <mnaser> and I guess in terms of scale there’s a few other deployments who have ran at our scale or even bigger :p
05:29:19 <mnaser> oh ouch, that will add a lot of additional data that is replicated across every gitea system
05:29:21 <clarkb> ya I haven't checked recently. I think gerrithub may be similar? But they didn't really exist until notedb was a thing? I may msiremember that. I know they were a driving force for it because it meant they could store stuff in github iirc
05:29:42 <clarkb> mnaser: ya the problem is refs/changes/12345/45/meta is where it goes
05:30:05 <clarkb> so you can't replicate the patchets without the notedb content (since git ref spec doesn't allow you to exclude things like that as far as I can tell)
05:30:16 <clarkb> I don't expect it will cause many issues once we get the initial sync done
05:30:25 <clarkb> that will just take some time (in testing it was like 1.5 days)
05:30:30 <mnaser> Looks like gerrithub is in the 500000s of changes
05:30:55 <mnaser> And I think we’re in the 700k’s
05:31:05 <clarkb> 760363
05:31:23 <clarkb> we're watching a slow count up to that number on the reindex right now
05:31:37 <mnaser> Doesnt Google have a big installation too?
05:32:05 <clarkb> there is the gerrit gerrit, chrome, and android
05:32:11 <clarkb> however, google doesn't really run gerrit
05:32:21 <clarkb> they use dependency injection to replace a bunch of stuff aiui
05:32:52 <clarkb> so that it ties into their proprietary internal distributed filesystems and databses and indexers etc
05:33:09 <mnaser> The chrome one is at 2.5m wow heh
05:33:23 <mnaser> Oh I see so they’re probably not running notedb
05:33:30 <clarkb> we discovered this the hard way when we did an upgrade once and jgit just didn't work
05:33:43 <clarkb> it turned out that jgit was fine talking to their filesystem/storage/whatever it was but not to a posix fs
05:34:02 <clarkb> and so no one caught it until an open source deployment upgraded
05:34:04 <clarkb> (us)
05:34:18 <mnaser> ouch
06:25:01 <corvus> i think they're using notedb, but the git data store isn't what mere mortals use
06:25:35 <corvus> Reindexing changes: project-slices: 49% (1345/2697), 51% (390766/760363) (/)    |
06:26:05 <corvus> that's a timestamped progress status before i go to bed
10:21:28 <ianw> Reindexing changes: project-slices: 74% (2021/2697), 77% (587125/760363) (-)
10:21:45 <ianw> 25% in ~ 4 hours
10:22:26 <ianw> that puts it at about 14:00UTC to finish
12:21:06 <fungi> yeah, awake again and it's claiming around 88% complete now
12:27:12 <fungi> Reindexing changes: project-slices: 87% (2373/2697), 89% (679414/760363)
13:44:24 <fungi> 99%!
13:46:49 <fungi> once this wraps up, assuming it looks good, i'll start the git gc and then i need to run out to the hardware store to pick up an order for some tools
14:11:52 <fungi> 1086m41.925s
14:12:10 <fungi> that's 18h6m42s
14:12:23 <fungi> exited 0
14:13:05 <fungi> i've pulled the gc command back up and will start it momentarily
14:13:24 <fungi> just need to switch computers to double-check our notes
14:18:59 <fungi> okay, looking good and i've updated our notes to indicate which step we're on, gc is running now
14:19:45 <clarkb> thanks. I'm very slowly waking up but maybe I can take it easy for another hour or teo now
14:19:46 <fungi> estimated time to completion is 1.25 hours so hopefully done before 16:00
14:20:05 <clarkb> the previous gc times were failry accurate if a few minutes fast iirc
14:20:45 <fungi> other than the final offline reindex, all the other steps should go quickly
14:21:14 <fungi> at least up until we start gerrit again, and then there's the replication which will probably take ages
14:21:47 <fungi> and the long tail of fixing things which are broken (some of which we know about, some of which we likely don't yet)
14:22:39 <fungi> anyway, not seeing any obvious errors stream by, so i'll take this opportunity to go pick up my order and be back in plenty of time for the rest of the upgrade
14:22:54 <clarkb> thanks again
14:25:07 <corvus> o/
15:21:24 <fungi> okay, i'm back. if the gc finishes at 1.25 hours then that'll be ~12 minutes from now
15:22:16 <clarkb> judging by the cinder runtime when I checked about 5 minutes ago I think it will be longer but not significantly so. All the expensive repos seem to be processing at this point
15:24:34 <clarkb> nova, cinder, horizon, manuals
15:24:42 <clarkb> oh and neutron
15:33:50 <clarkb> nova is the only one running now
15:39:56 <fungi> 80m54.544s
15:40:02 <fungi> exited 0
15:40:26 <clarkb> about 6 minutes logner than estimated much better.
15:41:15 <fungi> okay, ready for the next pull?
15:41:29 <clarkb> yes that loosk good to me
15:41:59 <fungi> opendevorg/gerrit   3.0                 fbd02764262c        46 hours ago        679MB
15:42:10 <clarkb> that looks about right
15:42:33 <clarkb> if you're ready to run the init I am
15:42:37 <fungi> running
15:42:54 <fungi> in testing this was near instantaneous
15:43:08 <fungi> 0m12.344s and exited 0
15:43:13 <fungi> no error messages
15:43:54 <fungi> ready for me to work on 3.1 or want to check anything?
15:44:10 <clarkb> I don't think there is anything other than the exit code to check
15:44:19 <clarkb> lets do 3.1. This init doesn't do any schema updates
15:44:46 <fungi> and pulling
15:45:05 <fungi> opendevorg/gerrit   3.1                 eae7770f89d6        46 hours ago        681MB
15:45:08 <clarkb> lgtm
15:45:33 <fungi> ready to init with 3.1?
15:45:39 <clarkb> I think so. Can't think of anything else to check first
15:45:46 <fungi> underway
15:45:59 <clarkb> and done
15:46:01 <fungi> 0m11.280s
15:46:07 <fungi> exited 0
15:46:44 <fungi> ready to pull 3.2?
15:46:46 <clarkb> yup
15:47:03 <fungi> opendevorg/gerrit   3.2                 6fdfe303e8df        46 hours ago        681MB
15:47:05 <clarkb> that image lgtm
15:47:26 <clarkb> I think we can do the reindex
15:47:32 <fungi> running
15:47:32 <clarkb> er no sorry I keep getting ahead of myself
15:47:34 <clarkb> the init
15:47:45 <fungi> yeah, the init is what i'm running, sorry
15:47:49 <clarkb> the command you have queued looks right :)
15:47:56 <fungi> okay, running now
15:48:23 <fungi> 0m13.628s and exited 0
15:48:32 <fungi> *now* it's time to reindex
15:48:46 <clarkb> yup and the command you have up for that lgtm
15:48:57 <fungi> okay, starting it now
15:49:08 <fungi> eta 41 minutes
15:49:36 <fungi> then we start gerrit and begin unwinding things
15:50:41 <clarkb> or 18 hours :/
15:50:47 <fungi> yeah, ugh
15:51:54 <fungi> well, we're already at 1% done so hopefully not 18 hours
15:52:12 <clarkb> ya this is going much quicker just counting off progress at 20 second intervals
15:52:22 <clarkb> we were doing about 200 changes per 20 second interval last night. This just did like 4k
15:52:37 <clarkb> I think the gc'ing helps tremendously
15:53:51 <clarkb> for the unwinding it would be good for others to maybe look over what I've written down again and just sanity check it. I think my biggest concern at this point is any interaction between our ci/cd and gitea replication lag
15:54:18 <clarkb> I believe in cd we pull from gerrit and not gitea so that isn't an issue but I've got us explicitly replicating our infra repos first to mitigate that
15:55:08 <clarkb> as another sanity check our disk utilization has gone up about 5GB since the gc which is what we expected based on testing
15:55:14 <clarkb> 93GB -> 98GB on that fs
15:55:31 <clarkb> the unpacked state was about 110GB iirc
15:57:12 <clarkb> already up to 10% much much quicker this time
15:59:40 <clarkb> those exceptions in the screen scrollback are expected (small number of corrupted changes)
15:59:52 <corvus> gerrit needs its coffee
16:00:33 <corvus> i'm estimating ~18:00 for completion of this step
16:00:58 <corvus> oh the rate seems to have just significantly improved
16:01:46 <corvus> and my math was wrong
16:01:59 <clarkb> corvus needs coffee too?
16:02:14 <corvus> maybe ~17:00?
16:02:45 <clarkb> ya about another hour by my math
16:03:01 <clarkb> it took ~14 minutes to get to 20% so another 4 blocks of 14 minutes
16:05:13 <corvus> i have to run an errand; i probably won't be back until after this completes, but i'll check in when i get back and see if there's unexpected issues i can help with
16:17:12 <fungi> thanks!
16:26:26 <clarkb> it is up to 61% now
16:27:35 <clarkb> I guess the trick with the notedb migration would've been to somehow stop that process prior to reindexing, then garbage collect, then reindex manually. Reading the code there is a --reindex flag but it isn't clear to me if you can negate that somehow. Anyway we shouldn't need to do this again so not worth thinking about too much anymore
16:28:24 <clarkb> fungi: not to get ahead of myself, but do you think we should block port 29418 and leave the apache message in place when we first start gerrit? then check that logs indicate it is happy before opening things up?
16:28:50 <clarkb> I did have us starting gerrit before updating apache to check logs but realize that port 29418 would still be accessible
16:37:55 <fungi> yeah, wouldn't hurt to temporarily remove public access to that port initially, but obviously we shouldn't start up anything which would need access either (like zuul)
16:38:37 <fungi> i can edit the firewall rules temporarily now to do that. i'll use a second window in that screen session
16:40:17 <clarkb> ya there are a number of things I think we should do before starting zuul in the etherpad
16:40:56 <fungi> and done
16:40:58 <clarkb> thanks
16:41:18 <clarkb> I'm putting together a list of scripts to update to use the 3.2 image on review.o.o now since it occurred to me that we run manage-project type things periodically iirc
16:41:19 <fungi> iptables -nL and ip6tables -nL now report no allow rule for 29418
16:41:33 <clarkb> and we don't want them to use the old image (ist actually probably ok for them to use the old image since its the same version of jeepby but I don't want to count on that
16:41:37 <fungi> (i left the overflow reject rules for 29418 in there for now)
16:42:26 <fungi> 90%
16:43:23 <clarkb> docker-compose.yaml, /usr/local/bin/manage-projects, /usr/local/bin/track-upstream seem to be the files using that variable when I grep in sytem-config
16:43:46 <clarkb> docker-compose is already edited but we should update the other two before starting zuul (I've made a note in the etherpad too)
16:48:20 <clarkb> done in 59 minutes
16:48:25 <fungi> 59m13.719s exited 0
16:48:27 <fungi> yep
16:48:54 <fungi> okay, and 29418 is currently blocked so in theory we can start gerrit and check its service logs for obvious signs of distress
16:49:01 <clarkb> yup I think that is our next step
16:49:09 <clarkb> docker-compose up -d
16:49:20 <fungi> ready?
16:49:23 <clarkb> I guess so
16:50:01 <clarkb> Gerrit Code Review 3.2.5-1-g49f2331755-dirty ready
16:50:23 <clarkb> that plugin manager exception is expected. I believe it is because we don't enable the oplugin manager in our config but have the plugin installed
16:50:44 <fungi> something to add to the to do list to remove or enable i guess
16:50:48 <clarkb> ya
16:51:30 <clarkb> before we open things up I should add my gerrit admin ssh key. But I think you've had more experience with doing those things so maybe you want to do the force submit of the change if it still looks good to you as well as kick off replication for system-config and project-config?
16:51:39 <clarkb> we want to force merge first then replicate I think
16:51:47 <clarkb> also before we go further let me reread the etherpad notes :)
16:52:13 <fungi> are you going to be able to do those things without 29418 open?
16:52:25 <clarkb> no I'm saying lets just be ready for that when we open it
16:52:30 <fungi> oh, sure
16:52:56 <clarkb> before we open things though why don't we fix /usr/local/bin/manage-projects and /usr/local/bin/track-upstream ?
16:53:04 <clarkb> we need to change the image version in those scripts to 3.2
16:53:10 <fungi> once 29418 is open i can add your openid account to project bootstrappers temporarily so you can add verify +2 and call submit
16:53:49 <clarkb> fungi: do you want to do the script fix in the screen or should I just do them off screen then you can confirm on screen?
16:53:52 <fungi> do we have a change to update /usr/local/bin/manage-projects and /usr/local/bin/track-upstream already?
16:54:17 <fungi> they're not going to get called until we reenable the crontabs
16:54:19 <clarkb> fungi: yes, the change whcih we force merge sets gerrit_container_image in ansible vars and that is used in docker-compose and the two scripts
16:54:26 <fungi> ahh, okay
16:54:28 <clarkb> fungi: manage-projects is called by zuul periodically iirc
16:54:38 <clarkb> so once zuul is up it may try it
16:54:43 <fungi> well, ansible is still disabled for the server too
16:54:48 <clarkb> oh good point
16:55:02 <clarkb> well I think we should fix it anyway since its a good sanity check
16:55:13 <fungi> sure, i can edit those manually for now
16:55:19 <clarkb> my concern in particular is a race between the config management updates and the manage-project updates
16:55:22 <clarkb> I don't know that they always go in order
16:56:38 <fungi> lgty?
16:56:40 <clarkb> those edits lgtm thanks
16:57:38 <clarkb> ok give me a minute to get situated with auth things then I guess we can turn it on and force merge the config mgmt change then replicate
16:59:39 <clarkb> alright i've got keys loaded and have my totp token
17:00:14 <fungi> cool, so open 29418 first or undo the maintenance page in apache first?
17:00:19 <clarkb> I think lets undo apache first
17:00:51 <fungi> does that look correct?
17:00:58 <clarkb> yes, but we also want to remove the /p blocks too
17:01:23 <fungi> like that?
17:01:26 <clarkb> yup
17:01:48 <fungi> ready for me to reload apache2?
17:02:09 <clarkb> let me just double check zuul isn't running somehow
17:02:16 <fungi> k
17:02:30 <clarkb> ps shows no zuul processes on zuul01
17:02:44 <clarkb> I guess we continue unless you can think of anything else
17:02:52 <fungi> nope, nothing comes to mind
17:03:00 <fungi> and it's up
17:03:16 <fungi> i get the webui
17:03:41 <fungi> signing in
17:04:08 <clarkb> I'm signed in
17:04:30 <clarkb> as my regular user. Did you want to review https://review.opendev.org/c/opendev/system-config/+/762895/1 and maybe be the one to force merge it?
17:04:36 <clarkb> doesn't look like anyone else has voted on it yet
17:04:54 <fungi> yeah, signed in as my normal user too
17:05:01 <fungi> firing up gertty
17:05:20 <fungi> seems to be syncing okay
17:05:24 <clarkb> I'm removing my WIP on that change now
17:06:45 <fungi> should to remember to remind gertty users that now they need to add "auth-type: basic" to their configs
17:09:43 <fungi> worth noting you actually wanted me to review https://review.opendev.org/c/opendev/system-config/+/762895/2 not /1
17:09:55 <fungi> took a bit to realize i was looking at an old patch there
17:10:15 <clarkb> oh sorry thats what it redirected me to from my link in theetherpad
17:10:31 <clarkb> because etherpad had the /1 too
17:12:02 <fungi> no worries, i've voted +2 on it
17:12:11 <clarkb> fungi: ok do you want to submit it or do you want me to?
17:12:24 <fungi> i can do it, just a sec
17:12:36 <clarkb> you'll need to add the +2 verified too
17:12:41 <fungi> yep
17:12:49 <fungi> and workflow +1 obviously
17:13:39 <clarkb> once that force merges I want  to see if replication for system-config replicates everything or just that ref
17:13:56 <clarkb> but generally replicate system-config and project-config next I think
17:16:35 <fungi> fatal: "762895" no such change
17:16:48 <fungi> d'oh
17:16:56 <fungi> i was doing it to review-test ;)
17:17:02 * fungi curses his command history
17:17:37 <fungi> need to open 29418 on review.o.o for this
17:17:46 <fungi> are we good with that?
17:17:49 <clarkb> yes I am
17:17:59 <clarkb> also you still need the verified +2 (I assume your admin accounts will do that)
17:18:11 <fungi> it will
17:18:36 <clarkb> fungi: note that rules.v4 is the file now iirc
17:19:14 <clarkb> and if we missed actually blocking 29418 on ipv4 then oh well at thsi point :) it seems fine
17:19:14 <fungi> yeah, i'm just keeping rules consistent with it until we confirm and clean up the cruft
17:19:18 <clarkb> kk
17:19:23 <fungi> i edited all three
17:19:30 <clarkb> gotcha
17:20:35 <fungi> okay, it's merged and i've removed membership for my admin account from project bootstrappers
17:20:53 <clarkb> now lets see what is being replicated
17:21:34 <clarkb> nothing in the queue so did it only replicate that ref? /me looks at gitea
17:22:04 <clarkb> https://opendev.org/opendev/system-config/commit/2197f11a0f27da3f9bd1c009c84107dc09559f6e yes only that ref
17:22:31 <fungi> neat
17:22:40 <fungi> i suppose we need to manually trigger a full replication
17:22:41 <clarkb> what I think that means is we could not replicate anything and let it catch up over time?
17:22:57 <clarkb> ya or we manually replicate. I still think we manually replicate system-config and project-config first though
17:23:15 <fungi> i can trigger replication for system-config first
17:23:15 <clarkb> probably ripping off this bandaid is the best option to ensure we have plenty of disk on the giteas
17:23:22 <clarkb> fungi: ++ that would be great
17:23:44 <fungi> triggered
17:24:36 <clarkb> there are already 4 new changes too
17:25:14 <clarkb> hrm system-config is done replicating? that took suspciously little time
17:27:37 <clarkb> I see the refs on gitea01 though
17:27:52 <clarkb> I wonder if part of the reason we were slow replicating in testing was network bandwidth
17:28:01 <fungi> could be...
17:28:09 <fungi> i can trigger project-config next
17:28:12 <clarkb> ++
17:28:26 <fungi> done
17:28:42 <clarkb> this might be overoptimization: but we may also want to do nova, neutron, cinder, horizon, openstack-manuals so that we can run teh gitea gc after they are done
17:28:44 <fungi> i assumed we were talking openstack/project-config not opendev/project-config in this case
17:28:47 <clarkb> since they should be the biggest repos
17:28:51 <clarkb> fungi: correct
17:29:03 <fungi> sure i can do nova next and see what happens
17:29:41 <clarkb> wow it says project-config is done
17:30:14 <fungi> tailing replication_log in the screen session is probably not useful. it lags waaaay behind because of how verbose the logs is
17:30:50 <clarkb> spot checking project-config on gitea01 shows that it seems to have worked too
17:30:57 <clarkb> I see refs/changes/xy/abcxy/meta
17:31:34 <clarkb> but ya lets work through that list I posted just above, check disk usage on gitea01 and run gc on all the giteas if it looks like we expanded disk use a lot
17:31:57 <clarkb> then when we're happy with that trigger full replication then start looking at zuul I guess
17:32:18 <fungi> the replication log is really, really busy though, are you sure it's not actively replicating everything?
17:32:35 <clarkb> fungi: gerrit show-queue -w says no
17:32:42 <fungi> strange
17:32:54 <clarkb> if I start a new tail on replication_log its quiet
17:33:02 <clarkb> I think thats just screen and ssh buffering with large amounts of text?
17:33:32 <fungi> previously it was very noisy but now it seems to have quiesced, yeah
17:33:58 <fungi> okay, i'll do nova now
17:34:06 <clarkb> ++
17:34:09 <fungi> and it should be running
17:34:37 <clarkb> I see it in the show queue
17:34:43 <fungi> yeah
17:36:30 <clarkb> I see disk use slowly increasing on gitea01 so it seems to be doing things
17:40:07 <fungi> status notice The Gerrit service on review.opendev.org is accepting connections but is still in the process of post-upgrade sanity checks and data replication, so Zuul will not see any changes uploaded or rechecked at this time; we will provide additional updates when all services are restored.
17:40:16 <fungi> something like that ^?
17:40:20 <clarkb> sounds good to me
17:41:16 <clarkb> nova replication is done according to show queue and disk use increased by about a gig so ya I think doing some of these big ones first, gc'ing then doing everything is a good idea
17:41:19 <fungi> #status notice The Gerrit service on review.opendev.org is accepting connections but is still in the process of post-upgrade sanity checks and data replication, so Zuul will not see any changes uploaded or rechecked at this time; we will provide additional updates when all services are restored.
17:41:19 <openstackstatus> fungi: sending notice
17:42:04 <fungi> okay, i'll do openstack-manuals next
17:42:09 <clarkb> ++
17:42:15 <fungi> and it's running
17:42:54 <clarkb> and honetly at the rate these have gone I think we should start global replication, benchmark it, then see if we can wait a bit before starting zuul since it seems quick. If benchmarks say it will eb all day then nevermind
17:43:33 <fungi> sure
17:43:49 <clarkb> since that will rule out any out of sync unexpectedness
17:44:24 <clarkb> manuals is done
17:44:30 <openstackstatus> fungi: finished sending notice
17:44:43 <fungi> neutron next?
17:44:43 <clarkb> I think you can just enqueue the others in the list and let gerrit figure out ordering
17:45:02 <clarkb> I would just tell it to do neutron cinder and horizon now
17:45:43 <fungi> yup, was just finding your original list in scrollback
17:46:02 <clarkb> that list is based on which things were slow to gc which implies more data/more refs
17:46:04 <fungi> triggered all three
17:47:48 <clarkb> horizon is done, neutron and cinder still running
17:48:58 * mnaser is playing around gerrit right now
17:49:21 <fungi> just be aware zuul is still offline
17:49:31 <mnaser> fungi: yep!  i'm just trying to see if the gerrit functionality itself seems to be okay
17:49:43 <fungi> thanks, appreciated!
17:49:47 <mnaser> i am noticing a few things, none are critical of course, but "oh, interesting" type of tings
17:49:58 <clarkb> mnaser: ya I expect a lot of that :)
17:50:00 <fungi> sure, i'm going to hate the new ui for a while i'm sure
17:50:08 <mnaser> i.e. anything except verified/code-review/workflow are under this thing called "Other labels"
17:50:08 <clarkb> polygerrit adds a bunch of new excellent features and some not so great things
17:50:18 <mnaser> so roll call votes in governance are under "Other labels"
17:50:46 <mnaser> backport candidate patches seem to be affected too, not a big deal but maybe good for us to know how it decides whats other and whats not
17:50:48 <clarkb> but where we were was a dead end so we're ripping the bandaid off and going to try and work upstream and with plugins etc to make stuff better
17:50:57 <clarkb> mnaser: have a link to a change so we can see that?
17:51:11 <mnaser> sure -- https://review.opendev.org/c/openstack/governance/+/760917
17:51:38 <fungi> also i'm noticing that the gitweb links are broken, probably worth working on a proper link to gitea to replace those anyway
17:51:56 <mnaser> you can see rollcall-vote is under other labels, so is code-review in there (but i guess maybe that's cause code-review doesn't mean anything for merging inside openstack/governance)
17:52:32 <fungi> might be a good time to start a post-upgrade notes etherpad where we can collect lists of things which have changed people might ask about, and things we know are broken which will either be fixed or removed
17:52:45 <mnaser> yeah, i can start putting a few things in there too
17:53:08 <clarkb> ++
17:53:08 <mnaser> some other minor things are the ordering of code review comments
17:53:15 <mnaser> it seems to be verified, code-review then workflow
17:53:34 <clarkb> I think it was that way before?
17:53:41 <clarkb> I've already forgotten
17:53:49 <mnaser> i remember you would see code-review, verified, workflow in the list
17:53:55 <mnaser> zuul always came in the middle, workflow was always at the end
17:54:08 <mnaser> (in the display of votes at least)
17:54:21 <clarkb> fungi: ok those replications are done and we're using 4gb extra disk. I'll trigger the gc cron on all of the giteas now? any other repos you think we should replicate first?
17:54:28 * corvus checking in
17:54:58 <clarkb> corvus: tl;dr is gerrit is up and seems ok so far. replication is much quicker than anticipated. We are manually triggering replication for "large" repos so that we can gc on the giteas to pack back down again then start global replication
17:55:13 <clarkb> after that we'll eb looking at zuul
17:55:42 <fungi> i've started a pad here https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes
17:55:44 <corvus> ++
17:55:47 <fungi> mnaser: ^
17:56:09 <mnaser> fungi: cool i'll fill those out
17:56:22 <fungi> clarkb: i agree, git gc on gitea next
17:56:26 <clarkb> corvus: fungi any other repos we should manually replicate? We have done system-config project-config nova neutron cinder horizon and openstack manuals
17:56:31 <fungi> then we can do a full replication
17:56:48 <corvus> can't think of any others
17:56:49 <clarkb> fungi: k will give corvus a minute to bring up any other repos that may be worth doing that to then I can do the gitea gc'ing
17:56:55 <clarkb> cool I'll work on gitea gc'ing now
17:57:01 <fungi> just to avoid overrunning the fs with all of them at once
17:57:12 <fungi> thanks!
17:57:43 <mnaser> something i remember broke last time we did an update was all the bp topic links from specs
17:57:46 <mnaser> i just tested one and its working just fine
17:57:59 <mnaser> specifically: https://review.opendev.org/#/q/topic:bp/action-event-fault-details from https://blueprints.launchpad.net/nova/+spec/action-event-fault-details as an example
17:58:25 <mnaser> oops
17:58:28 <mnaser> i found our first broken
17:59:24 <mnaser> Directly linked changes are redirecting to an incorrect port, Example: https://review.opendev.org/712697 => Location: https://review.opendev.org:80/c/openstack/nova/+/712697/
17:59:28 <mnaser> i added that to the etherpad
17:59:59 <mnaser> i remember fixing that inside our gerrit installation actually, let me find
18:00:26 <clarkb> that could be related to the thing fungi linked about after the bug fixing this week
18:00:32 <fungi> mnaser: that may be a known issue, at least wmf and eclipse ran into it and filed bugs
18:00:37 <mnaser> if i remember right, we did this: `listenUrl = proxy-https://*:8080/`
18:00:43 <mnaser> or maybe that was for https redirection stuff
18:00:50 <fungi> apparently we can fiddle the proxy settings in apache if it's the same issue
18:00:53 * fungi checks notes
18:01:06 <clarkb> all 8 giteas are gc'ing now
18:02:01 <fungi> mnaser: can you see if it looks like https://bugs.chromium.org/p/gerrit/issues/detail?id=13701
18:02:10 <clarkb> using /c/number works fwiw
18:02:16 <clarkb> that may be an easy workaround for now if necessary
18:02:35 <fungi> if so the solution is supposedly "X-Forwarded-Proto" expr=%{REQUEST_SCHEME}" in our vhost config
18:03:17 <corvus> clarkb: well, the /# links are supposed to be "permalinks" so i don't think "use /c" is an easy solution (the problem is existing links point there)
18:03:18 <mnaser> that makes sense
18:03:27 <clarkb> corvus: yup we should fix it
18:03:54 <corvus> fungi: x-forward-proto makes sense to me
18:03:55 <mnaser> "X-Forwarded-Proto is now required because of underlying upgrade of the Jetty library, when Gerrit is accessed through an HTTP(/S) reverse-proxy."
18:03:57 <clarkb> I think I have figured out why replication timing is so much better. its because we're not replicating all the actual git content now
18:04:02 <mnaser> indeed, so yes, that does all make sense
18:04:26 <corvus> anyone writing an x-forwarded-proto change?
18:04:42 <clarkb> I'm not
18:04:58 <corvus> looks like i am :)
18:05:03 <clarkb> in fact I need to find something to drink. back shortly
18:05:17 * mnaser keeps looking
18:05:25 <corvus> i kind of want to ninja the fix in first just to make sure it works
18:05:42 <fungi> corvus: please feel free to hand-patch it into the config first
18:05:50 <corvus> k will do both
18:06:17 <fungi> i agree the change isn't much good if the fix turns out to be incorrect for our deployment for some reason
18:06:32 <mnaser> i'll add the "you'll need a new version of git-review" to "what's changed"
18:06:36 <mnaser> as i guess that might come up
18:08:37 <corvus> mnaser: redirect look good now?
18:08:59 <mnaser> corvus: yes!  working in my browser and curl shows the right path too
18:09:12 <clarkb> yay
18:10:29 <mnaser> seems like gerritbot is not posting changes
18:10:33 <fungi> mnaser: i thnik it's git-review>=1.26
18:10:35 <mnaser> i am not sure if thats cause its turned off or
18:10:50 <fungi> it probably needs to be restarted now that the event stream is accessible
18:10:56 <fungi> i'll do that now
18:11:13 <clarkb> corvus: fwiw if you'relooking at the vhodt I think there may be old cruft in there we should cleanup. I always get lost when looking though
18:11:15 <mnaser> fungi: i see the 1.27.0 release notes have: "Update default gerrit namespace for newer gerrit. According to Gerrit documentation for 2.15.3, refs/for/’branch’ should be used when pushing changes to Gerrit instead of refs/publish/’branch’." -- is it not that change?
18:11:19 <corvus> remote:   https://review.opendev.org/c/opendev/system-config/+/763577 Add X-Forwarded-Proto to gerrit apache config [NEW]
18:11:43 <fungi> gerritbot has been restarted
18:11:44 <clarkb> corvus: fungi ^ should we force merge that one too?
18:11:54 <corvus> clarkb: i see comments related to upgrade i will address them
18:12:13 <clarkb> corvus: well the upgradethins should be handled
18:12:18 <mnaser> well look at that, i can now post emojis in my changes without a 500
18:12:18 <mnaser> :P
18:12:21 <clarkb> as part of the earlier force merge
18:12:31 <fungi> mnaser: thanks, yeah 1.27 sounds right, i was going from memory
18:13:19 <corvus> clarkb: oh, er, what do you want me to do?
18:13:35 <corvus> clarkb: i agree that the TODO lines have been removed in system-config master
18:13:37 <clarkb> I'm more thinking about what I think is old gitweb config. I don't think it needs doing now. I just mean someone that groks apache better than me should look at that vhost and audit it
18:13:48 <corvus> clarkb: i have manually removed them from the live apache config
18:13:51 <clarkb> as there may be a few cleanups we can do
18:13:53 <clarkb> corvus: thanks
18:13:58 <corvus> but they were already commented out, so that should all be a noop
18:14:21 <mnaser> is the "links" part in the gerrit change display something that is customizable by the deploy (where gitweb currently is listed?). if so, probably would be neat if we added a "zuul builds" link which went to a prefiltered zuul build search using the changeid!
18:14:30 <clarkb> the gitea gc's are still going. The cron only does one repo at a time
18:15:00 <clarkb> mnaser: you can probably write a plugin for that
18:15:15 <mnaser> ok i see, so the gitweb link comes from a plugin
18:15:25 <clarkb> mnaser: gitweb is built in but gitiles is a plugin
18:15:28 <clarkb> aiui
18:15:49 <mnaser> https://review.opendev.org/q/hashtag:%22dnm%22+(status:open%20OR%20status:merged) tags stuff working pretty neatly too
18:15:58 <fungi> but i feel like we should consider replacing that with a link to gitea anyway if we can
18:16:02 <corvus> mnaser: https://review.opendev.org/Documentation/dev-plugins.html#links-to-external-tools may be relevant?
18:16:11 <corvus> looks like we'd need to do a tiny plugin
18:16:32 <mnaser> ou, that's pretty cool and seems like it would be quite straightforward too
18:17:04 <corvus> not sure if that's the right interface to put it in the 'links' section
18:17:14 <corvus> but seems pretty close to that
18:17:29 <corvus> could incorporate that into the zuul plugin
18:18:00 <clarkb> fungi: https://review.opendev.org/c/opendev/system-config/+/763577 lgtm if you want to review that one and force merge it too?
18:18:03 <corvus> speaking of which https://gerrit.googlesource.com/plugins/zuul/
18:18:16 <mnaser> https://review.opendev.org/c/openstack/project-config/+/763576 seems to work pretty well too for a WIP change that is accessible :)
18:18:20 <corvus> also https://gerrit.googlesource.com/plugins/zuul-status/
18:18:55 <corvus> btw gertty has half-implemented support for hashtags
18:19:00 <corvus> i will be motivated to finish it now :)
18:19:08 <clarkb> mnaser: ya one followup we can look at doing is removing workflow -1
18:19:53 <mnaser> it seems like i see some 3pci still reporting to cinder, so they're probably 'just fine'
18:20:52 <fungi> 763577 is merged
18:20:58 <mnaser> it looks like you can mark a change as private, which i guess can be useful
18:20:58 <clarkb> yup and gerritbot reported it
18:21:09 <clarkb> mnaser: hrm I think we should actually disable that
18:21:11 <fungi> indeed it did
18:21:17 <mnaser> yeah i remmber it was disabled before
18:21:27 <fungi> well, "drafts" were disabled
18:21:28 <clarkb> mnaser: I don't want people assuming "private" is really "private' until we can check it
18:21:34 <clarkb> ya private is a newer thing iirc
18:21:44 <mnaser> i wonder if you can enable it per project too, or for specific users
18:21:44 <fungi> but gerrit removed drafts and replaced them with two features, private changes and work in progress status
18:21:55 <mnaser> would be really nice for embargo'd security changes
18:22:10 <clarkb> "Do not use private changes for making security fixes (see pitfalls below)"
18:22:13 <clarkb> no it won't be :P
18:22:20 <mnaser> aha
18:22:21 <clarkb> this is why I don't want it enabled if we can disable it
18:22:31 <clarkb> drafts was a honeypot and private will likely be too
18:22:34 <mnaser> i'll add it to "what's changed" for now
18:23:01 <clarkb> https://gerrit-review.googlesource.com/Documentation/intro-user.html#private-changes that quote is from there
18:23:37 <clarkb> we can set change.disablePrivateChanges to true
18:23:40 <fungi> yeah, it's an attractive nuisance
18:23:45 <fungi> i agree we should disable it
18:23:45 <clarkb> in gerrit.config
18:23:53 <fungi> i can write that change now
18:23:58 <clarkb> thanks
18:24:32 <mnaser> i moved my test change back to public in case that causes some issue about disabling it with a private change already there
18:25:03 <clarkb> mnaser: thanks, though I expect its fine. usualyl that stuff gets enforced when you push
18:25:11 <clarkb> similar to hwo we disabled drafts, the old drafts were fine
18:25:16 <clarkb> giteas are about done gc'ing
18:25:32 <fungi> what change topic were we using for upgrade-related changes?
18:26:21 <mnaser> oh that's quite cool -- if you look at a change diff and click on "show blame", it shows the blame and you can click to go to the original change behind it
18:26:33 <fungi> nifty
18:26:36 <clarkb> gitea06 onyl has 19GB free disk. I'm going to look at that as its much lower than the others
18:26:57 <clarkb> fungi: I was using gerrit-upgrade-prep
18:27:05 <fungi> thanks
18:27:07 <clarkb> and have a couple of wip changes there that we should land once we're properly settled
18:27:49 <clarkb> fungi: I wonder if we shouldn't manually apply that change and force merge it too
18:28:13 <fungi> it'll need a service restart
18:28:19 <clarkb> ya
18:28:26 <clarkb> probably not is the best time for those?
18:28:28 <clarkb> s/not/now/
18:28:34 <clarkb> since we're telling people its not ready yet
18:30:53 <fungi> change is 763578
18:31:00 <fungi> i'll hand edit the config now and restart gerrit
18:32:05 <fungi> i have the line added in the screen session if you want to double-check
18:32:22 <clarkb> screen looks correct
18:32:30 <mnaser> it looks like a change owner can set an assignee for their change
18:32:34 <clarkb> I'm still trying to sort out gitea06 disk
18:33:13 <mnaser> i'm not too sure what an assignee really .. means
18:33:53 <clarkb> the gitea web container us using 20gb of disk in /var/lib/docker/containers
18:33:58 <fungi> that may be to support workflows where reviewers are auto-assigned
18:34:02 <fungi> mnaser: ^
18:34:06 <clarkb> whcih should be separate from the bind mounted stuff which is where we expect data to go
18:34:21 <fungi> okay, restarting the service
18:34:29 <corvus> mnaser: i'm also not sure who's supposed to check the "resolved" box for comments.  the author or the reviewer?
18:34:34 <clarkb> I expect that if I restart gitea on 06 that will clean up after itself
18:34:47 <corvus> mnaser: we'll have some more cultural things to figure out
18:34:56 <clarkb> but maybe I should exec into it first and figure out where the disk is used
18:35:10 <mnaser> corvus: yep, as the assignee of a change seems to be a 1:1 mapping too
18:35:29 <mnaser> clarkb: i'd probably see why it ran away with so much disk space in the first place out of my curiosity :)
18:36:33 <clarkb> it is the log file
18:37:06 <clarkb> I'll compress a copy into my homedir then down up the container?
18:37:23 <fungi> wfm
18:37:30 <mnaser> hmmmmmmm
18:37:38 <mnaser> you can change your full display name inside gerrit right now
18:37:46 <clarkb> mnaser: you always could
18:38:02 <mnaser> oh, i thought you could change the formatting
18:38:09 <clarkb> some people would stick their irc nicks in there or put away messages
18:38:10 <fungi> nope, always was allowed
18:38:15 <mnaser> ah got it
18:38:28 <fungi> but now away messages are unnecessary, because...
18:38:37 <fungi> you can set your status!
18:38:42 <mnaser> indeed
18:39:10 <fungi> actually what's changed around the name is that it has a separate "display name" and "full name"
18:39:22 <fungi> you can change them both
18:39:33 <fungi> used to just be a full name
18:40:16 <mnaser> unrelated but
18:40:29 <mnaser> the static link url to the CLA is well, ancient
18:40:30 <mnaser> https://review.opendev.org/static/cla.html
18:40:44 <mnaser> "you agree that OpenStack, LLC may assign the LLC Contribution Agreement along with all its rights and obligations under the LLC Contribution License Agreement to the Project Manager."
18:40:53 <fungi> mnaser: technically still accurate
18:41:03 <mnaser> openstack, llc? :p
18:41:42 <fungi> mnaser: yep
18:41:47 <clarkb> hrm using xz because I don't want a 2GB gzip file
18:41:49 <clarkb> but this is slow
18:41:59 <clarkb> fungi: has gerrit restarted?
18:42:03 <mnaser> well IANAL but if it works, it works
18:42:08 <fungi> mnaser: section #9 contains the previous icla
18:42:17 <fungi> because lawyers
18:42:27 <fungi> it's an icla within an icla
18:42:31 <fungi> clarkb: yes
18:42:45 <fungi> in theory private changes should no longer appear as an option
18:43:26 <clarkb> fungi: cool the change for that lgtm. I +2'd it if you want to force merge it too
18:43:56 <clarkb> once I've got gitea06 in a good spot I t hink we're ready to start replicating more things
18:44:08 <clarkb> I'll give it say 5 minutes on the xz but if that isn't done switch to gz?
18:44:27 <fungi> mnaser: the short answer is that agreeing to the new license agreement carries a clause saying you agree that contributions previously made under the old agreement can be assumed to be under the new agreement, and part of doing that is specifying a copy of the old agreement
18:45:00 <fungi> i'll merge the private disable change now
18:46:02 <fungi> clarkb: care to add a workflow +1?
18:46:18 <clarkb> done
18:46:20 <fungi> thanks
18:46:40 <fungi> and now it's merged and i've removed my admin account from project bootstrappers again
18:46:47 <clarkb> thanks for taking care of that
18:46:51 <fungi> np
18:47:35 <mnaser> do we have an 'opendev' plugin in ue?
18:48:01 <mnaser> i was researching on how to add the opendev logo and replace 'Gerrit' by 'OpenDev', found out it was possible by writing a style plugin
18:48:10 <mnaser> i've found the one used by chromium -- https://chromium.googlesource.com/infra/gerrit-plugins/chromium-style/+/refs/heads/master
18:48:11 <fungi> nope, though that raises the question whether we'd want an aio plugin for all our stuff or separate single-purpose plugins
18:48:27 <clarkb> what is ue?
18:48:42 <fungi> i assumed he meant "use"
18:48:57 <mnaser> ah yes, in use indeed
18:49:49 <clarkb> mnaser: no what we came to realize was thati f we tried to get every single thing like that done before we did the notedb transition in particular we'd just be making it harder and harder as more changes land
18:50:01 <mnaser> clarkb: oh yes, of course, i agree :)
18:50:06 <clarkb> instead it felt prudent to ugprade, then figure out what we need tochange as we're able to roll ahead with eg the 3.3 release
18:50:16 <clarkb> that comes out next week, maybe we'll upgrade week after
18:50:25 <mnaser> by the way, funny thing
18:50:36 <mnaser> in that plugin `if (window.location.host.includes("chromium-review")) {`
18:50:39 <mnaser> `} else if (window.location.host.includes("chrome-internal-review")) {`
18:51:00 <mnaser> https://chrome-internal-review.googlesource.com/ i wonder where this little guy goes :)
18:51:47 <fungi> behind a firewall/vpn you can't reach, no doubt
18:52:15 <fungi> it's likely full of googlicious goodness
18:52:48 <clarkb> ok its been more than 5 minutes and xz is still going. I'm going to stop it and see how big a gzip is
18:53:21 <fungi> xz takes a lot more memory/cpu to compress than gzip
18:53:40 <fungi> so not surpeising
18:53:50 <fungi> gz will probably still make it nearly as small
18:54:15 <clarkb> fungi: I went with xz to start beacuse compressing journald logs is significanlty better with it than gzip
18:54:21 <clarkb> that is why the devstack jobs use xz for that purpose
18:54:28 <clarkb> like an order of magnitude
18:55:02 <fungi> woah really?
18:55:20 <clarkb> ya
18:55:25 <fungi> i rarely see xz get that much of an advantage over gz. maybe 25%
18:55:39 <fungi> order of magnitude is impressive indeed
18:55:39 <clarkb> its like 30MB xz and 200MB gzip iirc
18:56:02 <mnaser> rest of the gerrit looks pretty good to me so far in terms of functionality at this point, i'll come try 'break' things again once zuul is back up :)
18:56:08 <fungi> i guess it's on super repetitive stuff
18:56:19 * mnaser goes for a walk
18:56:20 <mnaser> gl!
18:56:22 <fungi> thanks again mnaser!
18:56:41 <fungi> i'm going to need to break in an hour to light the grill and start cooking dinner
18:57:14 <clarkb> corvus: ^ if you're still around any thoughts on the zuul startup process I have on the etherpad?
18:57:31 <clarkb> fungi: ya lunch here is in about an hour and I barely ate breakfast so should haev something too
18:58:57 <clarkb> ok gzip is done. took 18GB down to 1.2GB so its probably going to give us more than enough space. I'm stopping gitea06 now using the safer process in the playbook
19:00:29 <clarkb> yup 35GB available now which I think is plenty
19:00:56 <clarkb> fungi: corvus I think we are ready to trigger global replication now. Gitea01 has the least free disk at 27GB but our git repo growth was about 15GB so I expect that to be plenty
19:01:12 <clarkb> fungi: ^ do you want to trigger that if you agree we're good?
19:01:18 <fungi> sounds good, i can trigger it as soon as you're ready
19:01:58 <clarkb> I guess I'm as ready as I will be. gitea06 is up now
19:03:04 <fungi> i've done `replication start --all --now`
19:03:16 <clarkb> I see things getting queued up in show-queue
19:05:00 <clarkb> it doesn't seem to load the queue items as quickly as before
19:05:05 <clarkb> the number is still climbing
19:06:02 <clarkb> heh its stream events the replication scheduled events for everything
19:08:43 <clarkb> peaked at just over 17k events in the queue
19:08:48 <clarkb> number is falling now (slowly)
19:10:09 <clarkb> I'm going to remove the digest auth option from all our zuul config files as the default is basic
19:10:25 <clarkb> this is required before we start zuul back up again, but I will wait on zuul startup until we've got eyeballs
19:12:31 <clarkb> looks like it may only be necessary on the scheulder? the others have it but no corresponding secret. I'll do the others for completeness
19:15:45 <fungi> sounds right
19:16:09 <clarkb> just under 16k events now so whatever that comes out for replicating
19:16:21 <fungi> only the scheduler performs privileged actions on gerrit, the other services just pull refs (at least in our deployment)
19:16:34 <corvus> clarkb: looking re zuul
19:17:44 <corvus> clarkb: 6.4.1 and 6.4.2?
19:17:56 <clarkb> corvus: ya
19:17:58 <corvus> clarkb: i think 6.4.2 is done arleady, right?
19:18:10 <clarkb> yup and 6.4.1 is done as of 30 seconds ago
19:18:26 <clarkb> I guess the question for you is do you think we should start zuul now or wait or do other things first?
19:18:35 <clarkb> zuul can't ssh into bridge to run ansible right now
19:18:50 <clarkb> so we should be able to bring it up, have it run normal ci jobs, be happy with it then work to reenable cd?
19:19:07 <corvus> clarkb: sgtm.  i can't think of a reason to delay
19:19:54 <clarkb> looks like zuul_start.yaml starts the scheduler, then web, then mergers, then executors
19:20:05 <clarkb> do we want ot hack up a playbook to not exclude disabled or do it more manually?
19:20:38 <corvus> clarkb: i'd just hack out disabled then run that
19:21:08 <clarkb> ok I think it has to be in the same dir as what we run out of because it includes other roles?
19:21:16 <clarkb> I guess tahts fine because nothing is updating system-config on bridge right now
19:22:01 <fungi> are we planning on relying on ansible to undo the commented-out cronjobs or should we manually uncomment them (and when)?
19:22:36 <clarkb> fungi: I was going to rely on ansible
19:22:42 <clarkb> track-upstream isn't super critical
19:23:11 <clarkb> actually lets uncomment them because the gc'ing and the log cleanup is good to have
19:23:23 <clarkb> we can probably do that now?
19:23:41 <clarkb> corvus: fungi: I've got an edited zuul start playbook in the root screen on bridge
19:23:53 <clarkb> that is a vim buffer if you want to take a look at that before we run it
19:23:54 <fungi> okay, i'll uncomment the cronjobs now
19:24:18 <fungi> playbook in bridge root screen lgtm
19:24:19 <clarkb> down to 14.7k replication tasks now
19:24:24 <corvus> clarkb: lgtm
19:24:33 <corvus> clarkb: rember -f 20 :)
19:24:38 <clarkb> corvus: ++
19:24:55 <corvus> or 50 is fine :)
19:25:00 <fungi> heh, 50 it is
19:25:01 <corvus> -f lots
19:25:13 <clarkb> that command was in the scrollback so easy to modify
19:25:13 * fungi fasts fireball
19:25:17 <clarkb> does that command look good to yall?
19:25:19 <fungi> er, casts
19:25:27 <fungi> yeah, looks fine
19:25:30 <corvus> ++
19:25:31 <clarkb> ok running it
19:26:07 <fungi> success!
19:26:21 <clarkb> looks happy
19:26:27 <clarkb> now to see what the running service is like
19:26:43 <corvus> executors are deleting stale dirs
19:27:20 <corvus> 2020-11-21 19:25:55,459 DEBUG zuul.Repo: Updating repository /var/lib/zuul/git/opendev.org/inaugust/src.sh
19:27:25 <fungi> crontabs edited in root screen session on review.o.o if anyone wants to double-check those
19:27:34 <corvus> that is not going as quickly as i would expect
19:28:07 <corvus> i wonder if zuul is going to have to pull a lot of new refs
19:28:21 <corvus> oh okay, things are moving now
19:28:31 <corvus> i think we might have been stuck at branch iteration longer than i expected
19:28:46 <corvus> ie, the delay wasn't git, but rather the rest api querying branches
19:29:15 <corvus> cat jobs are proceeding
19:29:29 <clarkb> this takes about 5-10 minutes typically iirc
19:31:22 <corvus> i'm seeing a number of errors in gertty
19:31:26 <clarkb> I moved my temporary playbook into my homedir to avoid any trouble that may cause system-config syncing when we get there
19:31:54 <corvus> i have no reason to think they are on the gerrit side; more likely minor api tweaks
19:32:13 <corvus> zuul is running jobs in the openstack tenant
19:32:28 <clarkb> https://review.opendev.org/763599 for that change
19:32:54 <clarkb> down to 13.3k replication tasks
19:32:54 <fungi> corvus: gertty isn't logging any errors for me... did you change your auth from digest to basic?
19:33:34 <corvus> fungi: oh, not yet; that's not the error i'm getting but maybe it's a secondary effect
19:33:36 <corvus> 2020-11-21 19:31:30,509 WARNING zuul.ConfigLoader: Zuul encountered an error while accessing the repo x/ansible-role-
19:33:36 <corvus> bindep.  The error was:
19:33:36 <corvus> invalid literal for int() with base 16: 'l la'
19:33:48 <corvus> zuul logged that error for a handful of repos ^
19:33:51 <clarkb> corvus: I thnik I saw that scroll by in the zuul scheduler debug
19:34:00 <corvus> yeah
19:34:38 <clarkb> should I be digging into that or are you investigating?
19:34:44 <fungi> corvus: yeah, the error i remember gertty throwing when i had the wrong auth type was opaque to say the least
19:35:01 <corvus> i don't recall seeing that before, therefore i don't know if it could be upgrade related.  but it doesn't seem like it should be -- that's in-repo content over the git protocol, so i don't think anything should be different.  but i dunno.
19:35:03 <fungi> i've put a reminder in the post-upgrade etherpad for gertty users to update their configs
19:35:25 <clarkb> corvus: oh I see this is us talking git not api
19:36:06 <clarkb> three jobs have succeeded, but the other jobs on that chagne will take a while to run so will be a while before we see zuul comment back
19:37:03 <corvus> fatal: https://review.opendev.org/x/ansible-role-bindep/info/refs not valid: is this a git repository?
19:37:19 <corvus> that would explain the proximate cause of the zuul error
19:38:53 <clarkb> info/refs/ is there and file level permissiosn look ok
19:39:18 <clarkb> ansible-role-bindep doesn't show up in the error_log
19:39:44 <corvus> i can clone it over ssh
19:39:51 <corvus> is there a problem with "x/" repos and http?
19:40:44 <clarkb> x/ranger reproduces (just a random one I remembered was in x/)
19:41:00 <clarkb> I wonder if this is a permissions issue perhaps related to the bug that got mitigated?
19:41:25 <corvus> just for 'x/' though?
19:41:58 <clarkb> review-test reproduces fwiw
19:43:50 <clarkb> if you search for chagnes in those repos you can see them
19:43:54 <clarkb> in the web ui I mean
19:44:25 <corvus> if i curl info/refs for x repos, i get the gerrit web app
19:45:12 <corvus> i'm a little worried there's some kind of routing thing in gerrit that assumes any one-letter path component is not a repo
19:45:19 <clarkb> oh fun
19:45:25 <fungi> yikes
19:45:28 <corvus> no basis for that other than observed behavior
19:46:05 <corvus> i'm going to start looking at gerrit source code
19:46:10 <clarkb> ok
19:48:07 <clarkb> down to 11.1k replication tasks and things look good on gitea01 disk wise
19:49:04 <clarkb> its x/
19:49:12 <clarkb> corvus: java/com/google/gerrit/httpd/raw/StaticModule.java
19:49:44 <clarkb> it serves something related to polygerrit judging by the path names
19:49:47 <clarkb> s/path/variable/
19:49:55 <corvus> clarkb: thx
19:52:34 <corvus> clarkb: poly gerrit extension plugins?
19:53:00 <clarkb> ya the docs talk about #/x/<plugin-name>/settings
19:53:12 <corvus> and /x/pluginname/*screenname*
19:55:09 <clarkb> do we need to start talking about renaming them?
19:56:06 <clarkb> I did test a rename and if you move the project in gerrit's git dir everything seems to be fine except for project watches config
19:56:17 <clarkb> you can do an online reindex too
19:56:26 <clarkb> or maybe this is somethign to pull luca in on
19:56:47 <corvus> i think a surprise project rename might be disruptive
19:57:11 <clarkb> agreed
19:58:12 <corvus> grepping logs, i'm not seeing any currently legit access for /x/*
19:58:20 <corvus> (other than attempted clones)
19:58:30 <corvus> there are some requests for fonts: /x/fonts/roboto/Roboto-Bold.ttf
19:58:58 <corvus> but i'm not sure those are actually returning fonts (i think they may just return the app)
19:59:55 <clarkb> thinking out loud here. I wonder if we can convince the gerrit http server to check for x/repo first then fallback to x/else
20:00:16 <corvus> clarkb: i think long term if gerrit wants to own x/ we can't have it
20:00:31 <clarkb> ya agreed, I figure something liek that wold be so we can schedule a rename not today
20:00:49 <corvus> but short term, i'm wondering if, since it doesn't seem like our gerrit is using x/ right now, we can rebuild it without that exclusion then work on a rename plan
20:01:27 <fungi> i'm around to review a gerrit patch, though getting started grilling
20:01:43 <corvus> (if we're right about x/ being used for plugins, then it'll become an issue as we add polygerrit plugins)
20:01:58 <fungi> i assume we'll want to start a thread on repo-discuss noting that polygerrit has made some repository names impossible. that seems like a bug they would be interested in fixing
20:02:20 <corvus> fungi: i assume they'll fix it with a doc change saying 'don't use these'
20:02:28 <clarkb> corvus: ya maybe we can add a sed to the jobs to comment that out on the 3.2 branch which will rebuild image then pull that and use it?
20:02:28 <corvus> just like /p/ and /c/ are unavailable
20:02:47 <corvus> clarkb: sounds good
20:03:03 <clarkb> corvus: do you want to write that change or should I/
20:03:12 <fungi> if you can't use repositories whose names start with c/ or p/ or x/ but gerrit doesn't prevent you from creating them, that sounds like a bug
20:03:17 <corvus> clarkb: you if you're available
20:03:17 <clarkb> also I think we should trim down the images so its just 3.2 on that chagne
20:03:21 <clarkb> ok working on that now
20:03:25 <fungi> for not properly separating api paths from git project paths
20:03:40 <corvus> fungi: perhaps gerrit does prevent creation; we should check that
20:04:35 <corvus> i imagine we should just no longer allow single-char in the initial path component of project names to be safe for the future
20:05:38 <clarkb> ++
20:05:54 <fungi> or is there a more correct path prefix we should switch to using to access git repositories?
20:06:24 <clarkb> I always have to spend 10 minutes figuring out how we build the gerrit wars in these jobs
20:06:33 <clarkb> fungi: the download urls are rooted at /
20:06:36 <clarkb> I checked that as I wondered too
20:06:51 <fungi> and arent' configurable?
20:07:28 <fungi> because that seems like it would be a relatively minor fix... deprecate the / routing for project names and add a new prefix
20:08:02 <fungi> and instruct users to migrate to the new prefix and then eventually rtemove the download routing at / in a later release
20:14:20 <corvus> clarkb: i came to the same conclusion
20:14:41 <corvus> i mean, /p/ *used* to work :/
20:15:36 <clarkb> remote:   https://review.opendev.org/c/opendev/system-config/+/763600 Handle x/ prefix projects on gerrit 3.2
20:15:56 <clarkb> I figure we can pull that image onto review-test and test out there first, then if that looks ok do it to prod
20:16:11 <clarkb> and I'll update my change so that we can land it
20:16:16 <corvus> clarkb: ++
20:16:27 <corvus> clarkb: what needs to be updated?
20:16:51 <clarkb> corvus: stuff around which jobs to run I think
20:17:16 <clarkb> corvus: I removed 2.13 - 3.1 since they aren't necessary to get that image
20:17:47 <ianw> o/ ... well done everyone!
20:17:52 <corvus> clarkb: can't we land that?
20:17:52 <fungi> since luca reached out when he saw our upgrade was in progress and suggested we should let him know if we hit any snags, is this something we should give him a heads up about?
20:17:52 <clarkb> if we add them back in I need to make the sed branch specific. If we don't add them back in then I need squash it into fungi's use regular stable branches change I think
20:18:02 <corvus> fungi: yes
20:18:16 <clarkb> corvus: yes I think I need to update the system-config-run dependency maybe?
20:18:16 <corvus> i think we should send an email saying we found this issue and our proposed solution and see if he thinks it's ok
20:18:29 <clarkb> corvus: I'm sorry these jobs always confuse me
20:18:42 <clarkb> I'm basically hsut saying that we need to review teh job updates carefully if we land this
20:18:45 <fungi> i've got the grill starting so i'm happy to throw a quick e-mail out there pointing to our workaround and asking for suggestions
20:18:52 <clarkb> fungi: go for it
20:22:15 <clarkb> 5.8k on the replication
20:24:25 <clarkb> gitea01 is down to 18GB free. Should have plenty for the remaining replication
20:25:01 <clarkb> I'm going to find some food while I wait for zuul to build that image
20:31:02 <fungi> reply sent to luca, seems like my patio is experiencing unnecessary levels of packet loss so i'm less responsive than i might otherwise be at the moment
20:32:48 <clarkb> my ansible is bad
20:32:51 <clarkb> fixing
20:32:55 <ianw> the new UI is so much faster, very pleasant for us high latency users
20:34:47 <clarkb> new ps has been pushed
20:39:57 <clarkb> infra-root. I added myself to project bootstrappers and admins on review-test. Then went to /plugins/ which returns a json doc of plugins
20:40:10 <clarkb> the index_url for each plugin we have is listed there and they all start with plugins/ not x/
20:40:19 <clarkb> (just another data point towards the safety of this change)
20:41:29 <clarkb> I think toget that document you have to be in the amdins group
20:41:36 <clarkb> could probably get it via the rest api instead too
20:41:37 <corvus> clarkb: yeah, if i'm following correctly x/ might be used by polygerrit plugins to serve certain resources
20:43:24 <clarkb> corvus: hrm are any of the plugins we have polygerrit plugins? I assume that some are like the codemirror-editor and download-commands?
20:45:16 <corvus> clarkb: no idea
20:46:23 <corvus> clarkb: ansible pares error again
20:47:53 <clarkb> k, can someone look at it really quickly? I feel like my brain isn't working
20:47:57 <corvus> clarkb: will do
20:48:14 <clarkb> poking at codemirror editor on review-test with ff dev tools it self hosts its static contents looks like
20:49:55 <clarkb> I think I see it/ shell needs to be a list
20:50:00 <corvus> yes i'm on it
20:50:09 <clarkb> k
20:50:11 <corvus> -          "/x/*",
20:50:11 <corvus> +          //"/x/*",
20:50:17 <corvus> clarkb: that's the intended change, yeah?
20:50:20 <clarkb> corvus: yes
20:50:30 <corvus> i'm validating it makes it all the way through ansible unscathed
20:50:33 <clarkb> it comments out that line with the /x/* in it
20:51:18 <corvus> clarkb: pushed
20:51:38 <corvus> i figured i'd double check the whole thing to save us any more round trips
20:51:38 <clarkb> thanks
20:51:42 <clarkb> ++
20:56:15 <clarkb> ~600 replication tasks now
20:58:27 <fungi> one this is built, pulled and restarted, do we need to restart the executors and mergers as well?
20:59:32 <clarkb> its running the bazelisk build now
21:00:16 <clarkb> fungi: you want to respond to luca?
21:00:19 <clarkb> and file the bug?
21:00:41 <clarkb> replication is done. I'm going to do another round of gc'ing on the giteas
21:01:09 <fungi> oh, cool, he already replied. yeah i can do that immediately after dinner
21:01:15 <clarkb> fungi: specifically I think the bit that was missing in the email was that its cloning repos
21:02:09 <fungi> yes
21:03:40 <clarkb> giteas are gc'ing now
21:13:42 <corvus> build finished
21:13:49 <corvus> docker://insecure-ci-registry.opendev.org:5000/opendevorg/gerrit:f76ab6a8900f40718c6cd8a57596e3fc_3.2
21:14:12 <clarkb> cool I'll get that on review-test momentarily
21:14:26 <corvus> i'm also running it locally for fun
21:14:55 <corvus> or will, when it downloads, in a few minutes
21:15:57 <clarkb> note review-test's LE cert expired a few days ago and we decided to leave it be
21:16:54 <clarkb> cloning x/ranger from review-test works now
21:17:22 <corvus> \o/
21:18:07 <clarkb> https://review-test.opendev.org/x/fonts/fonts/robotomono/RobotoMono-Regular.ttf is a 404
21:19:06 <corvus> clarkb: but it's also not a real thing on prod
21:19:12 <clarkb> ya I guess not
21:19:17 <clarkb> I just wanted to see what it does there
21:20:08 <corvus> clarkb: want me to update your patch with the system-config-run change?
21:20:27 <clarkb> corvus: that would be swell
21:20:38 <clarkb> then I think it sould be landable?
21:22:19 <corvus> clarkb: actually... maybe we should make this 2 changes
21:22:32 <clarkb> corvus: I'm good with that too
21:22:37 <clarkb> just the sed then a cleanup?
21:22:53 <corvus> yep
21:22:58 <clarkb> wfm
21:23:03 <corvus> i'll take care of that
21:23:15 <clarkb> corvus: remember you need to check the branch if you do that
21:23:15 <corvus> clarkb: meanwhile, we have a built image -- want to go ahead and run it on prod?
21:23:20 <clarkb> or have 3.2 use a differnet playbook
21:23:30 <corvus> clarkb: how about we invert the order?
21:23:36 <clarkb> corvus: that also works
21:23:38 <corvus> remove old stuff, then the x/ change
21:23:41 <clarkb> ++
21:23:42 <corvus> will be easy to revert
21:23:53 <clarkb> for prod any concern that this may break something else? or are we willing to find out the hard way :)
21:24:12 <corvus> clarkb: i think we've done the testing we can
21:24:15 <clarkb> ok
21:24:20 <clarkb> I'll do this in the screen fwiw
21:24:30 <corvus> i'm not worried about it breaking anything in a way we can't roll back
21:26:46 <clarkb> gerrit is starting back up again on prod
21:28:05 <clarkb> hrm the chagne screen isn't loading for me though I thought I tested taht on review-test too
21:28:08 <clarkb> oh there it goes
21:28:10 <clarkb> I just need patience
21:28:59 <clarkb> I can clone ranger from prod via https now too
21:30:05 <corvus> remote:   https://review.opendev.org/c/opendev/system-config/+/763616 Remove container image builds for old gerrit versions [NEW]
21:30:06 <corvus> remote:   https://review.opendev.org/c/opendev/system-config/+/763600 Handle x/ prefix projects on gerrit 3.2
21:30:12 <corvus> clarkb: i think we should do a full-reconfigure in zuul
21:30:14 <corvus> i'll do that
21:30:15 <clarkb> oh I should go rsetart gerritbot now that I restarted gerrit
21:30:17 <clarkb> corvus: ++
21:30:56 <clarkb> gerritbot has been restarted
21:31:17 <corvus> i have more work to do on those image build changes; on it
21:34:19 <clarkb> btw zuul commented a -1 on https://review.opendev.org/c/openstack/os-brick/+/763599/ which was the first change that started runnign zuul jobs. That aspect of things looks good
21:34:37 <corvus> clarkb, fungi, ianw: remote:   https://review.opendev.org/c/openstack/project-config/+/763617 Remove old gerrit image jobs from jeepyb [NEW]
21:35:36 <clarkb> +2
21:36:08 <corvus> cat jobs are running
21:37:37 <clarkb> corvus: one small thing on https://review.opendev.org/c/opendev/system-config/+/763616
21:38:28 <clarkb> I'm ahppy to fix the issue on ^ if you want to roll forward instead
21:38:33 <clarkb> er I mean fix it in a follow on
21:38:38 <corvus> clarkb: i'll respin
21:38:41 <clarkb> ok
21:40:12 <corvus> clarkb: respin done
21:40:34 <corvus> 2020-11-21 21:37:15,977 INFO zuul.Scheduler: Full reconfiguration complete (duration: 379.767 seconds)
21:40:48 <clarkb> and no more of thos errors?
21:41:06 <fungi> was review.o.o restarted with the fix? i guess so, my tests to reproduce the error don't fail
21:41:13 <fungi> what was the error message on attempting to clone?
21:41:23 <fungi> sorry, just now catching up since dinner's done
21:41:37 <corvus> fungi: heh, lemme see if i have a terminal open with the error :)
21:41:43 <fungi> back to nominal levels of packet loss again and can test things suitably
21:41:57 <fungi> thanks!
21:42:17 <fungi> working up the reply to luca now
21:42:22 <clarkb> another thing I notice is that gitweb doesn't work but gitiles seems to
21:42:30 <clarkb> I think we should just stop using gitweb maybe and have it gitiles
21:42:36 <clarkb> that isn't super urgent though
21:42:43 <clarkb> then we can add in gitea when we sort that out
21:42:49 <corvus> fungi: i don't, sorry :(
21:43:05 <corvus> 19:37 < corvus> fatal: https://review.opendev.org/x/ansible-role-bindep/info/refs not valid: is this a git repository?
21:43:11 <corvus> fungi: but i pasted that ^
21:43:16 <corvus> that was about it
21:44:00 <corvus> clarkb: confirmed, no new 'invalid literal' errors from zuul
21:45:17 <clarkb> +2 from me on corvus' image stack
21:46:00 <corvus> +2 from me on clarkb's image stack
21:46:26 <clarkb> zuul stillcan't ssh into bridge (Ithink that is a good thing), once we've got these issues settled I figured we would use https://review.opendev.org/c/opendev/system-config/+/757161 this change as the canary for that?
21:46:49 <clarkb> my family has pointed out to me that I am yet to shower today though, so now might be time for me to take a break.
21:46:57 <clarkb> is there anything else you'd like me to do before I pop out for a bit?
21:48:02 <fungi> nope, go become less offensive to your family ;)
21:48:05 <corvus> i think now's a good break time
21:48:11 <clarkb> fungi: maybe you can include a diff for luca as well: http://paste.openstack.org/show/qz6zQ6a3jkRVluxebh8l/
21:48:13 <corvus> fungi: can you +3 https://review.opendev.org/763617 ?
21:48:14 <fungi> corvus: thanks! i'll try to work with that
21:48:26 <fungi> yeah, will review
21:49:03 <fungi> and approved
21:49:18 <clarkb> giteas are still gc'ing but free disk space is going up so we should be more than good there
21:49:20 <clarkb> and now break time
21:49:58 <clarkb> I've also removed my normal user from privileged groups on review-test
21:50:05 <clarkb> as I am done testing there for now
21:55:07 <fungi> i've re-replied to luca, will start putting the bug report together shortly
21:55:34 <fungi> any other urgent upgrade-related tasks need my attention first?
21:56:11 <corvus> fungi: i don't think so.  i'm about to +w the remaining image stack
21:57:22 <corvus> err, there's another error
21:59:42 <corvus> clarkb, fungi: can you +3 https://review.opendev.org/763616 ?
22:00:18 <corvus> missed an update for the infra-prod jobs to trigger on 3.2 builds
22:01:17 <fungi> yup, taking a look now
22:02:17 <corvus> current status: we need to merge https://review.opendev.org/763616 and https://review.opendev.org/763600 then the repos will match the image we're running in production.  then we can proceed with enabling cd.  aside from that, i think there's no known issues in prod and we're just waiting for replication to finish.
22:02:26 <fungi> i've approved 763616 now
22:02:45 <corvus> cool, then i'm going to afk for another errand
22:03:22 <corvus> infra-root: just a highlight ping for what i think is the current status (a couple lines up ^) as i think we're all on break while waiting for tasks to complete
22:04:08 <fungi> awesome, thanks again!
22:44:40 <fungi> https://bugs.chromium.org/p/gerrit/issues/detail?id=13721
22:45:03 <fungi> if anyone feels inclined, please clarify mistakes or omissions therein
22:46:06 <clarkb> I'll take a look in a few.
22:47:20 <clarkb> gitea01 has finished gc'ing and has 22gb free which should be plenty for now
22:48:19 <clarkb> the others all have more free disk too
22:48:22 <clarkb> and are done as well
22:48:28 <clarkb> I think that means all the replication related activities are done
22:49:35 <clarkb> fungi: the bug looks good to me
22:50:42 <clarkb> I'm going to start drafting a "its up, this is what we've discovered, this is where we go from here" type email in etherpad
22:52:04 <fungi> thanks! don't forget to incorporate notes from https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes as appropriate
22:52:32 <clarkb> ya was going to link to that I think
22:55:07 <clarkb> https://etherpad.opendev.org/p/rNXB-vJe8IUeFnOKFVs8 is what I'm drafting
22:57:41 <ianw> fungi: no idea if it helps but i think x/ was introduced @ https://gerrit.googlesource.com/gerrit/+/153d46c367965cd7782a3ac86212c07b298eaca8
22:58:33 <ianw> actually no, more to dig
22:59:19 <clarkb> the file was moved at some point which makes it difficult to go back in time with
22:59:26 <clarkb> I ended up doing a git log -p and grepping for it and giving up
23:00:07 <ianw> https://gerrit.googlesource.com/gerrit/+/7cadbc0c0c64b47204cf0de293b7c68814774652
23:00:31 <ianw> +    serve("/x/*").with(PolyGerritUiIndexServlet.class);
23:00:44 <ianw> that is really the first instance.  i wonder if it's not really necessary and just been pulled along since
23:01:42 <clarkb> ianw: the docs hint at it but could still be dead code
23:02:44 <ianw> .. at least it's in a "add x/ this is a really important path never remove" type change i guess :)
23:04:05 <clarkb> not in or in?
23:04:46 <clarkb> https://etherpad.opendev.org/p/rNXB-vJe8IUeFnOKFVs8 ok I think thats largely put together at this point
23:13:48 <ianw> clarkb: minor suggestion on maybe something that explains the x/ thing at a high level but enough for people to understand
23:14:36 <clarkb> ianw: something liek that?
23:15:09 <ianw> yeah, i think so; feel like it explains how both want to "own" the /x endpoint
23:15:24 <ianw> namespace, whatever :)
23:19:44 <clarkb> oh shoot, I think there is a minor but not super important issue with https://review.opendev.org/763600 it doesn't update the dockerfile so we won't promote the image
23:19:59 <clarkb> corvus: ^ maybe thats something we can figure out manually or just push up another change that does a noop dockerfile edit?
23:20:34 <clarkb> double check me on all that first though
23:22:57 <clarkb> also I'm starting to feel the exhaustion roll in. If others want to drive things and get cd rolling again I'll do my best to help, otherwise, tomorrow morning might be good
23:23:50 <clarkb> ya I think the promote jobs for the 3.2 docker image tagging didn't run
23:23:57 <clarkb> I'll push up a noop job now to get that rolling
23:26:03 <clarkb> remote:   https://review.opendev.org/c/opendev/system-config/+/763618 Noop change to promote docker image build
23:26:49 <ianw> i've got to run out, but i can get to the CD stuff early my tomorrow?  i don't think we need it before then?
23:27:23 <clarkb> ya I don't think its super urgent unless others really want their sunday back. I'm just wiped out
23:27:39 <clarkb> fungi: corvus ^ fyi. Also any thoughts on that email? should I send that nowish?
23:28:18 <clarkb> infra-root Note that https://review.opendev.org/c/opendev/system-config/+/763618 or something like it should land before we start doing cd again
23:29:23 <ianw> ok, that's the new image with the x/ fix right?
23:29:32 <clarkb> yes
23:29:41 <ianw> i.e. we don't want to CD deploy the old image
23:29:49 <clarkb> we actually just built it when corvus' changes landed but because we didn't modify files that the promote jobs match we didn't promote it
23:29:58 <clarkb> we could also do an out of band promote via docker directly if we want
23:30:11 <clarkb> 763618 should also take care of it since the dockerfile is modified
23:30:53 <ianw> ok, i have to head out but will check back later
23:30:59 <clarkb> ianw: o/
23:37:45 <fungi> clarkb: sorry, stepped away for a bit, reading draft e-mail now
23:41:01 <fungi> made a couple of minor edits but lgtm in general
23:42:13 <clarkb> cool I'll wait abit to seeif corvus is able to take a look then send that out
23:42:30 <clarkb> fungi: and maybe a corresponding #status notice
23:42:50 <clarkb> I'm taking abreak now though. The tired hit me hardin the lastlittle bit
23:43:21 <fungi> yup, a status notice at the same time that e-mail gets sent would make sense
23:46:19 <clarkb> fungi: did you see 763618 too?
23:47:43 <fungi> likely not if you're asking
23:48:16 <fungi> approvidado
23:48:22 <corvus> reading scrollback
23:50:24 <corvus> clarkb: email lgtm
23:50:34 <clarkb> cool I'll send that out momentarily
23:55:04 <clarkb> how about this for the notice #status notice Gerrit is up and running again on version 3.2. Zuul is talking to it and running jobs. You can push and review changes. However, we are still working through things and there may be additional service restarts during out upgrad window which ends at 01:00 November 23.
23:55:25 <corvus> clarkb: s/out upgrad/our upgrade/
23:55:32 <clarkb> I can also add "See http://lists.opendev.org/pipermail/service-announce/2020-November/000013.html for more details"
23:55:54 <clarkb> how about this for the notice #status notice Gerrit is up and running again on version 3.2. Zuul is talking to it and running jobs. You can push and review changes. However, we are still working through things and there may be additional service restarts during our upgrade window which ends at 01:00 November 23. See http://lists.opendev.org/pipermail/service-announce/2020-November/000013.html for
23:55:56 <clarkb> more details
23:56:23 <clarkb> is that just short enough if I drop my prefix?
23:57:24 <fungi> maybe squeeze it down a bit so it fits in a single notice
23:57:33 <fungi> i think statusbot will truncate it otherwise
23:57:43 <clarkb> like #status notice Gerrit is up and running again on version 3.2. Zuul is talking to it and running jobs. You can push and review changes. However, we are still working through things and there may be additional service restarts during our upgrade window ending 01:00UTC November 23. See http://lists.opendev.org/pipermail/service-announce/2020-November/000013.html for more details
23:57:56 <fungi> or, rather, statusbot doesn't know to so the irc server ends up discarding the rest
23:58:40 <fungi> looks good. hopefully that's short enough
23:58:55 <clarkb> I can trim it a bit more but I'll just go ahead and send it with that trimming
23:59:23 <clarkb> #status notice Gerrit is up and running again on version 3.2. Zuul is talking to it and running jobs. You can push and review changes. We are still working through things and there may be additional service restarts during our upgrade window ending 01:00UTC November 23. http://lists.opendev.org/pipermail/service-announce/2020-November/000013.html for more details
23:59:23 <openstackstatus> clarkb: sending notice
00:00:21 <fungi> looks like it fit!
00:00:46 <clarkb> I think its still channel dependent because the channel name goes at the beginning of the message but ya seems like for the channels I'm in it is good
00:02:03 <clarkb> once 763618 lands and promotes the image I think we're in a good spot to turn on cd again, but more and moer I'm feeling like that is a tomorrow morning thing
00:02:07 <clarkb> will others be around for that?
00:02:14 <clarkb> sounded like ianw would au morning
00:07:07 <fungi> i will be around circa 13-14z
00:08:05 <corvus> i probably won't be around tomorrow
00:08:56 <clarkb> my concern with doing it today is if we turn it on and dont notice problems becausethey happen at0600 utc or whatever
00:09:02 <clarkb> so probably tomorrow is best?
00:09:22 <corvus> yeah i agree
00:09:38 <fungi> i don't expect to have major additional issues crop up which we'll be unable to deal with on the spot
01:05:10 <clarkb> ok I think the promote has happened
01:05:44 <clarkb> https://review.opendev.org/c/opendev/system-config/+/763618/ build succeeded deploy pipeline
01:40:05 <fungi> excellent
01:40:14 <fungi> i'm starting to fade though
02:20:07 <mordred> Congrats on the Gerrit upgrade!!!
02:22:45 <mordred> The post upgrade etherpad doesn't look horrible
02:26:24 <clarkb> the x/ conflict is probably the big thing
02:32:18 <fungi> thanks mordred!
02:56:35 <mordred> Btw ... If anybody needs an unwind Netflix ... We Are The Champions is amazing. We watched the hair episode last night. I promise you've never seen anything like it
14:23:06 <fungi> no signs of trouble this morning, i'm around whenever folks are ready to try reenabling ansible
15:02:25 <clarkb> fungi: I think our first step is to confirm our newly promoted 3.2 tagged image is happy on review-test, then roll that out to prod
15:03:15 <clarkb> then in my sleep I figured a staged rollout of cd would probably be good: put the ssh keys back but keep review and zuul in the emergency file and see what jobs do, then remove zuul from emergency file and see what jobs do, then remove review and land the html cleanup change I wrote and see how that does?
15:03:25 <clarkb> I think for the first two steps the periodic jobs should give us decent coverage
15:04:56 <clarkb> mordred: I've watched the first two episods. the cheese wheel racing is amazing
15:07:23 <fungi> the image we're currently running from is hand-built? or fetched from check pipeline?
15:07:27 <clarkb> david ostrovsky has congradulated us on the gerrit mailing list. Also asks if we have any feedback. I guess following up there with the url paths thing might be good as well as questions about if you can make the notedb step go faster by somehow manually gc'ing then manually reindexing
15:07:58 <clarkb> fungi: review-test should be running the check pipeline image
15:08:06 <clarkb> the docker compose file should reflect that
15:08:19 <fungi> but we've got a fix of some sort in place in production right?
15:08:19 <clarkb> fungi: and prod is in the same boat iirc
15:09:05 <fungi> ahh, okay, yep looks like it's also running the image built in the check pipeline then
15:09:10 <clarkb> fungi: the fix is that https://review.opendev.org/c/opendev/system-config/+/763618 and promoted our workaround as the 3.2 tag in docker hub. Which means we can switch back to using the opendevorg/gerrit:3.2 image on both servers
15:09:24 <clarkb> I think we should do review-test first and just quickly double check that git clone still works, then do the same with prod
15:09:25 <fungi> right, i'll get it swapped out on review-test now
15:10:55 <fungi> opendevorg/gerrit                                         3.2                                    3391de1cd0b2        15 hours ago        681MB
15:11:07 <fungi> that's what review-test is in the process of starting on now
15:18:11 <clarkb> fungi: I can clone ranger from review-test
15:18:16 <clarkb> via https
15:23:20 <fungi> yup, same. perfect
15:23:38 <fungi> shall i similarly edit the docker-compose.yaml on review.o.o in that case?
15:23:50 <clarkb> yes I think we should go ahead and get as many of these restarts in on prod during out window as we can
15:24:19 <fungi> edits made, see screen window
15:24:32 <fungi> do i need to down before i pull, or can i pull first?
15:24:50 <clarkb> you can pull first
15:25:00 <clarkb> sorry I'm not on the screen yet, but I think it will be fine since you just did it on -test
15:25:23 <fungi> opendevorg/gerrit                                         3.2                                    3391de1cd0b2        15 hours ago        681MB
15:25:28 <fungi> that's what's pulled
15:25:32 <clarkb> and it matches -test
15:25:35 <fungi> shall i down and up -d?
15:25:42 <clarkb> ++
15:25:53 <fungi> done
15:29:50 <clarkb> one thing that occured to me is we should double check our container shutdown process is still valid. I figured an easy way to do that was to grab the deb packages they publish and read the init script but I can find where the actual packages are
15:30:25 <fungi> `git clone https://review.opendev.org/x/ranger` is still working for me
15:30:46 <clarkb> *I can't find where
15:31:33 <fungi> which package? docker? docker-compose?
15:31:37 <clarkb> nevermind found then deb.gerritforge.com is only older stuff bionic.gerritforce.com has newer things
15:31:47 <clarkb> fungi: the "native packages" that luca publishes http://bionic.gerritforge.com/dists/gerrit/contrib/binary-amd64/gerrit-3.2.5.1-1.noarch.deb
15:32:02 <clarkb> since I assume that will have systemd unit or init file that we can see how stop is done
15:32:31 <clarkb> our current stop is based on the 2.13 provided init script. actually I wonder if 3.2 provides one too
15:33:32 <clarkb> ah yup it does
15:33:45 <clarkb> resources/com/google/gerrit/pgm/init/gerrit.sh and that still shows sig hup so I think we're good
15:33:54 <fungi> oh, got it
15:34:03 <fungi> thought you were talking about docker tooling packages
15:34:45 <clarkb> no just more generally. Our docker-compose config should send a sighup to stop gerrit's container
15:34:50 <clarkb> which it looks like is correct
15:35:00 <clarkb> *is still correct
15:58:34 <clarkb> remote:   https://review.opendev.org/c/opendev/system-config/+/763656 Update gerrit docker image to java 11
15:58:39 <clarkb> I think that is a later thing so will mark it WIP
15:58:44 <clarkb> also gerritbot didn't report that :/
15:58:49 <clarkb> oh right we just restarted :)
15:59:02 <clarkb> I'm restarting gerritbot now
16:00:20 <clarkb> also git review gives a nice error message when you try to push with too old git review to new gerrit
16:00:25 <fungi> seems like we need to restart gerritbot any time we restart gerrit these days
16:03:30 <clarkb> ok I won't say I feel ready, but I'm probably as ready as I will be :) what do you think of my staged plan to get zuul cd happening again?
16:06:20 <fungi> it seems sound, i'm up for it
16:06:58 <clarkb> opendev-prod-hourly jobs are the ones that we'd expect to run and those run at the top of the hour. So if we move authorized_keys back in place then we should be able to monitor at 17:00UTC?
16:07:24 <clarkb> then if we're happy with the results of that we remove zuul from emergency and wait for the hourly prod jobs at 18:00UTC
16:07:29 <clarkb> (zuul is in that list)
16:10:08 <clarkb> fungi: I put a commented out mv command in the bridge screen to put keys back in place, can you check it?
16:10:40 <fungi> yep, that looks adequate
16:10:58 <clarkb> ok I guess we wait for 17:00 then?
16:14:00 <fungi> was ansible globally disabled, and have we taken things back out of the emergency disable list?
16:14:47 <fungi> looks like /home/zuul/DISABLE-ANSIBLE does not exist on bridge at least
16:14:48 <clarkb> ansible was not globally disabled with the DISABLE-ANSIBLE file and the host are all still in the emergency disable list
16:14:58 <clarkb> we used the more forceful "you cannot ssh at all" disable method
16:15:34 <fungi> cool, so in theory the 1700z deploy will skip the stuff we stil have disabled in the emergency list
16:15:56 <clarkb> yup, then after that if we're happy with the results we take the zuul hosts out of emergency and let the next hourly pulse run on them
16:16:13 <clarkb> then if we're happy with that we remove review and then land my html cleanup change
16:16:42 <clarkb> review isn't part of the hourly jobs so we need something else to trigger a job on it (it is on the daily periodic jobs though so we should ensure we run jobs against it before ~0600 or put it back in the emergency file)
16:19:21 <clarkb> fungi: one upside to doing the ssh disable is that the jobs fail quicker in zuul
16:19:31 <clarkb> which we wanted beacuse we knew that things would be off for a long period of time
16:19:43 <clarkb> when you write the disable ansible file the jobs will poll it and see if it goes away before their timeout
16:20:02 <clarkb> during typical operation ^ is better beacuse its a short window where you want to pause rather than a full stop
16:21:50 <clarkb> https://etherpad.opendev.org/p/E3ixAAviIQ1F_1-gzuq_ is the gerrit mailing list email from david. I figure we should respond. fungi not sure if you're subscribed? but seems like we should write up an email and bring up the x/ conflict?
16:22:48 <fungi> i'm not subscribed, but happy if someone mentions that bug to get some additional visibility/input
16:32:26 <clarkb> fungi: I drafted a response in that etherpad, have a moment to take a look?
16:33:16 <fungi> yep, just a sec
16:35:27 <clarkb> I think they were impressed we were able to incorporate a jgit fix from yseterday too :)
16:35:31 <clarkb> something something zuul
16:37:03 <fungi> yep, reply lgtm, thanks!
16:38:53 <fungi> i've got something to which i must attend briefly, but will be back to check the hourly deploy run
16:41:11 <clarkb> response sent
17:01:02 <clarkb> infra-prod-install-ansible is running
17:01:28 <clarkb> as well as publish-irc-meetings (that one maybe didn't rely on bridge though?)
17:03:24 <clarkb> infra-prod-install-ansible reports success
17:03:30 <clarkb> now it is running service -bridge
17:06:49 <clarkb> service-bridge claims success now too
17:07:52 <clarkb> cloud-launcher is running now
17:12:46 <clarkb> fungi: are you back?
17:12:49 <fungi> yup
17:12:55 <clarkb> I've checked that system-config updated in zuul's homedir
17:12:57 <fungi> looking good so far
17:13:05 <clarkb> but now am trying to figure out where the hell project-config is synced too/from
17:13:19 <clarkb> /opt/project-config is what system-config ansible vars seem to show but that seems old as dirt on bridge
17:13:37 <clarkb> that makes me think that it isn't actually where we sync from, but I'm having a really hard time understanding it
17:13:39 <fungi> i thought it put one in ~zuul
17:14:16 <clarkb> ok that one looks more up to date
17:14:24 <clarkb> btu I still can't tell from our config management what is used
17:14:24 <fungi> from friday, yah
17:14:39 <clarkb> (also its a project creation from friday... maybe we should've stopped those for a bit)
17:15:09 <fungi> well, it won't run manage-projects yet
17:15:20 <fungi> because of review.o.o still being in emergency
17:15:40 <clarkb> ya
17:15:45 <fungi> but yeah once we reenable that, we should check the outcome of manage-projects runs
17:15:56 <clarkb> I think I figured it out
17:16:02 <clarkb> /opt/project-config is the remote path but no the bridge path
17:16:12 <clarkb> the bridge path is /home/zuul/src/../project-config
17:17:56 <clarkb> fungi: looking at timestamps there is a good chance that projct is already created /me checks
17:18:20 <clarkb> https://opendev.org/openstack/charm-magpie
17:18:44 <clarkb> and they are in gerrit too, ok one less thing to worry about until we're happy wit hthe state of the world
17:19:05 <clarkb> fungi: nodepool's job is next and I think that one may be expected to fail due to the issues on the buidlers that inaw was debugging. Not sure if they have been fixed yet
17:19:13 <clarkb> just a heads up that a failure there is probably sane
17:20:46 <clarkb> I suspect that our hourly jobs take longer than an hour to complete
17:22:56 <clarkb> huh cloud launcher failed, I wonder if it is trying to talk to a cloud that isn't available anymore (that is usually why it fails)
17:23:21 <clarkb> fungi: it just occured to me that the jeepyb scripts that talk to the db likely won't fail until we remove the db config from the gerrit config
17:23:41 <clarkb> fungi: and there is potential there for welcome message to spam new users created on 3.2 beacuse it won't see them on the 2.16 db
17:24:09 <clarkb> I don't think that is urgent ( we can apologise a lot) but its in the stack of changes to do that cleanup anyway. Then those scripts should start failing on the ini file lookups
17:26:16 <fungi> mmm, maybe if they have only one recorded change in the old db, yes
17:26:56 <fungi> i think it would need them to exist in the old db but have only one change present
17:27:03 <fungi> i need to look back at the logic there
17:27:46 <clarkb> also we can just edit playbooks/roles/gerrit/files/hooks/patchset-created to drop welcome-message?
17:27:53 <fungi> easily
17:28:00 <clarkb> the other one was the blueprint update?
17:28:05 <fungi> bug update
17:29:19 <clarkb> looks like bug and blueprint both
17:29:29 <clarkb> confirmed that nodepool failed
17:29:35 <clarkb> registry running now
17:29:51 <clarkb> fungi: so ya maybe we get a change in that simply comments out those scripts in the various hook scripts for now?
17:30:03 <clarkb> then that can land before or after the db config cleanup
17:30:06 <fungi> yeah, looking at welcome_message.py the query is looking for changes in the db matching that account id, so it won't fire for actually new users post upgrade, but will i think continue to fire for existing users who only had one change in before the upgrade
17:30:17 <clarkb> got it
17:30:32 <clarkb> registry just succeeded. zuul is running now and it should noop succeed
17:30:42 <fungi> i think update_bug.py will still half-work, it will just fail to reassign bugs
17:31:05 <clarkb> fungi: but will it raise an exception early because the ini file doesn't have the keys it is looking for anymore?
17:31:51 <clarkb> I think it will
17:31:53 <fungi> oh, did we remove the db details from the config?
17:32:03 <clarkb> fungi: not yet, thati s one of the chagnes to land though
17:32:22 <clarkb> fungi: https://review.opendev.org/c/opendev/system-config/+/757162
17:32:22 <fungi> got it
17:32:39 <fungi> so yeah i guess we can strip them out until someone has time to address those two scripts
17:32:50 <fungi> i'll amend 757162 with that i guess
17:33:03 <clarkb> wfm
17:33:11 <clarkb> note its parent is the html cleanup
17:33:20 <clarkb> whcih is also not yet merged
17:33:30 <fungi> yeah, i'm keeping it in the stack
17:33:56 <clarkb> zuul "succeeded"
17:34:14 <clarkb> fungi: also its three scripts
17:34:23 <clarkb> welcome message, and update bug and update blueprint
17:36:17 <fungi> update blueprint doesn't rely on the db though
17:36:36 <clarkb> it does
17:36:50 <fungi> huh, i wonder why. okay i'll take another look at that one too
17:37:46 <fungi> select subject, topic from changes where change_key=%s
17:38:01 <fungi> yeesh, okay so it's using the db to look up changes
17:38:24 <clarkb> ya rest api should be fine for that anonymously too
17:38:29 <clarkb> I'm adding notes to the etherpad
17:38:59 <fungi> and yeah, the find_specs() function performing that query is called unconditionally in update_blueprint.py so it'll break entirely
17:40:20 <clarkb> and eavesdrop succeeded. Now puppet else is starting
17:40:59 <fungi> update_bug.py is also called from two other hook scripts, i'll double-check whether those modes are expected to work at all
17:43:26 <fungi> looks like the others are safe to stay, update_bug.py is only connecting to the db within set_in_progress() which is only called within a conditional checking args.hook == "patchset-created"
17:44:23 <clarkb> fungi: where does it do the ini file lookups?
17:44:33 <clarkb> because it will raise on those when the keys are removed from the file
17:44:47 <clarkb> (its less about where it connects and more where it finds the config)
17:48:28 <clarkb> puppetry is still running according to the log I'm tailing
17:48:46 <clarkb> I don't expect this to finish before 18:00, but should I go ahead and remove the zuul nodes from the emergency file anyway now since things seem to be working?
17:48:48 <clarkb> fungi: ^
17:49:59 <fungi> ini file is parsed in jeepyb.gerritdb.connect() which isn't called outside the check for patchset-created
17:50:26 <fungi> sorry, digging in jeepyb internals
17:50:49 <fungi> what's the desire to take zuul servers out of emergency in the middle of a deploy run?
17:51:03 <fungi> oh, just in case it finishes before the top of the hour?
17:51:06 <clarkb> that this deploy run is racing the next cron iteration
17:51:07 <clarkb> yup
17:51:18 <clarkb> if we wait we might have to skip to 19:00 though this may end up happening anyway
17:51:32 <clarkb> oh wait puppet is done it says
17:51:39 <fungi> is it likely to decide to deploy things to zuul servers in this run if they're taken out of emergency early?
17:51:42 <fungi> ahh, then go for it
17:51:48 <clarkb> ok will do it in the screne
17:52:55 <clarkb> and done, can you double check the contents of the emergency file really quickly just to make sure I didn't do anything obviously wrong?
17:54:29 <fungi> emergency file in the bridge screen lgtm
17:55:00 <fungi> gerritbot is still silent. did it get restarted?
17:55:45 <clarkb> ya I thought I restarted it
17:55:50 <fungi> last started at 15:59
17:56:12 <fungi> gerrit restart was 15:25
17:56:30 <fungi> but for some reason it didn't echo when i pushed updates for your system-config stack
17:56:35 <fungi> checking gerritbot's logs
17:58:03 <fungi> Nov 22 16:59:23 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 16:59:23,196 ERROR gerrit.GerritWatcher: Exception consuming ssh event stream:
17:58:23 <fungi> (from syslog)
17:58:28 <clarkb> neat
17:59:04 <clarkb> I thought it was supposed to try and reconnect
17:59:14 <fungi> looks like the json module failed to parse an event
17:59:24 <fungi> and that crashed gerritbot
17:59:38 <clarkb> probably sufficient for now to ignore json parse failures there?
17:59:46 <clarkb> basically go "well I don't understand, oh well"
18:00:34 <fungi> this is being parsed from within gerritlib, the exception was raised from there
18:00:49 <fungi> so we'll likely need to fix this in gerritlib and tag a new release
18:01:17 <clarkb> side note: zuul doesn't use gerritlib, maybe there is something to be learned in zuul's implementation
18:05:00 <fungi> http://paste.openstack.org/show/800291/
18:05:05 <fungi> that's the full traceback
18:05:35 <clarkb> are we getting empty events
18:06:02 <clarkb> maybe the fix is to wrap that in if l : data = json.loads(l)
18:07:03 <clarkb> and maybe catch json decode errors that happen anyway and reconnect or something like that
18:11:18 <clarkb> fungi: whats with the docker package insepection in the review screen?
18:11:28 <fungi> https://review.opendev.org/c/opendev/gerritlib/+/763658 Log JSON decoding errors rather than aborting
18:12:25 <fungi> clarkb: when you were first suggesting we needed to look at shutdown routines in some unspecified package you couldn't find, i thought you meant the docker package so i was tracking down where we'd installed it from
18:12:36 <clarkb> gotcha
18:14:47 <clarkb> fungi: looks like gerritbot side will also need to be updated to handle None events
18:14:58 <clarkb> I think we can land the gerritbot change first and be backward compatible
18:15:05 <clarkb> then land gerritlib and release it
18:16:24 <fungi> clarkb: i don't see where gerritbot needs updating. _read() is just trying to parse a line from the stream and then either explicitly returning None early or returning None implicitly after enqueuing any event it found
18:17:15 <fungi> i figured the return value wasn't being used since that method wasn't previously explicitly returning at all
18:17:18 <clarkb> fungi: https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L303-L346 specifically line 307 assumes a dict
18:17:52 <clarkb> oh I see
18:17:55 <clarkb> you're short circuiting
18:17:59 <clarkb> nevermind you're right
18:18:44 <fungi> yeah, the _read() in gerritbot is being passed contents from the queue, not return values from gerritlib's _read()
18:18:52 <clarkb> left a suggestion but +2'd it
18:18:56 <clarkb> yup
18:24:52 <clarkb> nodepool is running now, then registry then zuul
18:25:09 <clarkb> on the zuul side of things we expect it to noop the config because I already removed the digest auth option from zuul's config files
18:26:51 <fungi> okay, gerritbot is theoretically running now with 763658 hand applied
18:27:30 <fungi> we should be able to check its logs for "Cannot parse data from Gerrit event stream:"
18:27:50 <clarkb> and see what the data is
18:27:55 <fungi> exactly
18:28:13 <clarkb> infra-prod-service-zuul is starting nowish
18:29:42 <clarkb> oh another thing I noticed is that we do fetch from gitea for our periodic jobs when syncing project-config
18:30:07 <clarkb> this didn't end up being a problem because replication went quickly and we replicated project-config first, but we should keep that in mind for the future. It isn't always gerrit state
18:30:14 <clarkb> (maybe a good followup change is to switch it)
18:30:14 <fungi> good thing we waited for the replication to finish yeah
18:33:36 <clarkb> that is in the sync-project-config role
18:33:48 <clarkb> it has a flag to run off of master that we set on the periodic jobs
18:37:22 <clarkb> looks like we're pulling new zuul images (not surprising)
18:40:36 <clarkb> it succeeded and ansible logs show that zuul.conf is changed: false which is what we wanted to see \o/
18:42:13 <clarkb> infra-root I think we are ready for reviews on https://review.opendev.org/c/opendev/system-config/+/757161 since zuul looked good. if this chagne looks good to you maybe don't approve it just yet as we have to remove review.o.o from the emergency file for it to take effect
18:42:37 <clarkb> also note that we'll have to manually clean up those html/js/css files as the change doesn't rm them. B uthe change does update gerrit.config so we'll see if it does the right thing there
19:11:13 <fungi> it's just dawned on me that 763658 isn't going to log in production, i don't think, because that's being emitted by gerritlib and we'd need python logging set up to write to a gerritlib debug log?
19:11:36 <clarkb> depends on what the default log setup is I think
19:11:42 <fungi> mmm
19:11:44 <clarkb> I dont know how that s ervicr sets up logging
19:12:33 <clarkb> you could edit your update to always lot the event at debug level and see you grt those
19:12:43 <clarkb> if you dont then more digging is required
19:13:32 <fungi> gerritbot itself is logging info level and above to syslog at least
19:16:30 <fungi> ahh, okay, https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerritbot/files/logging.config#L14-L17 seems to be mapped into the container and logs gerritlib debug level and higher to stdout
19:16:37 <fungi> if i'm reading that correctly
19:17:07 <fungi> oh, but then the console handler overrides the level to info i guess?
19:17:17 <fungi> so i'd need to tweak that presumably
19:18:04 <clarkb> ya or update your thing ti log at a higher level
19:18:11 <clarkb> I suggestedthat on the change earlier
19:18:23 <clarkb> warn was what I suggested
19:19:26 <fungi> yeah, which i agree with if the content it can't parse identifies some buggy behavior somewhere
19:19:55 <fungi> anyway, in the short term i've restarted it set to debug level logging in the console handler
19:20:08 <clarkb> sounds good
19:22:10 <clarkb> fungi: what do you thibk about 757161? should we proceed with that?
19:23:00 <fungi> sure, i've approved it just now
19:24:47 <clarkb> weneed to edit emergency.yaml as well
19:24:53 <fungi> oh, yep, doing
19:25:30 <fungi> removed review.o.o from the emergency list just now in the screen on bridge.o.o
19:25:53 <clarkb> lgtm
19:32:39 <clarkb> fungi: I'm going to make a copy of gerrit.config in my homedir so that we can easily diff the result after these changes land
19:34:36 <fungi> sounds good
19:35:34 <clarkb> if this change lands ok then I think we land the next one, then restart gerrit and ensure it is happy with those updates? We can also make these updates on review-test really quickly and restart it
19:35:42 <clarkb> why don't I do that since I'm paranoid and this will make me feel better
19:40:08 <fungi> yeah, not a terrible idea
19:40:12 <fungi> go for it
19:40:51 <clarkb> I did both of the changes against review-test manually and it is restarting now
19:40:52 <fungi> i'm poking more stuff into https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes i've thought of, and updated stuff there we've since resolved
19:41:28 <clarkb> I didn't do the hooks updates though since testing those is more painful
19:42:20 <fungi> oh, one procedural point which sprang to mind, once all remaining upgrade steps are completed and we're satisfied with the result and endmeeting here, we can include the link to the maintenance meeting log in the conclusion announcement
19:42:51 <fungi> that may be a nice bit of additional transparency for folks
19:42:58 <clarkb> ++
19:43:15 <clarkb> review-test seems happy to me and no errors in the log (except the expected one from plugin manager plugin)
19:43:40 <fungi> i guess that should go on the what's broken list so we don't forget to dig into it
19:43:44 <fungi> adding
19:44:22 <clarkb> fungi: I'm 99% sure its because you have to explicitly enable that plugin in config in addition to installing it
19:44:49 <clarkb> but we aren't enabling remote plugin management so it breaks. But ya we can test if enabling remote plugin management fixes review-tests error log error
19:45:18 <fungi> i just added it as a reminder to either enable or remove that
19:45:34 <fungi> not a high priority, but would rather not forget
19:45:44 <clarkb> ++
19:46:02 <clarkb> zuul syas we are at least half an hour to it merging the change to update commentlinks on review
19:47:23 <clarkb> ianw: when your monday starts, I was going to ask if you could maybe do a quick check of the review backups to ensure that all the shuffling hasn't made anything sad
19:49:21 <clarkb> realted to ^ we'll want to cleanup the old reviewdb when we're satisfied with things so that only the accountPatchReviewDb is backed up
19:49:26 <clarkb> should cut down on backup sizes
19:52:11 <fungi> yeah, though we should preserve our the pre-upgrade mysqldump for a while "just in case"
19:52:46 <clarkb> ++
19:57:56 <fungi> i added some "end maintenance" communication steps to the upgrade plan pad
19:58:44 <clarkb> fungi: that list lgtm
20:20:31 <clarkb> the change that should trigger infra-prod-service-review is about to merge
20:21:21 <clarkb> hrm I think that decided it didn't need to run the deploy job :/
20:22:12 <clarkb> ya ok our files list seems wrong for that job :/
20:22:32 <clarkb> or wait now playbooks/roles/gerrit is in there
20:23:02 <clarkb> Unable to freeze job graph: Job system-config-promote-image-gerrit-2.13 not defined is the error
20:23:26 <clarkb> I see the issue
20:26:07 <clarkb> remote:   https://review.opendev.org/c/opendev/system-config/+/763663 Fix the infra-prod-service-review image dependency
20:26:21 <clarkb> fungi: ^ gerritbot didn't report that or the earlier merge that afiled to run the job I expected. Did you catch things in logs
20:26:46 <fungi> looking
20:27:05 <fungi> it didn't log the "Cannot parse ..." message at least
20:27:13 <fungi> seeing if it's failed in some other way
20:32:29 <clarkb> I'm not sure if merging https://review.opendev.org/c/opendev/system-config/+/763663 will trigger the infra-prod-service-review job (I think it may since we are updating that job). If it doesn't then I guess we can land the db cleanup change?
20:33:37 <fungi> so here's the new gerritbot traceback :/
20:33:40 <fungi> http://paste.openstack.org/show/800294/
20:34:39 <clarkb> fungi: its a bug in your change
20:34:46 <clarkb> you should be print line not data
20:34:52 <clarkb> beacuse data doesn't get assigned if you fall into the traceback
20:35:05 <fungi> d'oh, yep!
20:36:19 <fungi> okay, switched that log from data to l
20:36:22 <fungi> will update the change
20:36:50 <clarkb> fungi: note its line not l
20:36:55 <clarkb> at least looking at your change
20:37:06 <fungi> well, it's "l" in production, it'll be "line" in my change
20:37:19 <fungi> we're running the latest release of gerritlib in that container, not the master branch tip
20:37:24 <clarkb> I see
20:37:28 <fungi> pycodestyle mandated that get "fixed"
20:38:00 <clarkb> of course
20:38:50 <fungi> but the fact that we were tripping that code path indicates we're seeing more occurences of unparseable events in the stream at least
20:38:58 <clarkb> ya
20:39:33 <clarkb> can you review https://review.opendev.org/c/opendev/system-config/+/763663 ?
20:39:45 <clarkb> zuul should be done check testing it in about 7 minutes
20:40:11 <clarkb> fungi: I wonder if there is a new event type that isn't json
20:40:19 <clarkb> and we've just got to ginore it or parse it differently
20:40:39 <clarkb> I guess we should find out soon enough
20:41:15 <fungi> aha, yep, good catch
20:41:19 <fungi> on 763663
20:42:13 <fungi> also ansible seems to have undone my log level edit in the gerritbot logging config so i restarted again with it reset
20:42:32 <clarkb> fungi: it will do that hourly as eavesdrop is in the hourly cron jobs
20:42:38 <clarkb> fungi: maybe put eavesdrop in emergency?
20:42:50 <fungi> yeah, i suppose i can do that
20:43:50 <fungi> done
20:46:51 <clarkb> https://review.opendev.org/c/opendev/system-config/+/757162/ is the next chagne to land if infra-prod-service-review doesn't run after my fix (its not clear to me if the fix will trigger the job due to file matchers and zuul behavior)
20:55:58 <ianw> clarkb will do
20:59:17 <clarkb> fungi: fwiw `grep GerritWatcher -A 20 debug.log` in /var/log/zuul on zuul01 doesn't show anything like that. It does show when we restart gerrit and connectivity is lost
21:00:01 <ianw> just trying to catch up ... gerritbot not listening?
21:00:08 <clarkb> ianw: its having trouble decoding events at times
21:00:23 <clarkb> grepping JSONDecodeError in that debug.log for zuul shows it happens once?
21:00:54 <clarkb> and then it tries to reconnect. I think that may line up with a service restart
21:01:09 <clarkb> 15:25:46,161 <- iirc that is when we restarted to get on the newly promoted image
21:01:15 <clarkb> so no real key indicator yet
21:01:53 <clarkb> ianw: we've started reenabling cd too, having trouble getting infra-prod-service-review to run due to job deps whihc should be fixed by https://review.opendev.org/c/opendev/system-config/+/763663 not sure if that change landing will run the job though
21:02:27 <clarkb> once that job does run and we're happy with the result I think we're good from the cd perspective
21:03:17 <clarkb> fungi: ya in the zuul example of this problem it seems that zuul gets a short read beacuse we restarted the service. That then fails to decode because its incomplete json. Then it fails a few times after that trying to reconnect
21:04:18 <ianw> ok sorry i'm about 40 minutes away from being 100% here
21:04:34 <clarkb> ianw: no worries
21:11:04 <clarkb> waiting for system-config-legacy-logstash-filters to start
21:11:48 <clarkb> kolla runs 40 check jobs
21:12:15 <clarkb> 29 are non voting
21:12:20 <clarkb> I'm nto sure this is how we imaging this woudl work
21:23:00 <clarkb> "OR, more simply, just check the User-Agent and serve the all the HTTP incoming requests for Git repositories if the client user agent is Git." I like this idea from luca
21:26:41 <fungi> clarkb: yeah, no idea if it's a short read or what, though we're not restarting gerrit when it happens
21:27:04 <fungi> though that could explain why it was crashing when we'd restart gerrit
21:27:15 <clarkb> ya
21:27:24 <fungi> i hadn't looked into that failure mode
21:27:29 <clarkb> that makes me wonder if sighup sin't happening or isn't as graceful as we hope
21:27:41 <clarkb> you'd expect gerrit to flush connections and close them on a graceful stop
21:27:47 <clarkb> that might be a question for luca
21:28:26 <clarkb> we can rpobably test that by manually doing a sighup to the process and observing its behavior
21:28:32 <clarkb> rather than relying on docker-compose to do it
21:28:39 <clarkb> then we at least know gerrit got the signal
21:30:51 <clarkb> or maybe we need a longer stop_grace_period value in docker-compose
21:30:58 <clarkb> though its already 5m and we stop faster than that
21:33:33 <clarkb> this system-config-legacy-logstash-filters job ended up on the airship cloud and its super slow :/
21:34:49 <clarkb> slightly worried it might time out
21:36:31 <clarkb> fungi: I put a kill command (commetned out) in the review-test screen if we want to try and manaully stop the gerrit process that way and see if it goes away quickly like we see with docker-compose down
21:37:01 <fungi> checking
21:38:04 <clarkb> if https://review.opendev.org/c/opendev/system-config/+/763663 fails I'm gonna break for lunch/rest
21:38:09 <clarkb> while it rechecks
21:38:50 <fungi> clarkb: yeah, that looks like the proper child process
21:39:17 <clarkb> k I guess I should go ahead and run it and see what happens
21:40:00 <clarkb> it stopped almost immediately
21:40:19 <clarkb> thats "good" i guess. means our docker compose file is unlikely to be broken
21:40:48 <clarkb> I wonder if that means that gerrit no longer handles sighup
21:41:11 <fungi> may be worth double-checking the error_log to see if it did log a graceful stop
21:45:09 <clarkb> wow I think it may have finished just before the timeout
21:45:15 <clarkb> the job I mean
21:45:41 <fungi> seems like we ought to consider putting it on a diet rsn
21:46:19 <fungi> hey! my stuff is logging
21:46:56 <fungi> looks like it's getting a bunch of empty strings on read
21:47:07 <fungi> Nov 22 21:46:32 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 21:46:20,320 DEBUG gerrit.GerritWatcher: Cannot parse data from Gerrit event stream:
21:47:10 <fungi> Nov 22 21:46:32 eavesdrop01 docker-gerritbot[1386]: ''
21:47:35 <fungi> it's seriously spamming the log
21:47:50 <clarkb> fungi: maybe just filter those out then?
21:47:50 <fungi> i wonder if this is some pathological behavior from it getting disconnected and not noticing
21:47:55 <clarkb> oh maybe?
21:48:19 <fungi> but yeah, i'll add a "if line" or similar to skip empty reads
21:48:21 <clarkb> ok I don't think https://review.opendev.org/c/opendev/system-config/+/763663 was able to trigger the deploy job
21:48:23 <fungi> and see what happens
21:48:53 <fungi> it's having trouble stopping the gerritbot container even
21:50:14 <fungi> okay, it's restarted with that conditional wrapping everything after the readline()
21:50:16 <clarkb> interestingly zuul doesn't apepar to have that problem
21:50:27 <clarkb> could it be a paramiko issue?
21:50:33 <clarkb> amybe compare paramiko between zuul and gerritbot
21:51:34 <clarkb> infra-root do we want to alnd https://review.opendev.org/c/opendev/system-config/+/757162 to try and get the infra-prod-service-review deploy job to run now? Or would we prefer a more noopy change to get the change that previously merged to run?
21:52:00 <clarkb> I don't think enqueue will work because the issue was present on the cahnge thatmerged earlier so enqueing will just abort
21:52:37 <fungi> i'm in the middle of dinner now so can look soonish or will happily defer to others' judgement there
21:52:49 <clarkb> fungi: related to that I'm fading fast I think I need a meal
21:52:53 * clarkb tracks one down
21:53:19 <clarkb> I think the risk with 757162 is that it adds more changes to apply with infra-prod-service-review rather than the more simple 757161
21:53:44 <clarkb> I'll push up a noopy change then go get food
21:55:42 <clarkb> remote:   https://review.opendev.org/c/opendev/system-config/+/763665 Change to trigger infra-prod-service-review
21:56:00 <clarkb> and now I'm taking a break
21:56:55 <ianw> i'm surprised gerrit would want to do the UA matching; but we could do something like the google approach and move git to a separate hostname, but then we do the UA switching with a 301 in apache?
21:57:40 <clarkb> ianw: well its luca not google
21:58:11 <clarkb> I'm not quite sure how just a separate hodtname hrlps
21:58:12 <fungi> i haven't seen anyone reply or comment on my bug report yet
21:58:24 <clarkb> becauee you need to aplly gerrit acls and auth
21:58:54 <clarkb> fungi: ya most of the discussion is on thr mailing list I'm hoping they poke at thr bug when the work weel resumes
21:59:02 <fungi> is there some sort of side discussion going on?
21:59:04 <fungi> oh
21:59:58 <fungi> process: mention to luca and he asks me to file a bug. i do that and discussion of it happens somewhere other than the bug or my e-mail
22:00:29 <clarkb> indeed
22:01:35 <fungi> so just to be clear: if there has been some discussion of the bug report i filed, i have seen none of it
22:02:10 <fungi> i'll happily weigh in on discussion within the bug report
22:02:58 <clarkb> I mentioned both the issur and thr bug itself on my response to the mailing list and they are now discussing it on the mailing list not the bug
22:03:26 <fungi> i'll just continue to assume my input on the topic is not useful in that case
22:04:25 <clarkb> my hunch is its more that on sunday its easy to couch quarterback the mailing listbut not the bug tracker
22:05:22 <fungi> fair enough, i'll be at the ready to reply with bug comments once the professionals are back on the field
22:06:57 <ianw> yeah, i see it as just floating a few ideas; but fundamentally you let people call their projects anything, have people access repos via /<name> and use some bits for their UI.  seems like a choose two situation
22:07:47 <ianw> clarkb: https://review.opendev.org/c/opendev/system-config/+/757162 seems ok to me?
22:12:31 <clarkb> ianw: ya I expect its fine. its more that we've force merged a number of changes as well as merged 757161 at this point and none of those have run yet
22:12:39 <clarkb> ianw: so I'm thinking keep the delta down as much as possible may be nice
22:13:05 <clarkb> ianw: but if you're able to keep an eye on things I'm also ok with merging 757162
22:13:13 <clarkb> I'm "around". Eating soup
22:13:53 <clarkb> fwiw looking at the code it seems that gerrit does properly install a java runtime shutdown hook
22:14:03 <clarkb> not sureif that hook is sufficient to gracefully stop connections though
22:14:17 <ianw> clarkb: yeah, i'm around and can watch
22:14:34 <clarkb> ianw: that change still has my WIP on it by feel free to remove that and approve if the change itself looks good after a revie
22:15:13 <clarkb> ianw: I also put a copy of gerrit.config and secure.config in ~clarkb/gerrit_config_backups on review to aid in checking of diffs after stuff runs
22:15:30 <ianw> clarkb: i'm not sure i can remove your wip now
22:15:35 <clarkb> oh becuse we don't admin
22:15:40 <clarkb> ok give me a minute
22:16:30 <clarkb> WIP removed
22:16:35 <clarkb> (but I didn't approve)
22:17:40 <ianw> i'll watch that
22:44:27 <ianw> TASK [sync-project-config : Sync project-config repo] ************************** seems to be failing on nb01 & nb02
22:44:34 <fungi> :/
22:45:26 <clarkb> are the disks full again?
22:45:34 <clarkb> we put project-config on /opt too
22:45:52 <ianw> /dev/mapper/main-main 1007G 1007G     0 100% /opt
22:45:57 <ianw> clarkb: jinx :)
22:46:12 <ianw> ok, it looks like i'm debugging that properly today now :)
22:48:06 <fungi> gerritbot is still parsing events for the moment
22:53:06 <fungi> time check, we've got just over two hours until our maintenance is officially slated to end
22:53:30 <ianw> system-config-run-review (2. attempt) ... unfortunately i missed what caused the first attempt to fail
22:53:45 <ianw> this is on the gate job for https://review.opendev.org/c/opendev/system-config/+/757162/
22:54:18 <clarkb> fungi: yup I'm hopeful we'll get ^ too deploy and we restart one more time
22:54:38 <ianw> i'll go poke at the zuul logs to make sure it was an infra error, not something with the job
22:55:20 <clarkb> butonce that restart is done and we'rehappy with things I think wecall it done
22:58:11 <clarkb> its merging now
22:58:27 <fungi> excellent
22:58:34 <clarkb> infra-prod-service-review is queued
22:59:59 <clarkb> and ansible is running
23:01:14 <clarkb> and its done
23:01:22 <clarkb> the onyl thing I didn't quite expect was it restarted apache2
23:01:31 <clarkb> so maybe the edits we made to the vhost config didn't quite line up
23:01:42 <clarkb> I'm going to cmopare diffs and look at files and stuff now
23:03:18 <clarkb> gerrit.config looks "ok". We are not quoting the same way as gerrit I don't think so a lot of the comment links have "changes"
23:03:24 <clarkb> I think those are fine
23:03:51 <clarkb> secure.config looks good
23:04:11 <fungi> we should probably try to normalize them in git though
23:04:16 <clarkb> ++
23:04:18 <clarkb> docker-compose.yaml lgtm
23:04:43 <clarkb> the track-upstream and manage-projects scripts lgtm
23:05:45 <clarkb> patchset-created lgtm
23:06:35 <clarkb> the apache vhost lgtm its got the request header and no /p/ redirection
23:07:34 <clarkb> ok I think the only other thing to do is delete/move aside the files that 757161 stops managing
23:07:42 <clarkb> I'll move them into my homedir then we can restart?
23:09:50 <fungi> sgtm
23:10:09 <fungi> on hand for the gerrit container restart once you've moved those files away
23:10:55 <clarkb> files are moved
23:11:23 <clarkb> fungi: do you want me to do the down up -d or will you do it?
23:11:31 <fungi> happy to do it
23:11:31 <clarkb> (not sure if on hand meant around for it or doing the typing)
23:11:34 <clarkb> k go for it
23:12:03 <fungi> downed and upped in the root screen session on review.o.o
23:12:08 <clarkb> yup saw it happen
23:13:56 <clarkb> seems to be up now. I can view changes
23:14:08 <clarkb> fungi: you may need to convince gerritbot to be happy? or maybe not after your changes
23:14:58 <clarkb> on the upgrade etheerpad everything but item 7 is struck through
23:16:10 <clarkb> I'll abandon my noopy change
23:16:13 <fungi> looking
23:16:34 <fungi> Nov 22 23:11:46 eavesdrop01 docker-gerritbot[1386]: 2020-11-22 23:11:46,598 DEBUG paramiko.transport: EOF in transport thread
23:16:53 <fungi> that seems likely to indicate it lost the connection and didn't reconnect?
23:17:31 <fungi> i've restarted the gerritbot container now
23:17:58 <fungi> it's getting and logging events
23:18:12 <clarkb> cool
23:18:21 <clarkb> fungi: you haven't happened to have drafted the content for item 7 have yo?
23:18:55 <fungi> nope, but i could
23:19:19 <clarkb> re the config I actually wonder if what is happening is we quote things in our ansible and old gerrit removed them but new gerrit does not remove them
23:19:33 <clarkb> because the config has been static since 2.13 except for hand edits
23:19:43 <clarkb> fungi: that would be great
23:19:59 <fungi> i'll start a draft in a pad
23:20:07 <clarkb> zuul is seeing events too because a horizon change just entered the gate
23:33:19 <clarkb> fungi: I've detached from the screen on review and bridge
23:33:33 <clarkb> I don't think they have to go away anytime soon but I think I'm done with them myself
23:33:34 <fungi> cool
23:33:54 <clarkb> also detached on review-test
23:37:27 <fungi> started the announce ml post draft here: https://etherpad.opendev.org/p/nzYm6eWfCr1mSf0Dis4B
23:37:36 <fungi> i'm positive it's missing stuff
23:37:40 <clarkb> looking
23:38:50 <clarkb> fungi: I made acuple small edits
23:38:52 <clarkb> lgtm otherwise
00:04:56 <clarkb> ianw: fungi should we endmeeting in here, send that email out, and the status notice?
00:06:43 <ianw> if you've got nothing else for me to do but monitor, ++
00:06:47 <fungi> i think so, unless anybody can think of anything else we need to do first
00:06:56 <clarkb> I can't
00:07:00 <clarkb> and I'm ready for a 12 hour nap
00:07:09 <fungi> i hear ya
00:08:34 <ianw> is the http password a suggestion or a requirement?
00:08:54 <clarkb> for normal users I think just a suggestion. But for infra-root we should all do that
00:08:55 <ianw> unlike before, you only get one look at it now
00:09:53 <fungi> yup, if you dismiss the window prematurely, you need to generate anotehr
00:09:57 <fungi> another
00:10:47 <clarkb> fungi: I don't think you need to endmeeting becuase its been longe rthan an hour but maybe you should for symmetry?
00:10:49 <ianw> i'd probably s/mostly functional/functional/ just to not make it sound like we're worried about anything
00:14:18 <clarkb> fungi: and were you planning to send that out? /me is fading fast so wants to ensure this gets done :)
00:15:19 <fungi> yeah, i can send it momentarily
00:15:45 <clarkb> don't forget you wanted the meeting log whihc may need an endmeeting first?
00:15:54 <fungi> yup
00:28:37 <fungi> okay, any objections to me doing endmeeting now?
00:28:52 <clarkb> no
00:29:15 <fungi> in that case, all followup discussion should happen in #opendev
00:29:18 <fungi> see you all there
00:29:23 <clarkb> o/
00:29:24 <fungi> #endmeeting