Friday, 2020-11-20

*** hamalq has quit IRC03:56
*** sboyron has joined #opendev-meeting07:04
*** hashar has joined #opendev-meeting09:26
*** hashar has quit IRC12:27
*** hashar has joined #opendev-meeting12:53
fungi#startmeeting opendev-maint12:59
openstackMeeting started Fri Nov 20 12:59:28 2020 UTC and is due to finish in 60 minutes.  The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot.12:59
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.12:59
*** openstack changes topic to " (Meeting topic: opendev-maint)"12:59
openstackThe meeting name has been set to 'opendev_maint'12:59
fungi#status notice The Gerrit service at review.opendev.org will be offline starting at 15:00 UTC (roughly two hours from now) for a weekend upgrade maintenance: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html13:01
openstackstatusfungi: sending notice13:01
-openstackstatus- NOTICE: The Gerrit service at review.opendev.org will be offline starting at 15:00 UTC (roughly two hours from now) for a weekend upgrade maintenance: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html13:01
openstackstatusfungi: finished sending notice13:04
fungi#status notice The Gerrit service at review.opendev.org will be offline starting at 15:00 UTC (roughly one hour from now) for a weekend upgrade maintenance: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html13:59
openstackstatusfungi: sending notice13:59
-openstackstatus- NOTICE: The Gerrit service at review.opendev.org will be offline starting at 15:00 UTC (roughly one hour from now) for a weekend upgrade maintenance: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html13:59
openstackstatusfungi: finished sending notice14:02
clarkbmorning!14:25
clarkbfungi: I think I'll go ahead and put gerrit and zuul in the emregency file now.14:28
clarkband thats done. Please double check I got all the hostnames correct (digits and openstack vs opendev etc)14:31
clarkband when y ou're done with that do you think we should do the belts and suspenders route of disabling the ssh keys for zuul there too?14:31
clarkbfungi: also do you want to start a root screen on review? maybe slightly wider than normal :P14:38
fungidone, `screen -x 123851`14:40
clarkband attached14:41
funginot sure what disabling zuul's ssh keys will accomplish, can you elaborate?14:41
clarkbit will prevent zuul jobs from ssh'ing into bridge and making unexpected changes to the system should something "odd" happen14:41
clarkbI think gerrit being down will effectively prevent that even if zuul managed to turn back on again though14:41
fungioh, there, i guess we can14:41
fungii thought you meant its ssh key into gerrit14:41
clarkbsorry no ~zuul on bridge14:42
fungisure, i can do that14:44
clarkbfungi: just move aside authorized_keys is probably easiest?14:44
fungias ~zuul on bridge i did `mv .ssh/{,disabled_}authorized_keys`14:45
clarkbfungi: can you double check the emergency file contents too (just making sure we've got this correct on both sides then that way if one doesn't work as expected we've got a backup)14:45
clarkbmy biggest concern is mixing up a digit eg 01 instead of 02 and openstack and opendev in hostnames14:45
clarkbI think I got it right though14:46
fungihostnames in the emergency file look correct, yes, was just checking that14:46
clarkbthanks14:47
clarkbI've just updated the maintenance file that apache will serve from the copy in my homedir14:47
fungichecked them against our inventory in system-config14:47
clarkbI plan to make my first cup of tea during the first gc pass :)14:53
fungiyeah, i'm switching computers now and will get more coffee once that's underway14:58
clarkbfungi: I've edited the vhost file on review. When you're at the other computer I think we check that then restart apache at 1500?14:58
clarkbthen we can start turning off gerrit and zuul14:58
fungilgtm15:00
fungiready for me to reload apache?15:00
clarkbmy clock says 1500 now I think so15:00
fungidone15:01
fungimaintenance page appears for me15:01
clarkbme too15:01
clarkbnext we can stop zuul and gerrit. I don't think the order matters too much15:01
fungistatus notice or status alert? wondering if we want to leave people's irc topics altered all weekend given there's also a maintenance page up15:01
clarkbya lets not change the topics15:02
clarkbif we get too many questions we can flip to topic swapping15:02
fungi#status notice The Gerrit service at review.opendev.org is offline for a weekend upgrade maintenance, updates will be provided once it's available again: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html15:02
openstackstatusfungi: sending notice15:02
-openstackstatus- NOTICE: The Gerrit service at review.opendev.org is offline for a weekend upgrade maintenance, updates will be provided once it's available again: http://lists.opendev.org/pipermail/service-announce/2020-October/000012.html15:03
clarkbif you get the gerrit docker compose down I'll do zuul15:03
fungii guess we should save queues in zuul?15:03
clarkbeh15:03
fungiand restore at the end of the maintenance? or no?15:03
clarkbI guess we can?15:03
clarkbI hadn't planned on it15:04
clarkbgiven the long period of time between states I wasn't entirely sure if we wanted to do that15:04
fungii guess don't worry about it. we can include messaging reminding people to recheck changes with no zuul feedback on thenm15:04
fungigerrit is down now15:04
fungii'll comment out crontab entries on gerrit next15:05
openstackstatusfungi: finished sending notice15:05
*** corvus has joined #opendev-meeting15:06
clarkb`sudo ansible-playbook -v -f 50  /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_stop.yaml` <- is what I'll run on bridge to stop zuul15:06
clarkbactually I'll start a root screen there and run it there without the sudo15:06
fungii've got one going on bridge now15:07
fungiif you just want to join it15:07
clarkboh I just started one too. I'll join yours15:07
fungiahh, okay, screen -list didn't show any yet when i created this one, sorry15:07
clarkbhahaha we put them in the emergency file so the playbook doesn't work15:08
clarkbI'll manually stop them15:08
fungioh right15:08
fungiheh15:08
clarkbscheduler and web are done. Now to do a for loop for the mergers and executors15:09
fungii'll double-check the gerrit.config per step 1.615:11
corvusclarkb: could probably still do "ansible -m shell ze*"; or edit the playbook to remove !disabled15:12
fungiserverId, enableSignedPush, and change.move are still in there, though you did check them after we restarted gerrit earlier in the week too15:12
corvusbut i bet you already started the loop15:12
clarkbyup looping should be done now if anyone wants to check15:13
fungii'll go ahead and start the db dump per step 1.7.1, estimated time is 10 minutes15:13
clarkbfungi: ya I expected that one to be fine after our test but didn't remove it as it seemed like a good sanity check15:13
fungimysqldump command is currently underway in the root screen session on review.o.o15:14
fungiin parallel i'll start the rsync update for our 2.13 ~gerrit2 backup in a second screen window15:15
clarkbfungi: we don't want to start that until the db dump is done?15:15
clarkbthat way the db dump is copied properly too15:15
fungioh, fair, since we're dumping into the homedir15:16
fungiyeah, i'll wait15:16
fungii guess we could have dumped into the /mnt/2020-11-20_backups volume instead15:16
clarkboh good point15:16
clarkboh well15:16
fungiit'll be finished any minute now anyway, based on my earlier measurements15:22
fungimysqldump seems to have completed fine15:23
clarkbya I think we can rsync now15:23
fungi1.7gb compressed15:23
clarkbis that size in line with our other backups?15:24
fungirsync update is underway now, i'll compare backup sizes in a second window15:24
clarkbyes it is15:24
clarkbI checked outside of the screen15:24
fungiyeah, they're all roughly 1.7gb except teh old 2.13-backup-1505853185.sql.gz from 201715:25
fungiwhich we probably no longer need15:25
fungiin theory this rsync should be less than 5 minutes15:26
fungithough could be longer because of the db dump(s)/logrotate i suppose15:26
clarkbeven if it was a full sync we'd still be on track for our estimated time target15:27
fungiyeah, fresh rsync starting with nothing took ~25 minutes15:27
clarkbI think the gerrit caches and git dirs change a fair bit over time15:34
clarkbin addition to the db and log cycling15:34
fungiand it's done15:35
fungiyeah periodic git gc probably didn't help either15:35
fungianybody want to double-check anything before we start the aggressive git gc (step 2.1)?15:36
clarkbecho $? otherwise no I can't think of anything15:36
fungiyeah, i don't normally expect rsync to silently fail15:37
fungibut it exited 015:37
clarkbyup lgtm15:37
clarkbI think we can gc now15:37
fungii have the gc staged in the screen session now15:37
fungiand it's running15:37
clarkbafter the gc we can spot check that everything is still owned by gerrit215:38
fungiestimates time at this step is 40 minutes, so you can go get your tea15:38
clarkbyup I'm gonna go start the kettle now. thanks15:38
fungii don't see any obvious errors streaming by anyway15:38
clarkbkeeping timing notes on the etherpad too because I'm curious to see how close the estimates particularly for today are15:38
fungigood call, and yeah that's more or less why i left the time commands in most of these15:39
*** melwitt has joined #opendev-meeting15:44
fungiprobably ~15 minutes remaining16:01
clarkbI'm back fwiw just monitoring over tea and toast16:01
fungiestimated 5 minutes remaining on this step16:13
clarkbit is down to 2 repos16:13
clarkbof course one of them is nova :)16:13
fungithe other is presumably either neutron or openstack-manuals16:13
clarkbit was airshipctl16:13
fungioh16:13
fungiwow16:13
clarkbI think it comes down to how find and xargs sort16:13
clarkbI think openstack manuals was the third to last16:14
fungilooks like we're down to just nova now16:14
fungihere's hoping these rebuilt gerrit images which we haven't tested upgrading with are still fine16:15
clarkbI'm not too worried about that, I did a bunch of local testing with our images over the last few months and the images moved over time and were always fine16:16
fungiyeah, the functional exercises we put them through should suffice for catching egregious problems with them, at the very least16:17
clarkbthen ya we also put them through the fake prod marathons16:17
clarkbbefore we proceed to the next step it appears that the track upstream cron fired?16:20
clarkbfungi: did that one get disabled too?16:20
fungiand done16:20
fungii thought i disabled them both, checking16:20
fungioh... it's under root's crontab not gerrit2's16:21
clarkbwe should disable that cron then kill the running container for it16:21
clarkbI think the command is kill16:22
fungilike that? or is it docker kill?16:22
clarkbto line up with ps16:22
fungiyup16:22
fungiokay, it's done16:22
clarkbwe should keep an eye on those things because they use the explicit docker image iirc16:22
clarkbthe change updates the docker image version in hiera which will apply to all those scripts16:23
clarkbgranted they don't really run gerrit things just jeepyb in gerrit so its probably fine for them to use the old iamge accidentally16:23
fungithe only remaining cronjobs for root are bup, mysqldump, and borg(x2)16:23
clarkbok I think we can proceed?16:23
fungiand confirmed, the cronjobs for gerrit2 are both disabled still16:24
fungiwe were going to check ownership on files in the git tree16:24
clarkb++16:24
fungieverything looks like it's still gerrit2, even stuff with timestamps in the past hour16:25
clarkbthat spot check looks good to me16:25
fungiso i think we're safe (but also we change user to gerrit2 in our gc commands so it shouldn't be a problem any longer)16:25
clarkbya just a dobule check since we had problems with that on -test before we udpated the gc commands16:26
clarkbI think its fine and we can proceed16:26
fungidoes that look right?16:26
clarkbyup updated to opendevorg/gerrit:2.1416:26
clarkbon both entries in the docker compose file16:27
fungiokay, will pull with it now16:27
fungihow do we list them before running with them?16:27
clarkbdocker image list16:27
fungii need to make myself a cheatsheet for container stuff, clearly16:27
fungiopendevorg/gerrit   2.14                39de77c2c8e9        22 hours ago   676MB16:28
fungithat seems right16:28
clarkbyup16:28
fungiready to init?16:28
clarkbI guess so :)16:29
fungiand it's running16:29
clarkbaround now is when we would expect this one to finish, but also this was the one with the least consistent timing16:37
fungitaking longer than our estimate, yeah16:37
clarkbwe theorized its due to hashing the http passwds16:37
clarkband the input for that has changed a bit recently16:38
clarkb(but maybe we also need entrpoy? I dunno)16:38
fungishould be far fewer of those now though16:38
corvusit seems pretty idle16:39
clarkbya top isn't showing it be busy16:39
clarkbthe first time we ran it it took just under 30 minutes16:40
fungicould also be that the server instance or volume or (more likely?) trove instance we used on review-test performed better for some reason16:40
fungithe idleness of the server suggests to me that maybe this is the trove instance being sluggish16:41
corvus| 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query   |  716 | copy to tmp table | ALTER TABLE change_messages ADD real_author INT |16:41
corvus| Id     | User    | Host                | db       | Command | Time | State             | Info                                            |16:41
corvus^ column headers16:41
clarkbah ok so it is the db side then?16:42
corvusfungi: so yeah, looks like16:42
corvusyep that's "show full processlist"16:42
corvusin mysql16:42
mordredyeah - sounds like maybe the old db is tuned/sized differently16:42
mordredor just on an old host or something16:42
* fungi blames mordred since he created the trove instance for review-test ;)16:42
mordredtotally fair :)16:42
clarkbthis is one reason why we allocated tons of extra time :)16:42
fungis/blames/thanks/16:43
clarkbas long as we can explain it (and sounds like we have) I'm happy16:43
clarkbthough its a bit disappointing we're investing in the db when we're gonna discard it shortly :)16:43
mordredright?16:44
fungii'll just take it as an opportunity to catch up on e-mail in another terminal16:44
corvusthere should be a word for blame/thanks16:46
fungithe germans probably have one16:46
corvusmordred: _____ you very much for setting up that trove instance!16:47
fungideutsche has all sorts of awesome words english is missing16:47
mordredschadendanke perhaps? (me making up new words)16:48
fungidoch (the positive answer to a negative question) is in my opinion the greatest example of potentially solvable vagueness in english16:48
mordredyup16:48
corvusomg i need that in my life16:49
mordredit fills the "no, yes it is"16:49
mordredrole16:49
fungisomehow english, while a germanic language, decided to just punt on that16:49
mordredyup16:49
mordredI blame the normans16:49
corvus| 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query   | 1227 | rename result table | ALTER TABLE change_messages ADD real_author INT |16:50
fungimordred: sshhhh, ttx might be listening16:50
corvuschanged from "copy" to "rename"  sounds like progress16:50
corvus| 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query   |    5 | copy to tmp table | ALTER TABLE patch_comments ADD real_author INT |16:50
corvusnew table16:50
corvusi wonder what the relative sizes of those 2 tables are16:51
mordredalso - in newer mysql that should be able to be an online operation16:51
mordredbut apparently not in the version we're running16:51
mordredso it's doing the alter by making a new table with the new column added, copying all the data to the new table and deleting the old16:52
mordredyay16:52
clarkbya our mysql is old. we used old mysql on review-test and it was fine so I dind't think we should need to upgrade first16:52
fungimaybe the mysql version for the review-test trove instance was newer than for review?16:52
clarkbfungi: I'm 99% sure I checked that16:52
clarkband they matched16:52
fungiahh, so that did get checked16:52
clarkbbut maybe I misread the rax web ui or something16:52
mordredmaybe they both did the copy and hte new one is just on better hypervisor16:52
fungior the dump/src process optimizes the disk layout a lot compared to a long-running server16:53
clarkbI'm trying to identify which schema makes this change btu the way they do migrations doesn't make that easy for all cases16:53
clarkbthey guice inject db specific migrations from somewhere16:53
clarkbI can't find the somewhere16:54
clarkbanyway its proceeding I'll chill16:54
mordredfungi: yeah - that's also potentially the case16:54
mordredclarkb: they guice inject *everything*16:54
clarkbI don't think the notedb conversion will be very affected by that either since its all db reads16:54
clarkbso hopeflly the very long portion of the upgrade continues to just be long and not longer16:55
corvusoof, it also looks like they're doing one-at-a-time16:55
corvus| 460106 | gerrit2 | 10.223.160.46:56540 | reviewdb | Query   |   15 | copy to tmp table | ALTER TABLE patch_comments ADD unresolved CHAR(1) DEFAULT 'N' NOT NULL  CHECK (unresolved IN ('Y','N')) |16:55
corvussecond update to same table16:55
corvuswhich, to be fair, is the way we usually do it too16:55
corvusbut now i feel more pressure to do upgrade rollups :)16:56
mordredyah - to both16:56
fungi"we" being zuul/nodepool?16:56
fungier, i guess not nodepool as it doesn't use an rdbms16:56
clarkbya still having no luck figuring out where the Schema_13X.java files map to actual sql stuff17:04
clarkbI wonder if it automagic based on their table defs somewhere17:04
corvusfungi: yes (also openstack)17:04
clarkbI'm just trying to figure out what sort of progress we're making relative to the stack of schema migrations. Unfortunately it prints out all the ones it will do at the beginning then does them so you don't get that insight17:05
fungii would not be surprised if these schema migrations aren't somehow generated at runtime17:05
mordredcorvus: I think nova decided to do rollups when releases are cut - so if you upgrade from icehouse to juno it would be a rollup, but if you're doing CD between icehouse and juno it would be a bunch of individual ones17:06
mordredwhich seems sane - I'm not sure how that would map into zuul - but maybe something to consider in the v4/v5 boundaries17:06
corvusmordred: ++17:07
fungiyay!17:07
fungiit's doing the data migrations now17:07
clarkbok cool17:08
fungilooks like it's coming in around 40 minutes?17:08
clarkbseems like things may be slower but not catastrophically so17:08
fungi(instead of 8)17:08
clarkb142 is the hashing schema change iirc17:09
clarkbyup confirmed that one has content in the schema version java file because they hash java side17:11
clarkbcorvus: is it doing interesting db things at the moment? I wonder if it is also doing some sort of table update for the hashed data17:19
clarkbrather than just inserting records17:19
fungilooks like there's a borg backup underway, that could also be stealing some processor time... though currently the server is still not all that busy17:19
clarkbya I think it must be busy with mysql again17:19
mordreddb schema upgrades are the boringest17:19
clarkbalso note that we had originally thought that the notedb conversion would run overnight. Based on how long this is taking that may be the case again, but we've already buitl in that buffer so I don't think we need to rollback or anything like that yet17:20
clarkbjust need to be patient I guess (something I am terrible at accomplishing)17:21
corvusclarkb: "UPDATE account_external_ids SET"17:21
fungithat looks like what we expect, yeah17:21
corvusthen some personal info; it's doing lots of those individually17:21
clarkbyup17:21
clarkbdb.accountExternalIds().upsert(newIds); <- is the code that should line up to17:22
clarkboh you know what17:22
fungiyeah this is the stage where we believe it's replacing plaintext rest api passwords with bcrypt2 hashes17:22
clarkbits updating every account even if they didn't have a hashed password17:22
corvusyes17:22
corvusi just caught it doing one :)17:22
clarkbList<AccountExternalId> newIds = db.accountExternalIds().all().toList();17:23
corvuspassword='bcrypt:...17:23
clarkbrather than finding the ones with a password and only updating them17:23
clarkbI guess that explains why this is slow17:23
fungiis it hashing null for 99.9% of the accounts?17:23
clarkbno it only hashes if the previous value was not null17:23
fungior just skipping them once it realizes they have no password?17:23
clarkbbut it is still upserting them back again17:23
clarkbrather than skipping them17:23
corvusit's doing an update to set them to null17:23
fungiahh, okay that's better than, you know, the other thing17:23
corvus(which mysql may optimize out, but it'll at least have to go through the parser and lookup)17:24
clarkbcorvus: do you see sequential ids? if so that may give us a sense for how long this will take. I think we have ~36k ids17:24
corvusids seem random17:24
corvusmay be sorted by username though: it's at "mt.."17:25
corvusnow p..17:25
fungiso maybe ~halfway17:25
corvushah, i saw 'username:rms...' and started, then moved the rest of the window in view to see 'username:rmstar'17:26
corvusmysql is idle17:26
fungiand done17:26
clarkbit reports done on the java side17:26
fungiexited 017:27
clarkbyup from what we can see it lgtm17:27
fungianything we need to check before proceeding with 2.15?17:27
clarkbI think we can proceed and just accept these will be slower. Then expect notedb to run overnight again17:27
fungi57m11.729s was the reported runtime17:27
clarkbya I put that on the etherpad17:28
fungiupdated compose file for 2.15, shall i pull?17:28
clarkbyes please pull17:28
fungiopendevorg/gerrit   2.15                bfef80bd754d        23 hours ago        678MB17:29
fungilooks right17:29
clarkbyup17:29
fungiready to init 2.15?17:29
clarkbI'm ready17:29
fungiit's running17:30
clarkbschema 144 is the writing to external ids in all users17:31
clarkb143 is opaque due to guice17:31
clarkbanyway I shall continue to practice patience17:32
* fungi finds a glass full of opaque juice17:32
clarkbthe java is very busy on 14417:33
clarkb(as expected given its writing back to git)17:33
fungihuh, it's doing a git gc now17:34
clarkbonly on all-users17:34
fungiof all-users i guess17:34
clarkbya17:34
mordredbusy busy javas17:34
clarkbyou still need it for everything else to speed up the reindexing aiui17:34
fungisure17:35
fungithis one's running long too, compared to our estimate17:38
fungibut i have a feeling we're still going to wind up on schedule when we get to the checkpoint17:38
clarkb151 migrates groups into notedb I think17:39
fungiwe baked in lots of scotty factor17:40
clarkbya I think it "helps" that there was no way we thought we'd get everything done in one ~10 hour period. So once we assume an overnight being able to slot a very slow process in there makes for a lot of wiggle room17:40
clarkbmordred: you've just reminded me that mandalorian has a new episode today. I know what I'm doing during the notedb conversion17:42
clarkbbusy busy jawas17:42
mordredhaha. I'm waiting until the whole season is out17:42
fungiand done17:42
clarkbjust under 13 minutes17:42
fungi12m47.295s17:43
fungianybody want to check anything before i work on the 2.16 upgrade?17:43
clarkbI don't think so17:43
fungiproceeding17:44
fungigood to pull images?17:44
clarkb2.16 lgtm I think you should pull17:44
fungiopendevorg/gerrit   2.16                aacb1fac66de        24 hours ago        681MB17:44
fungialso looks right17:44
clarkbyup17:45
fungiready to init 2.16?17:45
clarkb++17:45
fungirunning17:45
fungitime estimate is 7 minutes, no idea how accurate that will end up being17:45
*** hashar has quit IRC17:46
* mordred is excited17:46
fungiafter this we have another aggressive git gc followed by an offline reindex, then we'll checkpoint the db and homedir in preparation for the notedb migration17:47
fungithis theoretically gives us a functional 2.16 pre-notedb state we can roll back to in a pinch17:47
clarkbthen depending on what time it is we'll do 3.0, 3.1, and 3.2 this evening or tomorrow17:47
fungiyup17:48
clarkbsort of related, I feel like notedb is sort of a misleading name. None of the db stuff lives in what git notes thinks are notes as far as I can tell17:49
clarkbits just special refs17:49
clarkbthis had me very confused when I first started looking at the upgrade17:49
fungiyeah, i expect that was an early name which stuck around long after they decided using actual git notes for it was suboptimal17:50
*** hamalq has joined #opendev-meeting17:51
fungii think we'll make up some of the lost time in our over-estimate of the checkpoint steps17:53
fungiglad we weren't late starting17:54
clarkb++ I never want to wake up early but having the extra couple of hours tends to be good for buffering ime17:56
fungihappy to anchor the early hours while your tea and toast kick in17:56
fungiin exchange, it's your responsibility to take up my slack later when my beer starts to kick in17:57
clarkbha17:58
fungisporadic java process cpu consumption at this stage17:59
clarkbmigration 168 and 170 are opaque due to guice. 169 is more group notedb stuff18:01
*** hamalq has quit IRC18:01
clarkbnot sure which one we are on now as things scrolled by18:02
clarkboh did it just finish?18:02
clarkboh interesting18:02
*** hamalq has joined #opendev-meeting18:02
clarkbthe migrations are done but now it is reindexing?18:02
fungino, i was scrolling back in the screen buffer to get a feel for where we are18:02
fungiit's been at "Index projects in version 4 is ready" for a while18:03
clarkbya worrying about whati t may be doing since it said 170 was done right?18:03
fungithough maybe it's logging18:03
fungiyeah, it got through the db migrations18:03
fungiand started an offline reindex apparenrly18:03
fungithere it goes18:03
fungidone finally18:03
clarkbya that was expected for projects and accounts and groups18:03
clarkbbecause accountsa nd groups and project stuff go into notedb but not changes18:04
fungi18m19.111s18:04
clarkbyup etherpad updated18:04
clarkbexit code is zero I think we can reindex18:04
fungiready to do a full aggressive git gc now?18:04
clarkber sorry not reindex18:04
clarkbgc18:04
clarkbgetting ahead of myself18:04
fungiyup18:05
fungiokay, running18:05
fungi41 minutes estimated18:05
clarkbthe next reindex is a full reindex because we've done skip level upgrades18:05
clarkbwith no intermediate online reindexing18:05
fungishould be a reasonably accurate estimate since no trove interaction18:05
clarkband we did one prior to the upgrades which was close in time too18:06
fungiyup18:07
*** gouthamr_ has quit IRC18:30
clarkbone thing the delete plugin lets you do which I didn't manage to have time to test is to archive repos18:30
clarkbit will be nice to test that a bit more for all of the dead repos we've got and see if that improves things like reindexing18:30
*** yoctozepto has quit IRC18:37
*** yoctozepto has joined #opendev-meeting18:38
clarkbdown to nova and all users now18:40
fungiyup18:41
*** gouthamr_ has joined #opendev-meeting18:46
fungidone18:48
clarkblooks happy18:48
clarkbtime for the reindex now?18:48
fungianything we should check before starting the offline reindex?18:48
clarkbI don't think so. UNless you want to check file perms again18:48
fungiwe want to stick with 16 threads?18:48
clarkbyes18:49
clarkbI think so anyway18:49
fungifile perms look okay still18:49
clarkbone of the things brought up on the gerrit mailing list is that thread for these things use memory and if you overdo the threads you oom18:49
clarkbso sticking with what we know shouldn't oom seems like a good idea18:49
clarkbits 24 threads on the notedb conversion but 16 on reindexing18:49
fungiyeah, i'm fine with sticking with the count we tested with18:49
fungiokay, it's running18:50
fungiestimates time to completion is 35 minutes18:50
fungigc time was ~43 minutes so close to our estimate. i didn't catch the actual time output18:51
clarkboh I didn't look, I should've18:52
fungifor those watching the screen session, the java exceptions are about broken changes which are expected18:52
clarkbya we reproduced the unhappy changes on 2.13 prod18:52
clarkbits just that newer gerrit complains more18:52
fungistems from some very old/early history lingering in the db18:53
clarkbit is about a quarter of the way through now so on track for ~40 minutes19:01
fungifairly close to our estimate in that case19:01
clarkbjust over 50% now19:14
clarkbjust crossed 90%19:32
clarkbdown to the last hundred or so changes to index now19:37
fungiand done19:37
clarkb~48minutes19:38
fungi47m51.046s yeah19:38
clarkb2.16 db dump now?19:38
fungiyup, ready for me to start it?19:38
clarkbyes I am19:38
fungiand it's running19:39
clarkbthen we backup again, then start the notedb offline transition19:39
clarkbsuch excite19:39
fungiit all over my screen19:42
fungi(literally)19:42
ianwo/19:43
ianwsounds like it's going well19:43
clarkbianw: slower than expected but no major issues otherwise19:43
* fungi hands everything off to ianw19:43
fungi[just kidding!]19:44
clarkbwe're at our 2.16 checkpoint step. backing up the db then copying gerrit2 homedir aside19:44
clarkbthe next step after the checkpoint is to run the offline notedb migration19:44
* ianw recovers after heart attack19:44
fungiyeah, i think we're basically on schedule, thanks to minor miracles of planning19:44
clarkbwhcih is good beacuse I'm getting hungry for lunch and notedb migration step is perfect time for that :)19:44
fungiother than the trove instance being slower than what we benchmarked with review-test, it's been basically uneventful. no major issues, just tests of patience19:45
ianwclarkb: one good thing about being in .au is the madolorian comes out at 8pm19:45
clarkbianw: hacks19:45
* fungi relocates to a different hemisphere19:46
fungii hear there are plenty of island nations on that side of the equator which would be entirely compatible with my lifestyle19:47
clarkbinternet connectivity tends to be the biggest issue19:48
fungii can tolerate packet loss and latency19:48
fungiokay, db dump is done19:49
fungirsync next19:49
fungiready to run?19:49
clarkblet me check the filesize19:49
clarkbstill 1.7gb lgmt19:50
clarkbI think you can run the rsync now19:50
fungioh, good call, thanks19:50
fungirunning19:50
fungithe 10 minute estimate there is very loose. could be more like 20, who knows19:50
clarkbwe'll find out :)19:50
fungiif it's accurate, puts us right on schedule19:51
fungiand done!20:01
fungi10m56.072s20:01
fungireasonably close20:01
corvus\o/20:01
clarkbonly one minute late20:01
corvushopefully not 10% late20:01
clarkbwell one minut against the estimated 10 minutes but also ~20:00UTCwas when I guessed we would start the notedb transition20:02
fungiokay, notedb migration20:02
fungianything we need to check now, or ready to start?20:03
clarkbjust that the command has the -Xmx value which it does and the threads are overridden and they are. I can't think of anything else to check since we aren't starting 2.16 and interacting with it20:03
clarkbI think we are ready to start notedb migration20:03
fungiokay, running20:04
fungieta for this is 4.5 hours20:04
fungino idea if it will be slower, but seems likely?20:04
fungithat will put us at 00:35 utc at the earliest20:05
clarkbwe should check it periodically too  just to be sure it hasn't bailed out20:05
fungii can probably start planning friday dinner now20:05
clarkb++ I'm going to work on lunch as well20:05
clarkbalso the docs say this process is resumable should we need to do that20:05
clarkbI don't think we tested that though20:05
ianwis this screen logging to a file?20:06
fungiyeah, it always worked in the tests i observed20:06
fungiianw: no20:06
fungii can try to ask screen to start recording if you think that would be helpful20:06
ianwmight be worth a ctrl-a h if you like, ... just in case20:06
clarkbwhat does that do?20:07
clarkb(I suspect I'll learn something new about screen)20:07
ianwactually it's a captial-H20:07
fungidone. ~root/hardcopy.0 should have it20:07
ianwclarkb: just keeps a file of what's going on20:07
fungiokay, ~root/screenlog.0 now20:07
clarkbTIL20:08
clarkbalright I'm going to find lunch now then will check in again20:08
fungiit's mostly something i've accidentally hit in the past and then later had to delete, though i appreciate the potential usefulness20:08
fungifor folks who haven't followed closely, this is the "overnight" step, though if it completes at the 4.5 hour estimate (don't count on it) i should still be around to try to complete the upgrades20:35
fungithe git gc which follows it is estimated at 1.5 hours as well though, will will be getting well into my evening at that point20:36
clarkbya as noted on the etherpad I kind of expected we'd finish with the gc then resume tomorroe20:37
clarkbthat gc is longer becauseit packs all the notedb stuff20:37
fungiif both steps finish on schedule somehow, i should still be on hand to drive the rest assuming we don't want to break until tomorrow20:37
clarkbya I can be around to push further if you're still around20:38
fungithe upgrade steps after the git gc should be fast20:38
fungithe real risk is that we turn things back on and then there are major unforeseen problems while most of us are done for the day20:38
corvusclarkb, fungi: etherpad link?20:38
fungihttps://etherpad.opendev.org/p/opendev-gerrit-3.2-upgrade-plan20:38
corvus#link https://etherpad.opendev.org/p/opendev-gerrit-3.2-upgrade-plan20:39
clarkbya I dont think weturn it on even if we get to that point20:39
fungiooh, thanks for remembering meetbot is listening!20:39
clarkbbecause we'll want to be around for that20:39
fungii definitely don't want to feel like i've left a mess for others to clean up, so am all for still not starting services up again until some point where everyone's around and well-rested20:40
corvuswe might be able to get through the 3.2 upgrade tonight and let it sit there until tomorrow20:41
fungithat seems like the ideal, yes20:41
corvuslike stop at 5.1720:41
fungisgtm20:41
corvus(i totally read that as "stop at procedure five decimal one seven")20:42
clarkbya I think that would be best.20:43
clarkbfun fact: this notedb transion is running with the "make it faster" changes too20:43
clarkbs/transion/migration/20:44
fungii couldn't even turn on the kitchen tap without filling out a twenty-seven b stroke six, bloody paperwork20:44
clarkbI got really excited about those changes tehn realized we were already testing with it20:44
clarkbhrm the output indicates we may be closer to finishing than I would've expected?21:17
clarkbTotal number of rebuilt changes 757000/760025 (99.6%)21:17
fungii'm not falling for it21:17
clarkbya its possible there is multiple passes to this or someting21:18
clarkbthe log says its switching primary to notedb now21:18
clarkbI will continue to wait patiently but act optimistic21:18
clarkboh ya it is a multipass thing21:21
clarkbI remember now that it will do another reindex21:21
clarkbbuilt in to the migrator21:21
clarkbgot my hopes up :)21:21
clarkb[2020-11-20 21:21:59,798] [RebuildChange-15] WARN  com.google.gerrit.server.notedb.PrimaryStorageMigrator : Change 89432 previously failed to rebuild; skipping primary storage migration21:22
clarkbthat is the causeof the tracback we see21:23
clarkb(this was expected for a number of changes in the 10-20 range)21:23
ianwdon't know why kswapd0 is so busy21:40
clarkbya was just going to mention that21:40
clarkbwe're holding steady at ~500mb swap use and have ~36gb memory available21:40
clarkbbut free memory is only ~600mb21:40
ianwi've seen this before and a dop_caches sometimes helps21:40
clarkbdop_caches?21:41
ianwecho 3 > /proc/sys/vm/drop_caches21:41
fungidope caches dude21:42
clarkb"This is a non-destructive operation and will only free things that are completely unused. Dirty objects will continue to be in use until written out to disk and are not freeable. If you run "sync" first to flush them out to disk, these drop operations will tend to free more memory. " says the internet21:43
* fungi goes back to applying heat to comestible wares21:43
corvusdo we want to clear the caches?21:43
clarkbpresumably gerrit/java/jvm will just reread what it needs back itno the kernel caches when it needs them?21:44
clarkbwhether or not that will be a problem I don't know21:44
corvusi guess that might avoid having the vmm write out unused pages to disk because more ram is avail?21:44
ianwyeah, this has no affect on userspace21:44
ianwwell, other than temporal21:45
corvusexcept indirectly21:45
corvus(iow, if we're not using caches because sizeof(git repos)>sizeof(ram) and it's just churning, then this could help avoid it making bad decisions; but we'd probably have to do it multiple times.)21:45
corvus(if we are using caches, then it'll just slow us down while it rebuilds)21:46
ianw2019-08-2721:47
ianw  * look into afs server performace; drop_caches to stop kswapd0,21:47
ianw    monitor21:47
ianwthat was where i saw it going crazy before21:47
corvusianw, clarkb: i think with near zero iowait and low cpu usage i would not be inclined to drop caches21:48
clarkbthe buff/cache value is going up slowly as the free value goes down slowly. But swap usage is stable21:48
clarkbcorvus: that makes sense to me21:48
corvuscould this be accounting for the cpu time spent performing cache reads?21:48
clarkbI'm not sure I understand the question21:49
corvusi don't know what actions are actually accounted for under kswapd21:49
ianwhttps://www.suse.com/support/kb/doc/?id=000018698 ; we shoudl check the zones21:50
corvusso i'm wondering if gerrit performing a bunch of io and receiving cache hits might cause cpu usage under kswapd21:50
corvusianw: If the system is under memory pressure, it can cause the kswapd scanning the available memory over and over again. This has the effect of kswapd using a lot of CPU cycles.21:50
corvusthat sounds plausible21:50
clarkbmy biggest concern is that the "free" number continue to fall slowly21:53
clarkbdo we think the cache value may fall on its own if we start to lose even more "free" space?21:53
corvusclarkb: i think that's the immediate cause for kswapd running, but there's plenty of available memory because of the caches21:53
clarkbcorvus: I see, so once we actually need memory it should start to use what is available?21:54
clarkbah yup free just went back up to 55421:54
clarkbfrom below 500 (thats MB)21:54
clarkbso ya I think your hypothesis matches what we observe21:54
corvusclarkb: yeah; i expect free to stay relatively low (until java exits)21:54
corvusbut it won't cause real memory pressure because the caches will be reduced to make more room.21:55
clarkbin that case the safest thing may be to just let it run?21:55
corvusi think so; if we were seeing cpu or io pressure i would be more inclined to intervene, but atm it may be working as designed.  no idea if we're benefitting from the caches on this workload, but i don't think it's hurting us.21:56
corvusthe behavior just changed21:57
corvus(all java cpu no kswapd)21:57
clarkbit switched to gc'ing all users21:58
clarkbthen I think it does a reindex21:58
ianwyeah i think that dropping caches is a way to short-circuit kswapd0's scan basically, which has now finished21:59
clarkbthis is all included in the tool (we've manualyl done it in other context too, just clarifying that it is choosing these things)21:59
*** sboyron has quit IRC22:02
fungialso with most of this going on in a memory-preallocated jvm, it's not clear how much fiddling with virtual memory distribution within the underlying operating system will really help22:06
clarkbfungi: that 20GB is spoken for though aiui22:06
clarkbwhich is about 1/3 of our available memory22:07
clarkb(we should have plenty of extra)22:07
clarkbI think this gc is single threaded. When we run the xargs each gc gets 16 threads and we do 16 of them22:08
clarkbwhich explains why this is so much slower. I wonder if jgit gc isn't multithreaded22:09
clarkbkids are out of school now. I may go watch the mandalorian now if others are paying attention22:10
clarkbI'll keep an eye on irc but not the screen session22:10
clarkbI just got overruled, great british bakeoff is happening22:12
fungii feel for you22:19
fungiback to food-related tasks for now as well22:20
* fungi find noel fielding to be the only redeeming factor for the bakeoff22:39
corvusi've never seen a bakeoff, but i did recently acquire a pet colony of lactobacillus sanfranciscensis22:40
* fungi keeps wishing julian barratt would appear and then it would turn into a new season of mighty boosh22:40
fungii have descendants of lactobacillus newenglandensis living in the back of my fridge which come out periodically to make more sandwich bread22:42
corvusfungi: i await the discovery of lactobacillus fungensis.  that won't be confusing at all.22:43
fungiit would be a symbiosis22:45
ianwapropos the Mandalorian, the plant he's trying to reach is Corvus22:45
fungithe blackbird!22:45
* clarkb checked between baking challenges, it is on to reindexing now22:47
clarkbiirc the reindexing is the last step of the process22:47
clarkbit is slower than the reindexing we just did. I think beacuse we just added a ton of refs and haven't gc'd but not sure of that22:48
corvusianw: wow; indeed i was looking up https://en.wikipedia.org/wiki/Corvus_(constellation) to suggest where the authors may have gotten the idea to name a planet that and google's first autocomplete suggestion was "corvus star wars"22:51
ianwmy son's obsessions are Gaiman's Norse Mythology, with odin's ravens, and the Mandalorian, who is going to Corvus, and I have a corvus at work22:51
corvusianw: you have corvids circling around you22:52
ianw(actually he's obsessed with thor comics, but i told him he had to read the book before i'd start buying https://www.darkhorse.com/Comics/3005-354/Norse-Mythology-1 :)22:53
corvuswow the radio 4 adaptation looks fun: https://en.wikipedia.org/wiki/Norse_Mythology_(book)22:55
fungiyou'll have to enlighten me on gaiman's take on norse mythology, i read all his sandman comics (and some side series) back when they were in print, but he was mostly focused on greek mythology at the time22:57
fungiclearly civilization has moved on whilst i've been dreaming22:58
fungii think i have most of sandman still in mylar bags with acid-free backing boards22:59
fungidelirium was my favorite character, though she was also sort of a tank girl rip-off23:01
ianwfungi: it's a very easy read book, a few chuckles23:04
ianwyou would probably enjoy https://www.audible.com.au/pd/The-Sandman-Audiobook/B086WR6FG823:05
ianwhttps://www.abc.net.au/radio/programs/conversations/neil-gaiman-norse-mythology/12503632 is a really good listen on the background to the book23:06
funginow i'm wondering if there's a connection with dream's raven "matthew"23:09
clarkboof only to 5% now. I wonder if this reindex will expand that 4.5 hour estimate23:11
* clarkb keeps saying to himself "we only need to do this once so its ok"23:12
fungifollow it up with "so long as it's done when i wake up we're not behind schedule"23:12
corvusit is all out on the cpu23:12
corvuswe have 16 cpus and our load average is 1623:12
clarkbya its definitely doing its best23:12
fungisounds idyllic23:12
clarkbideally we get to run the gc today too, I can probably manage to hit the up arrow key a few times in the screen and start that if its too late for fungi :)23:19
clarkbbut ya as long as thats done before tomorrow we're still doing well23:19
clarkbs/before/by/23:19
fungiyeah, if this ends on schedule i should have no trouble initiating the git gc, but...23:20
clarkbif this pace keeps up its actually on track for ~10 hours from now? thats rough napkin math, so I may be completely off23:25
clarkbalso if I remember correctly it does the biggest projects first then the smaller ones so maybe the pace will pick up as it gets further23:26
clarkbsince the smaller projects will have fewer changes (and thus refs) impacting reindexing23:26
clarkbanyway its only once and should be done by tomorrow :)23:26

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!