Thursday, 2023-11-16

clarkbhttps://213.32.76.236:3081/opendev/system-config this is the held node and it confirms the concern that we have to reorganize our image assets00:31
clarkbI won't be able to dig into that today00:31
clarkbbut yay for testing00:31
tonybyay01:55
opendevreviewRoman Kuznecov proposed zuul/zuul-jobs master: tox: Separate stdout and stderr in getting siblings  https://review.opendev.org/c/zuul/zuul-jobs/+/90107214:37
opendevreviewRoman Kuznecov proposed zuul/zuul-jobs master: tox: Separate stdout and stderr in getting siblings  https://review.opendev.org/c/zuul/zuul-jobs/+/90107214:39
opendevreviewRoman Kuznecov proposed zuul/zuul-jobs master: tox: Do not concat stdout and stderr in getting siblings  https://review.opendev.org/c/zuul/zuul-jobs/+/90107214:50
opendevreviewNate Johnston proposed ttygroup/gertty master: Support SQL Alchemy 2.0  https://review.opendev.org/c/ttygroup/gertty/+/90116614:52
opendevreviewNate Johnston proposed ttygroup/gertty master: Support SQL Alchemy 2.0  https://review.opendev.org/c/ttygroup/gertty/+/90116614:53
corvusi'm going to gracefully stop ze01 and upgrade it so i can observe the new git repo behavior15:18
fungicorvus: thanks! what new behavior is that again? the shallowness?15:18
corvusi wouldn't use that word since it suggests the git "shallow clone" feature which is definitely not what's going on; but rather that we (a) don't checkout a workspace when cloning and (b) are more efficient setting refs15:20
fungioh, right i forgot about skipping the checkout15:22
fungiand yeah, the ref replication improvements aren't really shallow in the clone sense, it's shallow copying?15:22
fungii'll check what terminology ended up in the release note15:23
fungi"thin" (not shallow) per the commit message15:24
corvusyeah i tried to find a new word.  and i used it as a verb too.  :)15:25
corvusbecause we're not changing the result, just the process.15:25
fungiwfm, it's great for clarity15:26
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.21.0  https://review.opendev.org/c/opendev/system-config/+/89767916:22
opendevreviewClark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node  https://review.opendev.org/c/opendev/system-config/+/84818116:22
clarkbI'm going to rotate the authold for ^16:22
fungicool, i'm disappearing for lunch but should be back by 17:30 utc16:27
tonybI'm going to be out for the rest of the day.16:28
clarkblooks like I have extra reason to be up early tomorrow. Starship laucnh window opens at 5am local time16:46
corvuslet's hope that's the only source of guaranteed excitement :)17:01
corvusthere is a zuul sql db schema migration ready to merge that, at least in our case, should be treated as planned full downtime.  i have run the migration locally on my workstation and it takes 22 minutes.  i don't have a factor to translate that to an estimate on our particular database server; it seems like using a scotty factor of 2x might be a good idea for safety.  so we're looking at a 22-44 minute outage, which fits within the already17:13
corvusscheduled maintenance window for gerrit.  i propose that i take zuul down during the gerrit outage and perform the sql migration then.17:13
corvusclarkb: fungi ^17:13
corvus(oh and ze01 finally exited, so i will be restarting it now with the new repo stuff)17:13
clarkbcorvus: we also haven't done a full restart since we updated the github api usage right? so unsure if that will take a number of iterations (but we don't expect it to at this point)17:14
clarkbfrom a gerrit upgrade perspective the gerrit upgrade process doesn't rely on zuul until we land the change to reflect what we've already done17:14
clarkband any downgrade that might happen would be manual outside of zuul as well17:14
clarkbI think for this reason I'm ok with it happening during the gerrit changes as they are well decoupled17:14
corvusclarkb: yeah, i believe we don't expect github to be a problem at this point (but also, we shouldn't actually need to do a full-reconfigure so we don't need to trigger the "list all branches on github" code path)17:15
corvus(but if something goes wrong, we might need to, so good to consider that for planning)17:15
clarkbcorvus: oh that is because we aren't clearing the zk data right?17:15
clarkbin my head shutting everything down still requires a from scratch rebuild of the configs but we cache in zk now so that isn't the case17:16
corvusyep17:17
fungiclarkb: corvus: i'm cool with a full zuul restart during or coordinated around the gerrit restart too17:42
clarkbfungi's message made it through the matrix bridge before it made it through the oftc network to my normal irc connection17:45
fungichristine has an eye appointment (only a few minutes from home) at 13:30 utc that i need to drive her to because she might not be able to see well enough after to drive herself back, they've been really slow in the past so there's a possibility i might be stuck working on my phone from a parking lot at 15:30 (i really hope not), but all that's to say don't count on me being 100% useful until17:46
fungilater in the maintenance17:46
corvusze01 seems to have produced the usual complement of results from jobs.  so far so good.17:51
clarkblogos are back on : https://104.130.127.229:3081/17:57
clarkbso ya I think that gitea change is now ready for review with the asterisk next to the ssh key length verification removal17:57
clarkbhappy to rotate keys first then do the upgrade17:57
fungiyeah, i think (without having closely reviewed changes yet) that week after next we should do the key rotation, check that replication is still working, upgrade gitea, and then force a full re-replication just to be sure17:59
clarkbwe will need another change to do the key updates on the gerrit side. I think we can land the change to add a new key to gitea first, then add that key to gerrit with a .ssh/config there selecting the key then restart gerrit to put it into use18:03
clarkbshouldn't be too difficult18:03
clarkbthe most difficult thing is deciding what key type and size we should use18:03
fungiyou might be surprised that, as a security professional, i consider that decision mostly irrelevant. newer algorithm, larger key, whatever. malfeasors won't be attacking our keys, they'll look for easier ways in regardless18:05
fungiwe should pick whatever makes sense and is simplest for long-term maintenance18:06
fungipeople who argue over key size or which algorithm is stronger than which other algorithm are missing the bigger security picture18:07
clarkbin that case I think I'd lean towards ed25519 since it has a single size? We'd only replace it if the entire algorithm/protocol is decided to be insecure vs doing an rsa key length extension every 5 years18:07
fungithat sounds fine to me18:07
corvushttps://tracing.opendev.org/trace/6b5bc808fec911b1abf607f6fc7ea37b has some of the new tracing info18:08
fungithis is quite literally one service we control talking to other services we control, and also restricted by firewall rules i think? we could just about do no encryption at all and not care18:08
fungithe only real concern is someone launching a middle-node attack between gerrit and gitea in order to inject or subvert git replication18:10
corvusis there an etherpad for tomorrow?21:29
tonybhttps://etherpad.opendev.org/p/gerrit-upgrade-3.821:30
fungicorvus: ^ that21:30
fungithanks tonyb!21:30
tonyb#nailedit21:31
corvuscool; i'm thinking of adding the zuul hosts to the emergency file now-ish, just to avoid any surprises when the change merges.  that means that changes to the tenant config file (ie, adding new projects) won't take effect.  does that sound reasonable?  should i wait till later?21:32
fungiclarkb: we want to do steps 1-2 (or 1-3) an hour before, so ~14:30 utc? or should it be earlier?21:32
corvus(the change = the schema migration change; it shouldn't auto-deploy, but it would if something went wrong overnight)21:32
fungicorvus: sounds fine to me21:33
fungii don't think we're merging anything new for those in the interim21:33
fungii expect we could even do all of step #2 nowish21:34
clarkbcorvus: seems fine to me22:32
clarkbfungi: I think an hour should be plenty22:32
corvusokay i'm editing emergency now22:32
corvuser, remind me -- can i put a group in here?  or is it just hostnames?22:34
clarkbcorvus: I've always just done hostnames. I'm not sure if it will recursively expand groups into the disabled group22:35
clarkbs/hostnames/that names that appear in the inventory/22:35
corvusif i'm reading this right, only one of the hosts in emergency is actually an ansible host22:37
corvusstoryboard-dev01.opendev.org22:37
corvusand yeah, i don't think recursive groups works22:37
clarkbcorvus: yes I think that file can use some clearing out22:37
corvus`ansible localhost -m debug -a 'var=groups["disabled"]'` is useful22:38
corvusi left notes in that file; maybe someone can double check that and we can clean it up later22:40
corvus#status log added zuul hosts to ansible emergency file to prepare for 2023-11-17 maintenance22:41
opendevstatuscorvus: finished logging22:41
clarkb++ to cleanup but lets do that after we're done with tomorrow's fun :)22:44
corvusyep22:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!