Thursday, 2021-03-04

clarkbfungi: my utility venv was still python2 so I've had to switch that over to 3 now :)00:07
clarkbinfra-prod service zuul is running now. So should be able to remove whenever we like. At this rate it will likely be tomorrow morning for me to get to that00:16
clarkbcool if we say that accounts without usernames or sshkeys and no reviews in the last year are fair game to retire. We should be able to cleanup another 180ish00:29
openstackgerritClark Boylan proposed opendev/system-config master: Add tools being used to make sense of gerrit account inconsistencies
clarkbthat contains my latest updates to check usernames and ssh keys00:31
ianwafs01.ord is upgraded.  since nothing has made like a spacex rocket and seemed to work but then exploded, i'll assume it's ok :)00:31
clarkbI didn't catch it live but watched the replay which ended before it had its big kaboom00:32
clarkbthen later found out it also had a sad00:32
clarkbI've uploaded the results of the sshkeys and username checking to review if other infra-root want to look that over00:35
ianwclarkb: in your homedir?00:36
clarkbyes under gerrit_user_cleanups00:37
clarkbits the file with today's (yesterday for you and utc) timestamp suffix00:38
clarkbI think the biggest risk there is if there are two accounts and one is being used for ssh (code pushes) and the other is used for http (code reviews) we can end up preventing peopel from logging in via http and doing reviews. However the recency check on code reviews shoudl guard against that for any current users00:43
ianwhow did they not choose a username?  didn't finish the signup?00:43
clarkbianw: ya, I suspect for many gerrit gave them a new account and either they sorted out how to keep using the old account and just deal with it, we fixed things enough for them to use old account, or they used one for ssh and one for http00:44
clarkbI think if we couple this with the recency check we can avoid the vast majority of issues, then hopefully if anyone shows up and needs account fixing we can do that on a case by case basis once we've addressed the consistency issues00:45
clarkbI do think we're fast approaching where we're goign to have to accept we may get things wrong though :/00:45
clarkbthat is why I think setting accounts inactive for a bit first then following up with the external id cleanups is probably a good idea. We can also send peopel email and tell them feel free to ignore this unless you expect to use this account again and if so tell us00:46
clarkbthen put anyone who responds aside00:46
ianwyeah i don't think a pre-emptive email is a bad thing00:46
clarkbya maybe we refine this ssh list down a bit (there are a few which appear odder than most like the 6 account case). Then fungi can help me sort out how to send a bunch of emails00:47
clarkbI suspect that the vast majority will not care as many of these do appear to be long ago contributors00:51
openstackgerritIan Wienand proposed opendev/system-config master: system-config-roles: only match jobs on roles tested
openstackgerritIan Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies
clarkbbut asking nicely first is a good way to ensure we don't inadverdently ruin someones day00:51
openstackgerritMerged openstack/project-config master: Add custom cirros image with ahci module enabled to cache
fungiianw: a while back gerrit stopped requesting (or perhaps lp stopped handing over) a username, so users have to enter one manually now01:01
fungiif they never use the ssh or rest apis to do anything, like maybe they're only reviewing changes with the account, then they may be actively using it even though they have no username01:02
* fungi sighs... privmsg spam has started back up again01:02
openstackgerritIan Wienand proposed opendev/system-config master: system-config-roles: only match jobs on roles tested
openstackgerritIan Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies
openstackgerritKendall Nelson proposed openstack/project-config master: Add New Repo for StoryBoard-vue
ianwwell this is ... crap ... it seems that afs01.dfw is having issues creating it's vicepa ... seems like some of the partitions aren't responding03:16
ianw"failed to start LVM2 PV scan for device 202:145"03:17
fungilike cinder volumes?03:17
fungioh yeah03:17
ianwit would seem to me that one of the PV's isn't responding, yeah they are all attached volumes03:17
ianwthere are three other "PV scans" that worked03:18
fungiit's also refusing ssh for m03:18
ianwthis has 5 volumes03:18
ianwyeah, it's stuck in early boot, i can only see via emergency console03:18
fungiahh, got it03:18
ianwif i detach these volumes, it seems like they will reattach as different devices03:19
fungiwell, that's fine03:19
ianwi *think* lvm puts a uuid on them, right03:19
ianwso it won't care if they move around03:19
fungilvm2 writes headers to all of them, yep03:19
fungithat's what pvcreate does, in fact03:19
ianwall the volumes are green, and listed as attached03:19
ianwi'm not sure what else to do :/03:20
fungiit's possible they live migrated the server instance and the device xmls for libvirt got all screwy03:20
fungiyou say you've tried detaching and reattaching with the server instance powered off?03:21
ianwinterestingly "shutoff" seems to have not made the cut for the webui03:24
fungii think openstack server poweroff will do it though03:25
ianwyep, i've stopped it via that now03:26
ianwi've detached all 5 volumes, and will try reattaching03:28
ianwthis is not looking good03:31
fungican't attach?03:34
ianwthis host really doesn't want to boot03:35
ianw"A start job is running for LVM2 PV scan on ..." and it keeps looping through things03:35
fungimight be better to try booting it with the cinder volumes detached, then hot attach them after03:36
ianwhrm, ok03:36
ianwi've unattached everything and am trying booting it03:39
ianwit still has a start job for dev-main-vicepa.device03:41
ianw"reached target network (pre)" and sitting there03:42
ianwnetworking issues could explain both unattached volumes and this i guess...03:42
ianwhrm, it's gone to an emergency shell promtp03:43
ianwtrying to boot it into default target ...03:44
ianwi wonder if this is all based of fstab; if i use the emergency attach thing to go in and modify the fstab on disk maybe it will boot03:45
fungiyeah, i think since systemd things get dodgy if you have a filesystem set to mount at boot in fstab which can't actually mount03:46
fungipreviously mount would timeout/give up and you could still at least try to boot up the rest of the way03:47
ianwin good news, every other afsdb/afs server has rebooted just fine :/03:49
ianwok, i edited out the mount03:56
ianwok, exiting rescue mode ...03:58
ianwalright, the host is back04:00
ianwbut with no storage obviously04:00
ianwlet me try attaching and see what it thinks04:00
ianwb,c,d,f,g are attached04:02
ianwblkid only shows three drives however04:02
ianwb,c,d it shows.  f & g not showing04:03
fungithat's certainly not what i'd expect04:03
ianwpvscan has just gone into something unkillable04:04
ianwthere's no /dev/xvdg104:05
ianwthis is
fungigot it, i wonder if it's just that volume or both which actually have problems04:06
ianwit took forever to respond, but it did04:09
ianwit's possible xvdg doesn't have a lvm partition but is just a raw disk04:09
ianwwe have /dev/mapper/main-vicepa now ...04:09
ianwi'm not sure if i trust mounting it ...04:10
ianwwell i did.  it seems to be there04:11
fungiwe can probably work out from the lvm config which devices (by uuid) are actually used to assemble04:11
fungior vgs --verbose (i think) should show which pvs are used04:11
ianwmain05 / xvdg5 was the extra area for snapshots i added when we did the 1.8 upgrade04:12
ianwi think we should remove that snapshot and remove the xvdg / main05 volume04:14
fungiahh, is it part of the main vg?04:14
fungii'm switching rooms but should be able to jump on there in just a sec04:14
ianwbut the lv is vicepa_snap04:14
ianwfungi: i think it's in a stable state ... though i'm not sure why it was failing04:15
ianwit is upgraded to bionic, vicepa is mounted and openafs is running04:15
fungii'd hesitate to reboot it if we uncomment /vicepa in fstab04:15
ianwthat's still commented, only manually mounted04:15
fungiuntil we can work out what's choking it04:16
ianwit was definitely xvdg being very slow ... pvscan took like 5 minutes but now seems ok04:16
fungiwe should probably try scaling back of the main05 volume, but that can wait until i get some sleep, i suppose04:17
ianwi think we should leave it given both our time constratints now04:17
ianwtomorrow we can delete the snapshot volume and get it back to a regular 4tb array04:17
fungiyep, full agreement from me04:17
fungithanks for working through that!04:17
ianwok, i will keep an eye but not planning on any more excitement now :)04:18
fungisounds good, thanks again04:20
*** lpetrut has quit IRC04:47
openstackgerritIan Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies
openstackgerritIan Wienand proposed opendev/system-config master: gerrit docker: match some more files
*** marios has joined #opendev06:07
*** ykarel_ is now known as ykarel06:18
*** ykarel_ has joined #opendev06:34
*** ykarel has quit IRC06:36
*** lpetrut has joined #opendev07:22
*** ralonsoh has joined #opendev07:37
openstackgerritAndreas Jaeger proposed zuul/zuul-jobs master: cabal-test: add install_args and build_args role var
*** lpetrut has quit IRC07:56
*** ykarel_ is now known as ykarel08:13
*** lpetrut has joined #opendev08:51
*** ykarel has quit IRC09:23
*** zoharm has joined #opendev09:33
*** roman_g has joined #opendev09:58
ianwfungi/clarkb: forgot to mention, mirror-update is still off to prevent any releases happening and getting corrupt volumes10:16
ianwmy plan is to remove the lvm snapshot / /dev/xvdg device asap, then complete the upgrade to focal (has gone fine on the other two servers)10:16
ianwif there's urgent need, feel free to do that, or even just turn mirror-update back on -- it *should* be fine10:17
ianwi'm not going to do anything at this stage, i'm more likely to make a mistake than get it done correctly :)10:17
openstackgerritIan Wienand proposed opendev/system-config master: gerrit docker: match some more files
openstackgerritIan Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies
openstackgerritBharat Kunwar proposed openstack/project-config master: [magnum] Add Backport-Candidate and Review-Priority labels
*** ykarel has joined #opendev10:30
openstackgerritOleksandr Kozachenko proposed opendev/base-jobs master: Update post-logs playbook
openstackgerritMaksim Malchuk proposed openstack/diskimage-builder master: Don't use hardcode while override base image file
openstackgerritMaksim Malchuk proposed openstack/diskimage-builder master: Fix hooks order for CentOS/Fedora when mirror used
*** artom has joined #opendev11:37
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
*** fressi has quit IRC15:23
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse. This PR is also related to the following PR in Ironic-python-agent-builder: Change-Id: Id2759be29bfcbf2ecf1ce67e171686924b506b1a
clarkbianw: fungi: thank you for working through that, let me know if I can help otherwise the plan to keep mirror update idle for now makes sense to me as well as cleaning up the snapshot volume16:10
fungii'm planning to try to pick it back up once meetings calm down16:13
fungirevisiting the gentoo image situation, looks like we've got uploaded images from 4 days ago and today, so it's ~roughly working again i guess?16:15
fungihowever, dib-image-list also mentions what i think must be orphaned records from before the builders got replaced16:16
fungitwo entries from nearly a year ago16:16
clarkbfungi: they may also be held in clouds that are preventing us from deleting the image16:16
fungii wonder what the best way is to clean those up (they're not uploaded to any clouds at this point, according to image-list)16:16
fungithe builder which held them no longer exists16:17
clarkbfungi: image-list is paginated, you should image show the image uuids16:17
clarkbbut ya if not in any clouds, then probably need to manually remove the records from zk16:17
clarkbsince there won't be a builder around to do that for us (only the builder that built an image can fully delete it iirc)16:18
fungiconfirmed, those image ids also don't appear in image-list16:18
clarkbfungi: image-list or show?16:18
clarkbbe very careful with image list buecause it will only show you like 50 images16:18
funginodepool image-list16:18
fungioh! i didn't realize that16:18
clarkboh nodepool image list, I thought you meant glance image list16:19
clarkbnodepool will not paginate16:19
fungino, not glance16:19
clarkbbut glance will16:19
funginodepool dib-image-list mentions the old entries, nodepool image-list does not show them uploaded anywhere16:19
fungiso presumably they're not in use for booting nodes any longer16:19
fungii think they were still current at the time their builder was taken out of the pool16:20
clarkbya in that case I think we may have to manually remove the zk records for the builds16:20
fungii'll try to remember to take care of that in the near future, but if anyone else is in zk-shell at some point cleaning up anything else keep in mind those can be cleared out16:20
clarkbfungi: re 'maybe they're only reviewing changes with the account, then they may be actively using it even though they have no username' yup, that is why we're also cross checking against reviewedby:foo after:year-ago16:31
clarkbI'm about to dig into the data the audit script returned a bit more to try and understand how some of these accounts ended up in this situation16:31
clarkbcuriously for a non zero number of emails with conflict they neverset a username or ssh keys on any of the associated accounts16:33
clarkbmost of them have an account with ssh credentials and one without16:33
clarkbwith those I worry about the scenario where one could be used for reviews and the other for pushes and I should be able to see evidence of that once I've settled in and start looking at what gerrit says16:34
clarkbya first insight digging in is that some of these accounts have never reviewed anything16:45
clarkbin addition to having no ssh keys or username16:45
clarkbfungi: I suspect that ^ those are actually quite safe to clean up. I'll work on modifying the audit script to differentiate between the two groups (no reviews ever vs no recent reviews)16:46
clarkbfungi: I'm finding cases where different openids report the same email address. do you know what sort of behavior may account for that?16:52
clarkb(maybe that will help us further narrow down accounts that can be cleaned up)16:53
clarkbdoesn't seem to be super common though, but maybe represents a set of accounst that can be simplified16:55
clarkbooh this may be promising, if you open the openid links it looks like login.ubuntu will indicate if the openid is valid or not?17:07
clarkbfungi: ^ this may be a golden ticket if I'm processing it correctly17:07
fungiclarkb: so, if you'll remember back, there was a time when our account management involved a "sync" from launchpad because of cla management happening there. it's quite likely a number of accounts were created by that sync but never used, and then the same people later created new accounts and added the same e-mail addresses to them17:09
fungiand the openid lookup trick does seem like a good way to find invalid ids, yes17:11
clarkbya I think these will be my two next improvements to the audit script. First is identifying where no reviews have ever been done, but also call out accounts without valid openids17:12
clarkbI'm testing some assumptions about the openid thing now though to see if that makes sense17:12
clarkbya it seems that valid accounts get an http 200 and invalid openids get http 40417:13
clarkband if we know there is no valid openid and no username and no ssh keys that account should be 100% safe to clean up?17:14
clarkbalso in doing some manual digging I found at least one account's openid is named foo-not-used and has never pushed or reviewed code. I'm going to just keep notes for accounts like this one and the tripleo one as "these can be cleaned up" and aren't immediately mechanically determined to be that way17:16
fungiyes, ssh keys are irrelevant there for that matter. if there's no username and no openid, there's no way to log in. if there's a username and a password or a username and ssh keys, then the lack of valid openid doesn't necessarily mean it's unused17:19
fungibut both methods of authenticating without an openid rely on having a username17:20
fungiif there's a username but no password and no ssh keys and no openid, then also no way to log in, but i expect that to be an unusual combo17:21
clarkbya I think we'll ignore that combo for now17:21
clarkband chip away at the easier sets (I don't know if we can check password via api)17:21
fungibut yeah, no/invalid openid + no username is a great set to look for17:22
openstackgerritAbhishek Kekane proposed openstack/project-config master: Change gerrit ACLs for glance-tempest-plugin
fungii'm starting to suspect that the wackyness with the main05 volume on afs01.dfw is that the raw block device was made a pv rather than being partitioned with a single partition marked as a pv17:40
fungii have a feeling this is causing problems for pvscan/vgscan17:40
fungiinfra-root: i'm going to delete the vicepa_snap logical volume from the main vg on afs01.dfw (it's not mounted, this was insurance from before the openafs 1.8 upgrade). if that goes well i'll vgreduce main off of the main05 pv (it will have no extents in use at that point) and then detach it from the server instance17:50
clarkbfungi: roger17:57
fungidouble-checking how to properly deactivate the snapshot volume first, since normal lvchange also wants to deactivate the origin volume when i try (that would be bad)17:58
clarkbfungi: re account sync from lp, the accounts exhibiting this are much newer than that I think18:03
fungiahh, okay18:03
fungiso now i've used vgreduce to remove /dev/xvdg (main05) from the main vg18:05
fungiand that pv is showing unused18:05
funginow i've used pvremove to clean off the pv header from that device18:05
clarkbI've just kicked off a run against all 608 remaining email conflicts that will check for valid openids18:06
clarkbI expect this to be much slower as I also did the check if there were any reviews or changes pushed at all rather than using after:year-ago on that query18:07
*** ralonsoh has quit IRC18:09
clarkbside note: the openid validity check seems to properly catch the tripleo ci accounts too18:10
clarkb(that gives me a bit more confidence that it is doing the right thing)18:10
*** toomer has quit IRC18:10
clarkbweshay|ruck: ^ do you know what you all did if anything to make the ubuntu one account for return a 404 on its openid now?18:11
clarkbI wonder if that is just a side effect of being assigned a new openid or if there is an explicit deactivation step. Either way just trying to build confidence in this categorization and it isn't super urgent18:12
weshay|ruckclarkb, hrm.. I have not done anything w/ it.. I'll check w/ sshnaidm|off18:12
weshay|ruckwe're probably the only two that would have done anything though18:12
*** iurygregory has quit IRC18:22
fungii ended up detaching through the rackspace dashboard, i couldn't remember how to wrangle osc to communicate correctly with their cinder since it's stuck on the v1 api (at least i think that's the problem)18:24
fungianyway, that volume is detached and deleted now18:25
fungimy current theory is that adding the main05 volume as a pv directly without any partitioning confused the device scan the next time the server was rebooted18:26
fungiand that rebooting now with the remaining 4 cinder volumes attached will come up just fine. i'm hesitant to readd /vicepa to the fstab until we've tested that though, since otherwise systemd will have many sads and we'll need to do another emergency repair boot which is a royal pain18:28
fungiianw: since you also have a lot of this paged in from yesterday, i'll hold until you're around before i test that theory18:28
clarkbhrm temporary failure in name resolution caused my script to bail out :( /me reruns it18:33
*** slaweq has quit IRC18:34
clarkbthe name it was looking for has a very low ttl, I'll blame that on random internets for now. If it persists I may need to look at my local networking18:36
*** smekala has joined #opendev18:42
clarkbinfra-root I'm going to delete 0cbe6ecb-be68-43aa-ba0d-58296a81ebcf now unless I hear objections in the next few minutes18:44
fungisounds good18:45
clarkband done18:49
clarkbstarting launches for 02-04 now18:54
clarkb#status log Removed old in favor of More new zuul executors to arrive shortly.18:57
openstackstatusclarkb: finished logging18:57
openstackgerritClark Boylan proposed opendev/ master: Add new ze02-04 servers to dns
openstackgerritClark Boylan proposed opendev/system-config master: Replace with
clarkbinfra-root ^ I think both of those are ready for review now. The servers are up and running and /var/lib/zuul is properly configured19:12
clarkband now it is time for lunch19:13
fungiawesome, yeah i saw some unattended upgrades spam from them19:14
openstackgerritMerged opendev/ master: Add new ze02-04 servers to dns
fungilooks like debian buster-updates has a patched openafs now since roughly a month, but it raced the 10.8 stable point release. if we're adding buster-updates to our default sources we can probably drop the workaround we put in place... otherwise we're roughly a month out from 10.9 which should include it19:44
fungi for those following along, 1.8.2-1+deb10u1 is the patched version for buster19:46
openstackDebian bug 980115 in openafs-client "connection failure when rx initialized after 08:25:36 GMT 14 Jan 2021" [Grave,Fixed]19:46
clarkbfungi: I think our sources are whatever debootstrap will install?19:46
clarkbthough I guess our mirror configuration role will then go over that after19:46
clarkb that seems to have -updates in it so we should be ok?19:47
clarkblooking at accounts with no username, no ssh keys, no reviews, no changes, and no valid openid (openid url is not a 200) we end up with ~70 accounts for cleaning19:50
clarkblet me copy the output of that and upload the script changes19:50
openstackgerritClark Boylan proposed opendev/system-config master: Add tools being used to make sense of gerrit account inconsistencies
clarkbfungi: ianw: maybe you can look over ^ and the resulting file (it has been copied to review) and see if it makes sense to clean those ~70 up?19:52
openstackgerritMerged opendev/system-config master: Replace with
clarkbit is interesting that a number of these with a completely idle account have >1 non completely idle accounts too20:23
clarkbthat means doing the cleanup for the idle side won't fix the conflict, however, it will reduce the number of accounts involved in the conflict20:23
clarkbon individual has 6 accounts. One seems to have no valid openid, no username, no changes, no reviews. two have pushed changes a couple years apart and still more than half a decade ago and the other 3 while they have ssh usernames have never reviewed or pushed code20:26
clarkbI'm not even sure where we start with something like that20:27
clarkbmaybe declare bankruptcy since all 6 haven't been recently used, email them, and plan to disable all the accounts and clean them up except for maybe the one that was most recently used?20:27
*** slaweq has joined #opendev20:29
fungii favor bankruptcy20:44
ianwfungi: thanks for looking at it!20:52
clarkbjrosser: related to the gerrit account cleanup discussion above it appears you have three accounts. The first is the one you appear to be actively using. There are two others that share an email address which is causing us problems with gerrit consistency checks. One of those two does not appear to have any username or valid openid.20:52
ianwfungi: i was thinking, i'm not sure the partition made any difference, however the amount of time pvscan took to come back i think did -- i think it took so long that the startup jobs would timeout, putting us into an emergency mode loop20:53
clarkbjrosser: I'm sort of spot checking the output of an audit script I wrote which tries to assert which accounts are probably safe to retire and remove all their externalids to fix teh conflict. It has detected that the one without a username, valid, openid, etc is one to clean up. I thought I'd run this by you as a sanity check against my script and see if there are issues with that20:53
clarkbjrosser: feel free to PM me and I can share details like emails and account ids, etc20:54
ianwi feel like maybe, since that was added probably many many years after the other volumes; maybe it's some sort of back-end weirdness with them being very far apart or something20:54
jrosserclarkb: some time ago I got in a spectacular mess with gerrit/openid and my personal email address20:55
clarkbjrosser: and the account I've identified as not really ever being used or having a valid openid would be the case?20:56
jrosserI think I broke the openid/ubuntu one stuff sufficiently that I would need them to help me ever make it valid again20:57
clarkb"cool". That helps me with confidence in the output of the script I've written20:58
clarkbit seems to have properly detected that situation and identified an account that is otherwise unuseable20:58
jrosseryup, cleaning up on the gerrit side is really valid first step to me ever getting that account working again, should I need to use something other than my current work email21:01
fungiianw: i feel like i've seen similar pv detection problems in rax when we didn't add a partition table on a cinder device21:02
ianwfungi: pvscan did detect it, just eventually ... anyway a good thing to keep in mind21:02
clarkbfungi: ianw: if you have time for a quick review will speed up the rotation of ze05-12 though too late for 02-04 (its ok I have a workaround I can use)21:03
ianwclarkb: lgtm21:04
clarkbjrosser: I've put that account in my notes for direct cleanup too since you've confirmed that situation (though the cleanup suggested by the script will likely be used anyway). Should get around to it once I've got sufficient confidence in a large enough group to make that set of changes21:05
clarkbgiven that ^ checks out I think we likely can move ahead with those ~70 identified as not having a username, ssh keys, reviews, pushes, or valid openids21:08
clarkbI've got an appointment this afternoon so probably wont' get to it today which means others can review the script and data and object :)21:08
clarkbalso we can set the accounts inactive first, then wait a bit and do the external id removals21:08
jrosserclarkb: thanks for taking the time to clean this all up21:09
clarkbthen I expect the next sets of accounts will be ones we want to email about since they all have some activity somewhere21:09
ianwclarkb: script looks good to me.  i'd be careful, it sort of feels like you're writing a replacement for our jeepyb "first contribution" scripts :)21:09
clarkbjrosser: it has been a very intereting experience to see how accounts have gotten mixed up due to gerrit and ubuntu one assumptions that don't hold true for either side21:09
clarkbianw: I'm mostly worried that my own script will decide that my account should be turned off :)21:10
ianw"I'm sorry, clarkb.  I'm afarid I can't do that." :)21:12
fungijust stay clear of the airlock21:14
clarkbthere is also another account I idnetified when digging into openid validity that is named foo-not-current and doesn't have any use while another one does so I'll mix that one in as well as the account that we identified wasn't used anymore21:15
clarkbI think that gets us ~72 or so potential cleanups. However, not all will result in happy gerrit consistency checks because some still have multiple active accounts with external id conflicts even after thatcleanup21:15
clarkbjust keep chipping away I guess21:16
fungiianw: just about done with dinner but i figure we can do two test reboots of afs01.dfw, the first with /vicepa still commented out of fstab, and the second with it back in if the first succeeds21:21
clarkbthe infra-prod-service-mirror-update job is running now. I don't expect that to cause problems with however you disable it but thought I would call it out just in case21:22
ianwfungi: yeah, i just kicked off the in-place focal upgrade; i'm fairly ok to reboot it with /vicepa set to mount because we can go in via the emergency host if we have issues and turn it off, if you are21:23
fungiianw: i'm okay with that if you are, i just didn't want to risk having to muddle through th emergency recovery boot on my own (have had inconsistent experience with that in the past)21:33
fungii miss being able to just escape a grub boot menu and override init in the kernel command line21:35
*** klonn has quit IRC21:35
fungirackspace's emergency boot wants to create a new server instance from the original metadata (including rerunning any userdata scripts) but then mount the original rootfs21:36
fungiwhich always weirds me out a bit21:36
fungiianw: when you say "just kicked off the in-place focal upgrade" you mean yesterday or now?21:49
clarkbfungi: I read it as now (I think yesterday was to bionic?)21:51
fungioh, right, multi-phase upgrade21:55
fungicool, i'll wait until ianw is satisfied with the focal upgrade, and am around (and hopefully more useful) if this reboot poses similar problems21:56
fungiin /etc/issue it already says the server is on 20.04.221:56
fungibut i likely checked too late21:56
clarkbthe new zuul executors are getting ansibled now. I'll get them started and turn off the old ones as soon as that finishes21:58
clarkbiirc it took a while compiling the openafs kernel driver last time so probably a little ways away still21:58
ianwfungi: yeah, sorry, kicked it off this morning.  i'm back now, i think we just uncomment and try a regular reboot as i'm pretty confident the array will build22:00
ianwit's been fine on all the other hosts, anyway22:01
ianwit's back ... hopefully that is it for afs!  still krb hosts but very close.  turning on mirror-update now to confirm22:04
fungiianw: i'm feeling fairly sure that fifth cinder volume was the problem, so yeah i guess let's go for it22:05
fungisupposed problem volume is completely gone now22:06
ianwok, i've run a manual --flush-cache --limit base.yaml run against it, mirror-update is up and trying some docs partition releases22:11
clarkbinfra-prod-service-zuul is done. I'll start the new executors and ask the old ones to pause now22:16
clarkband thats done. Now we wait for the old ones to go quiet22:18
ianwclarkb: nice, i think that's the last of the xenial afs clients too, one less thing to worry about22:19
ianwi guess it has decided docs needs a full release ... that's weird22:21
fungiif we rebooted in the middle of an earlier release, that could easily happen22:21
clarkbianw: note I still have to do 8-1222:32
clarkber 5-1222:32
ianwmirror-update was shutdown the whole time with no active releases (it's all release from the python script on there).  anyway, it's moved on; i'm feeling pretty confident it's all working22:32
clarkbI figured I'd do ~3-4 at a time just to avoid having too many moving pieces at once22:32
ianwclarkb: there's a little stack of changes ending at that updates some role matching and gerrit building bits.  apart from the slight change to build arguments should be a no-op generally22:38
ianwclarkb: and if you have a sec to double-check the stevedore one @ i can babysit that too, and cleanup the old bits22:39
clarkbianw: for the stevedore one I guess we create the cache dir on all hosts?22:40
ianwyeah, /root/.cache seems like something generic, to me22:41
clarkbianw: for the change at hte bottom of the noopy stack is the log collection working properly? looks empty22:45
ianwhrm, maybe i got the matchers wrong, i can never remember it22:47
clarkbThe other two changes lgtm though22:47
clarkbI've not approved anything though as I will soon need to pop out22:48
* clarkb looks at the log collection more to see if it can be figured out22:48
ianwmaybe it needs stage_dir?22:51
clarkbthe default for that appears to be ansible_user_dir and it should append /logs to that since these are logs_txt I think22:52
clarkb shows it created some stage dirs22:52
ianwit should be "'/var/log/openafs': logs" (not logs_txt)22:53
clarkband that also shows it copying things like syslog into the stage dir22:53
clarkbianw: I think it may be a mismatch between the copies done on the executor and where we stage22:55
clarkbthe executor copy seems to want to copy from remote:/home/zuul/zuul-output/logs but we stage at /home/zuul/logs22:56
ianwyeah, i think that's the stage-output22:56
*** tkajinam has joined #opendev22:57
clarkbya I think if you set stage_dir to {{ ansible_user_dir }}/zuul-output it might work22:58
clarkbianw: but also I'm not sure you need the explicit dir creates in the post run you added22:58
clarkbI think that may happen for you22:58
corvus#status log updated eavesdrop hosts entry and restarted gerritbot due to netsplit23:04
openstackstatuscorvus: finished logging23:05
openstackgerritIan Wienand proposed opendev/system-config master: system-config-roles: only match jobs on roles tested
openstackgerritIan Wienand proposed opendev/system-config master: gerrit docker: match some more files
openstackgerritIan Wienand proposed opendev/system-config master: Remove obsolete Bazel spawn strategies
clarkbianw: I have +2'd that on the assumption that it will fix things, but I probably won't be around when the jobs finish to double check23:07
ianwclarkb: ok, thanks .. it's fairly minor.  the actual problem was that the x86-64 centos-8 image was out of date, and it couldn't find the kernel headers package for the kernel it was actually running so couldn't build openafs23:09
ianwi don't like these jobs in general; they're an odd construction.  when we don't care about xenial i'll rework them :)23:10
clarkball three zuul executors are still running jobs (I asked them to pause though). I'll check on these when I can and get them shut down when quiet, but popping out now23:27
openstackgerritMartin Kopec proposed opendev/system-config master: refstack: Edit URL of public RefStackAPI

