Friday, 2021-01-15

clarkb	zuul, static, mirror, and mirror-update appear to be all the places we run openafs-client	00:00
clarkb	I agree a mirror node seems like the best option for upgrading out of that set	00:00
clarkb	the inap provider appears to be disabled in nodepool. I'll test on its mirror	00:01
clarkb	https://mirror.mtl01.inap.opendev.org/ubuntu/ it appears to be working now	00:01
clarkb	I will apt-get update && and apt-get install openafs-client on it?	00:02
clarkb	then reboot?	00:02
clarkb	ianw: fungi ^ that seem like a reasonable approach?	00:02
ianw	clarkb: ++	00:02
clarkb	looks like we also install openafs-krb5 so I'll apt-get install that with openafs-client	00:03
*** redrobot has quit IRC		00:03
*** fbo has quit IRC		00:03
*** artom has quit IRC		00:03
*** chrome0 has quit IRC		00:03
*** yoctozepto has quit IRC		00:03
*** calcmandan has quit IRC		00:03
*** cloudnull has quit IRC		00:03
*** openstackgerrit has quit IRC		00:03
*** otherwiseguy has quit IRC		00:03
*** elod has quit IRC		00:03
*** mhu has quit IRC		00:03
*** SotK has quit IRC		00:03
*** spotz has quit IRC		00:03
*** hamalq has quit IRC		00:03
*** auristor has quit IRC		00:03
*** jaicaa has quit IRC		00:03
*** tkajinam has quit IRC		00:03
*** andrii_ostapenko has quit IRC		00:03
*** dviroel has quit IRC		00:03
*** guilhermesp has quit IRC		00:03
*** clayg has quit IRC		00:03
*** rm_work has quit IRC		00:03
*** walshh_ has quit IRC		00:03
*** donnyd has quit IRC		00:03
*** persia has quit IRC		00:03
*** seongsoocho has quit IRC		00:03
*** mattmceuen has quit IRC		00:03
*** portdirect has quit IRC		00:03
*** gmann has quit IRC		00:03
*** melwitt has quit IRC		00:03
*** ttx has quit IRC		00:03
*** mordred has quit IRC		00:03
clarkb	hrm looks like we also do a dance where we install the kernel module first then those two	00:04
clarkb	I'll try to figure out how to translate that to apt-get commands	00:04
*** hamalq has joined #opendev		00:05
*** auristor has joined #opendev		00:05
*** jaicaa has joined #opendev		00:05
*** tkajinam has joined #opendev		00:05
*** andrii_ostapenko has joined #opendev		00:05
*** dviroel has joined #opendev		00:05
*** guilhermesp has joined #opendev		00:05
*** walshh_ has joined #opendev		00:05
*** clayg has joined #opendev		00:05
*** rm_work has joined #opendev		00:05
*** donnyd has joined #opendev		00:05
*** persia has joined #opendev		00:05
*** seongsoocho has joined #opendev		00:05
*** portdirect has joined #opendev		00:05
*** mattmceuen has joined #opendev		00:05
*** gmann has joined #opendev		00:05
*** melwitt has joined #opendev		00:05
*** ttx has joined #opendev		00:05
*** mordred has joined #opendev		00:05
fungi	clarkb: sounds great	00:05
clarkb	--no-install-recommends when install openafs-modules-dkms, then install the other packages	00:06
*** redrobot has joined #opendev		00:06
*** cloudnull has joined #opendev		00:06
*** fbo has joined #opendev		00:06
*** artom has joined #opendev		00:06
*** otherwiseguy has joined #opendev		00:06
*** elod has joined #opendev		00:06
*** spotz has joined #opendev		00:06
*** yoctozepto has joined #opendev		00:06
*** mhu has joined #opendev		00:06
*** calcmandan has joined #opendev		00:06
*** SotK has joined #opendev		00:06
*** mordred has quit IRC		00:07
*** Eighth_Doctor has quit IRC		00:07
*** artom has quit IRC		00:07
*** tosky has quit IRC		00:07
*** artom has joined #opendev		00:08
*** chrome0 has joined #opendev		00:08
*** guilhermesp has quit IRC		00:09
*** donnyd has quit IRC		00:09
*** gmann has quit IRC		00:09
*** donnyd has joined #opendev		00:10
fungi	i think system-config-run-static may be impacted by this	00:10
*** guilhermesp has joined #opendev		00:10
clarkb	presumably if it reruns it will run with the new packages and be happy	00:10
clarkb	still waiting on dkms to do its thing on the inap mirror	00:11
*** gmann has joined #opendev		00:11
fungi	https://zuul.opendev.org/t/openstack/build/28c0192e984548b0a48d10451e6752fb/log/job-output.txt#45767-45775	00:11
fungi	(also wow that's a long log)	00:11
fungi	look for the "Check AFS mounted" task since the autoscroll isn't going to work on progressive loading a log that long	00:12
clarkb	it looks like inap's mirror may have been running 1.8.3 not 1.8.6-1 fwiw	00:13
clarkb	so these may all need manual intervention?	00:13
fungi	i can try to upgrade some once you're comfortable with the first one	00:13
fungi	should i go ahead and recheck the change which was failing to afs mount?	00:14
fungi	i guess all the relevant packages are in our ppa now	00:14
clarkb	yes they should be there except for arm64 last I checked	00:15
fungi	it's an x86 job so should be fine then	00:15
*** artom has quit IRC		00:15
ianw	yep all published	00:18
clarkb	rebooting inap mirror now	00:19
ianw	so https://etherpad.opendev.org/p/infra-openafs-1.8 has the outline of what i think an emergency 1.8 upgrade would be	00:20
clarkb	ianw: I was trying to follow along as you went and added some notes too. I think that captures it. I don't know if iti spossible to do a non downtime upgrade. My understanding in the past was that it was not but that may not be accurate (and this was because 1.8 and 1.6 inside the server level couldn't talk to each other)	00:21
clarkb	https://mirror.mtl01.inap.opendev.org/ubuntu/lists/ seems to be working post reboot	00:22
clarkb	and from what I can tell it installed the new packages	00:22
clarkb	I think we'll update openafs in most (all?) places when zuul does its daily runs and/or when unattended upgrades happens	00:23
clarkb	do we want to proactively upgrade them? ahead of that? if not I can check dpkg -l tomorrow and confirm they updated on their own	00:23
ianw	that should be right, though i guess things might want a reboot	00:23
clarkb	ya my concern with rebooting mirrors is that zuul's queue is super deep right now	00:24
clarkb	trying to balance the various factors in play here (not easy)	00:24
ianw	but if the mirrors are getting random failures that's also not great	00:24
clarkb	ya though as far as I can tellthey haven't yet. Are you concerned that they may after the upgrade happens but before a reboot?	00:25
ianw	am i understanding correctly that is the current failure case? randomish failures from the 1.6 servers?	00:25
clarkb	ianw: 1.8 will 100% fail apparently when it starts exhibiting the problem	00:25
clarkb	1.6 will be randomish	00:25
clarkb	none of our systems should do that unless we sufficiently restart openafs (not sure what that is but reboot definitely is sufficient)	00:26
clarkb	this is beacuse all of our systems should've started before the epoch rollover thing	00:26
clarkb	obviously they won't necessarily all remain in that state in the future as clouds do their cloudy thing	00:26
ianw	yeah, i don't really like sitting on a time-bomb that as soon as the backend fails, we have a fire-drill to get the servers updated	00:27
ianw	actually it's not a fire-drill, it's an actual fire at that point :)	00:27
clarkb	agreed, but we also have a potential multiday zuul backlog that will just implode on itself if we take an outage to fix it. Trying to figure out where in my head the balance is between imploding all those jobs to take a downtime and fix this vs waiting for now and fixing it when zuul is hopeflly happier	00:28
clarkb	mirrors, zuul executors, and static are all involved in that	00:29
clarkb	(in addition to the afs servers)	00:29
ianw	i think doing the manual upgrade to 1.8 with servers in emergency probably isn't a bad thing in the long run	00:30
ianw	it will give us a chance to see 1.8 in action before we upgrade the base os of the servers	00:30
ianw	effectively one thing at a time. we can feel more confident about dropping in replacement servers one-by-one if eerything is at 1.8	00:31
clarkb	ya agreed	00:31
clarkb	for mirrors we can disable a cloud in nodepool, wait for it to drain out (up to 3 hours or so each due to tripleo jobs), update the mirror and reboot it	00:31
clarkb	for zuul executors we can stop zuul, update openafs, then reboot them one at a time	00:31
clarkb	zuul should retry the jobs that fail as a result	00:32
clarkb	that has an impact but it is smaller	00:32
clarkb	fungi: we may also want to talk to the release team?	00:32
fungi	2021-01-15 00:03:31 <-- openstackgerrit (~openstack@eavesdrop01.openstack.org) has quit (.net .split)	00:32
* fungi sighs		00:32
clarkb	are they making a ton of releases as part of this milestone that is plugging everything up?	00:32
clarkb	static I think is largely read only and so the impact of that might be less noticeable	00:33
clarkb	ianw: thinking out loud here, doing mirrors, executors, and static first is probably more straight forward and will test our packages betterer?	00:34
fungi	i think the release team really only notices when the releases and tarball sites get out of date (and release notes on docs site)	00:34
clarkb	fungi: well we'll potentially break our ability to write to tarballs	00:35
fungi	so 770856 is probably all they'll need	00:35
clarkb	(if zuul executors are not happy with the upgrade)	00:35
fungi	oh, writes, yes	00:35
ianw	yeah, i agree we need to make sure they are all in a state of having the latest ppa client running so that if the server does get switched on them we are ok	00:35
clarkb	fungi: the new package seems to work for reads just fine	00:35
clarkb	and we can do a single zuul executor first then observe it ?	00:36
fungi	so the current risk is that clients running unpatched 1.8 may spontaneously reboot and stop working, which if we've upgraded the packages (unattended upgrades, ansible, et cetera) and just not rebooted them yet, is probably fine	00:36
clarkb	fungi: also note the chagne you are trying to land will update static's install	00:36
fungi	we already deal with corrupted afs caches at reboot which blocks afs from working on the mirrors on a frequent basis	00:36
clarkb	fungi: ya the other risk is that the new packages don't work in some way and we may not notice until we reboot	00:37
fungi	update static's install but not restart afsd or reboot the server	00:37
clarkb	correct	00:37
fungi	sure, but like i said, we already frequently deal with afs not working after an unclean spontaneous reboot	00:38
clarkb	mirror update, mirrors, and static normally update daily via the periodic pipeline. zuul updates hourly	00:38
fungi	having to scramble to work out why it broke differently would be not great, sure	00:38
clarkb	its possible that all the zuul executors have already updated?	00:38
clarkb	fungi: ya I get what you are saying. basically we've reduced the risk of a reboot causing 100% failure	00:39
clarkb	and the chance of any failure post reboot is minimal since reads are workign	00:39
fungi	ii openafs-modules-dkms 1.8.6-1ubuntu1~xenial1	00:39
fungi	that's ze01	00:39
fungi	so no, not all anyway	00:39
clarkb	ze01 has not updated	00:39
clarkb	ya I mean once they update	00:39
clarkb	mirror-update and mirrosr won't happen until ~0600	00:40
clarkb	static may happe nsooner if your change lands	00:40
*** hamalq has quit IRC		00:40
clarkb	zuul should happen in the next hour?	00:40
clarkb	part of why I'm bringing this up is I need to pop out for dinner in a few short minutes and then idaelly also call it a work day	00:41
fungi	i'll be glad to stick around, and am happy to do a controlled reboot of static.o.o after the apache change triggers a fresh deployment there	00:41
*** mlavalle has quit IRC		00:42
ianw	yeah, i can make a list on that etherpad page and make sure things update	00:42
ianw	i can also create the vicepa snapshot in preparation for a manual server upgrade	00:43
clarkb	ze01 should've updated openafs-client and openafs-krb5 at 23:43:44 ish. Now trying to cross check with when the ppa updated for xenial	00:43
*** mordred has joined #opendev		00:43
ianw	fungi: be good if you could double check the instructions	00:43
clarkb	the timestamp for the amd64 openafs-client package seems to be 23:44	00:44
clarkb	missed it by seconds	00:44
clarkb	ianw: fungi thanks	00:44
clarkb	then maybe tomorrow I can work on rolling reboots of zuul executors and we can do a reboot of mirror-update if fungi's locks sufficiently idle that server?	00:45
fungi	ianw: i'll take a look, sure	00:45
clarkb	then maybe aim for Monday upgrade of the servers?	00:45
fungi	i'd rather not reboot mirror-update until the tarballs volume is at least done releasing	00:45
fungi	but i guess if we need to we need to	00:46
clarkb	fungi: ya I think we want to wait for it to idle if we can get it to do so	00:46
clarkb	since you've got all the locks held right?	00:46
clarkb	so it should finish the current set of releases then do nothing	00:46
clarkb	if we aim for monday for the outage we can send out comms tomorrow too and try and warn people as much as possible	00:46
clarkb	"the uotage" being the main afs server outage	00:46
clarkb	and that also gives time for vos releases to complete	00:47
fungi	right, i terminated the other outstanding vos release calls from mirror-update.o.o (which unfortunately doesn't stop the transactions so isn't actually freeing up the afs servers) and held locks for all of them in a screen session	00:47
fungi	the possible wrench in the ointment here is that the other replica sync transactions which already got initiated are likely to continue well into next week	00:48
clarkb	hrm I wonder how terrible it will be to upgrade with those happening :/	00:49
fungi	we might just need to consider afs02.dfw a total loss and start all its replicas from scratch again	00:49
fungi	presumably afs01.dfw will give up trying to replicate to it if the server goes away for a while?	00:49
clarkb	no clue	00:50
clarkb	might be good to have corvus think over some of this stuff too	00:50
ianw	indeed. i can shepherd it on my monday, which is usually a very quiet time	00:50
fungi	any one of these mirror volumes easily needs a weekend or more to do a full release, and we have something like 10 around that size	00:50
ianw	that would give y'all your monday to fix anything :)	00:50
*** mordred has quit IRC		00:51
clarkb	I've just double checked that openafs-client role is only applied to zuul-executor and not all of zuul. That is the case	00:53
clarkb	and with that I need to go catch up on household/family things. Thank you for all the help today. If you discover new things or have schedulign thoughts for getting stuff uipdated maybe update the etherpad and I'll do my best to catch up in the morning?	00:53
ianw	np. i think what i'll do is monitor all those servers and update the ehterpad. i'll add the vicepa snapshot	00:54
ianw	then i might send a summary email we can sync on	00:54
clarkb	sounds good. ++	00:54
fungi	awesome, i'll be back around shortly, need to switch rooms	00:54
clarkb	might also be good to indicate if you think an AU monday outage is a good idea given the other releases and everything when your day ends. That way we can send out a warning tomorrow about it	00:54
clarkb	I guess we can always send a warning and mention it may not happen depending on server state	00:55
clarkb	thanks again!	00:55
fungi	i'm going to restart gerritbot, i don't see it coming back since the split	00:56
fungi	#status log restarted gerritbot since it was lost in a netsplit at 00:03 utc	00:57
openstackstatus	fungi: finished logging	00:57
*** Eighth_Doctor has joined #opendev		01:01
*** mordred has joined #opendev		01:19
fungi	ze01 still doesn't have newer openafs packages installed yet. i suspect our hourly jobs for those aren't upgrading distro packages	01:34
fungi	unattended-upgrades will take care of it anyway	01:43
*** cloudnull has quit IRC		01:49
fungi	ze07 looks like it's mid-upgrade	01:56
fungi	openafs-modules-dkms is at 1.8.6-5ubuntu1~xenial2 now but openafs-client and openafs-krb5 are still on 1.8.6-1ubuntu1~xenial1 there	01:56
fungi	ahh, yeah, the dkms postinst build is in progress there	01:58
ianw	ok, lunch done, will take a look	01:59
fungi	it's parented to what looks like an ansible ssh session, so i guess we are updating them that way	01:59
fungi	yeah, other executors are in a similar state now	02:00
fungi	so they should be done shortly	02:00
ianw	ianw@ze01:~$ dpkg --list \| grep openafs	02:20
ianw	ii openafs-client 1.8.6-1ubuntu1~xenial1 amd64 AFS distributed filesystem client support	02:20
ianw	ii openafs-krb5 1.8.6-1ubuntu1~xenial1 amd64 AFS distributed filesystem Kerberos 5 integration	02:20
ianw	ii openafs-modules-dkms 1.8.6-5ubuntu1~xenial2 all AFS distributed filesystem kernel module DKMS source	02:20
fungi	yeah	02:22
ianw	- name: Install client packages	02:22
fungi	look at the process list and you'll see lkm compilation underway	02:22
fungi	or i guess it's finished on at least some i'm checking now	02:23
ianw	i think it's finished on ze01	02:23
ianw	the next step in the role should have updated the other two packages	02:23
ianw	service-zuul.yaml.log just finished	02:27
fungi	yeah, looks like they're fully upgraded now	02:28
ianw	yep	02:28
fungi	now the question whether we should reboot an executor and make sure it's still working correctly	02:29
ianw	just looping them all to double check they're all upgraded	02:29
*** brinzhang_ has joined #opendev		02:31
ianw	all good. probably worth restarting one ... i can do that if we like	02:32
fungi	it'll restart a bunch of in-progress builds, but that's probably still better than getting caught unawares with a problem when one needs to be rebooted for other reasons	02:34
*** brinzhang0 has quit IRC		02:34
*** cloudnull has joined #opendev		02:35
fungi	yeah, we've still got a nearly 3k node backlog	02:35
ianw	i'll restart ze01 for sanity	02:36
fungi	thanks	02:37
ianw	ok, it's back, it's got afs and can look around /afs/openstack.org	02:40
fungi	looks like it's up and yeah	02:40
fungi	we'll want to make sure a docs build succeeds on it	02:40
*** cloudnull has quit IRC		02:46
ianw	i don't think anything updates the cache for the mirror runs	02:59
ianw	the package cache	02:59
ianw	runs being the ansible runs	02:59
*** brinzhang_ has quit IRC		03:00
*** brinzhang_ has joined #opendev		03:00
fungi	unattended-upgrades will get it at some point in 24 hours	03:01
fungi	or we can manually force them earlier	03:01
ianw	i'm running ansible by hand now with update_cache fix in the openafs-client role	03:02
*** ysandeep\|out is now known as ysandeep		03:06
*** cloudnull has joined #opendev		03:21
*** cloudnull has quit IRC		03:53
*** cloudnull has joined #opendev		03:55
fungi	not seeing any obvious job failures attributable to afs writes from ze01, good so far	04:02
*** cloudnull has quit IRC		04:02
*** cloudnull has joined #opendev		04:05
ianw	ok, all mirrors should be updated	04:12
fungi	thanks! i think i'm about to nod off, but i'll check through everything again when i wake up	04:15
*** ysandeep is now known as ysandeep\|afk		04:49
*** ysandeep\|afk is now known as ysandeep		05:21
*** brinzhang0 has joined #opendev		05:36
*** brinzhang0 has quit IRC		05:37
*** brinzhang0 has joined #opendev		05:37
*** brinzhang_ has quit IRC		05:38
ianw	mail sent to discuss list to discuss how to move on from here	05:57
ianw	i have confirmed that all our clients have fixed 1.8.6 packages	05:58
ianw	i have implemented the vicepa snapshot on afs01.dfw; we may want to recreate this I guess but the basics are there	05:59
ianw	not sure what else to do now	06:06
ianw	i'll wait for some feedback and we can take it from there	06:06
*** zbr has quit IRC		06:09
*** zbr has joined #opendev		06:11
*** marios has joined #opendev		06:22
*** eolivare has joined #opendev		07:12
*** openstackgerrit has joined #opendev		07:25
openstackgerrit	Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258	07:25
*** ysandeep is now known as ysandeep\|lunch		07:47
*** sboyron has joined #opendev		07:54
*** ralonsoh has joined #opendev		07:55
openstackgerrit	Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258	08:09
*** rpittau\|afk is now known as rpittau		08:09
*** andrewbonney has joined #opendev		08:12
*** jpena\|off is now known as jpena		08:34
*** tosky has joined #opendev		08:42
*** ysandeep\|lunch is now known as ysandeep		08:47
*** lpetrut has joined #opendev		08:50
*** hashar has joined #opendev		08:52
openstackgerrit	Martin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker https://review.opendev.org/c/opendev/system-config/+/705258	09:12
*** DSpider has joined #opendev		10:17
*** dtantsur\|afk is now known as dtantsur		10:38
*** marios has quit IRC		10:58
zbr	any infra-core around?	11:28
*** rosmaita has joined #opendev		11:54
*** marios has joined #opendev		11:59
*** hashar is now known as hasharLunch		12:04
*** artom has joined #opendev		12:16
*** jpena is now known as jpena\|lunch		12:31
*** brinzhang_ has joined #opendev		12:32
*** brinzhang0 has quit IRC		12:36
*** whoami-rajat__ has joined #opendev		12:43
*** slittle1 has quit IRC		13:10
*** slittle1 has joined #opendev		13:11
mrunge	Hi there, is there anyone to help me figure out a POST_FAILURE in patches in gate phase? E.g https://review.opendev.org/c/openstack/panko/+/764906	13:18
mrunge	the same checks work in check phase, but fail in gate phase with this post_failure	13:19
mrunge	and I can't figure out why	13:19
*** slittle1 has quit IRC		13:30
*** jpena\|lunch is now known as jpena		13:33
frickler	mrunge: looks to me like "just" an unstable job, the same failure is also seen in check here https://zuul.opendev.org/t/openstack/build/3d7cb7959325456f98381916c95081ea	13:36
frickler	it's extremely unlikely that job results should consistently depend on whether the job runs in check or gate	13:36
mrunge	frickler, thank you. It looks like they are failing pretty consistently, but only in gate phase	13:37
frickler	we could either hold a node for you to debug the failure in situ or you could amend the devstack post job to not fail completely in that scenario, allowing for more logs to be present in case of this failure	13:37
mrunge	is it possible that just a signal is not sent to the right target?	13:39
frickler	zbr: the keywords to highlight people are either infra-root or config-core (the latter for e.g. project-config reviews). it would also be much more productive if you could just state your issue instead of doing empty pings	13:39
frickler	mrunge: not sure why exporting the journal would fail, most likely it gets oomed. without other logs present it's difficult to really tell, though	13:40
mrunge	right	13:40
mrunge	it looks like this is getting killed	13:40
mrunge	frickler, is it possible to freeze a node in the case of failure?	13:41
mrunge	although, why would the job then just fail in gate and not also in check mode?	13:42
frickler	mrunge: as I said before, I don't think the latter is true. I'll set up a hold in order to keep a node online when it is failing	13:45
mrunge	okay, thank you. I can retrigger a job	13:46
mrunge	(or you could...)	13:46
frickler	I just did, the hold is specific to the patch you mentioned above.	13:50
mrunge	thank you, that's awesome	13:50
zbr	frickler: it would be very useful to mention the keywords in channel topic as we have lots of channels with their own rules.	13:56
*** mlavalle has joined #opendev		13:58
frickler	zbr: we did not put them into the topic in order to avoid getting spammed too often, they are mentioned somewhere in our docs, though	13:58
zbr	https://review.opendev.org/q/hashtag:%22low-hanging%22+(status:open%20OR%20status:merged)	13:58
zbr	but with projects like git-review is quite hard to guess, which keyword to use.	14:00
zbr	i am core there, but i still need two others to get my own changes in.	14:00
zbr	is bit tricky as low activity projects rely more on help from infra in order to get things moving.	14:02
zbr	i still have no idea how we could improve this	14:02
zbr	another interesting subject would be related to elastic-search licensing: does this impact us? (long-term)	14:07
frickler	well we could discuss dropping the two-review rule for projects like this. maybe add that as a topic to our meeting agenda?	14:07
zbr	sure, i will do. i think that there requirement is not enforced but it could be tricky to know which project needs it or not.	14:08
mrunge	zbr, you're not alone with this issue.	14:14
mrunge	usually we dealt with this by distinguishing between low hanging fruits and patches where a second pair of eyes is really appreciated	14:15
zbr	mrunge: yep, my impresison is that simple low-risk changes should be allowed with downgraded cvorum, as in single reviewer. still, even if that is agreed, we need to give a good set of examples (good and bad).	14:17
mrunge	+1 , however it would be hard to give these do's and don'ts	14:18
zbr	fixing ci pipelines, reqs, broken jobs, is likely to be low risk. but breaking changes not	14:21
zbr	finny bit is that dropping support for py27 and py35 is somewhere in-between, based on the project, it could have breaking impact.	14:22
mrunge	yes!	14:22
zbr	on the other hand, when you already have CI broken for months, you start to wonder which one is the lesser evil	14:22
zbr	i added topic to https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Weekly_Project_Infrastructure_team_meeting	14:23
*** rpittau is now known as rpittau\|afk		14:48
frickler	mrunge: oom theory confirmed: http://paste.openstack.org/show/801664 , if you let me know your ssh key I can give you access for further debugging	14:49
mrunge	good to know	14:52
mrunge	frickler, https://github.com/mrunge.keys	14:52
fungi	is there swap set up on that job?	14:52
mrunge	tbh. I don't know	14:52
mrunge	I even haven't seen that job description so far	14:53
fungi	should be able to run free once logged in and find out	14:53
frickler	our default of 1G, yes, and it is all used up	14:53
frickler	mrunge: ssh root@158.69.69.72	14:54
frickler	/var/log/journal has 409M, so nothing super huge, though	14:56
fungi	so presumably memory pressure from something else	14:56
mrunge	all together may be a bit tight with 1gb	14:56
mrunge	is there a possibility to give it, e.g. 1.5 gig?	14:56
frickler	it's strange though that xz gets killed, though, maybe it chokes on some special kind of input? or is it just unlucky?	14:56
frickler	mrunge: there is a patch to have 8g of swap again, but so far it seemed only to be needed for stable/stein	14:57
frickler	or was it train?	14:57
mrunge	since the same job passes on check queue, the question is, what is the real difference?	14:58
mrunge	between check and gate queue	14:58
frickler	https://review.opendev.org/c/openstack/devstack/+/757488	14:58
frickler	mrunge: this failure was in check, it doesn't really matter which queue	14:58
mrunge	ugh	14:58
mrunge	there goes my theory	14:58
frickler	seems just like a 50% chance of failing	14:58
fungi	mrunge: frickler: it's entirely configurable. 1gb is simply the default	14:59
mrunge	fungi, where would I set that?	14:59
mrunge	I wouldn't want to use a lot more, since it often passes	15:00
fungi	you can set a variable in the job definition to indicate how large of a swapfile you want, just be aware it subtracts from available rootfs on at least some providers so don't make it huge unless you know you won't be using a lot of disk	15:00
frickler	mrunge: see the above patch.	15:00
* fungi goes looking for the variable		15:00
fungi	configure_swap_size yeah	15:02
mrunge	since there is 17 Gig out of 75 Gig used in rootfs, increasing swap to 8 gig would totally fit	15:02
fungi	you can see folks overriding it in various jobs: https://codesearch.opendev.org/?q=configure_swap_size	15:03
fungi	8192 and 0 seem to be the popular override values	15:03
fungi	depending on whether folks wanted more swap, or more disk and no swap	15:03
mrunge	so, in this case, nearly all of the 1 gig swap is used	15:04
fungi	for a bit of background, we basically had to make this compromise when newer linux kernels started refusing to allow sparsely allocated swapfiles	15:04
fungi	the swapfiles are now preallocated instead of sparse, which means the prior 8gb default made a lot of jobs start failing on providers with smaller rootfs sizes	15:05
mrunge	any idea where I would find the job description for telemetry-dsvm-integration ?	15:06
mrunge	that seems to be inherited from somewhere?	15:06
fungi	it will inherit from the job's parent if there's no description set in the job definition, i believe	15:07
*** lpetrut has quit IRC		15:07
fungi	mrunge: https://zuul.opendev.org/t/openstack/job/telemetry-dsvm-integration says it inherits from https://zuul.opendev.org/t/openstack/job/telemetry-tempest-base	15:08
mrunge	right, that's what I found so far	15:09
fungi	the former has no description set in the job, but the latter does	15:09
frickler	hmm, according to the manpage, xz may use multiple GB of memory when running with high compression settings like -9. we may want to add --memlimit=x, maybe 256M or so	15:12
fungi	or switch to gz and accept that the files will be a bit larger	15:13
fungi	though i want to say clarkb found xz was massively better at compressing systemd journals than gz	15:14
frickler	well, failing with oom doesn't sound better, so there seems to be a tradeoff to make ;)	15:15
fungi	yep, absolutely	15:16
frickler	anyway, eod for me, maybe someone wants to add extra swap to the held node and test various xz options. or I'll do that later	15:17
fungi	i suppose we could do a fallback to --memlimit=something if the normal compress attempt fails	15:17
mrunge	yes, that sounds sensible	15:17
fungi	but yeah, let's get some more suggestions. i need to switch to double-checking afs stuff	15:17
mrunge	and then there is this change: https://github.com/openstack/devstack/commit/d02fa6f856ac5951b8a879c23b57d5a752f28918	15:18
mrunge	but not causing this issue	15:19
mrunge	https://review.opendev.org/c/openstack/devstack/+/770949	15:23
mrunge	that is the change	15:24
mrunge	thank you fungi and frickler	15:24
mrunge	let's see how this goes	15:24
clarkb	xz is much better space wise yes	15:28
openstackgerrit	Merged openstack/project-config master: update gentoo from python 3.6 to python 3.8 https://review.opendev.org/c/openstack/project-config/+/770828	15:29
mrunge	fungi, frickler: I got access to a held node, I don't believe I still need it. that is 158.69.69.72 and could be released back to the pool	15:35
mrunge	Or can I do that on my own, and if yes: how?	15:35
clarkb	mrunge: you can't currentl release it back on your own	15:35
clarkb	thank you for letting us know	15:35
fungi	infra-root: static.o.o has the new openafs packages installed, unless there are objections i'm going to do a quick reboot in a few minutes so it's using the fixes and then maybe we can look at merging 770856	15:36
mrunge	thank you for giving access to help debugging	15:36
clarkb	fungi: looks like zuul has caught up a bit (though still well behind)	15:36
fungi	clarkb: yep	15:36
clarkb	fungi: no objections from me on the reboot	15:37
fungi	#status log rebooted static.o.o to pick up the recent openafs fixes	15:39
openstackstatus	fungi: finished logging	15:39
fungi	server's been up 2 minutes now	15:42
fungi	seems to be working, afs content is served	15:43
fungi	since there have been no objections, i'm going to self-approve 770856 for now and keep an eye on it once it deploys to make sure we're serving up to date content	15:44
clarkb	sounds good	15:47
*** caiqilong has joined #opendev		15:50
caiqilong	I received a email said I have been added to "Autopatrolled users". Is that good or bad?	15:50
fungi	caiqilong: it's "good"	15:50
caiqilong	fungi: ok. thanks.	15:51
fungi	i watch every change made to the wiki to make sure we catch and roll back any spam, but if i see a user make consistently legitimate changes i add them to the autopatrolled users and then i no longer need to review their edits in the future	15:51
fungi	clarkb: enough vos releases stopped that i can quickly get a vos status on afs01.dfw again (the tarballs volume release is still underway though)	15:54
caiqilong	fungi: thanks for your patience.	15:54
fungi	caiqilong: you're welcome!	16:04
openstackgerrit	Merged opendev/system-config master: Temporarily serve static sites from AFS R+W vols https://review.opendev.org/c/opendev/system-config/+/770856	16:16
*** slaweq has joined #opendev		16:19
*** ysandeep is now known as ysandeep\|dinner		16:21
fungi	okay, weird, looks like the last time we tried to build a gentoo image was september	16:24
clarkb	I think nodepool will stop building an iamge if we stop telling it to upload to any provider	16:27
clarkb	maybe we did that?	16:27
fungi	looks like we paused it	16:28
clarkb	ah	16:28
openstackgerrit	Jeremy Stanley proposed openstack/project-config master: Un-pause Gentoo image builds https://review.opendev.org/c/openstack/project-config/+/771031	16:32
fungi	prometheanfire: clarkb: ^	16:32
prometheanfire	yarp :D	16:33
*** ysandeep\|dinner is now known as ysandeep\|away		16:33
*** dtantsur is now known as dtantsur\|afk		16:41
*** iurygregory has quit IRC		16:41
*** marios is now known as marios\|out		16:47
*** marios\|out has quit IRC		17:04
*** iurygregory has joined #opendev		17:06
*** hasharLunch is now known as hashar		17:08
*** slaweq has quit IRC		17:12
clarkb	fungi: is tarballs the only thing releasing now or its just fewer things overall?	17:28
clarkb	I guess the mirrors are likely still releasing due to their size	17:28
clarkb	fungi: I've +2'd the gentoo unpause, not sure if you want to approve it or wait for another review	17:28
fungi	clarkb: what's "releasing" is a bit misleading. the previously killed vos release calls also still have sync transactions underway, should be able to spot them by listing the transactions	17:30
clarkb	gotcha	17:30
fungi	clarkb: i'll watch the build log for gentoo if you approve	17:31
*** mlavalle has quit IRC		17:31
clarkb	ok approving now	17:31
*** jpena is now known as jpena\|off		17:31
fungi	thanks!	17:32
*** mlavalle has joined #opendev		17:32
clarkb	fungi: making sure I'm up to date on people's thoughts re afs upgrades. We've rebooted enough afs clients to be reasonably confident that when the other clients reboot they will be fine (and all clients have the new packages installed)	17:38
clarkb	and that means we don't need to worry about doing rolling reboots of all the mirrso and zuul executors?	17:38
*** hashar has quit IRC		17:38
openstackgerrit	Merged openstack/project-config master: Un-pause Gentoo image builds https://review.opendev.org/c/openstack/project-config/+/771031	17:41
fungi	clarkb: yes, we rebooted at least the inap mirror, ze01 and static.o.o	17:45
clarkb	that should be a pretty good representative sample	17:45
fungi	i think that's a reasonable cross-section, yeah	17:46
clarkb	ok, is there anything afs related that you think I should be doing next to help? Looks like ianw did a vicepa snapshot already	17:47
clarkb	are we happy with that lvm level redundancy?	17:47
clarkb	I think yesterday we had suggested it as the least overhead option (lvm is often a mystery to me so want to double check I should be doing anything else around backups/snapshots to help)	17:48
fungi	as a temporary measure it seems like a fine solution to me	17:48
fungi	but we should make sure we have enough available extents in the vg to allow the volumes to diverge for a while	17:49
clarkb	fungi: is that something vgs would show us?	17:50
fungi	a quick rundown of how lvm2 snapshotting works: all the extents (device blocks essentially) which belong to the original volume are marked as also belonging to the snapshot. when new writes are committed in the original volume the old extents are kept and new extents are used instead. as the volumes diverge obviously more additional extents will be used up to (eventually) the size of the original volume	17:51
fungi	itself	17:51
fungi	and yeah, vgs will show you your available extents/space	17:51
fungi	we can always tack on another cinder device as a pv in that vg if needed	17:52
clarkb	ianw did that already according to the etherpad (a full size 1tb volume)	17:53
clarkb	I've just loaded my keys and will double check it too	17:53
clarkb	vgs only shows vfree and vsize by default	17:53
* clarkb reads manpages		17:53
clarkb	oh I may need lvs	17:54
clarkb	ya the actual lv and its snap show up there	17:54
fungi	right, lvs to see the volumes, vgs to see how much room there is for them to grow	17:56
*** ralonsoh has quit IRC		17:56
clarkb	in this case I think we've created two lvs one at 4TB and one at 1TB which fills up the entire 5TB vg	17:57
clarkb	the data% of the 1TB snapshot is .55%. Should I read that as this snapshot is using 0.55% of the space allocated to it?	17:57
clarkb	if so then I think we are currently quite happy with that state?	17:58
*** eolivare has quit IRC		18:02
clarkb	fungi: if you get a chance maybe you want ot double check all that too?	18:02
*** eolivare has joined #opendev		18:02
clarkb	corvus: if you haven't seen https://etherpad.opendev.org/p/infra-openafs-1.8 yet, it would be great if you could review that	18:03
clarkb	just to get another set of eyeballs on the whole situation	18:03
fungi	yep, that sounds right, but i'll take a closer look when i'm done eating	18:04
clarkb	thanks, that reminds me I should find breakfast	18:05
fungi	yeah, the snapshot setup on afs01.dfw looks okay to me, it's up to 0.56% in use, but at this rate of growth we've probably got plenty of time. also we'd presumably recreate a newer snapshot immediately prior to starting the upgrade	18:35
fungi	also calling sync before making the snapshot may be a good idea	18:36
clarkb	fungi: you should add that to the etherpad	18:36
fungi	i am	18:36
clarkb	thanks!	18:36
*** icey has quit IRC		18:48
auristor	clarkb: if you wait until jan 31 the need to update the servers to 1.8 will no longer be necessary	18:50
clarkb	auristor: oh does the bit flip over again at that point?	18:51
auristor	by the 31st the number of non-zero bits in the count of seconds since unix epoch will once again be great enough to permit time based random connection ids to be random	18:51
clarkb	aha	18:51
fungi	auristor: ooh, excellent point, though we have other reasons for wanting to upgrade too	18:52
clarkb	auristor: fwiw I think we should upgrade to 1.8 anyway.	18:52
clarkb	auristor: is changing the key format the only thing that you need to do in the upgrade? And it can't be done as a rolling upgrade right?	18:52
auristor	the underlying problem that openafs 1.8 had was that an effort was made to replace time based randomness with RAND_bytes and the implementation was botched.	18:52
auristor	the key does not need to change.	18:53
clarkb	auristor: akeyconvert was the thing that ianw found	18:53
clarkb	someone mentioned that converting the file format for the key was necessary?	18:53
auristor	1.6 stores non-DES keys in a krb5 keytab. 1.8 uses the KeyFileExt that AuriStor contributed.	18:53
clarkb	right, so to upgrade we stop 1.6, run akeyconvert, then start 1.8?	18:54
clarkb	on the fileservers and db servers	18:54
auristor	akeyconvert is used to import the keys from the keytab to the KeyFileExt.	18:54
auristor	1.6 doesn't know about KeyFileExt and doesn't care if it exists. 1.8 won't use the keytab.	18:54
clarkb	got it	18:54
clarkb	so we could convert, then stop 1.6 then start 1.8. But do need to convert before 1.8 starts	18:55
auristor	create the KeyFileExt and distribute to all servers as you would the keytab. Then upgrade / downgrade as you wish.	18:55
clarkb	does that also mean we need to update zuul secrets that use keytabs?	18:55
clarkb	though all of our clients are already 1.8 so maybe they do the correct thing already?	18:56
auristor	KeyFileExt is used by all openafs 1.8 servers and admin tools that use -localauth	18:56
clarkb	got it	18:56
auristor	openafs clients do not use keytabs or KeyFileExt	18:56
clarkb	auristor: and we shouldn't have a mix of 1.6 and 1.8 servers?	18:57
auristor	AuriStorFS clients use krb5 keytabs and provide encryption for anonymous processes	18:57
clarkb	I seem to recall reading that once upon a time but haven't found confirmation of that	18:57
auristor	you can mix 1.6 and 1.8	18:57
auristor	there are no protocol, database or vice partition changes	18:57
clarkb	that is really good to know actually	18:57
clarkb	given ^ I think we should upgrade the ord fileserver first	18:58
auristor	for that matter you can mix AuriStorFS and OpenAFS	18:58
clarkb	beacuse that will have the least impact, then we could do a rolling update across the others	18:58
clarkb	ianw: ^ highlight mark in irc for some interesting details	18:58
auristor	build a 1.8 test fileserver and add it to the cell to make sure your KeyFileExt works	18:58
auristor	once you are comfortable it does, distribute it to all the other db and file servers in the cell	18:59
clarkb	auristor: ya that is essentially what the ord server is (it is in another location compared to the others and we don't use it for much because releases to it over the internet are slow	18:59
clarkb	we should be able to use it as the test in this case with minimal impact (others should double check that assertion though)	18:59
*** andrewbonney has quit IRC		19:00
clarkb	I'll update our etherpad with this new info after lunch	19:00
fungi	all of the above sounds great, and also greatly simplifies the planned upgrade	19:02
*** slaweq has joined #opendev		19:22
openstackgerrit	Ghanshyam proposed openstack/project-config master: Combine acl file for all interop source code repo https://review.opendev.org/c/openstack/project-config/+/771066	19:28
fungi	prometheanfire: gentoo image build underway, log here: https://nb01.opendev.org/gentoo-17-0-systemd-0000143983.log	19:30
fungi	hopefully we'll know shortly if we have usable images again	19:30
fungi	oh, though it's been on "Emerging (1 of 1) dev-python/packaging-20.7::gentoo" for ~1.5 hours according to the timestamp. maybe that build got terminated	19:32
fungi	not terminated, this is still in the process table on nb01 since 18:03... "/bin/bash /tmp/in_target.d/pre-install.d/02-gentoo-04-install-desired-python"	19:33
fungi	this child of it has been running since 18:05... "/usr/bin/python3.8 -b /usr/lib/python-exec/python3.8/emerge --binpkg-respect-use --rebuilt-binaries=y --usepkg=y --with-bdeps=y --binpkg-changed-deps=y --quiet --jobs=2 --autounmask=n --oneshot --update --newuse --deep --nodeps dev-python/packaging"	19:34
fungi	looks like it might be deadlocked (livelocked?), strace shows it waiting on a private futex	19:35
clarkb	and it is that pid with the futex?	19:37
clarkb	or is another child of ^ (might be helpful to know which exact process is sitting on the futex)	19:37
clarkb	fungi: ianw I updated https://etherpad.opendev.org/p/infra-openafs-1.8 with the notes from auristor but didn't change the upgrade plan. Instead just added the new info and proposed a potentially different upgrade plan	19:38
fungi	i also replied to the ml thread with some of it	19:39
fungi	clarkb: the process i straced didn't have any children in the process table	19:40
clarkb	you wouldn't expect emerge to deadlock since that is a tool used by many users, but maybe we'vemanaged to trip a weird corner case in emerging python packaging	19:41
fungi	maybe it doesn't like running with linux 4.15 or something	19:45
*** dmellado has quit IRC		19:47
*** dmellado has joined #opendev		19:48
openstackgerrit	Jeremy Stanley proposed openstack/project-config master: Set up access for #openinfra channel https://review.opendev.org/c/openstack/project-config/+/771073	19:55
*** auristor has quit IRC		20:03
*** eolivare has quit IRC		20:04
*** slaweq has quit IRC		20:11
*** auristor has joined #opendev		20:23
fungi	yeah, i'm fairly certain it's not going to terminate on its own, will try killing it and see what happens on the next attempt	20:40
fungi	that caused it to fail and clean up	20:43
clarkb	fungi: it looks like zuul might be caught up by tomorrow	20:53
clarkb	assuming general friday trends continue	20:53
fungi	that'll be nice. gerrit and scheduler restart over the weekend then?	20:53
clarkb	what is the gerrit restart for?	20:53
fungi	i'll be around and can drive	20:53
clarkb	but ya getting a zuul scheduelr in would get us the WIP support which would be great	20:54
fungi	gerrit restart will be needed for the zuul results plugin, if we approve the stack	20:54
clarkb	I'll be around but in and out. Family is telling me that they are stir crazy and need to get out	20:54
fungi	yeah, no worries	20:54
clarkb	fungi: ah, I haven't had a chance to look at it since my last pass of reivews	20:54
clarkb	did the server rename stuff get pulled out of the stack? I really do think simplifying that is a good idea for now and then rename when we actually rename	20:55
fungi	checking	20:56
fungi	yeah, https://review.opendev.org/767059 is still near the top of the stack	20:58
fungi	we could do the gitea upgrade maybe?	20:59
clarkb	oh ya thats another one that got ignored due to the afs things	20:59
clarkb	ya if people are happy with the test node results on that one I think we can proceed	21:00
clarkb	looks like ianw did check the held node and I think you already did so that is three of us	21:00
fungi	i'm happy and seems like ianw tried it and didn't see any problem either	21:00
clarkb	ya	21:00
clarkb	I would probably prioritize zuul then gitea then gerrit	21:01
clarkb	zuul since it is most sensitive to the load	21:01
clarkb	and gerrit last since I'm not sure if we are going to rebase that and stop renaming the server in testing (my preference is that we do that)	21:01
fungi	clarkb: have any suggestions on the best way to reach out to nibalizer about https://review.opendev.org/771073 ?	21:03
clarkb	ya let me see	21:04
fungi	strange, nodepool still doesn't seem to have started trying to build another gentoo image	21:22
clarkb	if other images are building the slots may be full	21:23
fungi	yeah, likely	21:23
clarkb	since nodepool may have decided to start another build after you killed the broken gentoo build	21:23
*** whoami-rajat__ has quit IRC		21:26
openstackgerrit	Merged opendev/system-config master: system-config-run-review: remove review-dev server https://review.opendev.org/c/opendev/system-config/+/766867	21:39
*** sboyron has quit IRC		21:41
ianw	thanks for looking in on the afs stuff	22:18
ianw	i'd agree we can just start with ORD and see how it goes	22:18
ianw	i think it's good to have a plan in case things restart before the end of the month though :)	22:19
*** erbarr has quit IRC		22:38
*** erbarr has joined #opendev		22:38
*** logan- has quit IRC		22:40
*** logan- has joined #opendev		22:53
*** yoctozepto5 has joined #opendev		22:54
*** yoctozepto has quit IRC		22:55
*** yoctozepto5 is now known as yoctozepto		22:55
*** mwhahaha has quit IRC		22:57
*** mwhahaha has joined #opendev		22:57
*** TheJulia has quit IRC		22:58
*** TheJulia has joined #opendev		22:58
*** nautik has quit IRC		22:58
*** cap has quit IRC		23:00
*** jrosser has quit IRC		23:01
*** cap has joined #opendev		23:01
*** jrosser has joined #opendev		23:03
*** logan- has quit IRC		23:04
fungi	ianw: of course, and i'm not suggesting we wait to the end of the month to start either, just that we also can take it more slowly and carefully (events permitting)	23:05
openstackgerrit	Lon Hohberger proposed openstack/diskimage-builder master: Pass DIB image's kernel version when checking modules https://review.opendev.org/c/openstack/diskimage-builder/+/771092	23:12
*** yoctozepto6 has joined #opendev		23:15
*** yoctozepto has quit IRC		23:17
*** yoctozepto6 is now known as yoctozepto		23:17
*** logan- has joined #opendev		23:18
*** ysandeep\|away has quit IRC		23:20
*** ysandeep has joined #opendev		23:22
*** yoctozepto4 has joined #opendev		23:42
*** yoctozepto has quit IRC		23:44
*** yoctozepto4 is now known as yoctozepto		23:44
*** yoctozepto4 has joined #opendev		23:58
*** akrpan-pure has joined #opendev		23:59
*** yoctozepto has quit IRC		23:59
*** yoctozepto4 is now known as yoctozepto		23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!