Friday, 2021-01-15

clarkbzuul, static, mirror, and mirror-update appear to be all the places we run openafs-client00:00
clarkbI agree a mirror node seems like the best option for upgrading out of that set00:00
clarkbthe inap provider appears to be disabled in nodepool. I'll test on its mirror00:01
clarkbhttps://mirror.mtl01.inap.opendev.org/ubuntu/ it appears to be working now00:01
clarkbI will apt-get update && and apt-get install openafs-client on it?00:02
clarkbthen reboot?00:02
clarkbianw: fungi ^ that seem like a reasonable approach?00:02
ianwclarkb: ++00:02
clarkblooks like we also install openafs-krb5 so I'll apt-get install that with openafs-client00:03
*** redrobot has quit IRC00:03
*** fbo has quit IRC00:03
*** artom has quit IRC00:03
*** chrome0 has quit IRC00:03
*** yoctozepto has quit IRC00:03
*** calcmandan has quit IRC00:03
*** cloudnull has quit IRC00:03
*** openstackgerrit has quit IRC00:03
*** otherwiseguy has quit IRC00:03
*** elod has quit IRC00:03
*** mhu has quit IRC00:03
*** SotK has quit IRC00:03
*** spotz has quit IRC00:03
*** hamalq has quit IRC00:03
*** auristor has quit IRC00:03
*** jaicaa has quit IRC00:03
*** tkajinam has quit IRC00:03
*** andrii_ostapenko has quit IRC00:03
*** dviroel has quit IRC00:03
*** guilhermesp has quit IRC00:03
*** clayg has quit IRC00:03
*** rm_work has quit IRC00:03
*** walshh_ has quit IRC00:03
*** donnyd has quit IRC00:03
*** persia has quit IRC00:03
*** seongsoocho has quit IRC00:03
*** mattmceuen has quit IRC00:03
*** portdirect has quit IRC00:03
*** gmann has quit IRC00:03
*** melwitt has quit IRC00:03
*** ttx has quit IRC00:03
*** mordred has quit IRC00:03
clarkbhrm looks like we also do a dance where we install the kernel module first then those two00:04
clarkbI'll try to figure out how to translate that to apt-get commands00:04
*** hamalq has joined #opendev00:05
*** auristor has joined #opendev00:05
*** jaicaa has joined #opendev00:05
*** tkajinam has joined #opendev00:05
*** andrii_ostapenko has joined #opendev00:05
*** dviroel has joined #opendev00:05
*** guilhermesp has joined #opendev00:05
*** walshh_ has joined #opendev00:05
*** clayg has joined #opendev00:05
*** rm_work has joined #opendev00:05
*** donnyd has joined #opendev00:05
*** persia has joined #opendev00:05
*** seongsoocho has joined #opendev00:05
*** portdirect has joined #opendev00:05
*** mattmceuen has joined #opendev00:05
*** gmann has joined #opendev00:05
*** melwitt has joined #opendev00:05
*** ttx has joined #opendev00:05
*** mordred has joined #opendev00:05
fungiclarkb: sounds great00:05
clarkb--no-install-recommends when install openafs-modules-dkms, then install the other packages00:06
*** redrobot has joined #opendev00:06
*** cloudnull has joined #opendev00:06
*** fbo has joined #opendev00:06
*** artom has joined #opendev00:06
*** otherwiseguy has joined #opendev00:06
*** elod has joined #opendev00:06
*** spotz has joined #opendev00:06
*** yoctozepto has joined #opendev00:06
*** mhu has joined #opendev00:06
*** calcmandan has joined #opendev00:06
*** SotK has joined #opendev00:06
*** mordred has quit IRC00:07
*** Eighth_Doctor has quit IRC00:07
*** artom has quit IRC00:07
*** tosky has quit IRC00:07
*** artom has joined #opendev00:08
*** chrome0 has joined #opendev00:08
*** guilhermesp has quit IRC00:09
*** donnyd has quit IRC00:09
*** gmann has quit IRC00:09
*** donnyd has joined #opendev00:10
fungii think system-config-run-static may be impacted by this00:10
*** guilhermesp has joined #opendev00:10
clarkbpresumably if it reruns it will run with the new packages and be happy00:10
clarkbstill waiting on dkms to do its thing on the inap mirror00:11
*** gmann has joined #opendev00:11
fungihttps://zuul.opendev.org/t/openstack/build/28c0192e984548b0a48d10451e6752fb/log/job-output.txt#45767-4577500:11
fungi(also wow that's a long log)00:11
fungilook for the "Check AFS mounted" task since the autoscroll isn't going to work on progressive loading a log that long00:12
clarkbit looks like inap's mirror may have been running 1.8.3 not 1.8.6-1 fwiw00:13
clarkbso these may all need manual intervention?00:13
fungii can try to upgrade some once you're comfortable with the first one00:13
fungishould i go ahead and recheck the change which was failing to afs mount?00:14
fungii guess all the relevant packages are in our ppa now00:14
clarkbyes they should be there except for arm64 last I checked00:15
fungiit's an x86 job so should be fine then00:15
*** artom has quit IRC00:15
ianwyep all published00:18
clarkbrebooting inap mirror now00:19
ianwso https://etherpad.opendev.org/p/infra-openafs-1.8 has the outline of what i think an emergency 1.8 upgrade would be00:20
clarkbianw: I was trying to follow along as you went and added some notes too. I think that captures it. I don't know if iti spossible to do a non downtime upgrade. My understanding in the past was that it was not but that may not be accurate (and this was because 1.8 and 1.6 inside the server level couldn't talk to each other)00:21
clarkbhttps://mirror.mtl01.inap.opendev.org/ubuntu/lists/ seems to be working post reboot00:22
clarkband from what I can tell it installed the new packages00:22
clarkbI think we'll update openafs in most (all?) places when zuul does its daily runs and/or when unattended upgrades happens00:23
clarkbdo we want to proactively upgrade them? ahead of that? if not I can check dpkg -l tomorrow and confirm they updated on their own00:23
ianwthat should be right, though i guess things might want a reboot00:23
clarkbya my concern with rebooting mirrors is that zuul's queue is super deep right now00:24
clarkbtrying to balance the various factors in play here (not easy)00:24
ianwbut if the mirrors are getting random failures that's also not great00:24
clarkbya though as far as I can tellthey haven't yet. Are you concerned that they may after the upgrade happens but before a reboot?00:25
ianwam i understanding correctly that is the current failure case?  randomish failures from the 1.6 servers?00:25
clarkbianw: 1.8 will 100% fail apparently when it starts exhibiting the problem00:25
clarkb1.6 will be randomish00:25
clarkbnone of our systems should do that unless we sufficiently restart openafs (not sure what that is but reboot definitely is sufficient)00:26
clarkbthis is beacuse all of our systems should've started before the epoch rollover thing00:26
clarkbobviously they won't necessarily all remain in that state in the future as clouds do their cloudy thing00:26
ianwyeah, i don't really like sitting on a time-bomb that as soon as the backend fails, we have a fire-drill to get the servers updated00:27
ianwactually it's not a fire-drill, it's an actual fire at that point :)00:27
clarkbagreed, but we also have a potential multiday zuul backlog that will just implode on itself if we take an outage to fix it. Trying to figure out where in my head the balance is between imploding all those jobs to take a downtime and fix this vs waiting for now and fixing it when zuul is hopeflly happier00:28
clarkbmirrors, zuul executors, and static are all involved in that00:29
clarkb(in addition to the afs servers)00:29
ianwi think doing the manual upgrade to 1.8 with servers in emergency probably isn't a bad thing in the long run00:30
ianwit will give us a chance to see 1.8 in action before we upgrade the base os of the servers00:30
ianweffectively one thing at a time.  we can feel more confident about dropping in replacement servers one-by-one if eerything is at 1.800:31
clarkbya agreed00:31
clarkbfor mirrors we can disable a cloud in nodepool, wait for it to drain out (up to 3 hours or so each due to tripleo jobs), update the mirror and reboot it00:31
clarkbfor zuul executors we can stop zuul, update openafs, then reboot them one at a time00:31
clarkbzuul should retry the jobs that fail as a result00:32
clarkbthat has an impact but it is smaller00:32
clarkbfungi: we may also want to talk to the release team?00:32
fungi2021-01-15 00:03:31     <--     openstackgerrit (~openstack@eavesdrop01.openstack.org) has quit (*.net *.split)00:32
* fungi sighs00:32
clarkbare they making a ton of releases as part of this milestone that is plugging everything up?00:32
clarkbstatic I think is largely read only and so the impact of that might be less noticeable00:33
clarkbianw: thinking out loud here, doing mirrors, executors, and static first is probably more straight forward and will test our packages betterer?00:34
fungii think the release team really only notices when the releases and tarball sites get out of date (and release notes on docs site)00:34
clarkbfungi: well we'll potentially break our ability to write to tarballs00:35
fungiso 770856 is probably all they'll need00:35
clarkb(if zuul executors are not happy with the upgrade)00:35
fungioh, writes, yes00:35
ianwyeah, i agree we need to make sure they are all in a state of having the latest ppa client running so that if the server does get switched on them we are ok00:35
clarkbfungi: the new package seems to work for reads just fine00:35
clarkband we can do a single zuul executor first then observe it ?00:36
fungiso the current risk is that clients running unpatched 1.8 may spontaneously reboot and stop working, which if we've upgraded the packages (unattended upgrades, ansible, et cetera) and just not rebooted them yet, is probably fine00:36
clarkbfungi: also note the chagne you are trying to land will update static's install00:36
fungiwe already deal with corrupted afs caches at reboot which blocks afs from working on the mirrors on a frequent basis00:36
clarkbfungi: ya the other risk is that the new packages don't work in some way and we may not notice until we reboot00:37
fungiupdate static's install but not restart afsd or reboot the server00:37
clarkbcorrect00:37
fungisure, but like i said, we already frequently deal with afs not working after an unclean spontaneous reboot00:38
clarkbmirror update, mirrors, and static normally update daily via the periodic pipeline. zuul updates hourly00:38
fungihaving to scramble to work out why it broke differently would be not great, sure00:38
clarkbits possible that all the zuul executors have already updated?00:38
clarkbfungi: ya I get what you are saying. basically we've reduced the risk of a reboot causing 100% failure00:39
clarkband the chance of any failure post reboot is minimal since reads are workign00:39
fungiii  openafs-modules-dkms                   1.8.6-1ubuntu1~xenial100:39
fungithat's ze0100:39
fungiso no, not all anyway00:39
clarkbze01 has not updated00:39
clarkbya I mean once they update00:39
clarkbmirror-update and mirrosr won't happen until ~060000:40
clarkbstatic may happe nsooner if your change lands00:40
*** hamalq has quit IRC00:40
clarkbzuul should happen in the next hour?00:40
clarkbpart of why I'm bringing this up is I need to pop out for dinner in a few short minutes and then idaelly also call it a work day00:41
fungii'll be glad to stick around, and am happy to do a controlled reboot of static.o.o after the apache change triggers a fresh deployment there00:41
*** mlavalle has quit IRC00:42
ianwyeah, i can make a list on that etherpad page and make sure things update00:42
ianwi can also create the vicepa snapshot in preparation for a manual server upgrade00:43
clarkbze01 should've updated openafs-client and openafs-krb5 at 23:43:44 ish. Now trying to cross check with when the ppa updated for xenial00:43
*** mordred has joined #opendev00:43
ianwfungi: be good if you could double check the instructions00:43
clarkbthe timestamp for the amd64 openafs-client package seems to be 23:4400:44
clarkbmissed it by seconds00:44
clarkbianw: fungi thanks00:44
clarkbthen maybe tomorrow I can work on rolling reboots of zuul executors and we can do a reboot of mirror-update if fungi's locks sufficiently idle that server?00:45
fungiianw: i'll take a look, sure00:45
clarkbthen maybe aim for Monday upgrade of the servers?00:45
fungii'd rather not reboot mirror-update until the tarballs volume is at least done releasing00:45
fungibut i guess if we need to we need to00:46
clarkbfungi: ya I think we want to wait for it to idle if we can get it to do so00:46
clarkbsince you've got all the locks held right?00:46
clarkbso it should finish the current set of releases then do nothing00:46
clarkbif we aim for monday for the outage we can send out comms tomorrow too and try and warn people as much as possible00:46
clarkb"the uotage" being the main afs server outage00:46
clarkband that also gives time for vos releases to complete00:47
fungiright, i terminated the other outstanding vos release calls from mirror-update.o.o (which unfortunately doesn't stop the transactions so isn't actually freeing up the afs servers) and held locks for all of them in a screen session00:47
fungithe possible wrench in the ointment here is that the other replica sync transactions which already got initiated are likely to continue well into next week00:48
clarkbhrm I wonder how terrible it will be to upgrade with those happening :/00:49
fungiwe might just need to consider afs02.dfw a total loss and start all its replicas from scratch again00:49
fungipresumably afs01.dfw will give up trying to replicate to it if the server goes away for a while?00:49
clarkbno clue00:50
clarkbmight be good to have corvus think over some of this stuff too00:50
ianwindeed.  i can shepherd it on my monday, which is usually a very quiet time00:50
fungiany one of these mirror volumes easily needs a weekend or more to do a full release, and we have something like 10 around that size00:50
ianwthat would give y'all your monday to fix anything :)00:50
*** mordred has quit IRC00:51
clarkbI've just double checked that openafs-client role is only applied to zuul-executor and not all of zuul. That is the case00:53
clarkband with that I need to go catch up on household/family things. Thank you for all the help today. If you discover new things or have schedulign thoughts for getting stuff uipdated maybe update the etherpad and I'll do my best to catch up in the morning?00:53
ianwnp.  i think what i'll do is monitor all those servers and update the ehterpad.  i'll add the vicepa snapshot00:54
ianwthen i might send a summary email we can sync on00:54
clarkbsounds good. ++00:54
fungiawesome, i'll be back around shortly, need to switch rooms00:54
clarkbmight also be good to indicate if you think an AU monday outage is a good idea given the other releases and everything when your day ends. That way we can send out a warning tomorrow about it00:54
clarkbI guess we can always send a warning and mention it may not happen depending on server state00:55
clarkbthanks again!00:55
fungii'm going to restart gerritbot, i don't see it coming back since the split00:56
fungi#status log restarted gerritbot since it was lost in a netsplit at 00:03 utc00:57
openstackstatusfungi: finished logging00:57
*** Eighth_Doctor has joined #opendev01:01
*** mordred has joined #opendev01:19
fungize01 still doesn't have newer openafs packages installed yet. i suspect our hourly jobs for those aren't upgrading distro packages01:34
fungiunattended-upgrades will take care of it anyway01:43
*** cloudnull has quit IRC01:49
fungize07 looks like it's mid-upgrade01:56
fungiopenafs-modules-dkms is at 1.8.6-5ubuntu1~xenial2 now but openafs-client and openafs-krb5 are still on 1.8.6-1ubuntu1~xenial1 there01:56
fungiahh, yeah, the dkms postinst build is in progress there01:58
ianwok, lunch done, will take a look01:59
fungiit's parented to what looks like an ansible ssh session, so i guess we are updating them that way01:59
fungiyeah, other executors are in a similar state now02:00
fungiso they should be done shortly02:00
ianwianw@ze01:~$ dpkg --list | grep openafs02:20
ianwii  openafs-client                         1.8.6-1ubuntu1~xenial1                          amd64        AFS distributed filesystem client support02:20
ianwii  openafs-krb5                           1.8.6-1ubuntu1~xenial1                          amd64        AFS distributed filesystem Kerberos 5 integration02:20
ianwii  openafs-modules-dkms                   1.8.6-5ubuntu1~xenial2                          all          AFS distributed filesystem kernel module DKMS source02:20
fungiyeah02:22
ianw- name: Install client packages02:22
fungilook at the process list and you'll see lkm compilation underway02:22
fungior i guess it's finished on at least some i'm checking now02:23
ianwi think it's finished on ze0102:23
ianwthe next step in the role should have updated the other two packages02:23
ianwservice-zuul.yaml.log just finished02:27
fungiyeah, looks like they're fully upgraded now02:28
ianwyep02:28
funginow the question whether we should reboot an executor and make sure it's still working correctly02:29
ianwjust looping them all to double check they're all upgraded02:29
*** brinzhang_ has joined #opendev02:31
ianwall good.  probably worth restarting one ... i can do that if we like02:32
fungiit'll restart a bunch of in-progress builds, but that's probably still better than getting caught unawares with a problem when one needs to be rebooted for other reasons02:34
*** brinzhang0 has quit IRC02:34
*** cloudnull has joined #opendev02:35
fungiyeah, we've still got a nearly 3k node backlog02:35
ianwi'll restart ze01 for sanity02:36
fungithanks02:37
ianwok, it's back, it's got afs and can look around /afs/openstack.org02:40
fungilooks like it's up and yeah02:40
fungiwe'll want to make sure a docs build succeeds on it02:40
*** cloudnull has quit IRC02:46
ianwi don't think anything updates the cache for the mirror runs02:59
ianwthe package cache02:59
ianwruns being the ansible runs02:59
*** brinzhang_ has quit IRC03:00
*** brinzhang_ has joined #opendev03:00
fungiunattended-upgrades will get it at some point in 24 hours03:01
fungior we can manually force them earlier03:01
ianwi'm running ansible by hand now with update_cache fix in the openafs-client role03:02
*** ysandeep|out is now known as ysandeep03:06
*** cloudnull has joined #opendev03:21
*** cloudnull has quit IRC03:53
*** cloudnull has joined #opendev03:55
funginot seeing any obvious job failures attributable to afs writes from ze01, good so far04:02
*** cloudnull has quit IRC04:02
*** cloudnull has joined #opendev04:05
ianwok, all mirrors should be updated04:12
fungithanks! i think i'm about to nod off, but i'll check through everything again when i wake up04:15
*** ysandeep is now known as ysandeep|afk04:49
*** ysandeep|afk is now known as ysandeep05:21
*** brinzhang0 has joined #opendev05:36
*** brinzhang0 has quit IRC05:37
*** brinzhang0 has joined #opendev05:37
*** brinzhang_ has quit IRC05:38
ianwmail sent to discuss list to discuss how to move on from here05:57
ianwi have confirmed that all our clients have fixed 1.8.6 packages05:58
ianwi have implemented the vicepa snapshot on afs01.dfw; we may want to recreate this I guess but the basics are there05:59
ianwnot sure what else to do now06:06
ianwi'll wait for some feedback and we can take it from there06:06
*** zbr has quit IRC06:09
*** zbr has joined #opendev06:11
*** marios has joined #opendev06:22
*** eolivare has joined #opendev07:12
*** openstackgerrit has joined #opendev07:25
openstackgerritMartin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker  https://review.opendev.org/c/opendev/system-config/+/70525807:25
*** ysandeep is now known as ysandeep|lunch07:47
*** sboyron has joined #opendev07:54
*** ralonsoh has joined #opendev07:55
openstackgerritMartin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker  https://review.opendev.org/c/opendev/system-config/+/70525808:09
*** rpittau|afk is now known as rpittau08:09
*** andrewbonney has joined #opendev08:12
*** jpena|off is now known as jpena08:34
*** tosky has joined #opendev08:42
*** ysandeep|lunch is now known as ysandeep08:47
*** lpetrut has joined #opendev08:50
*** hashar has joined #opendev08:52
openstackgerritMartin Kopec proposed opendev/system-config master: WIP Deploy refstack with ansible docker  https://review.opendev.org/c/opendev/system-config/+/70525809:12
*** DSpider has joined #opendev10:17
*** dtantsur|afk is now known as dtantsur10:38
*** marios has quit IRC10:58
zbrany infra-core around?11:28
*** rosmaita has joined #opendev11:54
*** marios has joined #opendev11:59
*** hashar is now known as hasharLunch12:04
*** artom has joined #opendev12:16
*** jpena is now known as jpena|lunch12:31
*** brinzhang_ has joined #opendev12:32
*** brinzhang0 has quit IRC12:36
*** whoami-rajat__ has joined #opendev12:43
*** slittle1 has quit IRC13:10
*** slittle1 has joined #opendev13:11
mrungeHi there, is there anyone to help me figure out a POST_FAILURE in patches in gate phase? E.g https://review.opendev.org/c/openstack/panko/+/76490613:18
mrungethe same checks work in check phase, but fail in gate phase with this post_failure13:19
mrungeand I can't figure out why13:19
*** slittle1 has quit IRC13:30
*** jpena|lunch is now known as jpena13:33
fricklermrunge: looks to me like "just" an unstable job, the same failure is also seen in check here https://zuul.opendev.org/t/openstack/build/3d7cb7959325456f98381916c95081ea13:36
fricklerit's extremely unlikely that job results should consistently depend on whether the job runs in check or gate13:36
mrungefrickler, thank you. It looks like they are failing pretty consistently, but only in gate phase13:37
fricklerwe could either hold a node for you to debug the failure in situ or you could amend the devstack post job to not fail completely in that scenario, allowing for more logs to be present in case of this failure13:37
mrungeis it possible that *just* a signal is not sent to the right target?13:39
fricklerzbr: the keywords to highlight people are either infra-root or config-core (the latter for e.g. project-config reviews). it would also be much more productive if you could just state your issue instead of doing empty pings13:39
fricklermrunge: not sure why exporting the journal would fail, most likely it gets oomed. without other logs present it's difficult to really tell, though13:40
mrungeright13:40
mrungeit looks like this is getting killed13:40
mrungefrickler, is it possible to freeze a node in the case of failure?13:41
mrungealthough, why would the job then just fail in gate and not also in check mode?13:42
fricklermrunge: as I said before, I don't think the latter is true. I'll set up a hold in order to keep a node online when it is failing13:45
mrungeokay, thank you. I can retrigger a job13:46
mrunge(or you could...)13:46
fricklerI just did, the hold is specific to the patch you mentioned above.13:50
mrungethank you, that's awesome13:50
zbrfrickler: it would be very useful to mention the keywords in channel topic as we have lots of channels with their own rules.13:56
*** mlavalle has joined #opendev13:58
fricklerzbr: we did not put them into the topic in order to avoid getting spammed too often, they are mentioned somewhere in our docs, though13:58
zbrhttps://review.opendev.org/q/hashtag:%22low-hanging%22+(status:open%20OR%20status:merged)13:58
zbrbut with projects like git-review is quite hard to guess, which keyword to use.14:00
zbri am core there, but i still need two others to get my own changes in.14:00
zbris bit tricky as low activity projects rely more on help from infra in order to get things moving.14:02
zbri still have no idea how we could improve this14:02
zbranother interesting subject would be related to elastic-search licensing: does this impact us? (long-term)14:07
fricklerwell we could discuss dropping the two-review rule for projects like this. maybe add that as a topic to our meeting agenda?14:07
zbrsure, i will do. i think that there requirement is not enforced but it could be tricky to know which project needs it or not.14:08
mrungezbr, you're not alone with this issue.14:14
mrungeusually we dealt with this by distinguishing between low hanging fruits and patches where a second pair of eyes is really appreciated14:15
zbrmrunge: yep, my impresison is that simple low-risk changes should be allowed with downgraded cvorum, as in single reviewer. still, even if that is agreed, we need to give a good set of examples (good and bad).14:17
mrunge+1 , however it would be hard to give these do's and don'ts14:18
zbrfixing ci pipelines, reqs, broken jobs, is likely to be low risk. but breaking changes not14:21
zbrfinny bit is that dropping support for py27 and py35 is somewhere in-between, based on the project, it could have breaking impact.14:22
mrungeyes!14:22
zbron the other hand, when you already have CI broken for months, you start to wonder which one is the lesser evil14:22
zbri added topic to https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Weekly_Project_Infrastructure_team_meeting14:23
*** rpittau is now known as rpittau|afk14:48
fricklermrunge: oom theory confirmed: http://paste.openstack.org/show/801664 , if you let me know your ssh key I can give you access for further debugging14:49
mrungegood to know14:52
mrungefrickler, https://github.com/mrunge.keys14:52
fungiis there swap set up on that job?14:52
mrungetbh. I don't know14:52
mrungeI even haven't seen that job description so far14:53
fungishould be able to run free once logged in and find out14:53
fricklerour default of 1G, yes, and it is all used up14:53
fricklermrunge: ssh root@158.69.69.7214:54
frickler/var/log/journal has 409M, so nothing super huge, though14:56
fungiso presumably memory pressure from something else14:56
mrungeall together may be a bit tight with 1gb14:56
mrungeis there a possibility to give it, e.g. 1.5 gig?14:56
fricklerit's strange though that xz gets killed, though, maybe it chokes on some special kind of input? or is it just unlucky?14:56
fricklermrunge: there is a patch to have 8g of swap again, but so far it seemed only to be needed for stable/stein14:57
frickleror was it train?14:57
mrungesince the same job passes on check queue, the question is, what is the real difference?14:58
mrungebetween check and gate queue14:58
fricklerhttps://review.opendev.org/c/openstack/devstack/+/75748814:58
fricklermrunge: this failure was in check, it doesn't really matter which queue14:58
mrungeugh14:58
mrungethere goes my theory14:58
fricklerseems just like a 50% chance of failing14:58
fungimrunge: frickler: it's entirely configurable. 1gb is simply the default14:59
mrungefungi, where would I set that?14:59
mrungeI wouldn't want to use a lot more, since it often passes15:00
fungiyou can set a variable in the job definition to indicate how large of a swapfile you want, just be aware it subtracts from available rootfs on at least some providers so don't make it huge unless you know you won't be using a lot of disk15:00
fricklermrunge: see the above patch.15:00
* fungi goes looking for the variable15:00
fungiconfigure_swap_size yeah15:02
mrungesince there is 17 Gig out of 75 Gig used in rootfs, increasing swap to 8 gig would totally fit15:02
fungiyou can see folks overriding it in various jobs: https://codesearch.opendev.org/?q=configure_swap_size15:03
fungi8192 and 0 seem to be the popular override values15:03
fungidepending on whether folks wanted more swap, or more disk and no swap15:03
mrungeso, in this case, nearly all of the 1 gig swap is used15:04
fungifor a bit of background, we basically had to make this compromise when newer linux kernels started refusing to allow sparsely allocated swapfiles15:04
fungithe swapfiles are now preallocated instead of sparse, which means the prior 8gb default made a lot of jobs start failing on providers with smaller rootfs sizes15:05
mrungeany idea where I would find the job description for telemetry-dsvm-integration ?15:06
mrungethat seems to be inherited from somewhere?15:06
fungiit will inherit from the job's parent if there's no description set in the job definition, i believe15:07
*** lpetrut has quit IRC15:07
fungimrunge: https://zuul.opendev.org/t/openstack/job/telemetry-dsvm-integration says it inherits from https://zuul.opendev.org/t/openstack/job/telemetry-tempest-base15:08
mrungeright, that's what I found so far15:09
fungithe former has no description set in the job, but the latter does15:09
fricklerhmm, according to the manpage, xz may use multiple GB of memory when running with high compression settings like -9. we may want to add --memlimit=x, maybe 256M or so15:12
fungior switch to gz and accept that the files will be a bit larger15:13
fungithough i want to say clarkb found xz was massively better at compressing systemd journals than gz15:14
fricklerwell, failing with oom doesn't sound better, so there seems to be a tradeoff to make ;)15:15
fungiyep, absolutely15:16
frickleranyway, eod for me, maybe someone wants to add extra swap to the held node and test various xz options. or I'll do that later15:17
fungii suppose we could do a fallback to --memlimit=something if the normal compress attempt fails15:17
mrungeyes, that sounds sensible15:17
fungibut yeah, let's get some more suggestions. i need to switch to double-checking afs stuff15:17
mrungeand then there is this change: https://github.com/openstack/devstack/commit/d02fa6f856ac5951b8a879c23b57d5a752f2891815:18
mrungebut not causing this issue15:19
mrungehttps://review.opendev.org/c/openstack/devstack/+/77094915:23
mrungethat is the change15:24
mrungethank you fungi and frickler15:24
mrungelet's see how this goes15:24
clarkbxz is much better space wise yes15:28
openstackgerritMerged openstack/project-config master: update gentoo from python 3.6 to python 3.8  https://review.opendev.org/c/openstack/project-config/+/77082815:29
mrungefungi, frickler: I got access to a held node, I don't believe I still need it. that is 158.69.69.72 and could be released back to the pool15:35
mrungeOr can I do that on my own, and if yes: how?15:35
clarkbmrunge: you can't currentl release it back on your own15:35
clarkbthank you for letting us know15:35
fungiinfra-root: static.o.o has the new openafs packages installed, unless there are objections i'm going to do a quick reboot in a few minutes so it's using the fixes and then maybe we can look at merging 77085615:36
mrungethank you for giving access to help debugging15:36
clarkbfungi: looks like zuul has caught up a bit (though still well behind)15:36
fungiclarkb: yep15:36
clarkbfungi: no objections from me on the reboot15:37
fungi#status log rebooted static.o.o to pick up the recent openafs fixes15:39
openstackstatusfungi: finished logging15:39
fungiserver's been up 2 minutes now15:42
fungiseems to be working, afs content is served15:43
fungisince there have been no objections, i'm going to self-approve 770856 for now and keep an eye on it once it deploys to make sure we're serving up to date content15:44
clarkbsounds good15:47
*** caiqilong has joined #opendev15:50
caiqilongI received a email said I have been added to "Autopatrolled users". Is that good or bad?15:50
fungicaiqilong: it's "good"15:50
caiqilongfungi: ok. thanks.15:51
fungii watch every change made to the wiki to make sure we catch and roll back any spam, but if i see a user make consistently legitimate changes i add them to the autopatrolled users and then i no longer need to review their edits in the future15:51
fungiclarkb: enough vos releases stopped that i can quickly get a vos status on afs01.dfw again (the tarballs volume release is still underway though)15:54
caiqilongfungi: thanks for your patience.15:54
fungicaiqilong: you're welcome!16:04
openstackgerritMerged opendev/system-config master: Temporarily serve static sites from AFS R+W vols  https://review.opendev.org/c/opendev/system-config/+/77085616:16
*** slaweq has joined #opendev16:19
*** ysandeep is now known as ysandeep|dinner16:21
fungiokay, weird, looks like the last time we tried to build a gentoo image was september16:24
clarkbI think nodepool will stop building an iamge if we stop telling it to upload to any provider16:27
clarkbmaybe we did that?16:27
fungilooks like we paused it16:28
clarkbah16:28
openstackgerritJeremy Stanley proposed openstack/project-config master: Un-pause Gentoo image builds  https://review.opendev.org/c/openstack/project-config/+/77103116:32
fungiprometheanfire: clarkb: ^16:32
prometheanfireyarp :D16:33
*** ysandeep|dinner is now known as ysandeep|away16:33
*** dtantsur is now known as dtantsur|afk16:41
*** iurygregory has quit IRC16:41
*** marios is now known as marios|out16:47
*** marios|out has quit IRC17:04
*** iurygregory has joined #opendev17:06
*** hasharLunch is now known as hashar17:08
*** slaweq has quit IRC17:12
clarkbfungi: is tarballs the only thing releasing now or its just fewer things overall?17:28
clarkbI guess the mirrors are likely still releasing due to their size17:28
clarkbfungi: I've +2'd the gentoo unpause, not sure if you want to approve it or wait for another review17:28
fungiclarkb: what's "releasing" is a bit misleading. the previously killed vos release calls also still have sync transactions underway, should be able to spot them by listing the transactions17:30
clarkbgotcha17:30
fungiclarkb: i'll watch the build log for gentoo if you approve17:31
*** mlavalle has quit IRC17:31
clarkbok approving now17:31
*** jpena is now known as jpena|off17:31
fungithanks!17:32
*** mlavalle has joined #opendev17:32
clarkbfungi: making sure I'm up to date on people's thoughts re afs upgrades. We've rebooted enough afs clients to be reasonably confident that when the other clients reboot they will be fine (and all clients have the new packages installed)17:38
clarkband that means we don't need to worry about doing rolling reboots of all the mirrso and zuul executors?17:38
*** hashar has quit IRC17:38
openstackgerritMerged openstack/project-config master: Un-pause Gentoo image builds  https://review.opendev.org/c/openstack/project-config/+/77103117:41
fungiclarkb: yes, we rebooted at least the inap mirror, ze01 and static.o.o17:45
clarkbthat should be a pretty good representative sample17:45
fungii think that's a reasonable cross-section, yeah17:46
clarkbok, is there anything afs related that you think I should be doing next to help? Looks like ianw did a vicepa snapshot already17:47
clarkbare we happy with that lvm level redundancy?17:47
clarkbI think yesterday we had suggested it as the least overhead option (lvm is often a mystery to me so want to double check I should be doing anything else around backups/snapshots to help)17:48
fungias a temporary measure it seems like a fine solution to me17:48
fungibut we should make sure we have enough available extents in the vg to allow the volumes to diverge for a while17:49
clarkbfungi: is that something vgs would show us?17:50
fungia quick rundown of how lvm2 snapshotting works: all the extents (device blocks essentially) which belong to the original volume are marked as also belonging to the snapshot. when new writes are committed in the original volume the old extents are kept and new extents are used instead. as the volumes diverge obviously more additional extents will be used up to (eventually) the size of the original volume17:51
fungiitself17:51
fungiand yeah, vgs will show you your available extents/space17:51
fungiwe can always tack on another cinder device as a pv in that vg if needed17:52
clarkbianw did that already according to the etherpad (a full size 1tb volume)17:53
clarkbI've just loaded my keys and will double check it too17:53
clarkbvgs only shows vfree and vsize by default17:53
* clarkb reads manpages17:53
clarkboh I may need lvs17:54
clarkbya the actual lv and its snap show up there17:54
fungiright, lvs to see the volumes, vgs to see how much room there is for them to grow17:56
*** ralonsoh has quit IRC17:56
clarkbin this case I think we've created two lvs one at 4TB and one at 1TB which fills up the entire 5TB vg17:57
clarkbthe data% of the 1TB snapshot is .55%. Should I read that as this snapshot is using 0.55% of the space allocated to it?17:57
clarkbif so then I think we are currently quite happy with that state?17:58
*** eolivare has quit IRC18:02
clarkbfungi: if you get a chance maybe you want ot double check all that too?18:02
*** eolivare has joined #opendev18:02
clarkbcorvus: if you haven't seen https://etherpad.opendev.org/p/infra-openafs-1.8 yet, it would be great if you could review that18:03
clarkbjust to get another set of eyeballs on the whole situation18:03
fungiyep, that sounds right, but i'll take a closer look when i'm done eating18:04
clarkbthanks, that reminds me I should find breakfast18:05
fungiyeah, the snapshot setup on afs01.dfw looks okay to me, it's up to 0.56% in use, but at this rate of growth we've probably got plenty of time. also we'd presumably recreate a newer snapshot immediately prior to starting the upgrade18:35
fungialso calling sync before making the snapshot may be a good idea18:36
clarkbfungi: you should add that to the etherpad18:36
fungii am18:36
clarkbthanks!18:36
*** icey has quit IRC18:48
auristorclarkb: if you wait until jan 31 the need to update the servers to 1.8 will no longer be necessary18:50
clarkbauristor: oh does the bit flip over again at that point?18:51
auristorby the 31st the number of non-zero bits in the count of seconds since unix epoch will once again be great enough to permit time based random connection ids to be random18:51
clarkbaha18:51
fungiauristor: ooh, excellent point, though we have other reasons for wanting to upgrade too18:52
clarkbauristor: fwiw I think we should upgrade to 1.8 anyway.18:52
clarkbauristor: is changing the key format the only thing that you need to do in the upgrade? And it can't be done as a rolling upgrade right?18:52
auristorthe underlying problem that openafs 1.8 had was that an effort was made to replace time based randomness with RAND_bytes and the implementation was botched.18:52
auristorthe key does not need to change.18:53
clarkbauristor: akeyconvert was the thing that ianw found18:53
clarkbsomeone mentioned that converting the file format for the key was necessary?18:53
auristor1.6 stores non-DES keys in a krb5 keytab.   1.8 uses the KeyFileExt that AuriStor contributed.18:53
clarkbright, so to upgrade we stop 1.6, run akeyconvert, then start 1.8?18:54
clarkbon the fileservers and db servers18:54
auristorakeyconvert is used to import the keys from the keytab to the KeyFileExt.18:54
auristor1.6 doesn't know about KeyFileExt and doesn't care if it exists.   1.8 won't use the keytab.18:54
clarkbgot it18:54
clarkbso we could convert, then stop 1.6 then start 1.8. But do need to convert before 1.8 starts18:55
auristorcreate the KeyFileExt and distribute to all servers as you would the keytab.    Then upgrade / downgrade as you wish.18:55
clarkbdoes that also mean we need to update zuul secrets that use keytabs?18:55
clarkbthough all of our clients are already 1.8 so maybe they do the correct thing already?18:56
auristorKeyFileExt is used by all openafs 1.8 servers and admin tools that use -localauth18:56
clarkbgot it18:56
auristoropenafs clients do not use keytabs or KeyFileExt18:56
clarkbauristor: and we shouldn't have a mix of 1.6 and 1.8 servers?18:57
auristorAuriStorFS clients use krb5 keytabs and provide encryption for anonymous processes18:57
clarkbI seem to recall reading that once upon a time but haven't found confirmation of that18:57
auristoryou can mix 1.6 and 1.818:57
auristorthere are no protocol, database or vice partition changes18:57
clarkbthat is really good to know actually18:57
clarkbgiven ^ I think we should upgrade the ord fileserver first18:58
auristorfor that matter you can mix AuriStorFS and OpenAFS18:58
clarkbbeacuse that will have the least impact, then we could do a rolling update across the others18:58
clarkbianw: ^ highlight mark in irc for some interesting details18:58
auristorbuild a 1.8 test fileserver and add it to the cell to make sure your KeyFileExt works18:58
auristoronce you are comfortable it does, distribute it to all the other db and file servers in the cell18:59
clarkbauristor: ya that is essentially what the ord server is (it is in another location compared to the others and we don't use it for much because releases to it over the internet are slow18:59
clarkbwe should be able to use it as the test in this case with minimal impact (others should double check that assertion though)18:59
*** andrewbonney has quit IRC19:00
clarkbI'll update our etherpad with this new info after lunch19:00
fungiall of the above sounds great, and also greatly simplifies the planned upgrade19:02
*** slaweq has joined #opendev19:22
openstackgerritGhanshyam proposed openstack/project-config master: Combine acl file for all interop source code repo  https://review.opendev.org/c/openstack/project-config/+/77106619:28
fungiprometheanfire: gentoo image build underway, log here: https://nb01.opendev.org/gentoo-17-0-systemd-0000143983.log19:30
fungihopefully we'll know shortly if we have usable images again19:30
fungioh, though it's been on "Emerging (1 of 1) dev-python/packaging-20.7::gentoo" for ~1.5 hours according to the timestamp. maybe that build got terminated19:32
funginot terminated, this is still in the process table on nb01 since 18:03... "/bin/bash /tmp/in_target.d/pre-install.d/02-gentoo-04-install-desired-python"19:33
fungithis child of it has been running since 18:05... "/usr/bin/python3.8 -b /usr/lib/python-exec/python3.8/emerge --binpkg-respect-use --rebuilt-binaries=y --usepkg=y --with-bdeps=y --binpkg-changed-deps=y --quiet --jobs=2 --autounmask=n --oneshot --update --newuse --deep --nodeps dev-python/packaging"19:34
fungilooks like it might be deadlocked (livelocked?), strace shows it waiting on a private futex19:35
clarkband it is that pid with the futex?19:37
clarkbor is another child of ^ (might be helpful to know which exact process is sitting on the futex)19:37
clarkbfungi: ianw I updated https://etherpad.opendev.org/p/infra-openafs-1.8 with the notes from auristor but didn't change the upgrade plan. Instead just added the new info and proposed a potentially different upgrade plan19:38
fungii also replied to the ml thread with some of it19:39
fungiclarkb: the process i straced didn't have any children in the process table19:40
clarkbyou wouldn't expect emerge to deadlock since that is a tool used by many users, but maybe we'vemanaged to trip a weird corner case in emerging python packaging19:41
fungimaybe it doesn't like running with linux 4.15 or something19:45
*** dmellado has quit IRC19:47
*** dmellado has joined #opendev19:48
openstackgerritJeremy Stanley proposed openstack/project-config master: Set up access for #openinfra channel  https://review.opendev.org/c/openstack/project-config/+/77107319:55
*** auristor has quit IRC20:03
*** eolivare has quit IRC20:04
*** slaweq has quit IRC20:11
*** auristor has joined #opendev20:23
fungiyeah, i'm fairly certain it's not going to terminate on its own, will try killing it and see what happens on the next attempt20:40
fungithat caused it to fail and clean up20:43
clarkbfungi: it looks like zuul might be caught up by tomorrow20:53
clarkbassuming general friday trends continue20:53
fungithat'll be nice. gerrit and scheduler restart over the weekend then?20:53
clarkbwhat is the gerrit restart for?20:53
fungii'll be around and can drive20:53
clarkbbut ya getting a zuul scheduelr in would get us the WIP support which would be great20:54
fungigerrit restart will be needed for the zuul results plugin, if we approve the stack20:54
clarkbI'll be around but in and out. Family is telling me that they are stir crazy and need to get out20:54
fungiyeah, no worries20:54
clarkbfungi: ah, I haven't had a chance to look at it since my last pass of reivews20:54
clarkbdid the server rename stuff get pulled out of the stack? I really do think simplifying that is a good idea for now and then rename when we actually rename20:55
fungichecking20:56
fungiyeah, https://review.opendev.org/767059 is still near the top of the stack20:58
fungiwe could do the gitea upgrade maybe?20:59
clarkboh ya thats another one that got ignored due to the afs things20:59
clarkbya if people are happy with the test node results on that one I think we can proceed21:00
clarkblooks like ianw did check the held node and I think you already did so that is three of us21:00
fungii'm happy and seems like ianw tried it and didn't see any problem either21:00
clarkbya21:00
clarkbI would probably prioritize zuul then gitea then gerrit21:01
clarkbzuul since it is most sensitive to the load21:01
clarkband gerrit last since I'm not sure if we are going to rebase that and stop renaming the server in testing (my preference is that we do that)21:01
fungiclarkb: have any suggestions on the best way to reach out to nibalizer about https://review.opendev.org/771073 ?21:03
clarkbya let me see21:04
fungistrange, nodepool still doesn't seem to have started trying to build another gentoo image21:22
clarkbif other images are building the slots may be full21:23
fungiyeah, likely21:23
clarkbsince nodepool may have decided to start another build after you killed the broken gentoo build21:23
*** whoami-rajat__ has quit IRC21:26
openstackgerritMerged opendev/system-config master: system-config-run-review: remove review-dev server  https://review.opendev.org/c/opendev/system-config/+/76686721:39
*** sboyron has quit IRC21:41
ianwthanks for looking in on the afs stuff22:18
ianwi'd agree we can just start with ORD and see how it goes22:18
ianwi think it's good to have a plan in case things restart before the end of the month though :)22:19
*** erbarr has quit IRC22:38
*** erbarr has joined #opendev22:38
*** logan- has quit IRC22:40
*** logan- has joined #opendev22:53
*** yoctozepto5 has joined #opendev22:54
*** yoctozepto has quit IRC22:55
*** yoctozepto5 is now known as yoctozepto22:55
*** mwhahaha has quit IRC22:57
*** mwhahaha has joined #opendev22:57
*** TheJulia has quit IRC22:58
*** TheJulia has joined #opendev22:58
*** nautik has quit IRC22:58
*** cap has quit IRC23:00
*** jrosser has quit IRC23:01
*** cap has joined #opendev23:01
*** jrosser has joined #opendev23:03
*** logan- has quit IRC23:04
fungiianw: of course, and i'm not suggesting we wait to the end of the month to start either, just that we also can take it more slowly and carefully (events permitting)23:05
openstackgerritLon Hohberger proposed openstack/diskimage-builder master: Pass DIB image's kernel version when checking modules  https://review.opendev.org/c/openstack/diskimage-builder/+/77109223:12
*** yoctozepto6 has joined #opendev23:15
*** yoctozepto has quit IRC23:17
*** yoctozepto6 is now known as yoctozepto23:17
*** logan- has joined #opendev23:18
*** ysandeep|away has quit IRC23:20
*** ysandeep has joined #opendev23:22
*** yoctozepto4 has joined #opendev23:42
*** yoctozepto has quit IRC23:44
*** yoctozepto4 is now known as yoctozepto23:44
*** yoctozepto4 has joined #opendev23:58
*** akrpan-pure has joined #opendev23:59
*** yoctozepto has quit IRC23:59
*** yoctozepto4 is now known as yoctozepto23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!