Wednesday, 2018-01-10

clarkbugh ubuntu has kernels up for xenial now but not for trusty unless you use trusty with hardware enablement kernels00:01
clarkbwhich btw is not their default trusty server kernel00:01
clarkbinfra-root ^00:02
mordredclarkb: AWESOME00:03
clarkbI'm going to patch my local fileserver then will probably start doing logsatsh workers00:04
clarkbas those are probably a good canary and easily rebuilt00:04
clarkb(I had hoped to start with infracloud, may be we update infracloud to hwe kernel?)00:04
corvusclarkb: where's the info about hwe?00:06
clarkbcorvus: https://wiki.ubuntu.com/Kernel/LTSEnablementStack is the generic doc on it00:07
corvusclarkb: right, i mean where did you see that they aren't doing stuff for hwe?00:08
corvuser for not-hwe?00:08
clarkboh00:08
clarkbhttps://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/SpectreAndMeltdown and the usn site00:09
clarkbcurrently they have only published for the 14.04 + hwe kernel00:09
clarkbwell that an linux-aws00:09
clarkbheh now https://usn.ubuntu.com/usn/usn-3524-1/ is there00:11
clarkbso maybe just lag involved00:11
clarkbin that case I will likely do infracloud next to pick up ^00:11
clarkbmy local fileserver doesn't seem to see the new kernels yet00:14
clarkbgoing to sort that out00:15
corvusclarkb: yeah, on a trusty machine i have, i only see linux-image-3.13.0-137-generic  not 13900:17
clarkbI'm pointed at security.ubuntu.com and I see the package in the package list if I browse to it00:19
* clarkb tries again00:19
clarkbya apt nto seeing it for some reason00:20
*** rlandy has quit IRC00:24
clarkblogstash-worker01 does see the new kernel though00:24
clarkband I have the same set of xenial-security entries in my apt sources list so this has to be cdn or caching etc00:26
clarkbcorvus: I think the InRelease and Release files I am being served are older than the packages in the repo00:31
corvusi need to go offline for a bit to patch my workstation and the system my irc client is on... i may be gone for a bit, but will rejoin asap00:32
clarkbok00:33
clarkbsomeone completely unrelated to openstack says xenials 4.4.0.108 131 kernel panics on their workstation00:34
clarkbso uh be prepared for that I Guess :/00:34
corvusthat's great00:34
*** rosmaita_ has quit IRC00:36
clarkbI might just turn my fileserver back off and patch it when things get better :/00:37
corvusclarkb: so... permanently off?00:43
corvusokay, really signing off now00:43
*** corvus has quit IRC00:44
clarkbI'm going to do logstash-worker01 by hand really quickly to get a xenial canary up00:46
clarkb aha I think I sorted out my lcoal fileserver issue00:50
clarkbI was using the old hwe00:50
clarkbdue to chipset issues00:50
clarkbok logstash-worker01 came up00:53
clarkbit does not have the bugs: cpu_insecure entry in the cpuinfo file but does haev [    0.000000] Kernel/User page tables isolation: enabled in dmesg00:53
clarkbhttps://etherpad.openstack.org/p/infra-meltdown-patching01:02
clarkbusing compute000.vanilla.ic.openstack.org as infracloud canary01:09
clarkbjust manually going to do that one too to figure out how to check if kpti is enabled and if instances will even boot up01:09
fungiyou should see it mentioned in dmesg at least01:10
clarkbya on xenial it was in dmesg01:11
clarkbdetails like this going into the etherpad above01:12
fungiKernel/User page tables isolation: enabled01:12
fungithat's what i was looking for on my systems01:12
clarkbyup01:14
clarkbarg sudoers contents changed on infracloud nodes01:14
clarkbfungi: http://paste.openstack.org/show/641745/ should be safe to accept the new version there ya?01:19
*** tristanC has joined #openstack-infra-incident01:21
clarkbI'm going ahead and choosing the new version so this problem doesn't persist as I believe it to be largely equivalent to the old version01:23
fungiyeah, it shouldn't cause any problems01:25
fungiand puppet will undo it anyway01:25
clarkboh do we manage the top level sudoers file? I kept our unattended upgrades config because I know puppet manages that one01:26
clarkband am keeping the nova conf01:26
clarkbarg sudoers update did end up breaking sudo for me01:28
clarkbI'm going to kick.sh compute000.vanilla.ic.openstack.org so I can reboot it01:28
clarkbso I think we want to keep our versions of all those config files on trusty nodes01:28
*** rosmaita has joined #openstack-infra-incident01:29
fungiyour account was in the sudo group, right?01:30
ianwso sorry do we need repos for xenial, or is it just update & reboot?01:30
clarkbyes I am in the sudo group01:30
ianwi can go through the ze nodes and do that, since they're tolerant to dying01:30
clarkbianw: it should be just update and reboot. see https://etherpad.openstack.org/p/infra-meltdown-patching01:31
clarkbfungi: but it wants my password now01:31
fungiweird that it broke sudo for you. it's not clear to me what in that diff would have01:31
fungioh01:31
fungiOH01:31
fungiit dropped NOPASSWD01:31
clarkband puppet master is having a hard time ssh'ing to compute000.vanilla.ic.openstack.org now (I am ssh'd though)(01:32
fungii totally overlooked that since the %sudo stanza also moved01:32
clarkbya oh well01:32
clarkbas soon as I can get puppetmaster to ssh in it should fix it01:32
clarkbbut uh thats not working01:33
ianwok, trying on ze01 and see what happens01:33
fungiwas there a sudo package uprade or something?01:34
fungii guess you're upgrading more than just kernel and microcode packages01:34
clarkbfungi: yes01:34
clarkbfungi: ya I was doing full dist-upgrades01:34
clarkbassuming unattended upgrades were mostly keeping up with the other stuff01:34
clarkbwhihc seems to be the case on xenial at least01:34
clarkboh wow compute000 doesn't permit root ssh01:35
clarkbso I may have toasted it and it needs to be rebuilt :/01:35
clarkbsshd_config was not something I was asked to update01:35
fungii bet another package upgraded (openssh-server?) and disabled root login01:35
clarkboh wait there is a specific whitelist for puppetmaster on compute00001:36
clarkbin sshd_config so why isn't this working01:36
clarkbssh -vvv seems to indicate it is a tcp problem01:37
clarkboh wait no that was just slow but it connected via tcp01:37
clarkbit might just be slow as molasses01:38
ianwhow long does the dkms take ...01:38
ianwohh, it's doing afs01:38
clarkbyup confirmed just slow as can be01:40
clarkbI'm going to manually replace the sudoers file as I doube ansible + puppet will ever be happy with this slow connection01:40
ianwok, ze01 done01:42
ianwi will let it run for a while as i get some lunch, and if all good, i'll go through the rest and update01:42
clarkbalso based on how slow this connection is I'm not entirely convinced we should be using infracloud in its current state01:42
clarkbianw: sounds, good thanks01:42
ianwi'll also stop the executor and clear out /var/lib/zuul/builds on each host, as it seems there's a little cruft in there01:43
clarkbcompute000 is rebooting now01:43
clarkbit will probably take 10 minutes to come back up so I can cehck on it01:44
clarkbfungi: does dist-upgrade -y imply responding N (eg keep old version of conf file) when there are conflicts in packages?01:44
clarkbfungi: since I think we do want N in all cases but -y is yes not no but N is default01:45
clarkbI'm going to put the elasticsearch cluster in the "don't worry about rebalancing indexes" mode then reboot the whole cluster at once01:46
clarkbthats best way of ripping off that bandaid I think01:46
fungiclarkb: not really sure. if you use the noninteractive mode it will keep old configs01:46
clarkbok rebooting elasticsaerch cluster now01:51
clarkbelasticsaerch is recovering shards now01:55
clarkbfungi: --force-confold as a dpkg option is the magic there apparently01:59
fungiyeah, you can do it that way too01:59
clarkbfungi: `export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get -o Dpkg::Options::="--force-confold" dist-upgrade -y` how does that look?02:01
fungiclarkb: remarkably similar to the syntax we're using somewhere in our automation02:02
fungiand yes, should do what you want02:02
clarkblooks like archive.ubuntu.com has the new kernel now02:03
clarkbso don't need to update sources.list on infracloud02:03
clarkbif compute001 comes out of this looking happy I'm going to ansible the above command across infracloud compute nodes and I guess start doing reboots?02:04
clarkbI figure control plane is lesser concern and probably needs more eyeballs on it02:04
fungisounds good. i should be able to help tomorrow but it's getting late here02:05
clarkbI also confirmed that trusty has the same dmesg check as xenial02:06
fungiexcellent02:08
clarkbhrm rebooting broke neutron networking02:15
fungiouch02:16
clarkbpuppet is what sets that up02:16
clarkband since puppetmaster can't ssh to half these nodes...02:16
clarkbok 000 and 001 in vanilla are working with the manually run steps that puppet would normally run02:23
clarkbapt-get update is running on those vanilla compute hosts that puppetmaster can ssh into02:23
clarkbI think I will go ahead and upgarde those, reboot them, then fix networking, then use the dmesg check for finding those that were missed due to ssh issues02:24
*** mordred has quit IRC02:42
*** mordred has joined #openstack-infra-incident02:44
*** rosmaita has quit IRC02:58
ianwdoing the executors now03:05
clarkbI'm tracking the various states of infracloud things and its not very pretty...03:06
clarkbbunch of nodes can't be hit, at least one node has a ro / that apepars to be due to medium errors03:06
clarkbI'm gonna continue working through this but I'm beginning to think we might want to seriously considering not infraclouding anymore03:07
clarkbI expect I'll be in a position to do mass reboots in an hour or so03:08
* clarkb finds dinner03:08
clarkbthats cool the bad disk/fs happened just now ish03:19
clarkbI guess I'll give it a reboot and see if it comes up and is patchable03:19
*** rosmaita has joined #openstack-infra-incident03:42
clarkbdoing mass reboot of chocolate now03:59
clarkb(it updated quicker than vanilla)03:59
clarkball the gate jobs are failing on a tox siblings thing anyways so I figure just go for it04:00
clarkb(also zuul should retry anyways)04:00
clarkbchocoalte nodes are coming abck and I am applying their network config04:07
clarkbI expect that I will have all of the reachable chocolcate compute hosts patched shortly04:07
clarkbvanilla compute hosts are rebooting now as well04:10
*** rosmaita has quit IRC04:21
clarkbI think all infracloud computes that are reachable are patched04:40
clarkbthe chocolate cloud appears to be functioning too04:40
clarkbbut the vanilla cloud is somewhat ambiguous going by grafana data04:40
clarkbTL;DR for those of you catching up in the morning. I think a good next step would be to generate some proper node lists and we can start going through them. For what ianw and I did this evening was mostly using representative sets like logstash-workers and elasticsearch* and infracloud to get the ball rolling. I've tried to capture my notes on what I've done to get things updated properly04:55
clarkbhttps://etherpad.openstack.org/p/infra-meltdown-patching has more infos04:56
clarkband with that I'm off to bed04:57
*** jeblair has joined #openstack-infra-incident05:21
*** jeblair is now known as corvus05:21
*** frickler has joined #openstack-infra-incident08:53
*** rosmaita has joined #openstack-infra-incident13:04
*** rlandy has joined #openstack-infra-incident13:34
*** pabelanger has quit IRC14:56
*** pabelanger has joined #openstack-infra-incident14:56
-openstackstatus- NOTICE: Gerrit is being restarted due to slowness and to apply kernel patches14:58
fungito also track in here, review.o.o is running the updated kernel as of a few minutes ago (thanks pabelanger!)15:13
pabelangerYes, seems things are working15:14
fungiand frickler noticed the mirror in ic-choco was broken. apparently got shutdown when the hosts were rebooted for newer kernels, and i guess nova doesn't start instances automagically when the host comes back online?15:14
fungii should check vanilla too, now that i think about it15:14
fungii can't ssh to it either. not a good sign15:15
pabelangeryah, I think you need a nove.conf setting to turn instances back on15:15
pabelangernova.conf*15:15
fungiconfirmed, it too was in a shutoff state15:15
pabelangerack15:16
*** dmsimard has joined #openstack-infra-incident15:19
fricklerI haven't dug into how we set up infra cloud yet, but it should be "resume_guests_state_on_host_boot = true" in nova.conf/DEFAULT15:35
corvusoh, it looks like we're up to 4.4.0-109 this morning15:38
corvusit was 108 yesterday15:38
corvus(for xenial)15:39
corvusand trusty+hwe15:41
corvusxenial: https://usn.ubuntu.com/usn/usn-3522-3/15:41
corvusapparently fixes a regression that caused "a few systems failed to boot successfully"15:42
corvusso we *probably* don't have to redo the 108 hosts -- if they booted.15:42
clarkbfungi: oh sorry I completely spaced on the mirrors16:36
fungino problem. they're working now16:43
clarkbis review.o.o the only thing that has been patched since last night?16:44
fungicorvus: yeah, i say we just make sure the updated kernels get installed so the next time we reboot they'll also hopefully not fail to reboot16:44
clarkbI'll probably work on generating server lists as next step and adding that into the etherpad16:44
fungiclarkb: yes, we missed the opportunity to patch zuulv3.o.o when it got restarted for excessive memory use16:44
dmsimardbtw I mentioned yesterday a playbook to get an inventory of patched and unpatched nodes, I cleaned what I had and pushed it here as WIP: https://github.com/dmsimard/ansible-meltdown-spectre-inventory16:45
dmsimardneed to afk16:45
fungiclarkb: oh, also i saw that debian oldstable/jessie got meltdown patches if you want to update your personal system16:45
clarkbfungi: oh thanks for that heads up. I may actually start with that and patch my irc client server16:46
clarkbya gonna get that taken care of, will be afk for a bit16:48
*** clarkb has quit IRC16:53
*** clarkb1 has joined #openstack-infra-incident16:54
fungihelo clarkb116:56
clarkb1hello, now to figure out my nick situation16:56
fungiindeed!16:56
fungii see webkitgtk+ just released mitigation patches along with an advisory, so maybe chrome and safari will be a little safer shortly?16:56
*** rosmaita has quit IRC16:58
*** clarkb1 is now known as clarkb16:58
clarkbok I think thats irc mostly sorted out. Need to join a couple dozen more chnanels but that can happen later17:00
clarkbdmsimard: you had to afk, but how work in progress is that playbook? should we avoid running it against say infracloud or the rest of infra?17:01
clarkbdoes sshing to compute030.vanilla.ic.openstack.org close the connection just as soon as you actually login?17:02
clarkbor rather does it do that for anyone else?17:02
corvusclarkb: yep17:02
clarkb`ansible -i /etc/ansible/hosts/openstack '*:!git*.openstack.org' -m shell -a "dmesg | grep 'Kernel/User page tables isolation: enabled'"` is what I'm going to start with to generate a list of what is and what isn't patched17:04
clarkbthat excludes infracloud and excludes the centos git servers which should already be patched17:04
clarkbcorvus: completely unreachable from local and puppetmaster, unreachable from puppetmaster but reachable from local, reachable but connecton gets killed like for compute030, and sad hard drives plus RO / seems to be the rough set of different ways things are broken in infracloud17:06
pabelangeryah, sad HHDs in infracloud are happening more often17:15
clarkbvanilla also appears to be in worse shape17:16
clarkbI think the chocolate servers are newer17:17
clarkbI'm going to reorganize the etherpad a bit as I don't think the trusty vs xenial distinction matters much17:25
clarkbok I think https://etherpad.openstack.org/p/infra-meltdown-patching is fairly well organized now17:33
clarkbI'm going to continue to pick off some of the easy ones like translate-dev etherpad-dev logstash.o.o subunit workers17:36
clarkbbut then after breakfast I probably should context switch back to infracloud and finish that up17:36
pabelangerI can start picking up some hosts too17:38
fungii'm just about caught up with other stuff to the point where i can as well17:39
pabelangerwill do kerberos servers, since they fail over17:40
*** rosmaita has joined #openstack-infra-incident17:41
pabelangerokay, rebooting kcd01, run-kprop.sh worked as expected17:42
pabelangeractually, let me confirm we have right kernel first17:42
pabelangerlinux-image-3.13.0-139-generic17:44
pabelangerrebooting now17:44
clarkbon trusty unattended upgrades pulled in the latest kernel17:45
clarkbbut ubuntu released a newer kernel for xenial 109 instead of 108 that addressed some booting issues that we should install on unpatched servers17:45
pabelanger[    0.000000] Kernel/User page tables isolation: enabled17:46
clarkb(I'm still running an apt-get update and dist-upgrade per the etherpad on the servers I'm patching regardless)17:47
pabelangeryah, just doing that on kdc04.o.o now17:47
pabelangerwhich is xenial and got latest 109 kernel17:47
pabelangerlists-upgrade-test.openstack.org | FAILED17:49
pabelangercan we just delete that now?17:49
pabelangerclarkb: corvus: ^17:50
pabelangerlogstash-worker-test.openstack.org | FAILED I guess too17:50
clarkbI have no idea what logstash-worker-test.o.o is17:52
clarkbso I think it can be cleaned up17:52
pabelangerworking nb03.o.o and nb04.o.o17:52
clarkblists-upgrade-test was used to test the inplace upgrade of lists.o.o I don't think we need the server but corvus should confirm17:52
clarkbfungi: maybe you want to do the wiki hosts as I think you understand their current situation best17:54
corvusclarkb: confirmed -- lists-17:55
corvusha17:56
corvusupgrade-test is not needed :)17:56
fungiclarkb: yup, i'll get wiki-upgrade-test (a.k.a. wiki.o.o) and wiki-dev now17:57
pabelangerack, I'll clean up both in a moment17:58
pabelangershould we wait until later this evening for nodepool-launchers or fine to reboot now?17:59
clarkbI'm kinda leaning towards ripping the bandaid off on this one18:00
clarkband nodepool should handle that sanely18:00
pabelangerYah, launcher should bring nodes online again. nodepool.o.o (zookeeper) we'll need to do when we stop zuul18:01
clarkbspeaking of bandaids, just going to do codeserach since there isn't a good way of doing that one without an outage18:03
clarkbmaybe I should send email to the dev list first though18:03
clarkbya writing an email now18:04
pabelangerI didn't see anybody on pbx.o.o, so I've rebooted it18:05
pabelangermoving to nl01.o.o18:06
pabelangerclarkb: okay, we ready for reboot of nl01?18:08
clarkbI think so, that should be zero outage from a user perspective18:08
clarkbI'm almost done with this email too18:09
pabelangerokay, nl01 rebooted18:10
pabelangerand back online18:10
clarkbemail sent18:11
clarkboh hey arm64 email I should read too18:11
fungithe production wiki server is taking a while to boot18:12
fungimay be doing a fsck... checking console now18:12
pabelangerokay, both nodepool-launchers are done18:14
fungiyeah, fsck in progress, ~1/3 complete now18:14
clarkbcodesearch is reindexing18:15
pabelangerclarkb: how do we want to handle mirrors? No impact would be to launch new mirrors or disable provider in nodepool.yaml18:15
clarkbpabelanger: ya considering how painful the last day or so has been for jobs we may want to be extra careful there18:16
fungireplacing mirrors loses their warm cache18:16
clarkbeither boot new instances or do them when we do zuul18:16
pabelangerfungi: yah, that too18:16
fungii vote do them when we do zuul scheduler18:16
pabelangerokay, that works18:16
fungione outage to rule them all18:16
clarkbwfm18:17
fungialso, we may not have sufficient quota in some providers to stand up a second mirror without removing the first?18:17
*** ChanServ changes topic to "Meltdown patching | https://etherpad.openstack.org/p/infra-meltdown-patching | OpenStack Infra team incident handling channel"18:17
clarkbpabelanger: what we can do no though is run the update and dist upgrade steps18:18
clarkbso that all we have to do when zuul is ready is reboot them18:18
pabelangerfungi: I think we do, at least when I've built them we've had 2 online at once18:19
pabelangerclarkb: sure18:19
pabelangerk, doing grafana.o.o now18:21
corvuswhat do you think about pulling the list of systems that need patching into an ansible inventory file, then run apt-get update && apt-get dist-upgrade on all of them?18:21
clarkbthats annoying hound wasn't actually started on codesearch01 (I've started it)18:22
clarkbcorvus: not a bad idea. I can regenerate my list so that it is up to date18:22
fungi3.13.0-139-generic is booted on wiki-upgrade-test (and the wiki is back online again) but dmesg doesn't indicate that page table isolation is in effect18:23
corvusrebooting cacti0218:23
clarkbfungi: weird, does it show up in the kernel config as an enabled option?18:24
pabelangerHmm, grafana is trying to migrate something18:24
fungiCONFIG_PAGE_TABLE_ISOLATION=y is in /boot/config-3.13.0-139-generic18:25
pabelangerah, new grafana package was pulled in18:25
corvusmeeting channels are idle, i will do eavesdrop now18:25
fungigood call18:25
*** openstack has joined #openstack-infra-incident18:30
*** ChanServ sets mode: +o openstack18:30
corvussupybot is not set to start on boot.  and we call it meetbot-openstack in some places and openstack-meetbot others.  :|18:30
corvusi'm guessing something was missed in the systemd unit conversion.18:30
fungieep, last time i restarted it i used `sudo service openstack-meetbot restart`18:31
fungiwhich _seemed_ to work18:31
corvusfungi: i think they all go to the same place :)18:31
fungifun18:32
corvusany reason why etherpad-dev was done but not etherpad?18:32
fungii'll do planet next18:32
corvushrm.  apt-get dist-upgrade on etherpad only sees -137, not 13918:33
fungipointed at an out-of-date mirror?18:34
clarkbcorvus: yes18:34
clarkbcorvus: we are using the etherpad to coordindate so didn't want to just reboot it18:34
clarkbbut maybe it is best to get that out of the way18:34
clarkbI'm going to update the node list first though18:34
corvusfungi: deb http://security.ubuntu.com/ubuntu trusty-security main restricted18:34
fungistrange18:34
fungithat's after apt-get update?18:34
corvusanyone sucessfully done a trusty upgrade to 139 today?18:34
corvusfungi: yep18:34
corvusthe ones i picked up were all xenial18:35
clarkbcorvus: yes etherpad-dev was trusty too and worked18:35
fungii just did wiki and wiki-dev and they're trusty18:35
clarkbis it saying 137 is no longer needed?18:35
corvusoh wait...18:35
clarkbthe trusty nodes largely got the new kernel from unattended upgrades last night18:35
corvusclarkb: i think that's the case18:35
clarkband then the message says 137 is not needed anymore18:35
corvussorry :)18:36
clarkbnp18:36
fungiright, unattended-upgrades will install new kernel packages and then e-mail you to let you know you need to reboot18:36
pabelangersinc we nolonger do HTTP on zuul-mergers, those could be rebooted with no impact?18:36
clarkbeveryone ok if I just delete the current list (you'll lose your annotations) and replace with current list?18:36
clarkbor current as of a few minutes ago18:36
corvusclarkb: wfm18:36
fungiclarkb: fine with me18:36
fungithough wiki-upgrade-test will show back up18:36
fungiunless you're filtering out amd cpus18:37
pabelangercorvus: ok here18:37
clarkbfungi: I am not, but can make a note on that one18:37
corvusclarkb: i'm ready to reboot when you're done etherpadding.18:37
clarkbcorvus: ready now18:37
corvuspabelanger: why didn't your zk hosts show up as success?18:38
fungiclarkb: more pointing out that there could be other instances in the same boat (however unlikely at this point)18:38
pabelangerclarkb: I just did them, maybe clarkb list outdated?18:38
clarkbya could be a race in ansible running and me filtering18:39
clarkbit takes a bit of time for ansible to go thorugh the whole list18:39
clarkbI havne't gotten a fully automated solution yet because ansible hangs on a couple hosts so never exits in order to do things like sorting18:39
corvusetherpad is back up18:40
corvusfor that matter, the 2 i just did also showed up as failed18:40
corvusany special handling for files02, or should we just rip off the bandaid?18:41
clarkbwe could deploy a new 01 and then delete 0218:42
clarkbbut I'm not sure that is necessary, server should be up quickly and start serving data again18:42
clarkb(also we lose all the cache data if we do that18:42
corvusclarkb: if we feel that a 1 minute outage is too much, i'd suggest we deploy 01 and *not* delete 02 :)18:43
corvusi'll ask in #openstack-doc18:43
clarkbsounds good18:43
clarkbI got a response to my we are patching email pointing to a thing that indicates centos may not actually kpti if your cpu's don't have pcid18:44
clarkbso I'm going to track that down18:44
corvusalso *jeepers* the dkms build is slow18:44
clarkb"The performance impact of needing to fully flush the TLB on each transition is apparently high enough that at least some of the Meltdown-fixing variants I've read through (e.g. the KAISER variant in RHEL7/RHEL6 and their CentOS brethren) are not willing to take it. Instead, some of those variants appear to implicitly turn off the dual-page-table-per-process security measure if the processor18:46
clarkbthey are running on does not have PCID capability."18:46
clarkbuhm18:46
fungihuh18:46
corvushow do we git a console these days?18:46
clarkbgit.o.o does not have pcid in its cpuinfo flags like my laptop does18:46
clarkbcorvus: log into rax website go to servers open up specific server page then under actions there is open remote console iirc18:47
clarkbpabelanger: corvus dmsimard this may explain why centos dmesg doesn't say kpti is enabled18:47
clarkbbecause it isn't? the sysfs entry definitely seems to say it is there though18:47
corvusokay, files02 is back up.  i got nervous because it took about 2 minutes to reboot, which is very long.18:48
corvusclarkb: also, we don't necessarily lose the afs cache -- it shouldn't be any worse than a typical volume release (which happens every 5min)18:49
clarkboh right docs releases often18:49
corvusand it caches the content, and just checks that the files are up to date after invalidation.18:50
corvusclarkb: link to what you're looking at?18:52
clarkbcorvus: https://groups.google.com/forum/m/#!topic/mechanical-sympathy/L9mHTbeQLNU18:53
clarkbreading othe rthings on the internet the best check for kpti enablement is the message in dmesg18:54
clarkbthat is a clear indication it was enabled and is being used post boot18:54
clarkbbut ya apparently red hat did the sysfs thing which isn't standard so harder to get info on that and determine if that is a clear "it is on" message18:54
clarkbso I'm fairly convinced that it is working as expected on the ubuntu hosts we are patching (beacuse we are checking dmesg)18:55
clarkbdmsimard: do you have red hat machines where the cpu does have pcid in the cpu flags? if so does dmesg output the isolation enabled message in that case?18:56
clarkbpabelanger: ^18:56
clarkbI'm going to do review-dev now18:56
fungias soon as i confirm paste is okay, i'm going to do openstackid-dev and then openstackid.org18:57
fungithough i'll give the foundation staff and tipit devs a heads up on the latter18:58
fungilodgeit on paste01 seems unhappy18:59
pabelangerclarkb: I look at see on some internal things19:01
pabelangerbut don't have much access myself19:01
pabelangeronce bandersnatch is done running, I'll reboot mirror-update.o.o19:04
fungilooks like the openstack-paste systemd unit raced unbound at boot and failed to start because it couldn't resolve the dns name of the remote database. do we have a good long-term solution to that?19:04
clarkbanyone know whay apps-dev and stackalytics still show up in our inventory?19:07
clarkbI'm going to make a list of services that hsould happen with zuul's reboot19:09
pabelangerhave they been deleted?19:09
clarkbpabelanger: havne't checked yet, I guess if they are still instances up but no dns that could explain it19:09
pabelangerah, I wasn't aware we deleted stackalytics.o.o19:10
clarkbwell I don't know that we did but both are unreachable19:13
clarkbdo the zuul mergers need to happen with zuul scheduler?19:13
clarkbor can we do them one at a time and get them done ahead of time?19:13
clarkbcorvus: ^19:13
corvusclarkb: stopping them while running a job may cause a merger failure to be reported.  other than that, it's okay.19:14
pabelangermaybe we can add graceful stop for next time :)19:16
pabelangercorvus: also, ns01 and ns02, anything specific we need to do for nameservers? Assume rolling restarts are okay19:17
clarkbI'm inclined to do them as part of zuul restart then since we've alrady had a rough day with test stability yesterday19:17
corvuspabelanger: rolling should be fine19:17
pabelangerokay, I'll look at them now19:17
corvuspabelanger: a graceful stop for mergers would be great :)19:18
clarkbtranslate.o.o is failing to update package indexes, something related to a nova agent repo? I wonder if we made this xenial instance during a period of funny rax images19:22
clarkbI'm looking into it19:22
fungiick19:22
pabelangerclarkb: question asked about PCID on internal chat, hopefully know more soon19:23
clarkbdeb http://ppa.launchpad.net/josvaz/nova-agent-rc/ubuntu xenial main19:23
clarkbis in a sources.list.d file19:24
clarkbthat looks like a use customers as guinea pigs solution19:24
clarkbcloud-support-ubuntu-rax-devel-xenial.list is a differetn file name in /etc/apt/sources.list.d but is empty19:25
clarkbI'm thinking I will just comment out the deb line there and move on19:25
clarkbany objections to that?19:25
pabelangerclarkb: a PCID cpu with RHEL7, doesn't show isolation in dmesg19:27
clarkbpabelanger: ok, thanks for checking19:27
clarkbso I guess other than rtfs'ing we treat the sysfs entry saying pti is enabled as good enough?19:28
clarkbcommenting out the ppa allowed translate01 to update. I am rebooting it now19:29
pabelangerokay, all mirrors upgraded, just need reboots when we are ready19:30
pabelangerI don't think we have any clients connected to firehose01.o.o yet?19:32
clarkbnothing production like. I think mtreinish uses it randomly19:33
pabelangerkk, will reboot now then19:33
clarkbzuul-dev can just be rebooted right?19:33
clarkbcorvus: ^19:33
pabelangeror even deleted?19:33
corvusclarkb, pabelanger: either of those19:34
clarkbwell if it can be deleted then that is probably preferable19:34
clarkbdoing ask-staging now19:34
pabelangerokay, I can propose patched to remove zuul-dev from system-config19:35
fungiclarkb: yes please to be commenting out the rax nova-agent guinea pig sources.list entry19:39
pabelangerremote:   https://review.openstack.org/511986 Remove zuulv3-dev.o.o19:40
pabelangerremote:   https://review.openstack.org/532615 Remove zuul-dev.o.o19:40
pabelangershould be able to land both of them19:40
fungiopenstackid server restarts are done, after coordinating with foundation/tipit people19:40
fungigoing to do groups-dev and groups next19:41
pabelangergraphite.o.o is ready for a reboot, if we want to do it19:41
clarkbfungi: ask-staging looked happy anything I should know about before doing ask.o.o?19:41
fungipabelanger: after restarting firehose01, please take a moment to test connecting to it per the instructions in the system-config docs to confirm it's streaming again19:41
pabelangerfungi: sure, I can do that now19:42
fungiclarkb: assuming the webui is up, i think it's safe to proceed with prod19:42
clarkbpabelanger: if we do graphite with zuul then we won't lose job stats or nodepool stats19:42
clarkbpabelanger: but I'm not sure that is mission critical its probably fine to have a small gap in the data and reboot it19:42
clarkbfungi: ok doing ask now then19:42
*** rlandy has quit IRC19:43
pabelangerfungi: firehose.o.o looks good19:44
pabelangerclarkb: okay, will reboot now19:45
clarkbask.o.o is rebooting now19:45
fungicool19:45
fungithanks for checking pabelanger!19:45
pabelangernp!19:45
*** rlandy has joined #openstack-infra-incident19:48
pabelangerclarkb: what about health.o.o, reboot when we do zuul?19:49
clarkbhealth should be fine to do beforehand19:49
clarkbsince its mostly decoupled from zuul (it reads the subunit2sql db)19:49
pabelangerk19:49
pabelangerrebooting now19:50
clarkbask.o.o is up and patched but not serving data properly yet19:50
clarkbI see there are some manage processes running for aksbot though19:50
clarkb[Wed Jan 10 19:50:48.367125 2018] [mpm_event:error] [pid 2155:tid 140109368838016] AH00485: scoreboard is full, not at MaxRequestWorkers19:50
clarkbany idea what that means?19:50
pabelangerI'd have to google19:51
clarkblooks like it may indicate that apache worker processes are out to lunch19:52
clarkband it won't start more to handle incoming connections19:52
clarkbI am going to try restarting apache19:52
pabelangergoing to see how we can do rolling updates on AFS servers19:53
pabelangerI think we just need to do one at a time19:53
clarkbseems to be working now19:54
clarkbpabelanger: ya I think I documented the process in the system config docs19:54
pabelangerYup19:54
pabelangerhttps://docs.openstack.org/infra/system-config/afs.html#no-outage-server-maintenance19:54
pabelangergoing to start with db servers19:54
fungigot another one... groups-dev is AMD Opteron(tm) Processor 4170 HE19:55
clarkbadns1.openstack.org can just be rebooted right? its not internet facing19:56
clarkbcorvus: ^ you good for me to do that one?19:56
fungii agree it should be safe to reboot at will, but will defer to corvus19:57
clarkbI've pinged jbryce about kata's mail list server to make sure there isn't any crazy conflict with reboot it19:58
fungigroups-dev is looking good after its reboot, so moving on to groups.o.o19:58
fungigroups.o.o is Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz19:58
fungiso groups-dev won't be a great validation for prod, but them's the breaks19:59
pabelangerboth AFS db servers are done20:01
clarkbI'm actually going to take a break and get lunch20:02
clarkbI think we are close to being ready to doing zuul though which is exciting20:02
AJaegerclarkb: kata-dev mailing list is dead silent ;/20:02
pabelangerAFS file servers need a little more work, we'll have to stop mirror-update for sure, and likely wait until zuul is shutdown to be safe20:02
pabelangerotherwise, we can migrate volumes20:02
clarkbjbryce also confirms we can do kata list server whenever we are ready20:02
pabelangerwhich takes time20:02
pabelangeractually, afs02.dfw.openstack.org looks clear of RW volumes, so I can start work on that20:06
* clarkb lunches20:06
fungigroups servers are done and looking functional in their webuis20:08
fungihrm... we have status and status01. is the former slated for deletion?20:09
fungii'll coordinate with the refstack/interop team on the refstack.o.o reboot20:11
pabelangerfungi: yes, i believe status.o.o can be deleted but it was there even before we create new status01.o.o server.  I'm not sure why20:12
corvusfungi, clarkb: reboot adns at will20:13
pabelangermoving on to afs01.ord.openstack.org it also has not RW volumes20:13
pabelangerno*20:14
corvuspabelanger: i would just gracefully stop mirror-update before doing afs01.dfw.  i wouldn't worry about zuul.20:14
pabelangerk20:14
corvusthe wheel build jobs should be fine if interrupted.  the only reason to take care with mirror-update is so we don't accidentally get in a state where we need to restart bandersnatch.  but even that should be okay.20:15
corvus(i mean, it's run out of space twice in the past few months and has not needed a full restart)20:15
pabelangeryah, that is true20:16
pabelangerAFK for a few to get some coffee before doing afs01.dfw20:18
fungii'm taking a quick look in rax to see if i can suss out what the situation with stackalytics.o.o is20:18
fungii want to say we decided to delete it and redeploy with xenial if/when we were ready to work on it further20:19
clarkbya some of the unreachable servers may have had failed migrations20:19
fungilooks like stackalytics is in that boat20:20
funginothing in its oob console, but i issued a ctrl-alt-del through there20:21
fungii have at times seen a similar issue i've also guessed may be migration-related, where the ip addresses for some instances cease getting routed until they're rebooted, but they're otherwise fine20:21
fungieven happened to one of my personal debian systems in rax20:22
fungiin that case i was able to work out a login through the oob console and inspect from the guest side20:22
fungiand the interface was up and configured but tcpdump showed no packets arriving on it20:23
fungi(my personal instances have password login through the console enabled, just not via ssh)20:24
clarkbgeneral warning kids have been sick the last couple days and larissa now feels sick so I will be doing patching from couch while entertaining 2 year olds20:24
dmsimarddid someone forget to delete the old eavesdrops machine ?20:24
dmsimardfatal: [eavesdrop.openstack.org]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 2001:4800:7818:101:be76:4eff:fe05:31bf port 22: No route to host\r\n", "unreachable": true}20:24
fungidmsimard: i bet it was shutdown but not deleted until we were sure the replacement was good?20:24
fungii think stackalytics is exhibiting the same broken network behavior. console indicated it wasn't able to configure its network interface and eventually timed that out and booted anyway... remotely the instance is still unreachable. i'll move on to rebooting it through the api next20:27
fungihuh, even after a nova reboot of the instance, stackalytics.o.o seems unable to bring up its network20:32
dmsimardclarkb, fungi: the playbook works after some fiddling, the inventory is in /tmp/meltdown-spectre-inventory20:33
dmsimard~60 unpatched hosts still20:34
dmsimardand 66 patched hosts20:34
fungidmsimard: does it check whether they're intel or amd?20:35
fungii've already turned up two amd hosts which won't show patched because kpti isn't enabled for amd cpus20:35
dmsimardFor some reason the facts gathering would hang at some point under the system-installed ansible version (2.2.1.0), I installed latest (2.4.2.0) in a venv and used that20:35
dmsimardfungi: it doesn't but I can add that in -- can you show me an example ?20:36
fungidmsimard: so far wiki-upgrade-test and groups-dev20:37
fungidmsimard: you could check for GenuineIntel or not AuthenticAMD in /proc/cpuinfo20:37
fungiwhichever seems better20:37
dmsimardhmm, my key isn't on wiki-upgrade-test20:37
dmsimardlooks like I can use the ansible_processor fact (since we're gathering facts anyway) let me fix that20:40
fungidmsimard: yeah, just don't apply puppet on it. it's basically frozen in time from before we started the effort to puppetize our mediawiki installation (which is what's on wiki-dev)20:45
fungiand since it's perpetually in the emergency file until we get the puppet-deployed mediawiki viable, new shell accounts aren't getting created20:45
clarkbdmsimard: its because ssh fails to a couple nodes I think. Newer ansible must handle that better and timeout20:46
pabelangerI've added mirror-update.o.o to emergency file so I can edit crontab20:55
fungistackalytics.o.o isn't coming back no matter how hard i reboot it. we likely need to delete and rebuild anyway20:59
dmsimardclarkb: that's what I thought as well.21:00
pabelangerfungi: yah, sounds fair. needs to be moved to xenial anyways21:03
pabelangereavesdrop.o.o was likely me, yah it was shutdown and never deleted after eavesdrop01.o.o was good21:03
clarkbok back to computer now21:18
clarkbI'm going to do adns now if it isn't already done21:18
clarkbdmsimard: the storyboard hosts have been patched for ~2 hours but they show up in your unpatched list21:22
clarkbdmsimard: was it run more than two hours ago or maybe its not quite accurate?21:22
clarkb(I just checked both by hand and they are patched)21:22
clarkbadns1 rebooting now21:24
clarkbok its up and reports being patched and named is running21:26
clarkbanyone know what is up with status.o.o and status01.o.o?21:28
clarkblooks like they are different hosts21:28
fungimy bet is a not-yet-completed xenial replacement21:29
fungistatus01 is not in dns21:29
fungiand status is still in production and on trusty21:29
clarkbany objections to me patching and rebooting both of them?21:29
funginone from me. i'll see if i can figure out who was working on the status01 build21:30
clarkbstatus hosts elasticsearch and bots21:30
clarkbotherwise its mostly a redirect to zuulv3 status I think21:30
fungiyeah21:30
clarkbok doing that now21:30
fungiseems to have been ianw working on status01 during the sprint, according to channel logs21:31
clarkber not elasticsearch, elastic-recheck21:31
fungihttp://eavesdrop.openstack.org/irclogs/%23openstack-sprint/%23openstack-sprint.2017-12-12.log.html#t2017-12-12T00:41:1921:32
clarkbI think most of what is left is the zuul group, lists, puppetmaster and backup server21:33
clarkbstatus* rebooting now21:34
clarkbshould probably do puppetmaster last21:34
clarkbso that if it hiccups it does so after everything else is updated21:34
clarkboh and infracloud still needs updating on control plane21:35
fungizuul-dev and zuul.o.o also need updating but we should take care not to accidentally start services on the latter21:37
fungimaybe it's time to talk again about deleting them21:37
clarkbya earlier we said we could delete zuul-dev21:38
clarkbI'm doing the backup server now21:38
fungizuulv3.o.o has only one vhost so apache directs all requests there as the default vhost, and as such just making zuul a cname to zuulv3 will work without needing the old zuul serving a redirect21:38
clarkbI'm up for deleting zuul.o.o too21:38
clarkbbut first backup server21:39
fungii tested with an /etc/hosts entry and my browser earlier, worked fine21:39
corvusthere is no zuul, only zuulv321:40
clarkbfungi: I would say make the dns record update then and lets delete both old servers21:41
corvus++21:41
clarkbbackup server rebooting now21:42
clarkbok backups server came back happy and has both filesystems (old and current) mounted21:44
fungii'll do the cname dance now21:44
fungifor zuul->zuulv321:44
clarkbI guess I'm up now for lists reboots21:45
clarkbI'm going to start iwth kata because lower traffic21:45
clarkblists.o.o is actually amd21:48
clarkbso patching less urgent for it but we may as well21:48
fungittl on zuul.o.o a and aaaa were 5 minutes21:49
fungicname from zuul to zuulv3 now added21:49
clarkblists.katacontainers.io is back up and happy. At least it has exim, mailman, and apache running21:50
clarkbgoing to do lists.openstack.org now21:50
fungi#status log deleted old zuul-dev.openstack.org instance21:51
openstackstatusfungi: finished logging21:51
fungiwill delete the zuul.o.o instance shortly once dns changes have a chance to propagate21:51
clarkbafter zuul is cleaned up I'm going to rerun my check for what is patched21:51
clarkbbut I think we will be down to the zuulv3 group21:52
clarkb(and infracloud)21:52
clarkboh and puppetmaster (but again do this one after zuulv3 group)21:52
clarkblists.o.o rebooting now21:53
clarkband is back, services look good, but no kpti because it is AMD21:54
clarkbI'm going to make sure zm01-zm08 are patched now but not reboot them21:55
fungii think that makes three we know about now with amd cpus21:56
fungi(wiki-upgrade-test, groups-dev and lists)21:56
clarkbI think it has to do with the age of the server21:57
clarkbsince rax was all amd before they added the performance flavors21:57
clarkbfungi: let me know when zuul.o.o is gone and I am gonna regen our list to make sure only the zuulv3 set is left21:58
clarkbpatching nodepool.o.o now too but not rebooting22:01
clarkband now patching static.o.o22:06
corvusi just popped back to pick up another server and don't see one available.  it looks like we're at the end of the list, where we reboot the zuul system all at once?22:09
fungiyeah, i'm about to delete old zuul.o.o now22:09
fungiwhich i think is the end22:10
fungiwe've made short work of all this22:10
fungi(not me so much, but the rest of you)22:10
clarkbcorvus: yes I am generating a new list from ansible output to compare now22:10
clarkbwill be a sec22:10
corvuscool, count me in for helping with that.  i'd like to handle zuul.o.o itself, and patch in the repl in case i have time to debug memory stuff later in the week.22:11
clarkbok22:11
fungii hope you mean zuulv3.o.o22:11
fungisince i'm about to push the button on deleting the old zuul.o.o22:11
corvusfungi: yes.  thinking ahead.  :)22:11
fungidns propagation has had plenty of time now22:12
fungiand http://zuul.openstack.org/ is giving me the status page just fine22:12
ianwstatus01 is waiting for us to finish the node puppet stuff22:13
fungiianw: cool, thanks. i thought it might be something like that after looking at the log from the sprint channel22:14
fungi#status log deleted old zuul.openstack.org instance22:14
clarkbhttp://paste.openstack.org/show/642507/ up to date list22:14
openstackstatusfungi: finished logging22:14
clarkbI sorted twice so we can split up by status too22:15
clarkbpabelanger: you still doing afs01?22:15
clarkbI guess that one can happen out of band22:15
pabelangerclarkb: yup, just waiting for mirror-update stop bandersnatch22:15
fungii'll go ahead and delete the stackalytics server too, it's not going to be recoverable without a trouble ticket, and that's a bridge too far for something we know is inherently broken and unused anyway22:16
clarkbfungi: sounds good22:16
fungi#status log deleted old stackalytics.openstack.org instance22:16
openstackstatusfungi: finished logging22:16
pabelangerI'll have to step away shortly for supper with family, are we thinking of doing zuulv3.o.o reboot shortly?22:17
clarkbpabelanger: yes I think so22:17
clarkbyou still good to do the mirrors if we do it in the next few minutes?22:17
pabelangerclarkb: yah, I can do reboot them22:17
clarkbline 151 hsa the zuulv3 set. I've given corvus zuulv3.o.o and I've taken zuul mergers22:18
clarkbI'll put pabelanger on the mirrors22:18
clarkbthat leaves static and nodepool22:18
fungii'm around to help for the next little bit, but it's going to be lateish here soon so i likely won't be around later if something crops up22:18
clarkbfungi: ^22:18
fungii'll take static22:18
fungiand nodepool unless someone else wants to grab that22:18
corvusthis one we should announce -- should we go ahead and put statusbot on that?22:18
fungibig concern with static.o.o is making sure it doesn't fsck /srv/static/logs22:19
fungidoes touching /fastboot still work in this day and age?22:19
clarkbcorvus: ya why don't we do that22:19
clarkbfungi: I'm not sure, it is a trusty node right?22:19
fungialso, full agree on status alert this22:19
fungiyeah, static.o.o is trusty22:20
corvusstatus notice The zuul system is being restarted to apply security updates and will be offline for several minutes.  It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved.22:21
corvus^?22:21
clarkbcorvus: +122:21
fungilgtm22:21
corvus#status notice The zuul system is being restarted to apply security updates and will be offline for several minutes.  It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved.22:21
*** corvus is now known as jeblair22:21
jeblair#status notice The zuul system is being restarted to apply security updates and will be offline for several minutes.  It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved.22:22
openstackstatusjeblair: sending notice22:22
clarkbfungi: supposedly setting the sixth column in fstab to 0 will prevent fscking22:22
*** jeblair is now known as corvus22:22
pabelangeransible -i /etc/ansible/hosts/openstack 'mirror01*' -m command -a "reboot"22:22
pabelangeris the command I'll use to reboot mirrors22:22
corvusoh, now i spot the typo in that22:22
clarkbfungi: http://www.man7.org/linux/man-pages/man5/fstab.5.html22:22
fungiclarkb: thanks, done. that's a more durable solution anyway, i should have thought of it22:22
corvusre-equeued.  something about a horse i assume.22:22
-openstackstatus- NOTICE: The zuul system is being restarted to apply security updates and will be offline for several minutes. It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved.22:23
fungihistorically anyway, creating a /fastboot file only exempted it from running fsck for the next reboot, and then the secure or single-user runlevel initscripts would delete it at boot22:23
clarkbok I've double checked the zm0* servers have up to date packages. I think I am ready22:24
clarkb(everyone else should probably double check too before we turn off zuul)22:24
fungichecking mine now22:24
openstackstatusjeblair: finished sending notice22:24
corvusi'm ready and waiting on fungi and pabelanger to indicate they are ready22:25
pabelangeryes, ready22:25
corvuswaiting on fungi22:25
clarkbI think order is grab queues, stop zuul, reboot everything but zuul, reboot zuul after static, etc come back up22:25
fungiokay, all set to reboot mine22:25
corvusclarkb: yep.  after fungi is ready, i will grab + stop, then let you know to proceed22:26
fungiand yes, that plan sounds correct22:26
fungicorvus: start at will22:26
corvuszuul is stopped -- you may proceed22:26
fungithanks, rebooting static and nodepool now22:26
pabelangerokay, doing mirrors22:26
clarkbzm0* have been rebooted, waiting for them to return to us22:27
fungiwill the launchers need to have anything done to reconnect to zk on nodepool.o.o?22:27
clarkbfungi: I don't think so22:27
clarkbit should retry if it fails iirc22:27
clarkbsimilar to how gear works22:28
clarkb(but we should check)22:28
clarkbfungi: oh also we may need to make sure the old nodepool process doesn't start?22:28
clarkbfungi: should just have zk running on nodepool.o.o22:28
funginodepool has booted already22:28
fungii'll go check what's running22:28
clarkball zuul mergers report kpti, checking them for daemons22:28
dmsimardclarkb: looking for storyboard re: playbook22:29
fungizookeeper is running, nodepool is not22:29
fungiso should be okay22:29
clarkball 8 zuul mergers are running a zuul merger process I think my side is good22:29
fungistatic.o.o is up and reporting correct kernel security22:29
dmsimardclarkb: ok I need to fix ubuntu 14.04 vs 16.0422:30
dmsimardstoryboard have not yet been updated22:30
clarkbdmsimard: they should be the same22:30
clarkbdmsimard: and they havebeen updated22:30
dmsimard14.04 doesn't have the kaiser flag in /proc/cpuinfo, 16.04 does22:30
pabelangerokay, just waiting for inap, all other mirrors show good for reboot. Checking apache now22:30
pabelangerinap is also good for reboot22:31
clarkbdmsimard: ansible -i /etc/ansible/hosts/openstack 'logstash-worker0*' -m shell -a "dmesg | grep 'Kernel/User page tables isolation: enabled'" is apparently hte most reliable way to check22:31
clarkbdmsimard: as all the flag in cpuinfo means is that the kernel has detected the insecure cpu not that it has necessarily enabled pti22:31
dmsimardclarkb: yeah but that doesn't work for centos/rhel :(22:31
clarkbpabelanger: good for reboot meaning reboot is complete and the dmesg content is there?22:31
clarkbdmsimard: ya I know rhel is the only distro it doesn't work on from the ones I have sampled22:31
clarkbbut checking cpuinfo doesn't tell oyu if pti is enabled22:32
clarkbit only tells you if kernel knows its cpu is insecure22:32
pabelangerclarkb: yes, just validating AFS is working properly now22:32
dmsimard /boot/config-3.13.0-139-generic:CONFIG_PAGE_TABLE_ISOLATION=y is probably safe22:32
pabelangerso far, issue with vexxhost mirror22:32
clarkbdmsimard: that also doesn't tell you pti is enabled22:32
clarkbdmsimard: only that support for it was compiled in22:32
dmsimarddamn it22:32
clarkb(this is why the dmesg check is important it is positive confirmation from the kernel that it is ptiing)22:32
mgagnepabelanger: what is the reboot about? kernel for meltdown?22:32
clarkbmgagne: yes22:33
corvusclarkb, fungi: i believe i'm waiting only on pabelanger at this point, correct?22:33
clarkbcorvus: that is my understanding yes22:33
fungiyes, everything good on my end22:33
corvuspabelanger: i'm idle if you need me to jump on a mirror22:33
pabelangercorvus: yes, vexxhost please22:33
pabelangerit is not serving up AFS22:33
pabelangerI am checking others still22:33
corvusack22:34
corvuspabelanger: seems up now: http://mirror01.ca-ymq-1.vexxhost.openstack.org/ubuntu/lists/22:34
clarkbmgagne: we are doing all of our VM kernels today (and if I don't run out of time the control plane of infracloud)22:35
fungi`ls /afs/openstack.org/docs` returns content for me on mirror01.ca-ymq-1.vexxhost22:35
mgagneclarkb: cool, just wanted to check if it was for spectre22:35
clarkbmgagne: I'm not aware of any spectre patches yet22:35
pabelangercorvus: yes, confirmed. Thanks22:35
fungi329 entries to be exact22:35
clarkbmgagne: unfrotunately22:35
mgagneclarkb: ok, we are on the same page then22:36
pabelangerokay, all mirrors are rebooted, dmesg confirmed and apache running22:36
corvusmgagne: from what i read from gkh, that's probably next week.  and the week after.  and so on forever.  :|22:36
clarkbcorvus: we'll get really good at rebooting :)22:36
fungiuntil we get redesigned cpus deployed everywhere22:36
mgagne¯\_(ツ)_/¯22:36
clarkbcorvus: I think that means you are good to patch zuulv3 and reboot22:36
fungiand discover whatever new class of bugs they introduce22:36
corvuscool, proceeding with zuulv3.o.o.  i expect it to start on boot.22:37
clarkbcorvus: note I didn't prepatch zuulv322:37
corvusclarkb: i did22:37
clarkbsince python had been crashing there I didn't want to do anything early22:37
clarkbcool22:37
corvushost is up22:38
corvuszuul is querying all gerrit projects22:38
corvuszuul-web did not start22:38
corvusor rather, it seems to have started and crashed without error?22:39
pabelangerclarkb: okay if I step away now? Have some guests over this evening22:39
clarkbpabelanger: yup22:39
pabelangergreat, good luck all22:39
clarkbpabelanger: enjoy, and thanks for the help!22:39
fungithanks pabelanger!22:40
clarkbpabelanger: oh can you tldr what needs to be done with afs?22:40
clarkbpabelanger: we can finish that up while you dinner :)22:40
corvussubmitting cat jobs now22:40
clarkbI guess its wait for mirror-update to stop doing things22:40
clarkbthen reboot the afs server22:40
corvusmergers seem to be running them22:40
fungiand after that, puppetmaster22:40
clarkbfungi: ++22:40
corvusi've restarted zuul-web22:41
clarkblooks like a couple changes have enqueued according to status22:42
corvusaccording to grafana we're at 10 executors and 18 mergers which is the full complement22:42
clarkband jobs are running22:42
corvusokay, i'll re-enqueue now22:43
fungi43fa686e-12a4-4c51-ad3b-d613e2417ff3 claims to be named "ethercalc01"22:43
fungibut is not the real slim shady22:43
ianwfungi: that could be our in-progress ethercalc ... again waiting for nodejs22:43
clarkbfungi: oh I don't think the real one is in my listings either22:43
ianwi think fricker had one up for testing last year22:43
fungi93b2b91f-7d01-442b-8dff-96a53088654a is actual ethercalc0122:44
clarkbfungi: it might be a good idea to regenerate the openstack inventory cache file?22:44
clarkbsince a few servers have been deleted that are showing up in there22:44
fungiso it's in the inventory, just tracked by uuid since there are two of them22:44
clarkbcorvus: console streaming is working22:44
fungishould we delete the in-progress one before clearing and regenerating the inventory cache?22:44
ianwfungi / clarkb: see https://review.openstack.org/#/c/529186/ ... may be related?22:45
clarkbfungi: the in progress ethercalc? its probably fine to keep it but just patch it?22:45
fungishould we delete status01 and the nonproduction ethercalc01 duplicate for now, since we'd want to test bootstrapping them from scratch again anyway?22:45
ianwfungi: yep22:46
clarkbfungi: thinking about it htough that may be simplest22:46
mordredinfra-root: oh, well. I somehow missed that we were doing work over here in incident (and quite literally spent the day wondering why everyone was so quiet)22:46
clarkb++ to cleaning up22:46
fungimordred: i hope you got a lot done in the quiet!22:46
clarkbmordred: maybe you want to help mwhahaha debug jobs?22:46
mordredwhat, if anything, can I do to make myself useful?22:46
mordredclarkb: kk. will do22:46
ianwfungi: do you want me to do that, if you have other things in flight?22:46
clarkbmordred: the bulk of patching is done and unless you really want to do reboots on whatsleft we probably have it under control?22:47
fungiianw: feel free but i'm basically idle for the moment which is why i got to looking at the odd entries22:47
clarkbinfracloud was mostly ignored today but I think I got most/all of the hypervisors done there last night22:47
ianwfungi: ok, well if you've got all the id's there ready to go might be quicker for you22:47
clarkbfungi: mirror-update looks idle now, maybe you want to do the remaining afs server?22:47
fungiwill do22:48
clarkbin that case I can do afs server :)22:48
fungigo for it, i'll work on ethercalc01 upgrade and the deletions for dupes22:48
clarkbcool22:48
clarkbcorvus: and to confirm afs01.dfw can just be rebooted now that mirror-update is happy?22:49
corvusclarkb: yep, should be fine.22:51
clarkbmordred: http://logs.openstack.org/80/532680/1/check/openstack-tox-py27/2b32f22/ara/result/1771aec0-9c03-40bd-837e-4aca16e1ec88/ the fail there in run.yaml caused a post failure22:51
clarkbmordred: so that may be part of it22:51
fungii've deleted status01 and the shadow of ethercalc01, cleared the inventory cache, and am updating the real ethercalc01 now22:51
corvusclients should fail over to afs02.dfw22:51
clarkbupdating pacakges on afs01.dfw now22:52
clarkband rebooting it now22:53
fungiethercalc01 is done and seems to be none the worse22:53
clarkbafs01.dfw is back up and running now and has the patch in place according to dmesg22:56
* clarkb reads docs on how to check its volume are happy22:56
fungii'll make sure the updated packages are on puppetmaster (very carefully taking note first of what it wants to upgrade)22:56
corvusall changes re-enqueued22:57
clarkbcorvus: vos listvldb looks good to me22:57
clarkbanything else I should check before reenabling puppet on mirror-update?22:58
corvusshould we send a second status notice, or was the first sufficient?22:58
fungiupgraded packages on puppetmaster, keeping old configus22:58
fungii think the first was fine, that was brief enough of an outage22:58
fungipuppetmaster is ready for its reboot when we're all ready22:58
corvusclarkb: 'listvol' is probably better for this -- i think it consults the fileserver itself, not just the database22:58
clarkbcorvus: cool will do that now22:58
mordredclarkb: that's a normal test failure - so yah, we probably need to make our post-run things a little more resilient22:59
clarkband they all show online22:59
clarkbmordred: ya not sure if mwhahaha's things were all due to that weird error reporting though22:59
mordredclarkb: I'm guessing that'll be "zomg there is no .testrepository" - which is normal when a patch causes tests to not be able to import22:59
clarkbmordred: justnoticed it may be a cause of lots of post failures22:59
mordredclarkb: indeed.22:59
clarkbfungi: I think I'm ready for puppetmaster when everyone else is22:59
mordredclarkb: I'll work up a fix for it in any case - it's definitely sub-optimal22:59
clarkbfungi: i can reenable puppet on mirror-update after the reboot23:00
fungisure23:00
clarkbze10 is apparently not patched?23:00
clarkb(that can happen after puppetmaster)23:01
fungiokay, last call for objections before i reboot puppetmaster.o.o23:01
fungiand here goes23:01
fungii can ssh into puppetmaster.o.o again now23:02
clarkbas can I23:03
fungiKernel/User page tables isolation: enabled23:03
fungishould be all set23:03
clarkbI'm going to remove mirror-update from the emergency file now23:03
fungisounds good23:03
clarkbthats done23:03
clarkbnow to look into ze1023:03
clarkbok it looks like the dmesg buffer rolled over23:04
clarkbwe don't have the initial boot messages, but the kernel version lgtm so I'm going to trust that it was done23:04
fungiyeah, that'll happen23:05
clarkbgoing to generate a new list then I think it likely infracloud is what is left23:05
fungii guess the noise from bwrap floods dmesg pretty thoroughly23:06
fungiahh, oom killer running wild on ze10 actually23:07
clarkbcorvus: jeblairtest01 is the only server that is reachable and not patched23:09
clarkbinfra-root http://paste.openstack.org/show/642516/23:10
* clarkb looks into some of those unreachable servers23:10
fungihopefully there are fewer unreachables after the stuff i deleted earlier23:13
clarkbyup just 5 now.23:13
fungioh, apps-dev should be able to go away23:13
fungii'll delete it23:13
fungieavesdrop is in a stopped state, pending deletion after we're sure eavesdrop01 is good (per pabelanger)23:13
clarkbianw: https://etherpad.openstack.org/p/infra-meltdown-patching line 173 is I think the old backup server?23:14
clarkbianw: is that something you are able to confirm (and did you want to delete it?)23:14
fungi#status log deleted old apps-dev.openstack.org server23:14
openstackstatusfungi: finished logging23:14
fungiinfra-root: any objections to deleting the old (trusty) eavesdrop.o.o server?23:15
dmsimardfungi: pabelanger mentioned earlier it was him who kept it undeleted just in case23:15
dmsimardI don't believe there's anything wrong with the new eavesdrop23:15
fungidata was on a cinder volume anyway so unless you stuck something in your homedir it's not like there's anything on there anyway23:15
mordredno objectionshere23:16
clarkbfungi: if you are confident it can go away then I'm fine with it23:16
ianwclarkb: yeah, the old one, think it's fine to delete it now, will do23:16
clarkbI am putting these servers on the etherpad fwiw23:16
fungi#status log deleted old eavesdrop.openstack.org server23:16
openstackstatusfungi: finished logging23:16
fungikdc02 was supplanted by kdc04 right?23:17
clarkbfungi: ya there was a change for that too23:17
fungipretty sure i reviewed (maybe approved?) that23:17
clarkblooks like its already been deleted so stale inventory cache?23:17
clarkbunless its not in dfw23:17
fungishouldn't be stale inventory cache23:17
clarkbah yup its in ORD23:18
fungiord23:18
fungijust looked in the cache23:18
clarkbits state is shut off23:18
fungiwhich makes sense, and explains the unreachable23:18
fungiso we're safe to delete that instance as well?23:19
clarkbI think so23:19
fungiand odsreg, i'm pretty sure we checked with ttx and he said it was clear for deletion too23:19
clarkbya I seem to recall that23:19
fungi#status log deleted old kdc02.openstack.org server23:19
openstackstatusfungi: finished logging23:20
dmsimardclarkb: retrying the inventory after fixing 14.04/16.04 and adding AMD in23:20
clarkbfungi: you don't happen to hae that logged in your irc logs do you? I can't find it (I don't keep long term logs and just rebooted)23:20
clarkbdmsimard: thanks23:20
fungiclarkb: that's what i'm hunting down now23:20
dmsimardsorry about what little I could do today, I've been fighting other things23:20
dmsimardI'll help after dinner23:21
clarkbI think we are just about to the real fun. INFRACLOUD!23:21
dmsimardshould probably drain nodepool23:23
dmsimardand then yolo it23:23
ianwso my test xenial build is -109 generic, what got added over 108?23:24
clarkbianw: they fixed booting problems :)23:24
fungiianw: now with less crashing23:24
clarkbianw: so anything that is already on 108 is probably fine as long as unattended upgrades is working and updates to 109 before the next reboot23:24
clarkb109 is not required to be secure aiui23:24
ianwok, cool, thanks23:25
clarkbdmsimard: I'm not even sure we have to drain nodepool. I got the hypervisors done last night (the ones I could access) and now its just the control plane and making sure we've done due diligence to do any other hypervisors that might still be alive but not responding to puppetmaster.o.o23:26
clarkbcompute000.vanilla.ic.openstack.org for example I can connect to from home but not puppetmaster23:26
clarkbcompute030.vanilla.ic.openstack.org accepts ssh connections then immediately kills them23:27
clarkbalright I'm going to try collecting data on what needs help in infracloud23:30
*** rlandy is now known as rlandy|bbl23:37
clarkbinfra-root http://paste.openstack.org/show/642518/ that is infracloud23:45
clarkbwe need to figure out which hypervisor is running the mirror but we should be able to reboot any other node23:45
clarkblets save baremetal00 for last as its the bastion for that env23:45
clarkb(similar to how we did puppetmaster last in our control plane)23:45
* clarkb figures out where the mirror is running23:46
clarkbcompute039.vanilla.ic.openstack.org is where mirror01.regionone.vanilla.openstack.org is running and it has been patched so we don't have to worry about that23:49
clarkbanyone want to do compute012.vanilla.ic.openstack.org? I'm going to start digging into compute005.vanilla.ic.openstack.org and why it is not working23:49
*** SergeyLukjanov has quit IRC23:53
*** SergeyLukjanov has joined #openstack-infra-incident23:54
fungipretty sure the hypervisor hosts for both mirror instances were already done, because i had to explicitly boot them both earlier today23:56
fungii can give compute012.vanilla.ic a shot at an upgrade23:56
clarkblooking at the nova-manager service list and the ironic node-list I think we have diverged a bit between what is working and what is expected to be working23:57
clarkbI think what we should likely do is do our best to recover nodes like compute005 but if they don't come back we can disable them in nova then remove them from the inventory file23:57
clarkbI'm going to quickly make sure all of the patche dhypervisors + 012 are the only ones that nova thinks it can use23:57
fungicompute012.vanilla.ic is taking a while to reach, seems like23:58
clarkbthe networking there is so screwed up23:59
clarkbcompute000 for example apparently puppetmaster can hit it now23:59
clarkbthe XXX nodes in nova-manager service list are the ones that are unreachable23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!