Friday, 2021-08-13

*** dmellado_ is now known as dmellado00:36
clarkbI'm still able to ssh into servers like codeserach and nl01 and eavesdrop0100:56
clarkbthey all have a single port 22 entry now00:56
clarkb(I think it was infra-prod-base that applied the updates00:56
clarkbI don't know that I'll be functional when this deploy finishes in order to clean up old ansible stuff on bridge. Might have to just be very very careful tomorrow morning and clean stuff up that is a day old?00:58
clarkbalso the matrix oftc bridge seems to have died00:58
fungiyep, the servers lgtm02:06
fungimatrix bridge less so :/02:07
Unit193fungi: Thanks for fixing things, btw!05:14
*** corvus is now known as Guest410109:45
*** dviroel|ruck|out is now known as dviroel|ruck11:13
*** diablo_rojo is now known as Guest411511:36
fungiUnit193: it's not fixed yet, is it? we're still working on debugging the problem afaik12:06
fricklerif mordred happens to show up again later, would be great if someone can point him to today's backlog in #openstack-sdks where he might be able to help (regression in sdk's caching caused by major release of decorator lib)12:38
fricklerof course anyone else who might be able to help is also welcome, the issue seems to be out of reach for my mediocre python skills12:39
opendevreviewMerged opendev/system-config master: Upgrade etherpad to 1.8.14  https://review.opendev.org/c/opendev/system-config/+/80413614:31
Clark[m]fungi: I'm making tea now but will be at a proper keyboard soon14:35
fungino worries, i think we've still got a while before the deployment happens14:35
fungietherpad just restarted14:39
fungii'm able to load an existing pad just fine14:40
clarkboh that was quicker than I expected but I'm at a keyboard now14:46
clarkblet me load ssh keys and all that14:46
fungithere's no rush, i doubt anyone's using meetpad right this moment14:47
fungiand the infra-prod-service-etherpad job is still running14:47
clarkbIt loads in meetpad, but I'm not sure this is using the newer version as the colors don't touch between lines14:48
fungiwe haven't cleaned up the ansible processes on bridge yet, eh? load average tehre is ~1214:48
clarkbya ansible was still going last night when I needed to call it a day14:49
fungii guess we can disable-ansible after this and clean up14:49
clarkbfungi: ya the image we are running is not updated on prod14:50
fungiwonder why it restarted in that case... checking the deploy log on bridge14:50
clarkbI wonder if that is a race with the dockerhub indexes eventual consistency14:51
fungiit's at the docker-compose pull phase just now14:51
clarkboh weird are we going to restart twice?14:52
fungii wonder14:52
clarkbfungi: looks like the pull started almost 15 minutes ago14:52
clarkbrelated to the high system load maybe?14:52
fungithat's what i suspect, yeah14:52
clarkbI think the pull may have completed and then the restart just hasn't been properly logged yet? Becuase the timing lines up for the restart I think14:54
clarkbI half suspect that we restarted after the pull because the mariadb image updated but the pull of the etherpad image didn't update due to docker hub races14:54
clarkbif that is the case we should be able to safely pull and up -d manually once ansible is out of the way14:54
clarkbfungi: it seems like ansible is making no progress at all14:59
clarkbI'm going to start looking at process listings on bridge14:59
clarkbfungi: `ps -elf | grep ansible | grep Aug12 | grep remote_puppet_else` maybe we start by killing those processes?15:00
clarkbI'm going to start there. There are no remote puppet else jobs running and those are all from yseterday15:02
fungiyeah, that should be plenty safe15:02
clarkbfungi: also we should look for ssh connectivity problems as these tend to start from that15:03
*** Guest4101 is now known as notcorvus15:03
*** notcorvus is now known as corvus15:03
clarkbI think if you ps and grep for the controlmaster processes you can find old ones that might indicate bad conenctivity15:03
fungii also see a couple from Aug1115:03
*** corvus is now known as reallynotcorvus15:04
*** reallynotcorvus is now known as corvus15:04
fungisome of these are showing up as defunct too, so not sure if they'll be killable15:04
clarkblogstash-worker11 and elasticsearch06 are maybe sad hosts15:05
clarkbfungi: do you think you can check on those and reboot them while I dig through processes that we might be able to clean up?15:05
fungiyeah, looking into them15:06
clarkbelasticsearch02 maybe as well15:06
fungiConnection closed by 2001:4800:7819:103:be76:4eff:fe04:b9d7 port 2215:06
clarkbnext I'll clean out the base.yaml playbooks from august 12. That playbook doesn't seem to be running in zuul either15:07
fungiyeah, all three of them are resetting connections on 22/tcp15:07
fungii'll check their oob consoles15:07
corvustristanC: http://eavesdrop01.opendev.org:9001/ is answering ... what's the URI for prometheus stats?  and are you monitoring it now?  do you have enough data to see if the connection issue is resolved?15:08
clarkbAll of the august 12 ansible processes seem to be cleaned up and load has fallen significantly15:11
clarkbLooking at etherpad the job finished and it is still running the old image. I think we should manually pull and up -d as soon as we are happy with bridge15:11
fungiall three of the servers you mentioned were showing hung kernel tasks reported on their consoles, i've hard rebooted them15:12
clarkbthanks15:12
clarkbthose were the three IPs I saw with stale ssh control processes15:12
fungii can ssh into all three of them now, though i expect the elasticsearch data is shot15:13
clarkbfungi: I would give it a bit to try and recover on its own (but check if the processes need to start)15:14
clarkbthen we can delete any corrupted indexes once it has had a chance to recover15:14
fungi#status log Hard rebooted elasticsearch02, elasticsearch06, and logstash-worker11 as all three seemed to be hung15:14
opendevstatusfungi: finished logging15:14
clarkbansible is busy now but all of the processses related to ansible on bridge seem to be current15:15
clarkbya elasticsearch doesn't seem to be running on 0215:16
clarkbfungi: should I start those processes?15:16
fungioh, yeah i forgot it doesn't start them automatically15:16
fungii guess once the prod-hourly build complete, we can see if there are any lingering ansible proceses15:17
fungifatal: [etherpad01.opendev.org]: FAILED! => { ... "cmd": "docker-compose up -d"15:18
clarkbthere are a set of base.yaml playbooks running with current timestamps however I see no associated job. I half wonder if we unstuck those processes by rebooting the servers15:19
fungimaybe15:19
clarkbfungi: ya but it definitely restarted the containers. I think from ansible's perspecitve it sees it as a failure but it did restart15:19
clarkbfungi: that said I think our next step is to rerun pull on etherpad and up -d to get the image15:19
clarkbfungi: do you want to do that or should I?15:19
fungii'll do that now15:19
clarkbthanks15:19
fungiERROR: readlink /var/lib/docker/overlay2/l: invalid argument15:20
clarkbyou get that when trying to up the service?15:21
fungiyes15:21
fungithe compose file looks fine though, not truncated15:21
clarkbstackoverflow says that is a corrupted image. https://stackoverflow.com/questions/55334380/error-readlink-var-lib-docker-overlay2-invalid-argument15:21
fungiargh15:22
clarkbfungi: can you up just the mariadb container and see if that stargs?15:22
clarkbif that starts then we can delete and repull the etherpad image15:22
fungiyeah, that works15:22
clarkb`sudo docker image rm 5dbd5f4908bd` then docker-compose pull again?15:23
fungialready done, almost finished pulling15:23
fungithat's better15:23
fungithat looks like newer etherpad now15:24
clarkbhttps://etherpad.opendev.org/p/project-renames-2021-07-30 loads for me now and ya looks newer15:24
fungii loaded the same one15:25
fungii see you active on it15:25
clarkbhttps://meetpad.opendev.org/isitbroken loads that etherpad for me and I can add text15:25
clarkbI'm not too worried about the actual call as long as the pad loads there and it seems to15:25
fungijoining15:25
fungialso looks like recent improvements in jitsi-meet or etherpad (or both) have made the window embedding a bit more serviceable15:32
clarkbmeetpad and etherpad both seem happy. If you notice anything feel free to metnion it15:32
fungiclarkb: for the kata listserv, should i go ahead and start trying to create a server snapshot?15:33
clarkbfungi: if you want to. The thing I'm always confused about is what do we need to do on the server to make it safe to boot the resulting snapshot? Do we disable and stop exim and mailman?15:34
clarkbthat is my biggest concern and I'm not completely up to speed on how all the file spooling works there to feel confident in doing it myself15:34
fungiclarkb: i guess it's a question of what we want to do with the snapshot. if we just keep it as insurance in case the in-place upgrade goes sideways, we shouldn't need to disable anything because we wouldn't boot them both at the same time15:34
clarkboh ya I was thinking we would boot the snapshot and run through an upgrade on the booted snapshot15:35
clarkbthen do the upgrade on the actual server and it will serve as both a fallback and a test system15:35
fungiin that case we could stop and disable the exim and mailman services while snapshotting, i guess15:36
clarkbI figured doing that sort of thing with the lower traffic lists.kc.io would be less impactful15:37
clarkbbut then we'd get basically the same confidence out of upgrading it vs the prod snapshot15:37
fungisure, versions would be the same, though our multi-site setup wouldn't15:38
fungii also need to work out what to tweak in a dnm change to break lodgeit testing so i get a held paste equivalent for further troubleshooting the pastebinit regression15:39
clarkbfungi: put an assert False in system-config testinfra/test_paste.py15:40
fungioh, yeah that'd do it15:40
clarkbI'm going to go find something to eat now that etherpad seems happy, but then after will look at the lists.kc.io stuff if you haven't already done it15:40
tristanCcorvus: `curl http://eavesdrop01.opendev.org:9001/metrics | grep ssh_errors` shows no errors15:40
fungiclarkb: sounds good, i'll get started temporarily disabling things there shortly15:41
opendevreviewJeremy Stanley proposed opendev/system-config master: DNM: Break paste for an autohold  https://review.opendev.org/c/opendev/system-config/+/80453515:42
corvustristanC: great, thanks!  i'll work on an email / timeline for moving #zuul :)15:43
fungiclarkb: i guess i can clear your autohold for the etherpad upgrade testing?15:45
fungimnaser: do you still need held nodes for multi-arch container debugging in node-labeler or uwsgi build errors in loci-keystone?15:47
Clark[m]fungi yes you can clear my etherpad autohold15:50
fungithanks, done15:51
fungiit's fun that autohold and autohold-list need --tenant but autohold-delete errors if you supply --tenant15:51
fungii should probably be using the standalone zuul-client instead of the rpc client anyway15:52
fungiprod-hourly jobs are almost done, it's on the last one now. though it likely won't complete before the top of the hour15:53
fungiregardless, there are no ansible processes on bridge older than a minute, so looks like cleanup was thorough15:54
fungiand load average has dropped from 12 to around 1, so lots better15:55
fungithe last job did wrap up its deployment tasks before the top of the hour, and i caught bridge with 0 ansible processes15:59
fungisqueaky clean15:59
clarkbjust in time for the next hourly run16:00
fungiindeed16:00
fungii've put lists.katacontainers.io into the deployment disable list, disabled and stopped the exim4.service and mailman.service units, and initiated image creation in the provider for the server now16:07
clarkbfungi: oh the other qusetion I had about that was what client do you use to talk to the rax snapshot api?16:08
clarkb doesi t work with a modern osc?16:08
clarkbalso thank you!16:08
fungii just used their webui16:08
clarkbah16:08
fungisince i already had it up for the oob console stuff on the hung servers a few minutes ago16:08
fungiit's currently still "saving"16:14
fungiimaging is complete, putting services back in place now16:19
clarkbanother benefit to using that server for this is much quicker snapshotting16:19
fungieys16:19
fungiand it's back out of the disable list again16:20
clarkbI need to do a bunch of paperwork type stuff today, but hopefully monday we can boot that and test an upgrade16:20
fungii also double-checked that services were running on it after starting16:20
fungiwfm16:20
fungitime to see if my well-laid trap caught a paste server16:20
fungiwe got one16:21
fungithis is going to get tricky, pastebinit hard-codes server names, and also verifies ssl certs16:27
fungii'm starting to wonder if it's the server rename or redirect to https confusing it16:30
clarkbfungi: try it with your browser to see?16:31
clarkbwith etherpad we had to set up /etc/hosts because of the redirect16:31
fungithe browser's fine, and yeah i'm doing it with /etc/hosts entries to work around it16:31
clarkbmaybe use curl instead of pastebinit?16:33
clarkbthen you can control cert verification16:33
fungiyeah, but i'll need to work out what pastebinit is passing to the method16:33
fungiyep, i think i've confirmed it's the redirects16:34
fungii was able to use pastebinit with the held server by making the vhost no longer redirect from http to https16:34
fungithing is, pastebinit has a list of allowed hostnames, one of which is paste.openstack.org16:35
fungitrying to use it with the name paste.opendev.org throws an error16:35
fungioh, though i think it may be due to the way the redirect was constructed16:36
fungiwe didn't redirect to https://paste.opendev.org/$1 we're just redirecting to the root url16:36
fungii'll see if that works with a more thorough redirect16:37
fungiyeah, no luck getting the redirect to work with pastebinit, but if i get rid of the redirect it's fine. just "tested" on the production server by editing its apache vhost config and that got pastebinit working16:49
fungialso since we don't allow search engines to index the content there, and we don't support "secretly" pasting to it really, there's no real need to redirect from http to https16:51
fungii'll propose a change16:51
opendevreviewJeremy Stanley proposed opendev/system-config master: Stop redirecting for the paste site  https://review.opendev.org/c/opendev/system-config/+/80453917:01
fungiUnit193: ianw: clarkb: ^ that seems to be the fix17:01
clarkbfungi: does pastebinit work with https:// too?17:01
fungiit would i think, but we'd need to update the site entry at https://phab.lubuntu.me/source/pastebinit/browse/master/pastebin.d/paste.openstack.org.conf17:02
fungiregexp = http://paste.openstack.org17:02
fungiright now trying it results in the following error:17:03
fungiUnknown website, please post a bugreport to request this pastebin to be added (https://paste.openstack.org)17:03
opendevreviewJeremy Stanley proposed opendev/lodgeit master: Properly handle paste exceptions  https://review.opendev.org/c/opendev/lodgeit/+/80454017:09
fungiand that's ^ the other bug i discovered in digging into the problem17:09
fungilest upstream just starts smacking down every bug report from someone using a distro package17:20
fungi(which happens in lots of projects)17:20
clarkbwrong window? :)17:21
fungihah, yep17:21
clarkblooks like refstack had a backup failure. I'm hoping that it like lists is a one off internet is weird situation17:21
Unit193fungi: https://github.com/lubuntu-team/pastebinit/issues/6 isn't reassuring about the state of things.17:25
fungiUnit193: well, regardless we'll strive to keep backward compatibility with old pastebinit versions, so once 804539 merges and deployed things should hopefully stay working17:26
fungi(and it's temporarily working now, since i directly applied that change to the apache config to make sure it's sane)17:27
fungibut thanks for the pointer to that github issue, i didn't know about arch using a fork... if i get a moment i'll file a bug in debian to suggest switching to the same fork17:28
Unit193The maintainer in Debian is the Lubuntu team guy...  I may go poking around to see what I can find.17:29
fungiahh, yeah. also that fork on gh arch is using doesn't seem to actually differ from the revision history in the lubuntu phabricator17:31
fungiUnit193: please let me know what you find, and thanks again for alerting us to the issue before i ran into it myself!17:33
Unit193Hah, sure thing.17:33
Unit193And thanks for taking errors over IRC too.17:33
fungimy preference, really ;)17:34
fungiclarkb: ianw: looks like lance has e-mailed us asking if we've seen new issues with leaked/stuck images in osuosl17:42
clarkb| 0000041150 | 0000000001 | osuosl-regionone    | ubuntu-focal-arm64    | ubuntu-focal-arm64-1628601595    | 7e23243b-aee2-4100-b702-d7e05f456606 | deleting  | 01:00:50:58  | that might be a leak17:44
clarkblooking at a cloud side image list there may be a few leaks there too17:45
clarkbdebian-bullseye-arm64-1627056483 for example17:46
clarkbcreated_at       | 2021-07-23T16:08:06Z for that bullseye image but I don't see it in nodepool17:47
clarkbI can compile a list and see what others think of it17:47
clarkbubuntu-focal-arm64-1628601595 cannot be deleted because it is in use through the backend store outside of glance17:49
clarkbserver list shows no results though17:49
fungiwe're not doing bfv for the mirror or builder are we?17:57
clarkbwe might be, but those should use images we don't build17:59
clarkbthe builder is in linaro not osusol. The osuosul mirror is booted from Ubuntu 20.04 (7ffbb2e7-d2f4-467a-9512-313a1c6b6afd)18:00
clarkbI've got an email just about ready to send to Lance18:00
clarkbsent18:02
fungithanks!18:16
*** dviroel|ruck is now known as dviroel|out19:42
clarkbfungi: looks like a bunch of hosts had backup failures?19:58
clarkbboth servers report they have disk space19:59
clarkblooking at kdc03 the main backup failed but then the stream succeeded20:00
clarkbConnection closed by remote host. Is borg working on the server? was the error20:00
clarkbI'm going to try rerunning in screen on kdc0320:02
clarkblooking at the log more closely they all started about 2 hours before they errored20:03
fungimm, yeah refstack, storyboard, kdc03, translate, review20:04
fungialso gitea01 twice (i guess one was the usual db backup failure?)20:04
fungiall of those except kdc03 have mysql databases20:04
clarkbkdc03 does do a stream backup of something though20:05
clarkbthat said running it manually succeeded20:05
clarkbI suspect there was a network blip of some sort20:05
clarkbfungi: note all of those started around 17:12 then timed out after 2 hours and reoprted failure around 19:1220:05
clarkbI suspect this isn't a persistent issue given the kdc03 rerun succeeded20:06
fungiyeah, makes sense20:06
clarkbfungi: but you can check the log on kdc03 to see what it did. It was failure on normal backup, success on stream, then a bit later (nowish) I reran and you get success for both20:06
clarkbI guess check it tomorrow and see if things persist20:08
fungiyeah, missing one day is unlikely to be catastrophic20:09
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/804460 reviewing that one would be good before the memory of what the renames were like bcomes too stale :)20:11
fungisure, i should be able to take a look now, thanks20:26
fungiclarkb: left one question on it, otherwise lgtm20:30
opendevreviewClark Boylan proposed opendev/system-config master: Update our project rename docs  https://review.opendev.org/c/opendev/system-config/+/80446020:33
clarkbnice catch that was indeed meant to be rooted20:34
fungidebian bullseye releasing this weekend, probably20:34
clarkbexciting20:35
fungiyeah, scheduled for tomorrow20:36

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!