Friday, 2020-10-09

clarkbreal    437m38.542s <- one notedb migration00:00
clarkbfungi: I think I figured out how the change numbers work the All-Projects repo has a refs/meta/sequence entry that seems to be the counter00:19
clarkbso ya redirects may just work00:19
clarkbthough maybe it has to scan all the repos for the number? not sure how that works00:20
clarkbperhaps it looks it up in the index00:20
*** hamalq has quit IRC00:32
openstackgerritFatema Khalid Sherif proposed opendev/storyboard-webclient master: Show story description markdown preview by default  https://review.opendev.org/75694001:12
openstackgerritFatema Khalid Sherif proposed opendev/storyboard-webclient master: Show story description markdown preview by default  https://review.opendev.org/75694001:47
*** auristor has quit IRC02:17
*** auristor has joined #opendev02:20
ianwi believe this opens the maintenance window for the rax db's .. will keep an eye03:02
clarkbthanks. I'mchecking irc periodically03:06
*** ysandeep|away is now known as ysandeep03:29
openstackgerritMerged opendev/system-config master: Add initial borg backup server  https://review.opendev.org/75660703:42
ianwinfra-prod-service-bridge timed_out03:42
ianwhrm03:42
ianwlooks like i have also not hooked borg-backup jobs in correctly eitehr03:43
ianwoh, no doh that's the hourly runs.  still something is wrong03:43
ianw# ps -aef | grep ansible-playbook | wc -l03:44
ianw19303:44
ianwlogstash-worker02.openstack.org. seems to be the dead host03:45
ianwhung tasks as usual i guess03:51
ianw(i mean i checked the console and that's what's on it)03:51
clarkbrebooting those is basically always safe03:52
clarkbeven when not sad03:52
clarkb(we'll drop some loga butmeh at  a billion records a day thats ok)03:52
ianw#status rebooted logstash-worker02.openstack.org03:56
openstackstatusianw: unknown command03:56
ianw#status log rebooted logstash-worker02.openstack.org03:56
openstackstatusianw: finished logging03:56
ianwi've cleared out everything on bridge that was stuck03:56
*** ykarel|away has joined #opendev04:23
*** ykarel|away is now known as ykarel04:28
clarkbI think we areoutside the db window?05:01
*** marios has joined #opendev05:12
*** ykarel has quit IRC05:34
*** ykarel has joined #opendev05:35
*** rpittau|afk is now known as rpittau05:43
ianw2am cdt05:43
openstackgerritIan Wienand proposed opendev/system-config master: install-borg: also install python3-venv  https://review.opendev.org/75700005:51
*** sshnaidm is now known as sshnaidm|off06:06
*** eolivare has joined #opendev06:34
*** tkajinam has quit IRC06:42
*** tkajinam has joined #opendev06:42
*** ralonsoh has joined #opendev06:59
*** hashar has joined #opendev07:01
*** Dmitrii-Sh has quit IRC07:11
*** ysandeep is now known as ysandeep|lunch07:25
*** fressi has joined #opendev07:27
*** slaweq has joined #opendev07:37
*** slaweq has quit IRC07:37
*** slaweq has joined #opendev07:38
*** tosky has joined #opendev07:46
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:01
*** fressi has left #opendev08:20
*** fressi has joined #opendev08:23
*** Dmitrii-Sh has joined #opendev08:58
openstackgerritlikui proposed openstack/diskimage-builder master: Switch to unittest mock  https://review.opendev.org/75703108:59
openstackgerritlikui proposed openstack/diskimage-builder master: replace imp module  https://review.opendev.org/75123609:09
*** ysandeep|lunch is now known as ysandeep10:01
*** roman_g has joined #opendev10:02
*** DSpider has joined #opendev10:31
*** priteau has joined #opendev10:49
*** Eighth_Doctor has quit IRC11:02
*** mordred has quit IRC11:02
*** mordred has joined #opendev11:11
*** ykarel has quit IRC11:16
*** ykarel_ has joined #opendev11:16
*** ykarel has joined #opendev11:32
*** ykarel_ has quit IRC11:33
*** ttx has quit IRC11:34
*** Eighth_Doctor has joined #opendev11:35
*** ttx has joined #opendev11:36
*** slaweq has quit IRC12:03
*** slaweq has joined #opendev12:17
*** ysandeep is now known as ysandeep|brb12:25
*** fressi has quit IRC12:31
*** hashar has quit IRC12:46
*** rpittau is now known as rpittau|afk13:03
*** ysandeep|brb is now known as ysandeep13:31
openstackgerritBernard Cafarelli proposed openstack/project-config master: Update neutron stable grafana dashboards  https://review.opendev.org/75710213:49
*** ykarel has quit IRC13:52
*** ykarel has joined #opendev13:52
openstackgerritNicolas Alvarez proposed openstack/project-config master: Add initial files to project-config repo.  https://review.opendev.org/75671714:16
openstackgerritNicolas Alvarez proposed openstack/project-config master: Rename StarlingX Armada App files.  https://review.opendev.org/75711314:16
*** slaweq has quit IRC14:42
*** fressi has joined #opendev14:44
*** fressi has quit IRC14:47
*** mlavalle has joined #opendev14:57
*** ysandeep is now known as ysandeep|away14:58
openstackgerritNicolas Alvarez proposed openstack/project-config master: Add SNMP Armada App to StarlingX.  https://review.opendev.org/75671715:00
*** ykarel is now known as ykarel|away15:08
*** eolivare has quit IRC15:10
openstackgerritNicolas Alvarez proposed openstack/project-config master: Add SNMP Armada App to StarlingX.  https://review.opendev.org/75671715:12
*** lpetrut has joined #opendev15:19
*** lpetrut has quit IRC15:35
*** ykarel has joined #opendev15:36
*** ykarel|away has quit IRC15:37
*** ykarel has quit IRC15:39
*** priteau has quit IRC16:06
*** marios is now known as marios|out16:10
*** marios|out has quit IRC16:23
*** tosky has quit IRC16:35
*** priteau has joined #opendev16:41
openstackgerritClark Boylan proposed opendev/system-config master: Stop replicating to local git mirror on gerrit  https://review.opendev.org/75715216:44
clarkbfungi: as I push changes like ^ up I'll be editing configs on review-test and restarting things there if necessary16:45
*** hamalq has joined #opendev16:45
openstackgerritClark Boylan proposed opendev/system-config master: Disable change.move in gerrit  https://review.opendev.org/75715316:50
openstackgerritNicolas Alvarez proposed openstack/project-config master: Add SNMP Armada App to StarlingX.  https://review.opendev.org/75671716:55
openstackgerritClark Boylan proposed opendev/system-config master: Stop blocking /p/ in the gerrit apache vhost  https://review.opendev.org/75715516:56
fungiclarkb: thanks for the heads up, i'm not testing anything at the moment17:02
openstackgerritClark Boylan proposed opendev/system-config master: Switch to zuul's default gerrit auth type  https://review.opendev.org/75715617:03
clarkbfungi: is there a change to fix the cla.html problem yet?17:04
clarkbit won't be a problem in prod but only because we'll upgrade gerrit on the existing host17:05
clarkbasking because I need to sort out the best way to clean up commentlinks and such and want to avoid conflicts if I can17:05
clarkbmight just rebase my whole stack on that actually. If you haven't written one yet should I go ahead and add it to my stack ?17:07
fungiclarkb: there's not yet, but per your earlier comments about dropping the js file change, maybe we can just repurpose that one to add the cla.html file?17:08
clarkbfungi: ya I'm thinking now we should land a change soon that manages the files we use, then I'll do a cleanup change that is WIP'd to remove the ones we don't want later17:09
clarkbthe reason for that is if we end up on 2.16 then we want to keep at least the css stuff for the old web ui17:09
clarkbI can update that change to do the cla.html to if you'd prefer I do it17:09
clarkbthen I'll rebase the changes above on that17:09
*** ykarel has joined #opendev17:12
fungii figure we'll want that file included for 2.16 use too17:15
clarkb++17:17
openstackgerritClark Boylan proposed opendev/system-config master: Add gerrit static files that were lost in ansiblification  https://review.opendev.org/74633517:43
openstackgerritClark Boylan proposed opendev/system-config master: Stop replicating to local git mirror on gerrit  https://review.opendev.org/75715217:43
openstackgerritClark Boylan proposed opendev/system-config master: Disable change.move in gerrit  https://review.opendev.org/75715317:43
openstackgerritClark Boylan proposed opendev/system-config master: Stop blocking /p/ in the gerrit apache vhost  https://review.opendev.org/75715517:43
openstackgerritClark Boylan proposed opendev/system-config master: Switch to zuul's default gerrit auth type  https://review.opendev.org/75715617:43
clarkbfungi: ^ updated that first change and rebased the stack on it. Need to reapply some WIP's but the beginning of that stack should be safe to land17:43
fungii guess 746335 was rebased in addition to being updated. interdiff is yuge17:53
openstackgerritClark Boylan proposed opendev/system-config master: Clean up old Gerrit html theming and commentlinks  https://review.opendev.org/75716117:54
openstackgerritClark Boylan proposed opendev/system-config master: Remove reviewdb config from Gerrit  https://review.opendev.org/75716217:54
clarkbfungi: oh yup sorry about that17:54
clarkbI'm going to apply those last two changes to review-test now17:55
*** priteau has quit IRC17:58
clarkband all that looks good17:58
*** priteau has joined #opendev18:00
johnsomAre there known issues with the proxies? Failed to fetch https://mirror.bhs1.ovh.opendev.org/ubuntu/dists/bionic/InRelease  Could not connect to mirror.bhs1.ovh.opendev.org:443 (158.69.73.218), connection timed out18:02
johnsomhttps://71ded92e78f7ee54474f-70fe5a5f20a67e625f6dcb03e84a8d62.ssl.cf1.rackcdn.com/757158/1/check/octavia-v2-dsvm-scenario/afa1d04/job-output.txt18:02
fungithat looks unexpected18:03
fungii wonder if the vm has died18:04
johnsomMost of our jobs are red right now18:04
clarkb[Fri Oct  9 14:33:46 2020] afs: Lost contact with file server 23.253.73.143 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server)18:04
clarkbdidn't that just happen?18:04
fungii can ssh into it18:04
fungi14:33:46 is 3.5 hours ago18:04
clarkbya I wonder if it made the afs sad18:05
fungi[Fri Oct  9 14:34:30 2020] afs: file server 23.253.73.143 in cell openstack.org is back up (code 0) (multi-homed address; other same-host interfaces may still be down)18:05
clarkbif you hit https://mirror.bhs1.ovh.opendev.org/ubuntu/dists/bionic/InRelease it fails though18:05
clarkbwhich implies apache can't read from the fs18:05
fungiit saw it again ~45 seconds later18:05
fungii am having trouble accessing /afs/openstack.org/ from that vm18:06
fungii can get to it from other servers, like static.o.o18:07
clarkbwe can restart the openafsclient service or reboot then probably?18:07
clarkbI'm betting its a local issue triggered by the server going away18:07
*** ykarel has quit IRC18:07
*** priteau has quit IRC18:08
fungiyeah, it may not be able to restart though if there are open file handles in /afs18:08
fungitrying to restart it now18:09
*** ralonsoh has quit IRC18:10
fungiit restarted but still hangs trying to access /afs/openstack.org/18:11
clarkbI wonder if the network issues I've got to rax hosts are related18:11
fungii can't ssh into afs02.dfw18:11
clarkbugh18:11
fungiafs01.dfw is responding though18:12
fungibetting afs02.dfw is hung once again18:12
fungichecking oob console18:12
fungihung kernel tasks 323400 seconds after boot (3.75 days ago, i rebooted it 2020-10-05 20:06:49 UTC according to our status log)18:15
fungi#status log hard rebooted afs02.dfw.o.o to address a server hung condition18:17
openstackstatusfungi: finished logging18:17
clarkbits weird because those servers have been fairly stable until recently. But I guess it could be live migrations or similar18:17
openstackgerritClark Boylan proposed opendev/system-config master: DNM Forcing a gitea job failure to test gerrit replication  https://review.opendev.org/75716518:18
fungii can ssh into afs02 again18:18
TheJuliaare mirror issues a known thing right now?18:18
clarkbTheJulia: yes, the file server went out to lunch again18:19
TheJulialooks like it18:19
TheJuliaugh18:19
clarkbI've put a hold on https://review.opendev.org/757165 and will use the gitea it builds to test gerrit replication from review-test18:19
clarkbprobably tomorrow though as I'll need to figure out credentials and all that18:19
*** roman_g has quit IRC18:19
TheJuliashoudl we be giving CI some time or....18:20
TheJuliaWell, should I go enjoy a beverage or go poke patches I guess is what I'm wondering18:20
clarkbI expect this will recover as soon as the server finishes rebooting18:20
TheJuliak18:20
*** roman_g has joined #opendev18:20
*** roman_g has quit IRC18:21
clarkbhttps://mirror.bhs1.ovh.opendev.org/ubuntu/dists/ isn't loading yet so not yet18:21
fungiyeah, my attempts to ls in /afs/openstack.org/ from it are still stuck18:21
fungiload average on mirror.bhs1.ovh.opendev.org is 12218:22
fungii'm going to stop apache on it for a minute18:22
TheJuliawould it make sense to just shutdown CI because I suspect it is getting killed with activity too18:23
TheJuliaWith mirrors out to lunch, the jobs are toast anyway18:23
fungiwell, it will take longer to remove that region from nodepool than it will to get it back on track, worst case i'll reboot it18:23
TheJuliak18:24
* TheJulia goes and checks on the pie18:24
fungistopping apache on it is taking forever, so may be faster to forcibly reboot the vm18:24
clarkbfungi: wfm18:24
*** tkajinam has quit IRC18:25
fungi#status log hard rebooted mirror01.bhs1.ovh to recover from high load average (apparently resulting from too many hung reads from afs)18:26
openstackstatusfungi: finished logging18:26
fungiit's still booting up. hopefully it's not prompting for an interactive fsck on the console18:27
clarkbyou should be able to get the console from ovh18:28
clarkbvia the normal apis18:28
fungiA start job is running for OpenAFS client (3min 4s / 3min 24s)18:29
fungiokay, now it's up18:29
fungils: cannot access '/afs/openstack.org/': No such file or directory18:29
clarkbI wonder if it is a network issue between ovh and rax then18:30
clarkbon top of the other issue18:30
clarkbtry restart the openafs client?18:30
fungii can ping both afs01.dfw and afs02.dfw from mirror.bhs1.ovh18:30
fungididn't help18:31
*** tosky has joined #opendev18:32
clarkbthe fact that /afs is entirely empty makes me think it is the client/kernel18:33
clarkbif it were just the remote we'd be able to see the other afs fses?18:33
fungi`vos status -server afs01.dfw.openstack.org` reports "attachFlags:  busy"18:34
clarkbfungi: you restarted openafs client? based on systemctl status openafs-client it appears to have been running since 18:2618:35
clarkber not status ps18:35
fungiand for afs02.dfw is says "procedure: Restore"18:35
clarkbthe afsd is running since 18:2618:35
clarkbbut the service is from 18:31 which amkes me think it didn't truly restart18:35
fungihuh, i did a `sudo systemctl restart openafs-client`18:35
fungifor me it says "Active: active (running) since Fri 2020-10-09 18:31:08 UTC; 5min ago" which was shortly after my restart18:36
clarkbmaybe we should stop then start it18:37
fungiyeah, i concur. systemd and the process list are not in agreement18:37
fungiit's not stopping, presumably because it's busy18:38
clarkbugh18:38
fungiit's been trying to rmmod openafs for 11 minutes18:38
fungishall i add the nl hosts into the emergency disable list and then take bhs1 out of the nodepool configs on the servers?18:39
clarkbfungi: you only need to do nl04 and ya that seems like a good idea18:40
fungioh, righ18:40
fungit18:40
*** priteau has joined #opendev18:42
fungidoes the container need any kicking for a max-servers value change?18:42
fungior does it pick that up automatically when the file is modified?18:42
clarkbit should pcik it up18:43
clarkbit rereads the config file on every pass through its runtime loop18:43
fungii suppose i should check the other mirrors to see if this problem is more widespread18:43
clarkbI wonder if we should try another cleaner reboot of the bhs1 mirror18:43
fungii suspect it won't be able to cleanly reboot because of the openafs lkm18:44
fungibut trying a grafeful reboot now18:44
fungigraceful18:44
fungiit did manage to shutdown18:45
fungiat least according to the console18:45
fungigra1.ovh seems to be working fine18:46
clarkbfungi: did you sudo reboot or nova reboot?18:46
fungisudo reboot18:47
fungiall three rax mirrors can reach afs18:47
clarkbit does seem like it may have shut down services but is waiting for the kernel to be happy18:47
clarkbbecause ssh is refusing connections for al ong time which normally doen't happen on boot18:47
fungiboth vexxhost mirrors are happy18:47
fungissh just started18:48
clarkboh now I get pam complaining ya18:48
*** priteau has quit IRC18:48
fungiit was complaining about hung kernel tasks at shutdown for a few minutes18:48
fungii guess because the openafs driver was unresponsive/busy18:48
fungiconsole says the openafs-client startup is timing out again18:50
clarkbI feel like auristor said there was a race that may cause this at one time18:50
clarkbbut I don't recall if there was a proposed fix (or if I even recall accurately)18:50
clarkbfungi: idea: we disable openafs-client and reboot again. Let it come up happy then manually start openafs-client?18:51
fungithe construction crew here is wrapping up so i'm going to need to step away for a bit. if this persists at all we're going to need a change to zero the max-servers in git18:51
clarkbI can do the service disable and reboot18:52
fungithanks18:52
clarkbthen try and start it manually18:52
clarkbdisabled and rebooting now18:52
clarkbafter confirming there was another rmmod and /afs was empty18:52
openstackgerritClark Boylan proposed opendev/system-config master: Switch to zuul's default gerrit auth type  https://review.opendev.org/75715619:01
openstackgerritClark Boylan proposed opendev/system-config master: Clean up old Gerrit html theming and commentlinks  https://review.opendev.org/75716119:01
openstackgerritClark Boylan proposed opendev/system-config master: Remove reviewdb config from Gerrit  https://review.opendev.org/75716219:01
openstackgerritClark Boylan proposed opendev/system-config master: Update gerrit container image to 3.2  https://review.opendev.org/75717619:01
clarkbdoing the manual start doesn't seem happier19:02
clarkbmy systemctl start openafs-client hasn't returned after a few minutes19:02
clarkbI need to eat lunch but will look at this more after19:02
clarkbthe region is disabled in nodepool so we should be ok just at lower capacity19:03
cgoncalvesclarkb, thanks19:06
*** priteau has joined #opendev19:13
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add nim roles  https://review.opendev.org/74786519:18
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add nim roles and job  https://review.opendev.org/74786519:20
*** priteau has quit IRC19:27
clarkbcheckvolumes and flushvolume do not work say they are not implemented19:29
clarkbthere is another stuck rmmod happening though so maybe related to unloadingfrom the kernel?19:29
clarkbI'm beginning to run out of ideas thataren't rebuild the mirror19:30
clarkbit seems the problem is in getting the openafsclient to run at all and not specific to our afs tree19:31
fungii'm still tied up for the moment, but suspect there's some persistent state causing it to not access a working fileserver. like we see when "localhost" winds up in a server list19:31
clarkboh19:31
fungioh, yeah i guess afsd should still be able to start under those conditions though19:31
clarkbeverything looks fine in /etc/openafs19:33
clarkbI'm trying to modprobe openafs just to see if it will complain about something19:35
clarkbbut it isn't returning19:36
clarkbcould it be that our dkms build is justbad?19:36
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add nim roles and job  https://review.opendev.org/74786519:36
clarkbmaybe we should force a rebuild/reinstall of openafs-client?19:36
fungiit's possible a kernel update triggered a rebuild which got interrupted for some reason19:38
clarkbI'm reinstalling openafs-modules-dkms which appears to be rebuilding the modules19:39
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add nim roles and job  https://review.opendev.org/74786519:40
clarkband the rebuild is done. I'm going to try and reboot it and start manually again19:47
clarkbits not looking better19:53
clarkbstill waiting for it to error but its doing the sit and wait thing. COuld be new kernel isn't compat with our openafs package or we have some other local state problem19:54
clarkbit just timed out19:54
clarkbthis is a bionic mirror I think we've got at least one focal mirror now19:54
clarkbmaybe we do just rebuild it19:54
*** roman_g has joined #opendev20:32
*** roman_g has quit IRC20:44
clarkbI'm going to pop out for a bike ride now. I still don't have a good answer other than rebuild21:10
*** nuclearg1 has joined #opendev21:17
*** nuclearg1 has quit IRC21:29
*** nuclearg1 has joined #opendev21:32
*** hamalq has quit IRC21:37
*** moppy has quit IRC21:42
*** paramite has quit IRC22:00
*** hamalq has joined #opendev22:00
johnsom Failed to fetch https://mirror.mtl01.inap.opendev.org/ubuntu/dists/focal/InRelease  Could not connect to mirror.mtl01.inap.opendev.org:44322:06
johnsominap too22:06
*** Dmitrii-Sh has quit IRC22:08
fungimm, that's one of a couple i didn't test22:31
fungiit can still access afs22:31
fungiis that maybe from a few hours ago?22:32
fungiokay, this is nuts. i can connect to the ssh port on it, but https times out?22:36
corvusmaybe all the apache procs are stuck?22:41
fungimaybe. tcpdump says i can reach it, but it's not responding to my syn packets22:41
*** qchris has quit IRC22:41
fungiyeah, that was likely it22:42
fungistopping apache, making sure all the processes were gone, then starting again seems to have allowed me to get a response22:43
corvusoh, that explains why i wasn't seeing anything in netstat, i'm assuming you restarted it right before i started inspecting22:43
fungilikely22:44
fungibetween 2241 and 224222:44
fungii'm supposing lsof would have showed open file handles to /afs which might have been timing out from the afs02.dfw restart22:45
fungior failing to time out, rather22:45
fungithere are a bunch of old vos operations (listvol, release, partinfo) hanging out in the process lists for mirror-update.openstack.org and mirror-update.opendev.org too which need reaping, looks like22:48
fungii'll terminate them22:48
fungidone, they were all from 7+ hours ago22:52
fungikeeping an eye on the periodic volume release logs now to see if they pick back up normal operation22:52
*** qchris has joined #opendev22:53
fungilooks like it's on track again23:14
*** mlavalle has quit IRC23:19
*** Dmitrii-Sh has joined #opendev23:21
*** hamalq has quit IRC23:26
*** tosky has quit IRC23:26
clarkbany better ideas for bhs1 mirror?23:54
funginot really, i guess it will be nice to have it on focal anyway?23:57
clarkbya I think we've already started converting them (though I would need to double check that)23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!