Tuesday, 2021-09-14

jinyuanliu_hi02:45
jinyuanliu_https://review.opendev.org/SignInFailure,SIGN_IN,Contact+site+administrator02:46
jinyuanliu_I have newly registered an account. This error is reported when I log in. Does anyone know how to deal with it02:47
*** ysandeep|out is now known as ysandeep05:11
*** odyssey4me is now known as Guest719605:47
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703107:11
*** jpena|off is now known as jpena07:23
*** hjensas is now known as hjensas|afk07:27
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703107:53
opendevreviewOleg Bondarev proposed openstack/project-config master: Update grafana to reflect dvr-ha job is now voting  https://review.opendev.org/c/openstack/project-config/+/80559408:02
*** ykarel is now known as ykarel|lunch08:05
*** ysandeep is now known as ysandeep|lunch08:14
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703108:20
noonedeadpunkhey there! 08:30
noonedeadpunkIt feels for me that git repos might be out of sync right now. https://paste.opendev.org/show/809294/08:31
noonedeadpunkand at the same time on other machine I get valid 768b8996ba4cb24eb2e5cd5dc149cd114186debd08:31
*** ykarel|lunch is now known as ykarel09:24
ianwnoonedeadpunk: is the other machine coming from another ip address?09:25
noonedeadpunkyep09:26
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703109:29
ianwd615a7da              15:30:25.611      [1904d04c] push ssh://git@gitea03.opendev.org:222/openstack/cinder.git09:31
ianwinfra-root: ^ that to me looks like a stuck replication from gerrit, and matches the repo noonedeadpunk is cloning09:32
ianwthere are a few others as well, some older, up to sep 9 is the oldest09:32
ianwi've killed that process, i don't really have time for a deep debug at this point09:43
ianwhttps://paste.opendev.org/show/809297/ <- i have killed those process which all seem stuck.  push queues seem empty now09:50
ianwwe should keep an eye to see if more are getting stuck09:51
noonedeadpunkgit pull worked for my faulty machine09:55
noonedeadpunkthanks!09:55
*** odyssey4me is now known as Guest720910:02
*** ysandeep|lunch is now known as ysandeep10:08
iurygregoryopendev folks, we started to see ironic jobs failing because of reno 3.4.0... I'm wondering if anyone saw the same problem in other projects (e.g https://zuul.opendev.org/t/openstack/build/6f39928da6a04d2ab0a64258d7309bfa )10:38
iurygregorymaybe an issue with pip?11:01
*** dviroel|out is now known as dviroel11:10
rosmaitaiurygregory: seeing that on a lot of different jobs, usually see that when the pypi mirror is outdated11:13
*** jpena is now known as jpena|lunch11:24
iurygregoryrosmaita, so we just wait to do some recheck?11:36
*** ysandeep is now known as ysandeep|brb11:46
*** ysandeep|brb is now known as ysandeep11:54
rosmaitaiurygregory: other than reporting it here (not sure the infra team can do anything about the mirrors), not sure what else we can do12:04
iurygregoryI asked in #openstack-infra also =)12:19
*** jpena|lunch is now known as jpena12:20
ykarelClark[m], fungi, ^ happened again12:25
ykarelsome mirrors affected like pip install --index-url=https://mirror.mtl01.iweb.opendev.org/pypi/simple reno==3.4.012:25
ykareli ran PURGE as you suggested last time but doesn't seem to help, may be running wrongly12:26
ykarelran curl -XPURGE https://mirror.mtl01.iweb.opendev.org/pypi/simple/reno12:26
ykarelseems to work now12:27
ykarelthis time i ran without reno may be that worked?12:27
ykarelcurl -v -XPURGE https://mirror.mtl01.iweb.opendev.org/pypi/simple12:27
ykareliurygregory, both the failures you shared were on mirror.mtl01.iweb.opendev.org, which i see working now12:30
ykarelhave you seen on other mirrors too?12:30
ykarelif not can try recheck and see if all good now12:30
iurygregoryykarel, I haven't check other jobs, I will put recheck and see how it goes13:09
ykarelack13:09
*** lbragstad_ is now known as lbragstad13:15
fungiykarel: it's not our mirrors which need to be purged, it's pypi's cdn13:19
fungiiurygregory: ^13:19
fungiwe're just proxying through whatever pypi's serving, and their cdn seems to sometimes serve obsolete content around montreal canada13:19
ykarelfungi, yes correct, i messed the words13:19
fungino, i mean you're calling curl to purge our mirror which won't do anything, you need to purge that url from pypi's cdn13:20
fungiwhich should cause fastly to re-cache it from the correct content (hopefully)13:20
ykarelohkk not aware about that13:20
ykareli just ran above curl and it worked somehow13:21
ykarelmay be it was just timing13:21
fungirosmaita: also we have no pypi mirror to get outdated, we just proxy pypi13:22
fungiykarel: yes, from what we've seen in the past it's intermittent and may recur13:22
fungimy suspicion is that pypi still maintains a fallback mirror of their primary site which they instruct fastly to pull from if it has network connectivity issues reaching the main site, and that fallback mirror may be stale, and for whatever reason the fastly cdn endpoints near montreal have frequent connectivity issues and wind up serving content from the fallback site13:23
rosmaitafungi: ok, good to know13:23
ykarelfungi, ack13:25
fungiit used to be the case that the mirroring method pypi used to populate their fallback site didn't create the necessary metadata to make python_requires work, so we'd see incorrect versions of packages selected. it seems like they solved that, but possible that fallback could be very behind in replication or something13:26
fungiso instead we're just ending up getting old indices sometimrs13:26
rosmaitai'm getting an openstack-tox-docs failure during the pdf build ... sphinx-build-pdf.log tells me "Command for 'pdflatex' gave return code 1, Refer to 'doc-cinder.log' for details" ... but i can't find doc-cinder.log in https://zuul.opendev.org/t/openstack/build/f45e84debbae44449bf6d98cc0807d68/logs ... where should i be looking?13:27
fungiit's possible the job isn't configured to collect that log13:27
rosmaitaoh13:27
fungilooking to see if i can tell where it would be written on the node13:28
toskyhttps://47bdf347398a802ddc78-6d4960ba8e43184f8d8ec59d1a3f8e83.ssl.cf2.rackcdn.com/760178/13/check/openstack-tox-docs/f45e84d/sphinx-build-pdf.log13:28
toskyrosmaita: ^^13:28
fungiahh, it just has a different filename13:28
rosmaitais that the same log?13:28
fungidunno, do pdf builds work locally for you? if they break similarly you can compare the log content13:28
rosmaitafungi: guess i will have to check13:29
fungilooks like it contains the same error referring you to the other log, so i think it's not13:29
fungiit probably wrote it at /home/zuul/src/opendev.org/openstack/cinder/doc/build/pdf/doc-cinder.log13:30
iurygregoryfungi, ack13:37
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703113:38
Clark[m]fungi: rosmaita: https://zuul.opendev.org/t/openstack/build/f45e84debbae44449bf6d98cc0807d68/log/job-output.txt#2013 shows the file it redirected to which is the one tosky linked to13:42
fungiClark[m]: yeah, check the end of that file, it refers you to the other file rosmaita is looking for13:43
fungimaybe it thinks it's referring to itself by a different name?13:43
fungior maybe it contains all the same contents as the other file it references and is a superset?13:44
Clark[m]Well the redirected content is what the build outputs. I guess the build could write a separate log file the jobs don't know to collect. In the past the stout and stderr have been enough to debug those though iir 13:44
Clark[m]fungi: should we trigger replication for the repos ianw had to clean up old replication tasks to be sure they are caught up?13:46
fungiClark[m]: probably, i didn't see ianw mention having triggered a full replication13:46
fungii'll do that in a sec13:47
fungi`gerrit show-queue` indicates it's still caught up at least, so no new hung tasks13:48
Clark[m]fungi: ykarel: our proxies should proxy the purge requests so I expect those curl commands work. But as fungi points out it creates confusion over where the issue lies. To be very very clear we are only giving the jobs what pypi.org has served to us.13:49
Clark[m]Ya those stale ones could be fallout from when we did the server migrations if those weren't as graceful as we expected13:49
Clark[m]I don't know if the timing lines up for that or not13:50
fungiroughly 1.8k tasks queued now13:50
fungiup to 7.8k now13:51
fungi10k...13:51
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703113:52
Clark[m]I think the total for a full replication is around 17k13:52
Clark[m]It should go quickly for repos that are up to date13:53
fungitopped out a little over 18k and now falling13:53
fungiwe're down around 16k tasks now14:29
fungii guess we're looking at 4-5 hours for it to finish at this pace14:30
clarkbI wonder if the bw between sjc1 and ymq is lower than it was between dfw and sjc114:34
clarkbfungi: fwiw we can replicate specific repos which goes much quicker. In this case probably a good idea to do everything anyway14:35
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703114:38
clarkbfungi: thinking out loud here about lists.o.o reboots. I think step 0 is confirm that auto apt updates haven't replaced our existing boot tools with newer versions including the decompressed kernel. Reboot on what we've got and confirm that works consistently. Then we can replace the decompressed kernel with the compressed kernel and try rebooting again. If that works we are good and14:40
clarkbdon't really need to do anything more. If that doesn't work we recover with the decompressed kernel and put package pins in place and plan server replacement14:40
fungiyep14:42
fungithat should suffice14:42
clarkbjinyuanliu_ isn't here anymore but chances are an existing account has the same email address as the one they just tried to login with and gerrit caught that as a conflict and prevent login from proceeding14:46
clarkbthey should either use a different email address or login to the existing account14:46
fungiskimming last modified times in /boot on lists.o.o it looks like nothing has been touched since our edits14:48
clarkbcool. I still need to load ssh keys for the day. But I guess we can look at rebooting after meetings14:52
fungisgtm14:55
clarkbfungi: I just double checked and /boot/grub/grub.cfg shows the 5.4.0-84 kernel as the first entry, that kernel file in /boot is the decompressed version (much larger than the others), and /boot/grub/menu.lst still lists the chain load option first14:56
clarkbI agree those files all seem to be as we had them on the previous boot14:56
*** ykarel is now known as ykarel|away15:02
clarkbfungi: should we status notice the delay in git replication?15:03
clarkbsomething like #status notice Gerrit is currently undergoing a full replication sync to gitea and you may notice gitea contents are stale for a few hours.15:04
*** ysandeep is now known as ysandeep|dinner15:11
fungiwe can, though it's probably not going to be that noticeable. we're already down to ~14k remaining15:15
clarkbfungi: https://www.mail-archive.com/grub-devel@gnu.org/msg30984.html I'm not longer hopeful the chainloaded bootlaoder is any better15:17
clarkbhttps://lists.gnu.org/archive/html/grub-devel/2020-04/msg00198.html too15:18
clarkbfungi: in that case maybe we should focus on some combo of pinning the kernel package, doing the apt post install hook thing from that forum suggestion, replacing the server15:21
fungiyeah, that seems like the best we can probably arrange15:22
fungiit's still unclear to me from those posts why chainloading the bootloader still doesn't provide lz4 decompression, yet somehow grub in pvhvm guests can decompress them just fine15:23
clarkbfungi: I suspect the chainloader may not hand off and execute grub2 instead it knows how to read grub2 configs and then negotiate running the kernel directly like pv wants15:25
fungimmm, i see. yeah there's mention of the kernel file being handed off to the hypervisor, so i suppose it's being handed off compressed and it's up to the hypervisor whether it supports that compression15:26
fungiin which case i wonder why the chainloading was even needed15:27
clarkbfungi: the only thing I can think of is maybe I got the menu.lst wrong? But menu.lst isn't going to be reliably updated anymore is it?15:28
fungiright, i guess rackspace's external bootloader will want to parse menu.lst (though maybe it would parse grub.cfg if menu.lst wasn't there)15:29
clarkbbased on what I've read I think we should do a reboot to double check it works reliably as is. Then either pin the kernel package or put in place the post install hack. Then start planning a replacement server?15:32
clarkbhttps://lists.xenproject.org/archives/html/xen-users/2020-05/msg00023.html "I think we lost most of them to KVM already anyway :(" Not going to lie it seems like maybe problems like this are a big reason for that15:32
fungimaybe this is a good time to try to combine the two ml servers into one, and/or a more urgent push for mm3 migration15:35
fungiworth discussing in today's meeting15:36
clarkbya its on the agenda to discuss this stuff15:36
mordred"rackspace's external bootloader" ... do I even want to know?15:37
clarkbmordred: basically xen is weird and it can't reliably boot ubuntu focal anymore in pv? mode15:38
fungimordred: it's how xen pv works, the bootloader is external to the server image15:38
clarkbmordred: unlike kvm xen isn't running the bootloader if we understand things correcntly. Instead its finding the kernel and running it directly15:38
clarkbkvm is like your laptop and runs the actual bootloader avoiding all of these issues15:38
clarkbfungi: we can also yolo go for it and see if it can do the compressed file though I'm fairly certain it can't at this point15:39
fungii think xen pvhvm works like kvm though15:39
clarkbfungi: ya our pvhvm instances seem fine15:39
fungiclarkb: no, i agree with you after digging into those ml threads15:39
mordredwow. also - we have non-pvhvm instances? I thought we just used that for everytihng ... but in any case, just wow15:41
clarkbmordred: lists.openstack.org is our oldest instance as we haev upgraded it in place to preserve ip reputation for smtp15:42
clarkbI think it predates rax offering pvhvm15:42
mordredahhhhhh yes15:42
clarkbfungi: the extract tool and the postinstall script don't seem too terrible if we just want to put those in place for now as a CYA manuever while we plan to replace the server?15:42
clarkbNew suggestion, reboot on current state to ensure it is stable. Then install extract-vmlinux and the kernel postinstall.d script. Then start planning to replace the server15:43
clarkbfungi: ^ does that seem reasonable to you and if so do you think we need to try and ansible the extract-vmlinux and postinst.d stuff or just do it by hand?15:44
fungiwiki.o.o may be of a similar vintage to lists (though it's actually a rebuild from an image of the original so not technically an old instance, just an instance on an old flavor)15:44
fungimordred: lists.o.o has been continuously in-place upgraded since it was created with ubuntu precise (12.04 lts) some time in 201315:45
fungiclarkb: yes, that plan sounds solid15:45
clarkbcool in that case should we do a reboot nowish?15:46
fungiwe can probably by-hand manage the kernel decompression and either put a hold on the current kernel package or just plan to fix it with a recovery boot if unattended-upgrades puts us on a newer kernel and then rackspace reboots the instance for some reason15:47
clarkbya that is an option for us as well15:47
clarkbthe script from linux is in lists.o.o:~clarkb/kernel-stuff if we want to extract it by hand15:48
fungiand yes, i'm able to do the reboot now if you're ready15:48
fungiand at the ready to do the recovery boot if needed15:49
clarkbfungi: ya I'm ready. My meeting is over15:49
clarkbfungi: will you push the button? or should I?15:50
fungii will15:51
fungisorry, had to switch rooms, pushing the button now15:52
clarkbok15:53
fungilooks like it finished shutting down15:54
clarkband now it isn't coming back :/15:54
clarkbI guess it is a good thing to know this isn't reliable. But would be really nice to know why15:55
clarkboh wait it pings now15:55
fungiit's booting15:55
clarkbhuh is it just REALLY slow?15:55
fungifsck ran for a bit15:55
clarkbah15:55
fungialso the chainloading may delay it if we didn't take out the keypress timeouts15:56
fungi15:56:05 up 1 min,  2 users,  load average: 2.89, 0.81, 0.2815:56
fungiLinux lists 5.4.0-84-generic #94-Ubuntu SMP Thu Aug 26 20:27:37 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux15:56
clarkband it is on the 5.4. kernel15:56
clarkbok rebooting is reliable but slow.15:56
fungiftr, it's not our slowest rebootnig server15:56
fungibut yeah it did take a bit15:57
fungiso looking at mm3 options, we could either go the distro package route for something stable (focal has mm 3.2.2) or the semi-official docker container route (which has options for latest mm 3.3.4 or rolling snapshots of upstream revision control)15:58
clarkbI just updated the lists upgrade etherpad with reboot and potential future options. I guess we discuss what we want to do next in our meeting15:59
fungithe up side to distro packages, in theory, is we get backported security fixes but fairly stable featureset until the next distro release upgrade. with docker containers we get a version independent of the running distro but end up consuming new versions a lot more frequently if we want security fixes15:59
clarkbDo we want to remove lists.o.o from the emergency file before we meeeting? The two things to consider there are we should probably remove the cached ansible facts file for that instance on bridge and it will run autoremove of packages which may remove our 4.x kernels but we seem to be able to boot 5.4 now16:00
clarkbfungi: one thing I have really enjoyed with things like gitea where we consume upstream with containers is that we can update frequently and keep the deltas as small as possible16:00
clarkbletting things sit for years results in scaryness16:00
fungiyeah, i think it's fine. do we want a manual ansible run or just let the deploy jobs do their thing on their own time?16:01
clarkbfungi: the deploy jobs already ran so we should probably remove it then manually run the playbook16:01
clarkbwell deplyo jobs ran for the fixup change I mean16:01
fungido we need to remove cached facts before taking it out of the emergency list?16:01
clarkbfungi: yes I think we need to do that otherwise some of our option selections might select xenial options16:01
clarkb/var/cache/ansible/facts/lists.openstack.org appears to be the file?16:02
clarkbI'm going to find breakfast but can help with that stuff after but if you want to go ahead feel free16:03
fungiahh, we can just delete that file i guess. i was looking through documentation to find out how to clear cached facts16:04
fungihah, first hit on a web search was https://docs.openstack.org/openstack-ansible/12.2.6/install-guide/ops-troubleshooting-ansiblecachedfacts.html16:04
fungithat at least seems to confirm your suggested method16:05
*** ysandeep|dinner is now known as ysandeep16:08
fungiclarkb: after a bit more looking around, i managed to convince myself deleting that file should do what we want, so removed it16:11
opendevreviewMartin Kopec proposed openstack/project-config master: Move ansible-role-refstack-client from x/ to openinfra/  https://review.opendev.org/c/openstack/project-config/+/76578716:12
fungiand removed the lists.openstack.org entry from the emergency disable list16:12
Clark[m]Cool I guess next up is running the playbook? I should be back at the keyboard in 20-30 minutes16:13
fungidon't rush. i'll get things queued up in a root screen session on bridge.o.o16:15
fungii've queued up a command to run the base playbook first, presumably that's what we want to start with16:16
Clark[m]++16:18
Clark[m]That is what does autoremove fwiw16:19
*** jpena is now known as jpena|off16:26
opendevreviewJiri Podivin proposed zuul/zuul-jobs master: DNM  https://review.opendev.org/c/zuul/zuul-jobs/+/80703116:27
fungihard to say for sure since we just reset it with a reboot, but memory consumption on lists.o.o looks slightly lessened after the upgrade than it was prior16:28
fungijust based on comparing the memory usage graph in cacti to the same day last week16:29
clarkbfungi: I've joined the screen16:35
clarkbI guess I'm ready if you are16:36
fungiready for me to flip the big red switch then?16:36
*** artom_ is now known as artom16:36
clarkbya I think we run base, then check what it did (if anything) to /boot contents16:36
clarkbthen we run the service lists playbook and restart mailman and apache services to be sure they are happy with slightly updated contents?16:37
fungistarting now16:38
clarkbthe exim conf task seems to have nooped as expected16:40
fungiyeah, so far so good16:40
clarkbthe older kernels have remained and we did't end up with a recompressed new kernel16:41
clarkbI think that is a happy base run16:41
fungiso the service-lists playbook next? is that the correct one?16:41
clarkbchecking16:41
clarkbinfra-prod-service-lists job runs service-lists.yaml16:42
clarkbyes I believe that command is correct16:42
fungii mean, that's a file anyway, i was able to tab-complete it16:42
fungiokay, running now16:42
*** ysandeep is now known as ysandeep|out16:42
clarkbit is updating the things I expect it to update and skipping creation of lists because they arlready eixst if I read the log properly16:44
fungiyep, that's my take on the output16:45
clarkbthe updates for airships init script, apache config and the global mm_cfg.py all look as expected to me. Will just have to restart services and ensure they function when ansible is done I think16:45
fungiyep, i'll do that via ssh to the lists server16:46
fungii'm already logged in there16:46
fungiyay, completed without errors16:46
clarkbyup ansible looked good I think16:46
fungiso should i test the mailman service restarts then?16:46
clarkbfungi: maybe restart apache first and we check apache for each of the list domains then we can restart mailman-opendev and send an email to lists.opendev.org?16:47
fungisure16:47
clarkbthen if that is happy restart the other 416:47
fungiapache is fully restarted now16:48
clarkbI can browse all 5 list domains via the web and the lists seem to line up with the domain16:48
clarkbif that looks good to you I think we are ready to restart mailman-opendev and send a quick test email to it (just a response to your existing thread on service-discuss?)16:49
fungii browsed from the root page of each of the 5 sites all the way to specific archived messages, for some list on them, so lgtm16:50
fungirestarting mailman-opendev now16:51
fungirestarted. the list owned processes dropped from 45 to 36 and then went back up to 45 again16:51
fungisending a reply now16:52
clarkbya I see 9 new processes16:52
fungisent16:53
fungiit's in the archive16:54
fungiand i received a copy16:54
clarkbI see it in my inbox too16:54
fungii'll proceed with restarting the other four sets of mailman services now16:54
clarkb++16:54
fungiall of them have been cleanly restarted16:56
clarkbI see an appropriate number of new processes16:56
fungii confirmed the 9 processes for each went away and returned16:56
clarkbgerrit tasks queue down to 9.8k16:58
fungiso nearly halfway done16:59
fungii need to take a quick break, and then i'll try to spend a few minutes digging deeper on the current state of containerized mm3 deployments17:01
clarkbI think we're done with lists for now with next steps to be discussed in hte meeting. I'm going to context switch to catching up on some zuul related changes I've been reviewing17:01
clarkbfungi: thanks for the help!17:01
fungialmost done with lists? we still have some test servers and autoholds we need to clean up, right?17:02
fungiclarkb: mind if i delete autohold 0000000332 "Clarkb checking ansible against focal lists.o.o"17:03
clarkbfungi: go ahead17:03
fungidone17:04
clarkbfungi: I think there is an older hold for lists too that can be cleaned up if it is still there17:04
fungilooks like the manually booted test servers are already cleaned up17:05
clarkbthe one I had originally when checking the upgrade from xenial to focal17:05
fungiclarkb: nope, just gerrit revert and gitea upgrade17:05
fungino other lists autohosts17:05
fungiautoholds17:06
clarkbI guess I cleaned that one up after the lists.kc.io upgrade since that was a better check17:06
fungilooks that way17:06
fungion the rackspace side of things, i'll clean up the old image we made for lists.o.o in may, but keep the one from last weekend17:07
clarkb++17:08
fungiand done17:09
clarkbfungi: should we let westernkansasnews know they have likely been owned? Their WP install is what these phsihing list owner emails link back to17:15
clarkb(otehrwise why return to them?)17:16
*** artom_ is now known as artom17:22
fungii receive countless phishing messages to openstack-discuss-owner, not sure it's worth my time to follow up on every one. i just delete them17:22
clarkbya I'm just noticing that a legit organization (they have a wikipedia entry so they are real!) seems to maybe be compromised.17:24
fungiskimming the mm3 docs, and the documentation for the container images, this doesn't look too onerous. three containers (for the core listserv+rest api, the web archiver/search index, and the administrative webui). the web components are uwsgi listening on a high-numbered port and expects an external webserver, so fits well with our usual deployment model. similarly the core container expects to17:26
fungicommunicate with an mta on the system17:26
fungithey have postfix and exim4 config examples for the mta, only nginx example for the webserver but apache shouldn't be hard to adapt to it17:26
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563817:27
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563817:32
opendevreviewMartin Kopec proposed openstack/project-config master: Move ansible-role-refstack-client from x/ to openinfra/  https://review.opendev.org/c/openstack/project-config/+/76578718:07
clarkbdown to about 7k tasks now18:09
opendevreviewClark Boylan proposed opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement  https://review.opendev.org/c/opendev/infra-specs/+/80412218:32
clarkbinfra-root ^ it seems everyone was largely happy with the last patchset of that chagne. I updated it based on some of the feedback I got. I expect this si mergeable in the near future if you can take a look to rereview it quickly18:32
corvusclarkb: i replied on a change there; i'm a little unclear if i should leave a -1 or +1.  maybe you can read the comment and let me know :)18:45
clarkbcorvus: hrm ya good point. I'm thinking maybe we evaluate both as we boostrap and then commit to one before turning off cacti?18:48
clarkbcorvus: I can update the text in there to be more explict along those lines if that makes sense to you18:48
corvusok.  tbh, if we want to consider node_exporter, i'd do it now and not spend any wasted effort on snmp_exporter.  like, it's a bit "why do it once when we can do it twice?"18:49
fungiwe're down to ~4k tasks remaining in the gerrit queue now18:49
clarkbcorvus: if we want to not bother with extra evaluation then I would probably say stick with snmp18:51
fungiis there a good summary on the benefits of node_exporter over snmp_exporter? or is it just a way to get rid of yet one more uncontainerized daemon on our servers?18:51
clarkbfungi: node_exporter is a bit more precanned for its gatherign and graphing18:51
clarkbfungi: with snmp we'll have to define the mibs we want to grab and then define graphs for them18:51
clarkbthe downside to node_exporter is you have to run an additional service and without docker doing that for node_exporter is likely to be difficult and we run some services without docker18:52
fungii'll read up on it since i have no idea what precanned means in this sense. net-snmpd seems nicely precanned already18:52
corvusyeah, i think that's a fair summary of the trade off18:52
clarkbfungi: the service already knows the sorts of stuff you want to report because it is geared towards server performance metrics18:52
clarkbsnmp is var more generic and you have to configure the snmp exporter to grab what you want18:52
clarkbI'll revert the bit about node_exporter since I think the safest course for us is snmp exporter18:53
corvusi think my concern is that if the plan is "get node_exporter running once to try it out" we've done 98% of what's needed for "get node_exporter running everywhere" and we should just do that instead of snmp.  the advantage of snmp is we don't have to do any node_exporter work.  :)18:53
fungiahh, so the summary is that node_exporter is more opinionated and decides what you're likely to want rather than letting you choose?18:53
corvusso if we do both, we'll defeat the advantages.  :)18:53
clarkbfungi: I think it lets you choose too but yes comes with good defaults 18:53
opendevreviewClark Boylan proposed opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement  https://review.opendev.org/c/opendev/infra-specs/+/80412218:53
clarkbBut for our environment I strongly suspect we need the flexibiltiy of something like snmp18:54
clarkbsince running docker everywhere is not always going to be good for us18:54
corvusnode_exporter has a lot of collectors and i think it can be extended18:54
fungioh, i think i get it. it's not that net-snmpd is the problem, it's that snmp_exporter requires fiddling?18:54
clarkbfungi: yes18:54
clarkbfungi: you have to tell the prometheus snmp collector what to gather and how often to gather it and where to store it. Then you ahve to tell grafana to pull that data out and render it18:55
fungibecause prometheus is push-based as opposed to pull-based, so the configuration is on the pushing agent side not the polling server side18:55
corvusi thought docker-everywhere was a goal?18:55
clarkbcorvus: I don't think it ever was? iirc we explicitly didn't use docker for things like the dns servers18:55
fungii can't imagine every single daemon on every server will be in a container though, there's bound to be a line somewhere18:55
clarkbfungi: the constraint would be more along the lines of does every server run a dockerd that can have node_exporter running in it18:56
fungifor example we've so far considered it simple enough to rely on distro packages of apache rather than requiring an apache container on every webserver18:56
corvusrunning node_exporter everywhere doesn't seem like it should be too much of a challenge?  like, there isn't a server where we can't run docker?18:56
clarkbcorvus: currently there isn't a place where we can't run docker but we have chosen not to on some18:56
clarkbthe afs infrastructure is without docker and the dns servers are the ones I can think of immediately (mailman too but we're talking about changing that above)18:57
fungii don't really see a problem with deciding to deploy something in docker on every server now that we have orchestration for that though18:57
corvusfungi: for node_exporter: prometheus main server will poll node_exporter on leaf-node server.  pretty similar to an snmp architecture.  just the agent is node_exporter instead of netsnmpd18:57
clarkbI guess in today's meeting I'll ask people to specifically think about whether or not they think node exporter with docker everywhere is worthwhile18:57
clarkband to leave comments on the spec with what they decide18:58
corvusfungi: for snmp_exporter: prometheus main server will poll snmp_exporter on main server which will poll snmpd on leaf-node servers.18:58
fungiis the concern with the additional overhead of dockerd on some of the servers?18:58
clarkbfungi: I think that concern is minimal. For me it was more that we had explicitly decided not to run docker as it wasn't necessary for certain services like dns and afs18:58
fungioh, so prometheus does poll, like mrtg/cacti?18:59
clarkband if we want to gather metrics from those services we would need to add docker everywhere18:59
clarkbin either case we have to modify firewall rules so that doesn't count against either option18:59
fungiand i just doesn't speak snmp, so needs a translating layer to turn its calls into snmp queries?18:59
tristanCyou can also run the node exporter without docker, the project even publish static binary for many architecture as part of their release18:59
corvusfungi: yes it polls18:59
clarkbtristanC: I don't think we would do it taht way.18:59
clarkbtristanC: that would be worse than running it in docker imo18:59
clarkb(because now you need some additional system to keep it up to date etc)19:00
mordredand we're already pretty well set up to run docker in places19:00
mordreda positive about the polling is that it makes some of the firewall rules simpler - just allow from the prom server on all the endpoints (static data) rather than needing to open the firewall on teh cacti server for each endpoint. I mean - we have that complexity implemented so it's not an issue, but it's a place where a change to a service also impacts the config management of the cacti so there's an overlap19:03
corvusmordred: cacti polls snmp on the leaf-node servers, so it's the same firewall situation19:04
corvus(we don't use snmp traps, which would require inbound connections to cacti)19:04
mordredcorvus: oh - what am I thinking about where we need to open the firewall rules centrally for each leaf node?19:04
mordredam I just making up stuff in my head?19:05
corvusi think that may be it?  :)  we do have to add cacti graphs for every host...19:05
fungiso then is the difference that snmp_exporter would run on the prometheus server and connect to net-snmpd on every system while node_exporter would run on the individual systems in place of net-snmpd and prometheus would query it remotely via its custom protocol?19:05
corvusbut i reckon we'd probably end up adding grafana dashboards for every host too19:06
corvusfungi: yes  (note the custom protocol is http carrying plain text in "key:value" form)19:07
fungigot it, so node_exporter basically runs a private httpd on some specific port we would allow access to similar to how we control access to the snmpd service today19:08
corvusyeah; i assume we'd pick some value we could use consistently across all hosts.  maybe TCP 1061 ;)19:09
corvus(that wouldn't interfere with service-level prometheus servers also running on the host)19:10
fungioh, so it doesn't have a well-known port assigned19:10
fungiwe just pick something19:10
fungiwfm19:10
fungithis new internet where ip has been replaced by http is still somehow foreign to me ;)19:11
corvusfungi: prometheus servers can relay data from other prometheus servers, so there's a prometheus network topology on top of the http layer too :)19:13
fungiwow19:14
clarkbfungi: also applications like gerrit, gitea, and zuul can expose a port from within themselves to report to prometheus polls19:14
fungii see, and then prometheus knows the app-specific metrics endpoints to collect from as well as the system endpoint. that makes for rather a lot of sockets in some cases i'm betting19:15
mnasercorvus, fungi: i'm jumping into this but there is a 'registry' of exporter ports19:49
mnaserit ain't IANA but.. https://github.com/prometheus/prometheus/wiki/Default-port-allocations19:50
fungimnaser: oh cool19:50
mnasernode_exporter is 9100, etc, that is also indirectly a nice 'list' of exporters you can look at :p19:50
corvusgood, so as long as services aren't abusing 9100 that should be fine, and i agree we should try to follow that if we go that route19:51
* fungi concurs wholeheartedly19:51
mnaseralso, fungi, you mentioned some ipv6 reachability issues over the past .. while, we finally (i believe) got to the bottom of this, could you let me know if you're still seeing those issues19:51
clarkbmnaser: the last email we got about failed backups due to ipv6 issues was on the 10th19:52
fungimnaser: i can test again in just a moment19:52
clarkbmnaser: seems it has been happier the last few days19:52
fungibut yeah, if we're no longer getting notified of backup failures, that's a good sign it's fixed19:52
fungiianw: ^ good news!19:52
mnaserwee, awesome.  backlogs work! :P19:53
fungimnaser: do you feel like the fix probably took effect on friday or saturday? if so, that coincides with our resumption of backups19:53
mnaserfungi: the fix would have been applied by 12pm pt on saturday19:54
fungimnaser: sounds like a correlation to me then19:54
ianwfungi / mnaser: debug1: connect to address 2604:e100:1:0:f816:3eff:fe83:a5e5 port 22: Network is unreachable19:55
mnaseraw darn19:56
ianwso it looks like it's falling back to ipv4, which is making it work19:56
fungiaha19:56
ianwwhich is better(?) than before when it just hung? :)19:56
mnaserah dang, i think i have an idea here what's happening19:56
mnaseri bet its because the bgp announcement is not being picked up19:56
mnasersince the asn that announces that route is the same as all of our regions19:56
mnaserand we don't usually install an ipv6 static route19:56
mnaserso it makes sense that it cant find a route, gr19:57
clarkbfungi: for the kernel pin we have to craft a special file and stick it in a dir under /etc right? The thing that I always get lost on is what goes in the special file (priorities etc)19:58
clarkbfungi: maybe if you do the pin I can take a look at the file afterward and ask qusetions about the semantics of the thing?19:58
fungiclarkb: nah, we can just echo the package name followed by the word "hold" and pipe that through dpkg --set-selections19:59
clarkbfungi: TIL19:59
fungifirst step is getting the package name right20:00
clarkbfungi: in that case I guess just share the command you end up running and I'll feel filled in20:00
ianwmnaser: yeah, that was what i was going to say ... not! :)  i mean of course just let us/me know if i can do anything to help!20:00
fungiclarkb: `dpkg -S /boot/vmlinuz-5.4.0-84-generic` reports the linux-image-5.4.0-84-generic package is what installs that file, so it's what we want to hold20:01
clarkbfungi: that makes sense20:01
fungiecho linux-image-5.4.0-84-generic hold | sudo dpkg --set-selections20:01
fungii ran that just now20:02
clarkband then if we dpkg -l it should show a hold attribute on the package listing?20:02
fungiif you `dpkg -l linux-image-5.4.0-84-generic` you'll see the desired state is reported as "hold" instead of "install"20:02
funginow apt-get and other tools will refuse to replace that package version until the hold flag is manually removed20:03
mnaserianw: thanks, that error helps :)20:03
clarkbfungi: will it still update the kernels and install them in grub?20:03
fungiif we wanted to revert it, we'd just say install instead of hold on the command i ran earlier20:03
clarkbfungi: if so that hold may not be sufficient because we're relying on the first grub entry aiui20:03
clarkbmaybe we walso hold linux-generic and linux-image-generic ?20:04
fungiclarkb: good point, we can also hold the virtual package20:04
fungii've held both of those too now20:05
funginot normally necessary but as you note, with kernel packages they make a new one for each version20:05
clarkbthere is also linux-image-virtual but it just installs linux-image-generic so I think we're good20:05
fungiyes, the problem is with new package names, new versions of the existing package of the same name will be blocked20:06
fungikernel packages are special in that regard20:06
fungimost packages don't have version-specific names20:06
clarkbright20:06
clarkbok time to eat lunch.20:08
opendevreviewMarco Vaschetto proposed openstack/diskimage-builder master: Allowing use local image  https://review.opendev.org/c/openstack/diskimage-builder/+/80900920:31
clarkbfungi: looking at https://opendev.org/opendev/system-config/src/branch/master/playbooks/gitea-rename-tasks.yaml I think what we want for updating the metadata is a task at the very end of that list of tasks that looks up the metadata from somewhere and applies it20:32
clarkbhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea-git-repos/library/gitea_create_repos.py#L141-L178 is how we do that in normal projcet creation20:32
clarkband that is called if we set the always update flag. I suppose another option here is to just run the manage projects playbook with always update set?20:33
clarkbas a distinct step of the rename process rather than trying to collapse it all. The problem with that is it is likely to be much slower than collapsing it down to only the projects that are renamed20:33
clarkbhttps://www.brendangregg.com/blog/2014-05-09/xen-feature-detection.html might be interesting to others (ran into it when digging around to see if there is any documentation on converting from pv to pvhvm)20:46
clarkbI've used that to confirm we are PV mode on lists and HVM elsewhere20:47
clarkbXen does support converting from pv to pvhvm. Not clear if rax/nova do20:50
clarkbLooks like the vm_mode metadata in openstack (which you can use to select pv or hvm) is an image attribute. This might be what makes ti tricky20:52
clarkblooking at the image properties I don't see the vm_mode set though20:54
opendevreviewLuciano Lo Giudice proposed openstack/project-config master: Add the cinder-netapp charm to Openstack charms  https://review.opendev.org/c/openstack/project-config/+/80901220:55
clarkbI'm not finding anything definitive on the internet saying there is a preexisting process for openstack or rackspace. I guess we might have to file a ticket and ask?20:57
clarkbLooking at https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md to get a sense of what bionic node exporter vs focal node exporter if using packages looks like and they appear to report metrics under different names21:16
clarkb0.18.1 includes a number of systemd related performance updates too21:17
clarkbI suppose it is possible to use the distro packages but then when we upgrade servers metric names will change on us21:17
clarkbif we deploy latest release with docker we'd avoid that as we could have a consistent version and it seems tehy have avoided changing names like that on the 1.0 release series21:18
clarkbI think if the distro packages were at least 1.0 it wouldn't be as big of a deal21:21
clarkbthere are ~5 tasks in the gerrit task queue from the great big replication that haven't completed21:23
clarkbI wonder if we should kill them and then attempt replication for those specific repos21:24
clarkbI guess give them another half hour and if they dno't compelte then stop them and reenqueue specifically for those repos21:25
clarkbmakes me wonder if we're potentially having the ipv6 issues in the other direction now21:26
clarkbsince the replication plugin should fail then retry with a new task entry iirc21:27
clarkbI'm actually going to go ahead and do the deb-liberasurecode task stop and restart it since that repo is not used anymore aiui21:30
clarkbya it completed when I did that. I'll go through the others and just reenqueue them21:32
*** dviroel is now known as dviroel|out21:36
clarkbthat is done. Seems like that worked to clear things out and then also ensure replciation ended up running21:36
fungiinteresting, so that suggests we're getting some hung tasks, maybe on the order of 0.03% of the time21:43
fungirare, but not so rare that we wouldn't notice at our volume21:44
clarkbya. Most were to gitea03 (3 total) and one each to gitea01 and gitea0621:44
clarkbI suspect it is an issue with creating a network connection similar to the gitea01 backups21:47
clarkbsince we should retry and create new tasks if we fail, but that doesn't seem to have happened21:48
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563822:35
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report  https://review.opendev.org/c/opendev/elastic-recheck/+/80563822:36
ianwclarkb: after killing the ones last night, i'm pretty sure i saw the replication restart22:47
clarkbianw: huh I don't think I saw that here but may be it did22:48
clarkbit could've retried quicker than I could relist22:48
ianwthat was my thinking behind not setting off a full replication.  but anyway it would be good to know why these seem stuck permanently 22:49
ianwyou'd think there'd be a timeout22:49
clarkbI wonder if network level connection is just sitting there and not doing anything to return a failure but also not completing a SYN ACK SYNACK successfully22:50
clarkbsimilar to how ansible will sometimes ssh forever22:50
opendevreviewMerged openstack/diskimage-builder master: Fix debian-minimal security repos  https://review.opendev.org/c/openstack/diskimage-builder/+/80618823:16

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!