Wednesday, 2020-07-01

*** ryohayakawa has joined #opendev00:02
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_261/738714/2/check/system-config-run-gitea/261d1a6/gitea99.opendev.org/logs/access.log we have ports there00:06
clarkbfungi: corvus ianw ^ https://review.opendev.org/#/c/738714/ could use rereview, though I need to call it a day then can restart things with that tomorrow00:06
ianwi can restart if we like; but overall i'm not sure00:15
ianwthe UA look a lot like those listed in  https://www.informationweek.com/pdf_whitepapers/approved/1370027144_VRSNDDoSMalware.pdf, a 2013 article about a 2011 ddos tool russkill00:16
ianwhowever, https://amionrails.wordpress.com/2020/02/27/list-of-user-agent-used-in-ddos-attack-to-website/ is another one that links to00:19
ianwhttps://github.com/mythsman/weiboCrawler/blob/master/opener.py00:19
ianwthat has the very specific00:20
ianwua's we see -- compare and contrast to http://paste.openstack.org/show/795414/00:21
ianwlike "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)" ... that's no co-incidence00:22
ianwhttps://github.com/mythsman/weiboCrawler is also suspicious00:23
ianw"Opener.py independently encapsulates some anti-reptile header information."00:33
*** rchurch has quit IRC00:34
ianwkevinz: ^ i hope i'm not being rude asking but maybe you could translate more of what the intent is?00:35
*** Dmitrii-Sh has quit IRC00:38
*** Dmitrii-Sh has joined #opendev00:43
*** diablo_rojo has quit IRC00:44
*** DSpider has quit IRC00:57
openstackgerritIan Wienand proposed opendev/system-config master: [wip] add gitea proxy option  https://review.opendev.org/73872101:48
corvusianw: an alternate google translate of that phrase is "some anti-anti crawlers"02:16
ianwcorvus: yeah, it's pretty much a smoking gun for what's hitting us ... if it is malicious or a university project gone wrong is probably debatable ... the result is the same anyway02:17
openstackgerritIan Wienand proposed opendev/system-config master: [wip] add gitea proxy option  https://review.opendev.org/73872102:20
corvusianw: the code is clearly intended to avoid detection as a crawler.  it looks like it's designed to crawl weibo without weibo detecting that it's a crawler02:27
corvusperhaps repurposed02:29
*** sgw1 has quit IRC02:49
openstackgerritIan Wienand proposed opendev/system-config master: [wip] crawler ua reject  https://review.opendev.org/73872502:54
openstackgerritIan Wienand proposed opendev/system-config master: [wip] add gitea proxy option  https://review.opendev.org/73872103:23
openstackgerritIan Wienand proposed opendev/system-config master: [wip] crawler ua reject  https://review.opendev.org/73872503:23
*** sgw1 has joined #opendev03:27
openstackgerritIan Wienand proposed opendev/system-config master: [wip] crawler ua reject  https://review.opendev.org/73872503:51
openstackgerrityatin proposed openstack/diskimage-builder master: Revert "Make ipa centos8 job non-voting"  https://review.opendev.org/73872803:57
openstackgerrityatin proposed openstack/diskimage-builder master: Revert "Make ipa centos8 job non-voting"  https://review.opendev.org/73872803:57
*** ykarel|away is now known as ykarel04:24
openstackgerritIan Wienand proposed opendev/system-config master: [wip] crawler ua reject  https://review.opendev.org/73872504:28
*** iurygregory has quit IRC04:31
*** sgw1 has quit IRC04:40
*** mugsie has quit IRC04:53
*** mugsie has joined #opendev04:57
*** ysandeep|away is now known as ysandeep05:09
openstackgerritIan Wienand proposed opendev/system-config master: gitea: Add reverse proxy option  https://review.opendev.org/73872105:36
openstackgerritIan Wienand proposed opendev/system-config master: gitea: crawler UA reject rules  https://review.opendev.org/73872505:36
openstackgerritFederico Ressi proposed openstack/project-config master: Create a new repository for Tobiko DevStack plugin  https://review.opendev.org/73837805:48
*** factor has quit IRC06:03
*** factor has joined #opendev06:03
*** icarusfactor has joined #opendev06:05
*** factor has quit IRC06:06
openstackgerritIan Wienand proposed opendev/system-config master: gitea: crawler UA reject rules  https://review.opendev.org/73872506:16
*** icarusfactor has quit IRC06:22
*** bhagyashris is now known as bhagyashris|brb06:59
*** hashar has joined #opendev07:09
*** iurygregory has joined #opendev07:09
*** sorin-mihai_ has joined #opendev07:15
*** bhagyashris|brb is now known as bhagyashris07:15
*** sorin-mihai_ has quit IRC07:16
*** sorin-mihai_ has joined #opendev07:16
*** sorin-mihai has quit IRC07:17
*** sorin-mihai_ has quit IRC07:19
*** sorin-mihai_ has joined #opendev07:19
*** sorin-mihai_ has quit IRC07:21
*** sorin-mihai_ has joined #opendev07:21
*** sorin-mihai__ has joined #opendev07:23
*** sorin-mihai_ has quit IRC07:26
*** dtantsur|afk is now known as dtantsur07:35
*** tosky has joined #opendev07:40
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:01
*** DSpider has joined #opendev08:12
*** hashar has quit IRC08:17
*** hashar has joined #opendev08:18
openstackgerritDaniel Bengtsson proposed openstack/diskimage-builder master: Update the tox minversion parameter.  https://review.opendev.org/73875408:19
*** ysandeep is now known as ysandeep|lunch08:21
*** ysandeep|lunch is now known as ysandeep09:21
*** hashar has quit IRC09:25
*** ryohayakawa has quit IRC09:31
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy  https://review.opendev.org/73877109:34
*** hashar has joined #opendev09:42
*** hashar is now known as hasharAway09:47
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy  https://review.opendev.org/73877109:53
*** hasharAway has quit IRC09:57
*** hashar has joined #opendev09:58
*** tkajinam has quit IRC09:59
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy  https://review.opendev.org/73877110:13
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy  https://review.opendev.org/73877110:26
*** ysandeep is now known as ysandeep|brb10:27
*** dtantsur is now known as dtantsur|brb10:27
openstackgerritSlawek Kaplonski proposed openstack/project-config master: Update Neutron Grafana dashboard  https://review.opendev.org/73878410:28
*** ysandeep|brb is now known as ysandeep10:37
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy  https://review.opendev.org/73877111:00
*** priteau has joined #opendev11:02
*** kevinz has quit IRC11:11
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy  https://review.opendev.org/73877111:12
sshnaidm|afkjust fyi, I saw in console today failure of cleanup playbook. It didn't affect the job results or something else, but in case you weren't aware: http://paste.openstack.org/show/795427/11:19
*** sshnaidm|afk is now known as sshnaidm|ruck11:19
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy  https://review.opendev.org/73877111:25
*** owalsh_ has joined #opendev11:55
*** owalsh has quit IRC11:58
*** ysandeep is now known as ysandeep|afk12:45
*** dtantsur|brb is now known as dtantsur12:45
*** sgw1 has joined #opendev12:55
clarkbsshnaidm|ruck: the way we are using the cleanup playbook is as a last effort to produce debug data at the end of jobs. The unreachable there is expected in the case of job failures that would otherwise break the job. In this case the job looks mostly happy at the end though. Is it possible that is the centos8 issue showing up later than usual? Maybe that points to the ssh keys being modified on the host by12:57
clarkbthe job somehow?12:57
clarkbsshnaidm|ruck: long story short we expect it to fail, but really only when the job node was so unhappy by the job12:57
sshnaidm|ruckclarkb, ok, that's fine then13:05
sshnaidm|ruckclarkb, I'm not sure any centos issue exists btw13:05
sshnaidm|ruckclarkb, we tried the same image and job on third party CI and never got retry_limit13:06
sshnaidm|ruckclarkb, even today it's much much less retry_limits than previous 2 days, and I thin I saw multiple attempts in non-tripleo jobs as well13:06
clarkbyes, I've offered now like 5 times to attempt to hold nodes on our CI system or boot test nodes in our clouds, but no one will take me up on it13:07
clarkbthe issue is clearly centos8 related in that it happens there13:07
clarkbit may require specific "hardware" or timing races to be triggered though13:07
sshnaidm|ruckclarkb, if it was centos issue, I13:07
openstackgerritLance Bragstad proposed openstack/project-config master: Create a new project for ansible-tripleo-ipa-server  https://review.opendev.org/73884213:07
sshnaidm|ruck'd expec it to be more consistent13:07
sshnaidm|ruck30% percent retry_limits yesterday and only a few today13:08
clarkbsshnaidm|ruck: is the job load different though?13:08
clarkbtripleo's queue looks very small right now13:08
clarkb(we can be our own noisy neighbor, etc)13:09
sshnaidm|ruckclarkb, during these 2 days it was usual, maybe yesterday more because of rechecks13:09
sshnaidm|ruckclarkb, usually it's growing to US afternoon13:10
clarkbalso I think its still happening if you look at the queue there are many retry attempts13:10
clarkbyou're just getting lucky in that its not failing 3 in a row consistently13:10
clarkbwe should be careful to treat this as fixed if we're relying on it passing on attempt 2 or 313:12
clarkbsshnaidm|ruck: ^ you may also want to double check that your third party CI checks weren't doing the same thing13:13
sshnaidm|ruckclarkb, I set attepmts:1 in both patches, on upstream and third party13:13
clarkbok, I'm just calling it out bceause I see the current jobs with a bunch of retries13:14
sshnaidm|ruckclarkb, never got retry limit on 3party though, and today is much better in upstream too: https://review.opendev.org/#/c/738557/13:14
sshnaidm|ruckoh, got one13:14
fungialso when i analyzed the per-provider breakdown, it was disproportionately more likely in some than others (not proportional to our quotas in each) suggesting that the timing/load/noisy-neighbor influence could be greater in some places than others13:15
sshnaidm|ruckfungi, the problem that I don't have logs, usually it's only build id13:18
clarkbsshnaidm|ruck: yes, this is why I've offered to set holds or boot test nodes13:18
clarkbthe logs are on the zuul test node and zuul can't talk to them to get the logs13:18
sshnaidm|ruckclarkb, let's do it, I'm only for it13:18
clarkbwhich means its difficult for zuul to do anything useful when we get in this situation. But if we set a hold maybe we can reboot the test node and log in or boot it on a rescue host or something13:19
clarkbsshnaidm|ruck: is there a specific job + change combo we should set a hold that is likely to have this problem?13:19
sshnaidm|ruckclarkb, all jobs here for example: https://review.opendev.org/#/c/738557/13:20
clarkbthats the big problem I have with setting a hold. I don't know what to set it on13:20
sshnaidm|ruckall of them with attempts:113:20
clarkbsshnaidm|ruck: do the -standalone jobs meet that criteria? THose would be good simply because they are single node which keeps the cost of holds down13:21
sshnaidm|ruckclarkb, I think it's fine, the whole list is: http://paste.openstack.org/show/795437/13:22
*** jhesketh has quit IRC13:22
sshnaidm|ruckclarkb, you can just remove "multinode" from there13:22
*** jhesketh has joined #opendev13:24
clarkbsshnaidm|ruck: I've set it on tripleo-ci-centos-8-standalone tripleo-ci-centos-8-scenario001-standalone and tripleo-ci-centos-8-scenario002-standalone. The first two are running in check so maybe we'll catch one shortly13:24
clarkbI'll do 003 and 01013:24
sshnaidm|ruckclarkb, great13:25
clarkbthats a good spread based on the jobs that are running I think13:25
sshnaidm|ruckclarkb, can we pull any statistics which jobs have more attempts?13:25
sshnaidm|rucklooking at this patch, seems like not only centos have more attempts: https://zuul.opendev.org/t/openstack/status/change/737983,213:26
clarkbsshnaidm|ruck: sort of. I think we may only really record that if/when an attempt succeeds because we're logging that in elasticsearch. THere is/was work to put it in the zuuldb which may get it in all cases but I'm not sure if that has landed13:26
sshnaidm|ruck"puppet-openstack-lint-ubuntu-bionic (2. attempt)"13:26
clarkbya retries aren't abnormal13:26
fungireattempts can be for a number of reasons13:26
fungifailed to download a package during a pre phase playbook? retry13:27
fungiin this case we specifically care about retries from the node becoming unreachable13:27
sshnaidm|ruckI see, though I still suspect it's not centos, I couldn't reproduce any problem with the same image on a different ci13:29
clarkbI don't think any of the jobs in this pass will hit it. Looking at console logs they seem to be further along than the toci quickstart13:30
clarkbwe'll just have to keep trying until one trips over the old13:30
*** ysandeep|afk is now known as ysandeep13:31
sshnaidm|ruckclarkb, ok, so I will leave standalone, 001-003, 010 in this patch and will keep rechecking?13:32
sshnaidm|ruckclarkb, please add also tripleo-ci-centos-8-scenario010-ovn-provider-standalone , seems like it may have actually problem with network: http://paste.openstack.org/show/795439/13:33
clarkbsshnaidm|ruck: yup, then let us know if one of those hits the issue and we'll see if we can reboot the host and get you ssh'd in13:33
sshnaidm|ruckclarkb, great13:33
clarkbsshnaidm|ruck: if that reboot doesn't fix things we can try to boot from snapshot or use a rescue instance too13:33
clarkbadded that hold13:34
fungiin the past we've also seen job nodes which go unreachable suddenly become reachable again soon after the job ends13:34
sshnaidm|ruckclarkb, now the different problem, what can be an issue with post_failure in finger://ze05.openstack.org/7be6bcfbd9274abba5caf69c65a3d519 ? No logs from job, it just builds containers, no tripleo there13:34
clarkbsshnaidm|ruck: thats the same failure mode. If the job fails during the run step then it doesn't upload logs and you get the finger url13:35
clarkbdoes the tripleo quickstart script build iamges too?13:35
sshnaidm|ruckclarkb, no13:35
clarkbperhaps this is a networking issuewith docker/podman?13:35
sshnaidm|ruckclarkb, completely only build containers job13:36
clarkband doing container things trips it13:36
sshnaidm|ruckclarkb, can not be13:36
sshnaidm|rucklike completely nothing touches network there13:36
clarkbcontainers do though13:36
clarkbincluding container image builds13:36
fungi2020-07-01 12:53:36,912 DEBUG zuul.AnsibleJob: [e: 7f319ac2c4284af3b8d5381a995ee25d] [build: 7be6bcfbd9274abba5caf69c65a3d519] Ansible complete, result RESULT_UNREACHABLE code None13:36
clarkbbceause containers if they are namespacing the network need netwroking13:36
sshnaidm|rucka-ha, so node disappeared13:37
sshnaidm|ruckclarkb, it doesn't run containers, just build them13:37
clarkbiirc building a container image happens in a container13:37
fungihttps://review.opendev.org/738668 should help get more public info for those cases13:37
clarkbI believe that is true for buildah and docker13:38
sshnaidm|ruckclarkb, sorry, but this is not realistic that buildah makes something to nodes network13:38
clarkbwhy?13:38
clarkbthe container for the image build needs network access13:38
clarkbit has to do something13:38
clarkbI'm not sure what exactly that is, but it is doing something if I understand container image builds properly13:39
clarkbbasically I'm trying to not rule anything out13:39
corvusclarkb, sshnaidm|ruck: the patch to store retried builds in the db has landed; but it's not fully deployed in opendev's zuul13:40
clarkblet's see if we can recover system logs and go from there13:40
clarkbbut ruling things out before we have any logs is not very helpful13:40
clarkbfungi: I've rechecked https://review.opendev.org/#/c/738710/1 and plan to approve the gitea side as the conference winds down13:42
*** bhagyashris is now known as bhagyashris|afk13:42
fungii was considering pasting the executor loglines for build 7be6bcfbd9274abba5caf69c65a3d519 but it's 9302 lines and each is so long that i can only fit about 200 in a paste, so not wanting to make a series of 50 pastes out of that13:43
sshnaidm|ruckclarkb, let's add also tripleo-build-containers-centos-8 tripleo-build-containers-centos-8-ussuri13:43
sshnaidm|ruckclarkb, they're pretty short jobs13:44
sshnaidm|ruckmaybe we could catch it13:44
clarkbsshnaidm|ruck: ok thats done. Lets get some of these jobs running again. Maybe push a new ps that disables the other jobs and then start iterating to see if we catch some?13:45
sshnaidm|ruckclarkb, ack, retriggered now13:45
sshnaidm|ruckclarkb, will try to figure out how to disable others, not so simple now..13:46
sshnaidm|ruckfungi, clarkb from this last post_failure:13:48
sshnaidm|ruck2020-07-01 11:51:13.001938 | TASK [upload-logs-swift : Upload logs to swift]13:48
sshnaidm|ruck2020-07-01 12:51:00.211178 | POST-RUN END RESULT_TIMED_OUT: [trusted : opendev.org/opendev/base-jobs/playbooks/base/post-logs.yaml@master]13:48
clarkbsshnaidm|ruck: that means the job was uploading logs for 30 minutes and didn't finish in time13:49
clarkbwe have seen that happen due to sheer volume of log uploads in a job.13:49
clarkbeither a massive log file or many many files etc13:50
fungithough we tended to see it a lot more before we started limiting the amount of log data a job could save13:50
clarkbwe do have a watchdog for that but it relies on checking after the fact iirc so isn't perfect13:50
fungiyep13:50
clarkbya13:50
sshnaidm|ruckthis job has a few logs, unlike usual tripleo jobs, I believe it's network hiccup13:51
sshnaidm|ruckProvider: vexxhost-ca-ymq-113:51
clarkbsshnaidm|ruck: it could be, though I would expect ssh to complain in that case13:51
clarkbebcause we rsync the logs from the test node to the executor then upload to swift13:52
clarkbif it was a network issue on the test node the rsync should've failed. Its possible its a bw limitation between the executor and the swift I guess?13:52
sshnaidm|ruckthese jobs timed out to upload logs: http://paste.openstack.org/show/795442/13:53
clarkbdifferent zuul executors13:54
sshnaidm|ruckand clouds, and jobs..13:54
clarkbsshnaidm|ruck: did they both fail during upload-logs-swift?13:54
sshnaidm|ruckclarkb, yes13:54
clarkbok, that step is swift upload from executor to one of 5 swift regions. 3 in rax and 2 in ovh13:54
sshnaidm|ruckand this one as well http://paste.openstack.org/show/795443/13:58
clarkbthe logs for successful runs of that job aren't particularly large. Not tiny either. Looks like ~20MB which is reasonable. I've also found successful buidls with uploads to both ovh regions13:58
*** mlavalle has joined #opendev13:58
sshnaidm|ruckmaybe also node dies when uploads logs, but..seems weird such a timing13:58
fungiit's more likely the count of individual log files being uploaded has a bigger impact on upload time than the aggregate byte count13:59
clarkbsshnaidm|ruck: also the swift upload is not from the node to swift. Its executor to swift13:59
clarkbfungi: the count is also not bad either, THough there are a few files13:59
sshnaidm|ruckclarkb, well, then even dying node doesn't explain..13:59
clarkbsshnaidm|ruck: the process is rsync the logs from node to executor then swift upload from executor to swift13:59
clarkbwhat could be happening is we spend a large amount of time doing the earlier steps like that rsync14:00
sshnaidm|ruckclarkb, ack14:00
clarkbthen by the time we do the swift upload we fail because there isn't much time left14:00
fungiyeah, we'd need to profile the start/end times of the individual tasks14:01
clarkbya though looking at the timestamps sshnaidm|ruck posted here its spending an hour doing the upload logs to swift step14:02
clarkbfull context for the post-run stage would be good though14:03
clarkbsshnaidm|ruck: can you paste that somewhere?14:03
clarkbbut also my ability to debug things deteriorates as the number of simultaneous issues increases14:03
sshnaidm|ruckclarkb, yeah, preparing..14:03
clarkbthe build uuid is useful too because we can use that to find the logs on the executor to see if there is any hints there14:04
*** hashar has quit IRC14:06
corvusclarkb: you're saying there's a second issue related to log uploading?14:08
clarkbsshnaidm|ruck: to stop running jobs you want to edit this file: https://review.opendev.org/#/c/738557/3/zuul.d/layout.yaml. Under templates remove everything but the container builds and standalone template. Under check remove everything I think. Set check to [] or remove the check key entirely14:08
clarkbcorvus: yes, or perhaps its the same issue with different symptoms (like maybe the network issues are on the executors)14:08
clarkbI don't think that is the case because I was able to reproduce failure to connect to test nodes from home on monday14:08
clarkbmy hunch is it is two separate issues14:09
corvusclarkb: would you like me to look into http://paste.openstack.org/show/795443/ ?  (is that a good example)?14:09
clarkbcorvus: I haven't been able to look at those more closely since they don't have build uuids and sshnaidm|ruck's examples pasted directly into irc weren't directly attributed to anything, but those are what we've got until sshnaidm|ruck pastes more context somewhere14:10
clarkbbasiclly I think that is the breadcrumb we've got but I don't know how good an example they are14:10
corvusit has an event id and a job, i can track it down14:10
fungii concur, the node we witnessed go unreachable was really unreachable from everywhere, not just from the executor (i couldn't reach it from home either)14:10
sshnaidm|ruckclarkb, https://pastebin.com/Zn30H9sj https://pastebin.com/4nz8V1W714:10
corvussshnaidm|ruck: what are those pastes?14:11
sshnaidm|ruckcorvus, consoles where upload logs times out14:11
corvuscool -- since we're tracking multiple issues, it's good to be explicit :)14:11
sshnaidm|ruckclarkb, corvus one build id is 0e70e37c7f8948a581483123aaab98a214:12
sshnaidm|rucksecond is bf81f447b4924b56a613c3449a474d3014:13
clarkbthanks14:13
corvusi'll track down the executor logs for d3014:14
clarkb0e70e37c7f8948a581483123aaab98a2 seems to have been unreachable when the cleanup playbook ran14:15
sshnaidm|ruckyeah, second one was more stable14:16
corvussame is true for d30 -- it spent 1 hour trying to upload logs, timed out, then unreachable for cleanup; that sounds like a hung ssh connection during the 1 hour upload playbook, then no further connections?14:17
corvuswhen the final playbook fails, it logs the json output from the playbook; i'm looking through that14:18
corvusit only has the json output from the previous playbook, likely because it killed the process running the final playbook, so that's not much help14:22
sshnaidm|ruckclarkb, tripleo-ci-centos-8-scenario003-standalone has retry limit14:22
sshnaidm|ruck\o/14:22
clarkbsshnaidm|ruck: cool let me see what we canfind14:22
corvusclarkb: however, the second-to-last playbook removes the build ssh key -- why does the cleanup playbook work at all?14:23
corvusclarkb: (are we relying on a persistent connection lasting through that?)14:23
clarkbcorvus: oh we may with the control persistent thing14:24
sshnaidm|ruckOMG almost all of them start to fail14:24
clarkb23.253.159.20214:24
clarkbit actually pings14:24
clarkband I can ssh in14:25
corvusclarkb: i think we should take any unreachable errors on the cleanup playbook with a grain of salt.  especially if we sat there for an hour doing the swift upload, it's likely controlpersist will timeout and we won't re-establish.  so i think that's a red herring.14:25
clarkbwhich suddenly has me thinking: zk connections dropping maybe?14:25
clarkbsshnaidm|ruck: do you have an ssh key somewhere that I can put on the host? You'd be better able to look at tripleo logs to check them for oddities14:25
corvusthere are no zk errors in the scheduler log14:25
sshnaidm|ruckclarkb, https://github.com/sshnaidm.keys14:26
fungiand the scheduler isn't running out of memory (or even close)14:26
clarkbsshnaidm|ruck: your key has been added. Let's not change anything on the server jsut yet as we try and observe what may be happening14:27
corvusclarkb: do you have a build id handy for that?14:27
clarkbcorvus: not yet that was going to be my next thing14:27
sshnaidm|ruckclarkb, thanks14:27
clarkbcorvus: finger://ze05.openstack.org/603b84def27f47ee99585d58436cdc72 I think14:27
sshnaidm|ruckclarkb, which user?14:27
fungisshnaidm|ruck: root14:28
sshnaidm|ruckI'm in14:28
clarkbnow this is interesting14:28
clarkbits an ipv6 host14:29
clarkbbut ifconfig doesn't show me the ipv6 addr?14:29
clarkbya I think its ipv6 stack is gone14:29
clarkband we'd be using that for the job I'm pretty sure14:29
clarkbsshnaidm|ruck: ^14:29
sshnaidm|ruckclarkb, hmm.. not sure why14:30
clarkbI'm double checking that zuul would use that IP now14:30
fungilooks like the first task to end in unreachable was prepare-node : Assure src folder has safe permissions14:30
clarkb2001:4802:7803:104:be76:4eff:fe20:3ee4 is the ipv6 addr nodepool knows about14:31
sshnaidm|ruckfungi, this task takes much time from what I saw in jobs..14:31
clarkbwe don't seem to record either ip in the executor log14:31
clarkbso need to do more digging to see what ip was used14:31
corvusfungi: for build 603b84def27f47ee99585d58436cdc72 ?  that looks like it succeeded to me.14:31
fungicorvus: oh, that was the last task of 11 for the play where one task was unreachable14:32
clarkbcorvus: 603b84def27f47ee99585d58436cdc72 hit unreachable14:32
corvusclarkb: yes, i was saying i disagreed with fungi about what task was running at the time14:33
clarkbah14:33
clarkbsorry read it as successful job you meant successful task14:33
corvusyep, i'll take my own advice and use more words14:33
corvuswe do have all of the command output logs available in /tmp on the host14:34
corvus/tmp/console-*14:34
fungii'm stull having trouble digestnig ansible output. the way it aggregates results makes it hard to spot which task failed in that play14:34
clarkbooh should we copy those off so that a reboot doesn't clear tmp?14:34
corvusyes14:34
corvusnote that the most recent one is console-bc764e04-a4bf-cd3a-2382-000000000016-primary.log and exited 014:35
clarkbk I'll do that copy once I've confirmed the ip used was the ipv6 addr14:35
clarkbthe copy of the logs14:35
corvusclarkb: save timestamps14:35
corvussince that's about the only way to line them up with tasks14:35
corvusoh they have timestamps in them :)14:36
corvusso not critical, but still helpful :)14:36
clarkbrgr14:36
fungithis is a rax node, so would have gotten its ipv6 address via glean reading it from configdrive, right?14:37
clarkbfungi: oh yup. Also I've just found other rax-iad jobs and they use the ipv4 addr as ansible_host14:38
fungiweird14:38
clarkbso ipv6 may just be a shiny nuisance14:38
clarkbI think glean may not configure ipv6 on red hat distros14:38
clarkbwith static config drive config14:38
clarkbwe are swapping quite a bit on ze05 as well14:39
clarkbcopying logs now14:39
sshnaidm|ruckclarkb, node 23.253.159.202 - from which jobs is it?14:42
sshnaidm|ruckfungi, clarkb, do you have console logs for it just in case?14:42
clarkbsshnaidm|ruck: tripleo-ci-centos-8-scenario003-standalone14:43
sshnaidm|ruckack14:43
*** diablo_rojo has joined #opendev14:43
clarkbsshnaidm|ruck: the console logs are all in /tmp/console-* on that host14:43
corvusclarkb: i'm having trouble lining this up with the executor log14:43
clarkbI've pulled them onto my desktop bceause /tmp there could be cleared out on a restart14:43
corvuscan we double check that the ip and uuid match14:43
clarkbcorvus: oh I can do that14:43
corvusclarkb: my understanding is we're looking at 603b84def27f47ee99585d58436cdc72 on ze05, right?14:44
clarkb| 0017542272 | rax-iad             | centos-8                  | e7bd207a-87d2-4394-877e-696b381c34f2 | 23.253.159.202  | 2001:4802:7803:104:be76:4eff:fe20:3ee4  | hold     | 00:00:45:56  | unlocked | main              | centos-8-rax-iad-0017542272                          | 10.176.193.212  | None     | 22   | nl01-10-PoolWorker.rax-iad-main                          | 300-0009721443 | openstack14:44
clarkbopendev.org/openstack/tripleo-ci tripleo-ci-centos-8-scenario003-standalone refs/changes/57/738557/.* | tripleo with sshnaidm debug retry failures               |14:44
clarkbthat is the nodepool hold14:44
clarkband that tripleo-ci-centos-8-scenario003-standalone standalone job had the finger url I posted aboev. But I'm double checking again14:44
sshnaidm|ruckwow, so many console logs in /tmp/14:44
clarkb2020-07-01 13:49:50,678 INFO zuul.AnsibleJob: [e: e0622c34b7944cfcadddb92079a3e537] [build: 603b84def27f47ee99585d58436cdc72] Beginning job tripleo-ci-centos-8-scenario003-standalone for ref refs/changes/57/738557/3 (change https://review.opendev.org/738557)14:45
clarkbthe job and change are correct at least14:45
clarkbstill trying to be sure on the build. Not sure what key we use between nodepool and executor logs14:45
clarkbok thats weird14:46
corvusshould be able to correlated node id to build on scheduler14:46
corvus2020-07-01 11:27:56,980 INFO zuul.ExecutorClient: [e: 7f319ac2c4284af3b8d5381a995ee25d] Execute job tripleo-ci-centos-8-scenario003-standalone (uuid: 1f2db893d451453bacfb80face66ec92) on nodes <NodeSet single-centos-8-node [<Node 001754227  ('primary',):centos-8>]> for change <Change 0x7fab39611990 openstack/tripleo-ci 738557,2> with dependent changes [{'project': {'name': 'openstack/tripleo-ci',14:47
corvus'short_name': 'tripleo-ci', 'canonical_hostname': 'opendev.org', 'canonical_name': 'opendev.org/openstack/tripleo-ci', 'src_dir': 'src/opendev.org/openstack/tripleo-ci'}, 'branch': 'master', 'change': '738557', 'change_url': 'https://review.opendev.org/738557', 'patchset': '2'}]14:47
clarkbthe ansible facts for 603b84def27f47ee99585d58436cdc72 show the hostname is centos-8-inap-mtl01-001755060414:47
corvusclarkb: the scheduler log i just pasted is for the node id you pasted, but it's a different uuid14:47
corvusit's also several hours old14:47
clarkbits like we held a rax node but the job ran on inap?14:48
fungi23.253.159.202 is centos-8-rax-iad-0017542272 if that's what everyone's still looking at14:48
sshnaidm|ruckclarkb, do you have other nodes?14:49
fungihuh, yeah that's extra weird14:49
corvusclarkb: how did you arrive at the uuid 603b84def27f47ee99585d58436cdc72 ?14:49
*** mwhahaha has joined #opendev14:49
clarkbsshnaidm|ruck: yes all of the standalone jobs ended up being held14:49
clarkbcorvus: I went to the zuul web ui and retrieved the finger url for the job that triggered the hold14:50
clarkbcorvus: the jobs have been modified on that change to set attempts: 1 and so won't retry beyond the first failure14:50
corvusclarkb:  uuid 603b84def27f47ee99585d58436cdc72 ran on node 00175422714:51
corvusis that one held?14:51
weshay_rucksshnaidm|ruck, mwhahaha https://review.opendev.org/#/c/738557/14:51
clarkbcorvus: yes | 0017542272 | rax-iad             | centos-8                  | e7bd207a-87d2-4394-877e-696b381c34f2  | 23.253.159.202 is the ehld node14:52
clarkboh your's is short a digit? is that a mispaste?14:52
corvuschecking14:52
clarkb[build: 603b84def27f47ee99585d58436cdc72] Ansible output: b'        "ansible_hostname": "centos-8-inap-mtl01-0017550604",' <- from the executor logs14:54
clarkbI would expect the ansible_hostname fact to be centos-8-rax-iad-001754227214:54
corvusyes -- i think that whole number was a mispaste; let's start over14:54
corvus603b84def27f47ee99585d58436cdc72 ran on 001755060414:55
corvus1f2db893d451453bacfb80face66ec92 ran on 001754227214:55
clarkbok that matches the ansible host value we see in the build so tahts good. That doesn't explain why the finger url on the web ui seems to be wrong14:56
corvuswhich one are we going to look at?14:56
clarkbthe hold is for 0017542272 so we want 1f2db893d451453bacfb80face66ec9214:56
clarkboh you know what14:56
sshnaidm|ruckclarkb, seems like 23.253.159.202 is the wrong node14:56
clarkbI think I know what happened. sshnaidm|ruck abandoned the chagne and restored it to rerun jobs14:57
clarkbthe hold was already in place when that happened. Do we hold when jobs are cancelled for that state change?14:57
clarkb603b84def27f47ee99585d58436cdc72 is what we want a hold for, but not what we got a hold for14:57
clarkbthe other holds we got for the other jobs are much newer and all in inap14:58
clarkbI'm thinking lets ignore standalone003 and look at the other jobs?14:58
clarkb| 0017550599 | inap-mtl01          | centos-8                  | d4bf117d-ef6d-427e-9081-fb37cd069594 | 198.72.124.39   |                                         | hold     | 00:00:26:07  | unlocked | main              | centos-8-inap-mtl01-0017550599                       | 198.72.124.39   | nova       | 22   | nl03-7-PoolWorker.inap-mtl01-main                        | 300-0009723644 | openstack14:58
clarkbopendev.org/openstack/tripleo-ci tripleo-ci-centos-8-standalone refs/changes/57/738557/.*             | tripleo with sshnaidm debug retry failures               |14:58
clarkbThat hold looks like a proper hold14:58
corvusclarkb: the autohold info may be the most concise way to correlate them14:59
sshnaidm|ruckclarkb,  198.72.124.39  ?14:59
clarkband I think finger://ze08.openstack.org/b5b5835abca34522933ee496c49514a6 is related to 0017550599 but I'm double checking that now before I do anything else14:59
clarkbsshnaidm|ruck: yes, but let me double check my values :)14:59
sshnaidm|ruckclarkb, sure14:59
corvusclarkb: what autohold number is that?15:00
clarkb[build: b5b5835abca34522933ee496c49514a6] Ansible output: b'        "ansible_hostname": "centos-8-inap-mtl01-0017550599",'15:00
corvusautohold-info 0000000160  Held Nodes: [{'build': 'b5b5835abca34522933ee496c49514a6', 'nodes': ['0017550599']}]15:00
corvusclarkb: is that correct ^ ?15:01
clarkbcorvus: 0000000160 yup15:01
clarkbthat server does not ping15:01
clarkbwhich is more like what we expected15:01
corvusbuild b5b5835abca34522933ee496c49514a6 ran on ze0815:02
corvusit failed 24 minutes into 2020-07-01 13:57:59,664 DEBUG zuul.AnsibleJob.output: [e: e0622c34b7944cfcadddb92079a3e537] [build: b5b5835abca34522933ee496c49514a6] Ansible output: b'TASK [run-test : run toci_gate_test.sh executable=/bin/bash, _raw_params=set -e'15:03
clarkbserver show on the server instance shows no errors from nova15:03
clarkbI can try a reboot and see if it comes back15:03
fungiTASK: oooci-build-images : Run build-images.sh15:03
clarkbbut I'll hold off on that to ensure we've done all the debugging we can15:04
clarkbmgagne: ^ if you're around you may be able to help as well as this is an inap instance15:04
fungioh, again, i'm looking at the last task in the play which had the first unreachable result... how do you identify the task which actually was unreachable?15:05
corvusfungi: i don't see that line at all.  are you looking at build b5b5835abca34522933ee496c49514a6 on ze08?15:05
fungioh, i know what i'm doing wrong15:06
fungii resolved build b5b5835abca34522933ee496c49514a6 to event e0622c34b7944cfcadddb92079a3e537 so i'm looking at the entire buildset15:06
funginow that i'm not looking at interleaved logs from multuple builds, yes this is easier to follow15:08
fungii agree it was run-test : run toci_gate_test.sh which the node became unreachable during15:08
clarkbthe neutron port for that IP address shows it is active and I don't see any errors15:08
clarkbI'm now trying to confirm that that port is actually attached to the instance15:09
clarkbthe port device id matches our instance id15:09
clarkbnow I'm going to double check security groups15:10
clarkbthe security group is the expected default group wtih all ingress and egress allowed as expected15:11
fungiunfortunately the nova console log buffer has been overrun by network traffic logging15:11
clarkbfungi: the time range is about 45 minutes though15:12
clarkbwe can find when the instance booted to see if any crashes should show up in that15:12
clarkb| created                | 2020-07-01T13:48:50Z15:13
fungiyep, current console log covers 50 minutes of time and ends at 4827 seconds15:14
clarkb4827seconds + 2020-07-01T13:48:50Z is ~= 10 minutes ago?15:14
clarkbmaybe a bit more recent?15:14
fungiso basically an hour and twenty minutes after 13:48:5015:14
fungi~15:08z yep15:15
clarkb2020-07-01 14:21:08,075 is when failure occurred15:15
fungiso maybe 7 minutes ago15:15
*** ysandeep is now known as ysandeep|away15:15
clarkbits possible a crash occurred earlier and ansible didn't know about it I guess15:15
fungiit's continuing to log15:15
clarkbI'm running out of things to consider from the api side of things15:15
clarkbanything else people want to try before we ask for a reboot?15:16
fungiso yes the instance is still running and outputting network traffic loggnig on its console15:16
fungilooks like it's logging its own dhcp requests15:16
fungioh, actually these look like other machines's dhcp requests maybe15:17
fungii suppose they could be virtual machines running on this instance15:17
corvusfungi: any identifying info like mac?15:17
clarkbfa:16:3e:fa:da:5e is the port mac according to the openstack api15:18
fungiMAC=ff:ff:ff:ff:ff:ff:fa:16:3e:e9:0c:f4:08:0015:18
fungiMAC=ff:ff:ff:ff:ff:ff:fa:16:3e:16:ce:62:08:0015:18
fungii keep seeing those same two logged on ens315:18
*** _mlavalle_1 has joined #opendev15:19
fungiudp datagrams from 0.0.0.0:68 to 255.255.255.255:67 so definitely dhcp discovery15:19
corvusoh you know what, i'm silly -- that's not going to tell us anything, because of course the internal vms are openstack, and the real vm is also openstack, so they're all going to be fa:16:3e :)15:19
clarkbcentos-8-1593208195 is MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:16:ce:62:08:0015:20
clarkbcorvus: ya15:20
* clarkb figures out the other one15:20
fungiyup. also if they're other machines and not this one but somehow still showing up on our port (because they're broadcast) they'll also likely be openstack15:20
clarkber thats a mispaste image name15:21
clarkbcentos-8-inap-mtl01-0017550603 is MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:16:ce:62:08:0015:21
clarkbcentos-8-inap-mtl01-0017550600 is MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:e9:0c:f4:08:0015:21
clarkbthose also happen to be two held nodes for other jobs that failed15:22
*** mlavalle has quit IRC15:22
corvusso there's a dhcp process running on our failed node which is continuning to receive and log requests from other inap vms nearby?15:22
clarkbcorvus: I think its just the iptables logging from the kernel that is receiving them15:23
clarkbwhat that implies to me is l2 is fine15:23
clarkbits l3 that is failing15:23
corvusack15:23
clarkband those other two hosts also don't ping15:23
clarkbso ya they all seem to have working links and ethernet is functioning. IP is not functioning15:24
clarkbI'm ready to try a reboot if no one objects15:24
corvus++ i think we learned a lot.  reboot ++15:24
fungino objection here15:24
fungiwe also have a couple more candidates to compare, sounds like15:24
fungiit's also possible these are reachable from other nodes in the same network but not from outside (lost their default route?)15:25
fungiwe can check that with one of the others15:25
clarkbit pings now15:26
clarkband I'm in15:26
clarkbssh root@198.72.124.39 for the others (I'll get sshnaidm|ruck's key shortly)15:26
clarkbthe console logs are still in /tmp so reboot didn't blast that away15:26
sshnaidm|ruckclarkb, did you restart it?15:26
clarkbsshnaidm|ruck: yes15:26
sshnaidm|ruckack15:26
fungirebooted via nova api15:26
clarkbsshnaidm|ruck: your key should be working now15:27
sshnaidm|ruckI'm in15:27
sshnaidm|ruckclarkb, which job is it?15:27
clarkbJul  1 14:16:56 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <warn>  [1593613016.1172] dhcp4 (ens3): request timed out15:28
clarkbsshnaidm|ruck: tripleo-ci-centos-8-standalone15:28
fungii wonder, could something be putting a firewall rule in place on the public interface which blocks dhcp requests?15:29
clarkbfungi: I was just goign to mention earlier in syslog is a bunch of iptables updates :)15:29
clarkbI don't know that it is breaking dhcp yet but that definitely seems to be an order of ops15:30
fungithat would explain the random behavior... depends on if it needs to renew a lease during that timeframe15:30
clarkbansible-iptables does a bunch of work then soon after dhcp stops15:30
clarkbfungi: ya15:30
fungii wonder if i can find the dhcp requests getting blocked in the console log15:31
fungichecking15:31
fungifa:16:3e:fa:da:5e is the mac for ens315:31
clarkbsshnaidm|ruck: ^ maybe you can look into that? I'm not sure what tripleo's iptables rulesets are intended to do. Also you could cross check against successful jobs to see if they fail to renew a lease (usually you renew at 1/2 the lease time so you'd possible still fail to renew but have working IP until a bit later)15:31
sshnaidm|ruckmwhahaha, weshay_ruck ^15:31
sshnaidm|ruckbrb15:32
*** sshnaidm|ruck is now known as sshnaidm|afk15:32
*** factor has joined #opendev15:32
clarkbthat host has rebooted with no iptables rules ?15:32
clarkbwhcih is I guess good for debugging but surprising since we write out a ruleset I thought15:32
fungiindeed, iptables -L and ip6tables -L are both entirely empty15:33
clarkbmaybe that doesn't work on centos815:34
clarkbsomething to investigate15:34
fungii'm not finding any blocked dhcp requests logged by iptables, but i have a feeling the iptables logging is not comprehensive (probably does not include egress, for example)15:37
clarkbI'm also having a hard time finding the lease details for our current lease15:37
clarkboh neat I think beacuse ens3 config is static now15:38
clarkbthe network type in config drive is ipv4 not ipv4_dhcp or whatever the value is15:40
clarkbso I think static config is actually what we expect15:40
mwhahahaso the mtu is bad i believe15:40
clarkbis something kicking it over to dhcp which is then failing bceause there is no dhcp server?15:40
mwhahahait shouldn't be 150015:40
mwhahaha?15:40
clarkbmwhahaha: why not?15:40
mwhahahamost clouds it's not 150015:40
mwhahahabecause tennat networking15:40
clarkbmost of ours are15:40
mwhahahayou sure?15:40
clarkbI think openedge is our only cloud with a smaller mtu15:40
clarkbmwhahaha: ya they do jumbo frames to carry tenant networks allowing tenant traffic to have a proper 1500 mtu15:41
mwhahahak15:41
fungiJul  1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info>  [1593612925.5251] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)15:41
fungithat's almost an hour after the node booted15:41
fungicould the job be reconfiguring ens3 to dhcp?15:41
clarkbfungi: cool I think thats the smoking gun. This should be statically configured based on teh config drive stuff15:41
mwhahahacentos did have a bug with an ens3 existing always btw15:41
* mwhahaha digs up bug15:41
mwhahahathe cloud image that is15:42
clarkbit switches to dhcp, fails to dhcp because neutron isn't doing dhcp there and then network manager helpfully unconfigures our interface15:42
clarkbwe reboot and go back to our existing glean config and are statically configured again15:42
fungithat may also explain why the dhcp discovery datagrams we were seeing logged by the node's iptables drop rules were other nodes experiencing the same issue15:42
mwhahahahttps://bugs.launchpad.net/tripleo/+bug/186620215:42
openstackLaunchpad bug 1866202 in tripleo "OVB on centos8 fails because of networking failures" [Critical,Fix released] - Assigned to wes hayutin (weshayutin)15:42
clarkbfungi: oh ya they're all asking for dhcp and no one can respond15:42
clarkband this would happen in any other cloud that is statically configured too15:42
mwhahahaso if /etc/sysconfig/network-scripts/ifcfg-en3 exists, when we restart legacy networking it'll nuke the address15:42
fungibecause that cloud doesn't have a dhcpd to answer15:42
clarkbrax is at least15:42
clarkbmwhahaha: that will do it15:43
clarkb/etc/sysconfig/network-scripts/ifcfg-en3 is how glean configures the interface15:43
clarkbens3 but ya15:43
mwhahahathought on that node it's configured static15:43
mwhahahaso seems weird15:43
clarkbmwhahaha: it is statically configured via the network script and NM's sysconfig compat layet15:43
mwhahahaovs and networkmanager don't place nicely so i wonder if there's an issue around that15:44
clarkband this could explain why your third party ci doesn't see it. iF that cloud is configured to use dhcp then it will work fine15:44
fungialso explains why we saw disproportionately more of these in some providers... those are likely the ones with no dhcpd15:45
clarkbfungi: ya I think rax and inap are the non dhcp cases for us15:45
clarkbeveryone else does dhcp I think15:45
mwhahahahrm no /tec/sysconfig/network-scripts is empty in the image i pulled from nb0215:45
mwhahahaso we're not hitting that bug where the left over thing is there15:45
clarkbmwhahaha: its written by glean on boot based on config drive information15:45
mwhahahayea but we shouldn't be changing it to dhcp (and it's not dhcp on that node)15:46
clarkbmwhahaha: well something is telling NM to dhcp15:46
clarkbper the syslog fungi pasted above15:46
mwhahahaclearly15:46
clarkband its happening about an hour after the node boots so it isn't glean15:46
* mwhahaha goes digging15:46
clarkb(at least it would be super weird for boot units to fire that late)15:47
clarkbmwhahaha: we can add your ssh key to this node if it helps to dig with logs15:47
clarkbI just need a copy of the pubkey15:47
mordredwow. I pop in to see how things are going and I see a conversation about NM randomly dhcping an hour after boot15:48
mwhahahai think sshnaidm|afk added me if you're looking at centos-8-inap-mtl01-001755059915:48
clarkbmwhahaha: yup thats the one15:48
clarkbmwhahaha: not sure if you had all the bakcground on it. We caught the held node. Observed its link layer was working based on iptables drop logging in the console log but it did not ping or ssh15:49
fungii'll correlate the timestamp to the job15:49
clarkbmwhahaha: after double checking the cloud hadn't error'd with oepsntack apis we rebooted the instance and it came back15:49
clarkbcurrently it appears to be using the glean static config as expected which is why it is working15:49
clarkbsyslog shows at some point NM switched to dhcp which failed and unconfiogured the interface15:49
clarkbwe've also got ~2 more instances in the unping state that we can probably reboot if necessary but for now keeping them in that state might be good to have in our back pocket15:51
clarkbnote the container image build jobs seem to hit tis too. I wonder if podman/buildah are doing NM config?15:51
fungilooks like the "run toci_gate_test.sh" task starts at 13:57:59 and then dhcp4 is activated on ens3 by NetworkManager at 14:15:25 we log the unreachable state for the node at 14:21:0715:52
clarkband with that I'm going to find breakfast and maybe do a bike ride since we're somewhere this is deuggable15:52
clarkbfungi: mwhahaha sshnaidm|afk /tmp/console-* will have consoel logs generated by the host too15:52
clarkbthose may offer a more exact correlation to whatever did the thing15:52
mwhahahaso we run os-net-config right before networkmanager starts doing stuff15:53
mwhahahalet me trouble shoot this15:53
fungithere is a bit of time between dhcp starting to try to get a lease and giving up/unconfiguring the interface (more likely picking a v4 linklocal address to replace the prior static one with) and then a bit more delay before ansible decides ssh access is timing out15:55
mwhahahahad glean always used network manager?15:55
fungii believe it had to for centos-8 and newer fedora15:55
mwhahahak15:55
clarkbwe switched centos7 to it too I think15:55
mwhahahawe aren't touching ens3 in our os-net-config configuration15:55
clarkbbut would need to double check that15:55
fungiianw knows a lot more of the history there, he had to struggle mightily to find something consistent for everything15:56
mwhahahawe're adding a bridge br-ex to br-ctlplane which we alreadyc onfigured15:56
mwhahahaso let me see if i can figure out why networkmanager tries to dhcp ens315:56
clarkbmwhahaha: fungi we could rerun suspect things on that host and reboot to bring it back if it breaks15:56
clarkbthat may help narrow it down quickly15:56
mwhahahaJul  1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info>  [1593612925.4888] device (ens3): state change: activated -> deactivating (reason 'connection-removed', sys-iface-state: 'managed')15:57
mwhahahais likely the cause but i don't know why that's occuring15:57
mwhahahaso we do a systemctl restart network which is the legacy network stuff and it seem to mess with networkmanager15:57
fungineat15:58
clarkbyou could try it on that held node and see if it breaks15:59
mwhahahaso we restart networking, network manager bounces ens3 and it tries to reconnect it using dhcp15:59
clarkbI wonder if that is an 8.2 change15:59
mwhahahaeven though ifcfg-ens3 is configured as static15:59
mwhahahai bet it is but not certain15:59
*** shtepanie has joined #opendev15:59
mwhahahacan you point me to the glean config code16:00
clarkbhttps://opendev.org/opendev/glean/src/branch/master/glean/cmd.py16:00
mwhahahathx16:00
*** sshnaidm|afk is now known as sshnaidm|ruck16:01
clarkbthat is the bulk of it and there should be a systemd unit that calls that for ens3 on the host16:01
clarkbthough it also uses udev so it is parameterized unit16:01
mwhahahayou folks rebooted this to get the node back right?16:02
clarkbyes16:02
mwhahahayea so that points to networkmanager doing silly stuff16:03
mwhahahai think we can work around it by just disabling networkmanager in a pre task16:04
mwhahahafor now16:04
clarkbwill that unconfigure the static config NM sets?16:05
clarkbworth a try I guess16:05
fungiyeah, easy enough to test16:06
mwhahahait shouldn't16:06
mwhahahastoping the networkmanager service should still have the networking configured16:07
mwhahahait should prevent networkmanager from waking up and touching the interface16:07
mwhahahaweshay_ruck, sshnaidm|ruck: we should be able to reproduce this by launching a vm on a network w/o DHCP and configuring the interface statically via network manager. then 'service restart network'16:11
mwhahahaon a centos8.2 vm16:11
*** etp has quit IRC16:16
fungilooking at cacti graphs for gitea-lb01, the background noise from our rogue distributed crawler is still ongoing16:19
*** sshnaidm|ruck is now known as sshnaidm|afk16:20
clarkbfungi: ya Im trying to get out the door for a bike ride but after I'dlike to land and apply our logging updates for haproxy and gitea16:21
clarkband see if the post failures for swift uploads are persitent and dig into that more16:21
fungisounds good. i'm still trying to catch up since half of every day this week has been conference16:24
clarkbfungi: also did you see ianw found the likely tool that us hitting us?16:25
clarkbbased on the UA values16:25
fungiyup16:25
fungii fiddled with machine translating the readme16:25
fungiseems a very likely suspect16:26
*** hashar has joined #opendev16:29
*** ykarel is now known as ykarel|away16:34
openstackgerritMerged openstack/project-config master: Update Neutron Grafana dashboard  https://review.opendev.org/73878416:39
*** dtantsur is now known as dtantsur|afk16:44
*** factor has quit IRC16:51
*** icarusfactor has joined #opendev16:51
*** hashar has quit IRC16:58
*** moppy has quit IRC17:15
*** corvus has quit IRC17:15
*** bolg has quit IRC17:15
*** guillaumec has quit IRC17:15
*** andreykurilin has quit IRC17:15
*** factor has joined #opendev17:16
*** icarusfactor has quit IRC17:16
*** hashar has joined #opendev17:17
*** corvus has joined #opendev17:18
*** guillaumec has joined #opendev17:18
*** moppy has joined #opendev17:18
*** andreykurilin has joined #opendev17:19
*** slittle1 has quit IRC17:26
*** weshay_ruck has quit IRC17:26
*** mtreinish has quit IRC17:26
*** cloudnull has quit IRC17:26
*** hrw has quit IRC17:26
*** AJaeger has quit IRC17:26
*** slittle1 has joined #opendev17:29
*** weshay_ruck has joined #opendev17:29
*** mtreinish has joined #opendev17:29
*** cloudnull has joined #opendev17:29
*** hrw has joined #opendev17:29
*** AJaeger has joined #opendev17:29
openstackgerritJames E. Blair proposed zuul/zuul-jobs master: Test multiarch release builds and use temp registry with buildx  https://review.opendev.org/73731517:30
hrwtarballs.opendev.org have PROJECTNAME-stable-BRANCH.tar.gz tarballs for most of projects. What defines which projects get them?17:39
*** bolg has joined #opendev17:41
fungihrw: those are created by the publish-openstack-python-branch-tarball job included in the post pipeline by lots of project-templates and also directly by some projects17:42
fungimostly it'll be projects which use the openstack-python-jobs project-template or one of its dependency-specific variants17:43
hrwfungi: thanks.17:45
*** hashar is now known as hasharAway17:47
openstackgerritMerged zuul/zuul-jobs master: Test multiarch release builds and use temp registry with buildx  https://review.opendev.org/73731518:19
*** hasharAway is now known as hashar18:25
openstackgerritLance Bragstad proposed openstack/project-config master: Create a new project for ansible-tripleo-ipa-server  https://review.opendev.org/73884218:28
*** chandankumar is now known as raukadah18:39
openstackgerritMerged zuul/zuul-jobs master: ensure-pip debian: update package lists  https://review.opendev.org/73752918:39
clarkbI've approved https://review.opendev.org/#/c/738714/18:41
clarkbfungi: I'm thinking maybe we single core approve https://review.opendev.org/#/c/738710/ ? I think having that extra logging will be important to continue monitoring this ddos18:43
fungiwfm, it's been tested18:44
corvus+318:44
fungii was only hesitating to self-approve in case folks disagreed with the approach there18:44
clarkbcorvus: thanks!18:44
fungisince it's a general haproxy role we might use for other things with different logging styles18:45
fungiso i could have set it per-backend instead of in the default section18:45
clarkbfungi: ya I think we can sort that out if we end up adding different backend types18:45
corvusi bet if we used it for others we'd want consistency18:45
corvus(oh, yeah, if we used different backend types that might be different.)18:45
fungithough the way we're generating the backends from a template now would still need reworking to support other forwarders18:45
corvusanyway, that's a future problem18:46
corvusand i'm in the present18:46
*** factor has quit IRC18:46
*** factor has joined #opendev18:46
clarkbmwhahaha: if you all end up sorting out the underlying issue it would be great to hear details so that we're aware of those issues geneally (since others do use those images too)18:48
mwhahahawill do18:48
mwhahahai don't think it affects others because most folks integrate with networkmanager, os-net-config's support is still WIP so we use the legacy network scripts still18:48
clarkbgotcha18:49
mwhahahabut if i can figure out the RCA i'll get a bz and a ML post together18:49
fungithat would be awesome18:49
fungiwe sort of expect the situation with glean+nm to be a little fragile after everything we discovered about the interactions between ipv6 slaac autoconf in the kernel and nm fighting over interfaces18:50
clarkbunfortunately its the system rhel and friends are committed to so we're trying to play nice18:50
mwhahahayea the best thing to do would be NM_managed=No18:50
mwhahahait looked like it was still yes for ens318:51
mwhahahaanyway time to try and reproduce :D18:51
mwhahahaat least now we have a direction, thanks for your efforts18:51
*** factor has quit IRC19:04
*** factor has joined #opendev19:04
*** factor has quit IRC19:10
*** _mlavalle_1 has quit IRC19:10
clarkbI'm going to clean up some of those holds now. In particular the one that had us confused as it isn't useful19:12
clarkbcorvus: fungi is the proper way to do that via zuul autohold commands or nodepool delete?19:12
clarkbI think I've been doing nodepool delete in the pats but realized when corvus asked for a hold id earlier today that maybe I can do that via zuul?19:13
clarkbya I think I may have been doing this wrong19:15
*** _mlavalle_1 has joined #opendev19:15
mwhahahai'm still poking at centos-8-inap-mtl01-0017550599 so if you could keep that around for a few that'd be great19:16
mwhahahai did grab logs/configs but i want to make sure i got everything19:16
fungiclarkb: i had been doing it through nodepool delete, but recently learned that removing the autohold is better since that will still cause nodepool to clean up the nodes19:18
clarkbmwhahaha: yup will leave that one alone19:18
mwhahahathanks19:18
clarkbfungi: cool I was just about to test that19:18
fungiyeah, better because it doesn't leave orphaned autohold entries around19:18
clarkbya I've got a couple to clean up19:19
clarkbnow to double check which autohold id is the one mwhahaha is on19:21
clarkbid 0000000160 is the one mwhahaha is on and I kept 0000000161 too in case we need a second one but cleaned up the others19:22
clarkbthe gitea and haproxy things should be landing soon too. I'll be sure to restart giteas for the new format once that config is applied19:24
mwhahahais there an easy way to manually run glean to configure the network?19:25
mwhahahatrying to reproduce how it would configure it vs like the installer19:25
clarkbmwhahaha: you can invoke the script that the unit runs either directly or by triggering hte unit. Let me take a quick look to be more specific19:26
clarkbmwhahaha: the unit is glean@.service and the @ means it takes a parameter. In this case the interface name. I think you trigger that with systemctl start glean.ens3 ? But also I notice the unit has a condition where it won't run if the sysconfig files exist19:27
clarkbin this case it may be easiest to run it directly and forthat you wany Environment="ARGS=--interface %I" and ExecStart=/usr/local/bin/glean.sh --use-nm --debug $ARGS19:28
clarkb%I is magical systemd interpolation for the argument and in this case it would be ens319:28
mwhahahayea that's what i'm looking for, thanks19:28
mwhahahai'll figure it out from there19:28
clarkband if you want to know what triggers it with the parameter I think it is a udev rule19:29
clarkbya /etc/udev/rules.d/99-glean.rules SUBSYSTEM=="net", ACTION=="add", ATTR{addr_assign_type}=="0", TAG+="systemd", ENV{SYSTEMD_WANTS}+="glean@$name.service"19:29
clarkbglean@ens3 is the name to use with systemctl19:29
openstackgerritMerged opendev/system-config master: Update gitea access log format  https://review.opendev.org/73871419:33
*** yoctozepto7 has joined #opendev19:37
openstackgerritMerged opendev/system-config master: Remove the tcplog option from haproxy configs  https://review.opendev.org/73871019:40
corvusclarkb, fungi: yes, delete via zuul autohold delete19:42
corvusone command instead of 219:42
*** yoctozepto has quit IRC19:45
*** yoctozepto7 is now known as yoctozepto19:45
openstackgerritSean McGinnis proposed openstack/project-config master: update-constraints: Install pip for all versions  https://review.opendev.org/73892619:53
clarkbI've restarted gitea on gitea01 to pick up the access log format change19:55
clarkbit is happy so I'm working through the others then will double check the lb is similarly updated19:55
weshay_ruckclarkb, fungi thanks guys!19:59
*** sorin-mihai__ has quit IRC20:01
mwhahahaso it looks like the ens3 config gets lost20:01
mwhahahaglean writes out the file an it's named 'System ens3' NetworkManager[1023]: <info>  [1593611369.2978] device (ens3): Activation: starting connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085)20:02
clarkbinfra-root we now have ip:port recorded in haproxy and gitea logs so we can map between them now20:02
mwhahahabut when we restart networking, it doesn't have this so it creates a Wired connection 120:02
clarkbmwhahaha: so when I rebooted it it didn't just reread the existing config instead glean reran and rewrote the config? interesting20:04
mwhahahamaybe?20:04
mwhahahabecause now it's back to being 'System ens3'20:04
clarkbianw should be waking up soon and may have NM thoughts20:05
fungioh, i'm sure he has nm thoughts, but most are probably angry ones20:07
mwhahahaha20:08
mwhahahalet me see if i can see where we might be removing this file all of the sudden20:08
mwhahahaseems weird still20:08
mwhahahayea it's like ifcfg-ens3 goes missing. the behaviour is that of a node where ifcfg-ens3 is removed and when you restart networkmanager it creates the 'Wired connection 1'20:11
mwhahahawe shouldn't be touching that20:11
openstackgerritSean McGinnis proposed openstack/project-config master: Use python3 for update_constraints  https://review.opendev.org/73893120:13
mwhahahasweet, it's os-net-config20:14
mwhahahacould you do me a favor and restart that node20:16
* mwhahaha nukes the network20:16
clarkbmwhahaha: I can20:16
mwhahahathank you20:16
clarkbmwhahaha: ready for that now?20:16
mwhahahayes plz20:16
clarkbreboot issued, will probably be a minute before ssh is accessible again20:17
*** sgw1 has quit IRC20:35
mwhahahayea i don't think it's coming back, oh well20:38
clarkbmwhahaha: we have another held andnow that we know what is going on we can boot out of band centos8 images in inap too20:53
clarkbassuming we need them let me know20:53
mwhahahano i think we have some direction. you should be able to release them for now20:53
mwhahahawe're going to disable networkmanager as a starting point while we investigate what's happening to that config20:54
clarkbk I'll clean those up ina  abit20:56
clarkbtomorrow I'll plan to land https://review.opendev.org/#/c/737885/ as that should make the gitea api interactions a bit more correct21:02
clarkbbut will want to watch it afterwards and its already "late" in my day relative to my early start now21:02
clarkbthe follow on to ^ needs work though21:02
clarkbalso I've just noticed that ianw implemented the apache proxy for gitea with UA filtering21:03
clarkbso reviewing that now too21:03
clarkbthose changes actually look good. fungi corvus https://review.opendev.org/#/c/738721/4 the way its done we don't cut over haproxy to it. So we could land those changes then deploy the apache, test it then switch haproxy. Should be very safe21:06
clarkbianw: ^ thanks for doing that I like the approach being able to be a measured transition in prod too21:07
openstackgerritDmitriy Rabotyagov (noonedeadpunk) proposed opendev/system-config master: Add copr-lxc3 to list of mirrors  https://review.opendev.org/73894221:15
corvusclarkb, ianw: +2, but i hope we don't have to use it21:20
*** priteau has quit IRC21:21
*** factor has joined #opendev21:29
*** factor has quit IRC21:33
*** factor has joined #opendev21:33
*** hashar has quit IRC21:49
*** factor has quit IRC21:49
*** factor has joined #opendev21:50
fungiwell, the traffic we saw yesterday seems to be continuing, so i expect it may be either that or leave every customer of the largest isp in china blocked with iptables21:54
fungino idea when (or if) it will ever subside21:55
ianwhey, around now21:59
clarkbianw: no rush, was just looking at your change to apache filter UAs in front of gitea. I think we can roll that out if we decide it is necessary22:01
ianwi figure with the proxy we could just watch the logs for 301's until they disappear, and then turn it off22:01
clarkbya22:01
clarkbthough aren't you doing 403?22:01
ianwsorry 403 yeah22:01
ianwif anyone is feeling containerish and wants too look over https://review.opendev.org/#/q/topic:grafana-container+status:open that would be great too22:03
ianwreally the only non-standard thing in there is in graphite where i've used ipv4 and ipv6 socat on 8125 to send into the graphite container22:04
clarkbianw: also we managed to catch a centos8 node that lost networking during a tripleo job. Rebooting it brought it back. Turns out that something to do with restarting networking in os-net-config or similar caused our static config for the interface to be wiped out and NM switched to dhcp22:05
clarkbtripleo is going to try and workaround it by disabling NM but also looking into possibility that centos8.2 changed that and broke things22:05
clarkbianw: small thing on https://review.opendev.org/#/c/737406/1822:11
clarkband question on https://review.opendev.org/#/c/738125/722:15
*** DSpider has quit IRC22:18
ianwclarkb: yeah, the host network thing is the ipv6 thing, which is addressed in https://review.opendev.org/#/c/738125/7/playbooks/roles/graphite/tasks/main.yaml @ 6322:20
ianwbasically, i found that when setting up port 8125 to go into the container, docker would bind to ipv622:20
ianwbut the container has no ipv6 handling at all22:21
clarkbianw: what about host networking?22:21
clarkbis the issue that with host networking the graphite services only bind on 0.0.0.0 ?22:21
clarkbmaybe it would be better to use host networking with the proxy (if we can configure grpahite services to bind on other ports then proxy to them) that way we can be consistent?22:22
ianwhrm, yeah what i tried first was having host with 8125 and hoping i could run a socat proxy for ipv6 8125 but found docker took it over22:24
ianwyeah, i guess if it's different ports it doesn't matter, let me try22:25
clarkbya in my suggestion we'd put statsd on 8126 and then the proxy listends on 8125 and forwards22:25
clarkbat least that way its weird but consistently weird with our normal setup :)22:25
ianwsystemd actually has a nice socket activated forwarding service22:26
ianw... that doesn't support udp22:26
clarkbha22:26
fungi"nice"22:29
clarkbI guess they do that because it just wants to stop at the syn ack syn22:31
fungiit's synful22:31
openstackgerritIan Wienand proposed opendev/system-config master: Graphite container deployment  https://review.opendev.org/73812522:31
ianwclarkb: ^ ok, so that has socat listening on ipv4&ipv6 8125 which forwards to 8825 (8126 is taken by statsd admin) which docker should map to 8125 in the container ... phew!22:32
clarkbhrm I'm not sure docker will do the port mapping with host mode22:33
clarkb(it might)22:33
clarkbwe may need to configure the services to use a different port if that doesn't work22:33
ianwOOOHHHHH yeah THAT's right!22:33
ianwthat's why i did it22:33
*** icarusfactor has joined #opendev22:33
ianwdocker silently says "oh btw i'm not doing the port mapping" but continues on22:34
*** factor has quit IRC22:36
openstackgerritJames E. Blair proposed zuul/zuul-jobs master: Handle multi-arch docker manifests in promote  https://review.opendev.org/73894522:39
ianwi wonder if with host networking ipv6 gets into the container22:41
fungii would expect so22:43
fungiwe have other containerized services listening on v6 sockets22:43
fungiour gitea haproxy, for example22:44
ianwthe problem is i'll have to re-write the upstream container to have ipv6 bindings22:45
ianwmaybe it's worth the pain22:45
clarkbthey dont make it configurable? thats too bad if so22:45
ianwthey did take one of my patches fairly quickly22:45
*** tkajinam has joined #opendev22:46
corvuswhy are there port bundings with host network mode?22:50
clarkbcorvus: they werethere from earlier ps which did not use hostnetworking22:51
corvusok, so next ps will remove those?22:51
corvus(cause ps8 has both)22:51
ianwcorvus: this is the question ... i'm trying to avoid having to hack the pre-built container for ipv622:52
clarkbanother option could be to drop our AAAA record from dns for that service22:52
ianwyeah, i feel like that's the worst option, it's a regression on what we have22:53
openstackgerritMerged zuul/zuul-jobs master: Handle multi-arch docker manifests in promote  https://review.opendev.org/73894522:53
corvusi'd love to stick with host networking like all the others, so if we could change the bind that would be great22:58
ianwthere's two things to deal with in the container; gunicorn presenting graphite and statsd22:59
clarkbit looks like we can mount in configs for both23:01
clarkbwe'd then need to write the configs but I think that will allow us to change the bind?23:01
ianwstatsd maybe, that launches from a config file23:01
clarkbas an alternative we could change the bind upstream since ipv6 listening at :: will also work for ipv423:01
ianwit looks like gunicorn is started from the script with "-b" arguments23:02
ianwurgh, the other thing is it has no support for ssl23:05
ianwit runs nginx23:08
*** _mlavalle_1 has quit IRC23:11
corvuswonder why they didn't just use uwsgi23:16
*** tosky has quit IRC23:16
ianwif we map in the keys, a nginx config, a statsd config it might just work23:20
ianw... assuming upstream never changes anything23:20

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!