Thursday, 2018-11-29

ianwlet me bring it up...00:00
mnasercorvus: oops that's probably a horizon bug that should be fixed00:00
*** wolverineav has joined #openstack-infra00:04
*** slaweq has quit IRC00:05
ianwclarkb: root@38.145.34.8 ... that is a f29 vm from the .qcow2 of a failed job that i copied off a held node.  it never got network during the job, i moved it and it booted just fine :/00:05
ianwthen i tunneled port 6080 into the held node, got into devstack's horizion, broke into the bootloader and reset the root password, rebooted it and ... it was alive.  so couldn't even replicate it in the same place it failed00:06
clarkbianw: did you delete the interface file before rebooting in that case?00:07
clarkbchacnes are glean got it configured enough that on next startup it worked?00:07
clarkblooking at this f29 instance the glean unit file is a bit different than the one in gerrit00:08
clarkbit doesn't have the before network manager service thing00:08
ianwyeah, this was just before i switched it to the local-fs.target -- which made it work in CI00:08
ianwunfortunately, the logs are recorded because i was watching it, and when it failed i uploaded a new change so zuul canceled it00:10
ianwaren't recorded00:10
openstackgerritIan Wienand proposed openstack-infra/system-config master: bridge.o.o : install ansible 2.7.2  https://review.openstack.org/61721800:13
clarkbianw: maybe this is something. systemctl list-units shows glean@lo.service but not glean@ens3.service on that f29 instance00:13
*** mriedem has quit IRC00:13
clarkbianw: is it possible glean isn't managing this interface at all?00:13
ianwhrm, it's in nmcli ...00:14
clarkbit certainly seems like there is an ifcfg-ens3 file that looks like it came from glean though00:14
ianwoh, hrm i've rebooted this, it's second boot so it would have skipped it00:15
clarkboh ya00:15
ianwcan delete the file and reboot00:15
clarkblets dothat, I think it iwll help to better understand what it is doing on boot. Do you also want to update the unit ot match the current version?00:15
clarkb(you can reboot it under me, I don't have anything running on that connection I care about currently)00:15
ianwok, let me try00:17
*** kjackal has joined #openstack-infra00:17
clarkbshould glean configure an lo ifcfg too for consistency? (I don't actually know, maybe linux just does that for us?)00:19
ianwi think we skip it on purpose, but not sure00:20
ianwrebooting.  i've left the defaultdependencies=no in, for now00:20
ianwit's going to be a while before the gate reports if it's happier with that00:20
clarkbI think we skip it in glean ya, but that means we end up with the unit running in systemd (not a huge deal but we don't run the unit for everything else after first boot)00:20
ianwwithout that00:20
ianwyeah, ens3 service there now00:21
ianwhrm, the dot file from systemd-analyze creates a 21mb png00:24
*** hamerins has quit IRC00:24
clarkbit made a 6.3kb svg for me doing systemd-analyze dot glean@ens3.service | dot -Tsvg systemd.svg00:25
clarkbdid you run it without specifying a unit? that must be for all the things?00:25
openstackgerritJames E. Blair proposed openstack-infra/system-config master: Add kube config to nodepool servers  https://review.openstack.org/62075500:25
clarkbin any case it seems to config that NetworkManager happens after glean@ens300:25
clarkband journalctl -u glean@ens3 -u NetworkManager seems to confirm as well00:26
*** hamerins has joined #openstack-infra00:27
ianwyeah, https://imgur.com/a/2VBZ11G is a more restricted set00:28
clarkbhttp://paste.openstack.org/show/736341/ logging with very precise timestamps00:28
*** hamerins has quit IRC00:30
clarkbrw-r--r--. 1 root root 134 2018-11-29 00:20:04.439000000 +0000 ifcfg-ens300:30
clarkbthat lines up with when selinux was restored. Further evidence it isn't likely a sync/flush00:30
openstackgerritJames E. Blair proposed openstack-infra/project-config master: Nodepool: add kubernetes provider  https://review.openstack.org/62075600:31
*** kjackal has quit IRC00:32
ianwi guess the big difference is that during the job it's a binary translated nested vm00:32
clarkbianw: if it continues to not work, it would be curious if adding a sleep(5) after the selinux restore makes it work. Like maybe we just need to go slower beacuse qemu00:33
corvusclarkb: can you +3 https://review.openstack.org/620704 and https://review.openstack.org/620646 ?00:34
clarkbcorvus: yes00:35
ianwclarkb: yep, good idea.00:36
ianwclarkb: to summarise -- setting "after= & wants = local-fs.service" empirically works; but is theoretically wrong.  setting "Before=NetworkManager.service network-pre.target" is theoretically right, but empirically does not work in the gate00:39
ianwcurrently i'm testing before=networkmanager but dropping "defaultdependencies=no" (which we've just always had) to see if that makes a difference00:40
ianwif not, i'll try again with a sleep() and sync() in glean to see if it's some sort of qemu race in the gate between getting the file out and starting networkmanager00:41
ianwif not that, well just go back to local-fs.service and call it a day i guess00:41
clarkbianw: note that case seems to matter so its NetworkManager not networkmanager00:41
ianwyep, it's using the CamelCase name in the system files00:42
openstackgerritClark Boylan proposed openstack-infra/system-config master: Nodepool group no longer hosts zookeeper  https://review.openstack.org/62076000:47
clarkbcorvus: fyi ^ is a cleanup  I noticed when reviewing your change00:48
*** tosky has quit IRC00:50
*** jistr has quit IRC01:00
*** jistr has joined #openstack-infra01:01
clarkbianw: alright I've still got nothing. I am going to go rake leaves and take a better look at my retaining wall that fell over. Heres to hoping there is understanding when I'm back tomorrow :)01:05
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: executor: add support for generic build resource  https://review.openstack.org/57066801:07
*** markvoelker has joined #openstack-infra01:23
*** markvoelker has quit IRC01:28
ianwclarkb: yeah, me either :)  i'm on a half-day today so will check it all out later01:30
*** dklyle has joined #openstack-infra01:32
*** sthussey has quit IRC01:35
openstackgerritMerged openstack-infra/project-config master: Add opendev-website project to Zuul  https://review.openstack.org/62070401:48
*** gyee has quit IRC01:54
*** hamerins has joined #openstack-infra02:00
*** mgutehall has quit IRC02:19
*** wolverineav has quit IRC02:21
*** mrsoul has joined #openstack-infra02:21
*** wolverineav has joined #openstack-infra02:24
*** hongbin has joined #openstack-infra02:28
*** wolverineav has quit IRC02:28
*** markvoelker has joined #openstack-infra02:30
*** hamerins has quit IRC02:33
*** sthussey has joined #openstack-infra02:36
*** bhavikdbavishi has joined #openstack-infra02:42
*** ramishra has joined #openstack-infra02:53
*** rcernin has quit IRC02:58
*** jamesmcarthur has joined #openstack-infra03:00
*** jamesmcarthur has quit IRC03:04
*** ramishra has quit IRC03:04
*** ramishra has joined #openstack-infra03:06
*** diablo_rojo has quit IRC03:11
*** jamesmcarthur has joined #openstack-infra03:14
*** apetrich has quit IRC03:16
*** rlandy|bbl is now known as rlandy03:18
*** hamerins has joined #openstack-infra03:20
*** jamesmcarthur has quit IRC03:30
*** hamerins has quit IRC03:34
*** bobh has joined #openstack-infra03:35
*** rcernin has joined #openstack-infra03:38
*** bobh has quit IRC03:41
*** EmLOveAnh has joined #openstack-infra03:42
*** roman_g has quit IRC03:49
openstackgerritMerged openstack/diskimage-builder master: Fix unit tests for elements  https://review.openstack.org/61938703:53
*** EmLOveAnh has quit IRC03:54
*** dklyle has quit IRC03:58
openstackgerritMerged openstack-infra/zuul master: Remove uneeded if statement  https://review.openstack.org/61798404:02
*** diablo_rojo has joined #openstack-infra04:17
*** jamesmcarthur has joined #openstack-infra04:20
*** armax has quit IRC04:24
*** jamesmcarthur has quit IRC04:25
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Amazon EC2 driver  https://review.openstack.org/53555804:29
*** hongbin has quit IRC04:34
*** dave-mccowan has quit IRC04:34
*** dave-mccowan has joined #openstack-infra04:35
*** auristor has quit IRC04:38
*** eernst has joined #openstack-infra04:45
*** auristor has joined #openstack-infra04:45
*** dave-mccowan has quit IRC04:49
*** auristor has quit IRC04:50
*** auristor has joined #openstack-infra04:53
openstackgerritKendall Nelson proposed openstack-infra/infra-specs master: StoryBoard Story Attachments  https://review.openstack.org/60737704:54
*** eernst has quit IRC04:55
openstackgerritMerged openstack-infra/project-config master: Add opendev-website jobs  https://review.openstack.org/62064605:00
*** yboaron_ has joined #openstack-infra05:01
*** sthussey has quit IRC05:05
*** udesale has joined #openstack-infra05:06
*** janki has joined #openstack-infra05:09
*** bhavikdbavishi has quit IRC05:18
*** bhavikdbavishi has joined #openstack-infra05:18
*** ykarel|away has joined #openstack-infra05:20
*** bhavikdbavishi has quit IRC05:30
*** ykarel|away has quit IRC06:02
*** bhavikdbavishi has joined #openstack-infra06:12
*** mgutehall has joined #openstack-infra06:14
*** ykarel|away has joined #openstack-infra06:16
openstackgerritIan Wienand proposed openstack-infra/glean master: Add NetworkManager distro plugin support  https://review.openstack.org/61896406:19
openstackgerritIan Wienand proposed openstack-infra/glean master: A systemd skip for Debuntu systems  https://review.openstack.org/62042006:19
*** ykarel|away is now known as ykarel|lunch06:20
*** ccamacho has quit IRC06:21
*** armax has joined #openstack-infra06:44
*** kjackal has joined #openstack-infra06:56
*** bhavikdbavishi has quit IRC07:01
*** ahosam has joined #openstack-infra07:02
*** slaweq has joined #openstack-infra07:03
*** quiquell|off is now known as quiquell07:07
*** e0ne has joined #openstack-infra07:10
*** ccamacho has joined #openstack-infra07:12
*** chkumar|away has quit IRC07:13
*** chandan_kumar has joined #openstack-infra07:15
*** rascasoft has quit IRC07:25
*** apetrich has joined #openstack-infra07:26
*** ahosam has quit IRC07:26
*** ahosam has joined #openstack-infra07:26
*** armax has quit IRC07:27
*** aojea has joined #openstack-infra07:32
*** dpawlik has joined #openstack-infra07:32
*** rcernin has quit IRC07:35
*** ccamacho has quit IRC07:39
*** rascasoft has joined #openstack-infra07:40
openstackgerritKendall Nelson proposed openstack-infra/infra-specs master: StoryBoard Story Attachments  https://review.openstack.org/60737707:53
*** florianf|afk is now known as florianf07:57
*** ralonsoh has joined #openstack-infra07:58
*** ccamacho has joined #openstack-infra08:06
*** ccamacho has quit IRC08:07
*** ccamacho has joined #openstack-infra08:08
*** ginopc has joined #openstack-infra08:11
*** quiquell is now known as quiquell|brb08:13
*** roman_g has joined #openstack-infra08:18
*** bhavikdbavishi has joined #openstack-infra08:18
*** olivierbourdon38 has joined #openstack-infra08:27
*** diablo_rojo has quit IRC08:29
*** jpena|off is now known as jpena08:31
*** ykarel|lunch is now known as ykarel08:31
*** takamatsu has quit IRC08:31
*** dims has quit IRC08:32
*** dims has joined #openstack-infra08:33
*** quiquell|brb is now known as quiquell08:37
*** bhavikdbavishi1 has joined #openstack-infra08:37
*** bhavikdbavishi has quit IRC08:38
*** bhavikdbavishi1 is now known as bhavikdbavishi08:38
*** ginopc has quit IRC08:39
*** ginopc has joined #openstack-infra08:39
*** pcaruana has joined #openstack-infra08:48
*** markvoelker has quit IRC08:50
*** apetrich has quit IRC08:52
openstackgerritMerged openstack-infra/irc-meetings master: Update Meeting Chairs  https://review.openstack.org/62068508:53
*** rossella_s has joined #openstack-infra08:54
openstackgerritMerged openstack-infra/irc-meetings master: rpm-packaging: Adjust meeting time  https://review.openstack.org/62061608:57
*** tosky has joined #openstack-infra09:03
*** shardy has joined #openstack-infra09:04
*** gfidente has joined #openstack-infra09:06
*** ahosam has quit IRC09:10
*** ahosam has joined #openstack-infra09:10
*** takamatsu has joined #openstack-infra09:13
*** shardy has quit IRC09:13
*** jpich has joined #openstack-infra09:15
*** derekh has joined #openstack-infra09:22
openstackgerritBrendan proposed openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema  https://review.openstack.org/62083809:25
*** kaiokmo has quit IRC09:38
*** apetrich has joined #openstack-infra09:43
*** ssbarnea has quit IRC09:45
*** ssbarnea|rover has joined #openstack-infra09:45
*** yboaron_ has quit IRC09:46
*** yboaron_ has joined #openstack-infra09:46
openstackgerritMerged openstack-infra/nodepool master: Update node request during locking  https://review.openstack.org/61880709:50
*** shardy has joined #openstack-infra09:51
*** markvoelker has joined #openstack-infra09:51
*** rossella_s has quit IRC09:52
*** bhavikdbavishi has quit IRC09:55
openstackgerritMerged openstack-infra/nodepool master: Add second level cache of nodes  https://review.openstack.org/61902509:59
openstackgerritMerged openstack-infra/nodepool master: Add second level cache to node requests  https://review.openstack.org/61906909:59
openstackgerritMerged openstack-infra/nodepool master: Only setup zNode caches in launcher  https://review.openstack.org/61944009:59
*** jchhatbar has joined #openstack-infra10:02
*** janki has quit IRC10:05
*** ahosam has quit IRC10:06
*** kashyap has left #openstack-infra10:10
ianwclarkb: http://logs.openstack.org/71/618671/14/experimental/nodepool-functional-py35-redhat-src/0067296/controller/logs/screen-nodepool-launcher.txt.gz#_Nov_29_08_02_15_15567210:23
ianwhttp://paste.openstack.org/show/736383/10:23
ianweven with two syncs() and a pause something still goes wrong10:23
openstackgerritBenoĆ®t Bayszczak proposed openstack-infra/zuul master: Disable Nodepool nodes lock for SKIPPED jobs  https://review.openstack.org/61326110:23
*** markvoelker has quit IRC10:24
*** udesale has quit IRC10:27
*** ahosam has joined #openstack-infra10:31
openstackgerritIan Wienand proposed openstack-infra/glean master: Add NetworkManager distro plugin support  https://review.openstack.org/61896410:31
openstackgerritIan Wienand proposed openstack-infra/glean master: A systemd skip for Debuntu systems  https://review.openstack.org/62042010:31
*** AJaeger_ has quit IRC10:32
*** rossella_s has joined #openstack-infra10:36
*** jchhatba_ has joined #openstack-infra10:37
*** lpetrut has joined #openstack-infra10:38
*** jchhatbar has quit IRC10:39
*** gfidente has quit IRC10:39
*** mgutehall has quit IRC10:40
*** shardy has quit IRC10:41
*** yamamoto has quit IRC10:43
*** shardy has joined #openstack-infra10:43
*** rossella_s has quit IRC10:49
*** rossella_s has joined #openstack-infra10:53
*** mgutehall has joined #openstack-infra10:56
*** electrofelix has joined #openstack-infra11:04
*** AJaeger_ has joined #openstack-infra11:13
*** markvoelker has joined #openstack-infra11:21
*** takamatsu has quit IRC11:25
*** jchhatbar has joined #openstack-infra11:30
*** takamatsu has joined #openstack-infra11:31
*** jchhatba_ has quit IRC11:33
*** rossella_s has quit IRC11:39
openstackgerritMerged openstack/diskimage-builder master: Fix a typo in the help message of disk-image-create  https://review.openstack.org/61967911:42
*** jchhatbar has quit IRC11:43
*** ahosam has quit IRC11:45
*** markvoelker has quit IRC11:55
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics  https://review.openstack.org/61958911:57
*** jpena is now known as jpena|lunch11:59
*** jamesmcarthur has joined #openstack-infra12:00
*** jamesmcarthur has quit IRC12:04
*** takamatsu has quit IRC12:06
*** yamamoto has joined #openstack-infra12:08
*** xek_ has joined #openstack-infra12:12
*** gfidente has joined #openstack-infra12:12
*** tpsilva has joined #openstack-infra12:15
*** dave-mccowan has joined #openstack-infra12:16
*** jcoufal has joined #openstack-infra12:19
*** xek_ has quit IRC12:21
chandan_kumarodyssey4me: Hello12:23
chandan_kumarodyssey4me: https://review.openstack.org/#/c/620800/ and https://review.openstack.org/#/c/619986/4 both does different tasks12:24
chandan_kumarodyssey4me: I am not getting how it is similar12:24
chandan_kumarodyssey4me:  need help here12:24
chandan_kumarsorry wrong channel12:24
*** jcoufal has quit IRC12:33
*** rh-jelabarre has joined #openstack-infra12:37
*** panda|pto is now known as panda12:41
*** sshnaidm|afk is now known as sshnaidm12:51
*** markvoelker has joined #openstack-infra12:52
openstackgerritErno Kuvaja proposed openstack-infra/project-config master: Add Review Priority column to glance repos  https://review.openstack.org/62090412:53
*** e0ne has quit IRC12:53
*** boden has joined #openstack-infra12:55
*** shardy has quit IRC13:03
*** shardy has joined #openstack-infra13:10
*** rlandy has joined #openstack-infra13:11
*** rossella_s has joined #openstack-infra13:12
*** kgiusti has joined #openstack-infra13:14
*** yamamoto has quit IRC13:19
*** yamamoto has joined #openstack-infra13:19
*** markvoelker has quit IRC13:24
*** rh-jelabarre has quit IRC13:27
*** jpena|lunch is now known as jpena13:31
*** kaiokmo has joined #openstack-infra13:31
*** e0ne has joined #openstack-infra13:31
DobroslawHello again zuul masters13:32
Dobroslawwhat I want: in `release` step create docker image with tag containing release version13:32
Dobroslawquestion: does zuul create some env variable on the machine when creating new release so that I could catch it with bash script?13:32
Dobroslawor is there any other way for getting this value?13:32
DobroslawI can't find anything useful in docs or zuul code13:32
pabelangerDobroslaw: you can look for zuul.tag in the inventory13:33
pabelangerthen check zuul.pipeline13:34
pabelangerto know you are in the release pipeline13:34
*** janki has joined #openstack-infra13:35
Dobroslawpabelanger: something like this?: https://github.com/openstack/kolla/blob/master/tests/templates/kolla-build.conf.j2#L513:35
*** yboaron_ has quit IRC13:36
pabelangeryup13:36
pabelangerhttps://zuul-ci.org/docs/zuul/user/jobs.html#tag-items13:36
pabelangerfor more info13:36
*** yboaron_ has joined #openstack-infra13:36
Dobroslawpabelanger: great, checking, thank you13:37
*** jamesmcarthur has joined #openstack-infra13:48
*** jcoufal has joined #openstack-infra13:48
*** jcoufal has quit IRC13:50
*** takamatsu has joined #openstack-infra13:56
*** zul has quit IRC14:00
*** dpawlik has quit IRC14:01
*** dpawlik has joined #openstack-infra14:03
*** jamesmcarthur has quit IRC14:04
*** dpawlik has quit IRC14:04
*** bobh has joined #openstack-infra14:09
efriedHey folks, I think I brought this up a few weeks ago, but then lost track of it.14:11
efriedhttp://ci-watch.tintri.com/ <== seems to be down. Did we figure out who had been maintaining it, if it was going to be fixed or replaced with something equivalent, etc?14:11
efriedThe nova team (at least) used to get a lot of use out of it.14:11
mordredefried: I don't know that we know anything about it. what did it do?14:13
efriedmordred: It was like a summary table of all the CIs, including 3rd party. You could filter (by things like project, date range, etc) with queryparams. It had nice big green checkmark or red X to indicate whether a particular run passed or failed, with links to the run.14:14
efriedmade it easy to tell at a glance whether a particular CI was really dead (lots of red in a row) etc.14:15
fungiit's referenced in this infra spec: https://specs.openstack.org/openstack-infra/infra-specs/specs/deploy-ci-dashboard.html#proposed-change14:16
*** zul has joined #openstack-infra14:16
mordredgotcha14:16
mordredah - so we have the source code for it at least14:17
*** roman_g has quit IRC14:17
fungilooks like krtaylor and mmedvede were involved in the drafting of that spec, so maybe they know who was running the poc14:17
mordredwith the update to config management stuff, it might be an easier task for someone to pick up now14:18
*** roman_g has joined #openstack-infra14:19
fungiin unrelated news, looks like we're getting a lot of pip install errors in vexxhost-sjc1 so i'm checking out the proxy host now14:19
ttxinfra-core: we are holding off on releases until the pypi access issue affecting some regions is solved (http://status.openstack.org/elastic-recheck/#1449136) -- if you notice things are working correctly again please let us know !14:19
fungittx: the issue in rax-dfw seemed to clear up late yesterday but now we have a problem in another provider i'm just starting to check into14:20
fungimirror.sjc1.vexxhost.o.o seems to be completely unreachable for me14:20
ttxhmm, yeah, that graph could be explained by two different issues14:20
fungiright, if you tick on the node_provider field in one of the relevant logstash queries you'll see it's a different issue14:21
*** sthussey has joined #openstack-infra14:22
funginova claims the instance is active14:24
mordredfungi: anything exciting in the nova console log?14:24
fungiconsole log show is empty14:24
funginova reboot?14:25
fungier, server reboot14:25
fungiunless we want mnaser to see if there's a network-related explanation for why it's unreachable14:25
fungiin which case we can turn down that region14:26
fungibut we're already running full-out with some ~3 hours to get node assignments in check at this point, so further reduction in capacity is probably not going to help that situation14:27
fungijudgement call... i'm going to reboot it via the api and see if i can get any sort of post-mortem from whatever system logs it managed to write (if any)14:28
mmedvedefungi: the person who was running the ci-watch.tintri.com poc no longer works there. He left a contact email which is not responding so far. I did deploy the same service on http://ciwatch.mmedvede.net14:28
fungi#status log rebooted mirror01.sjc1.vexxhost.openstack.org via api as it seems to have been unreachable since ~02:30z14:29
*** mriedem has joined #openstack-infra14:29
openstackstatusfungi: finished logging14:29
fungimmedvede: thanks! efried: see mmedvede's comment above14:30
fungiA start job is running for Raise ne...k interfaces (2min 5s / 5min 1s)14:31
fungithat's not very promising14:31
efriedmmedvede, fungi: Thanks!14:31
fungiif it can't bring up the nic in the next couple minutes, i'll get the region temporarily disabled14:31
*** eharney has joined #openstack-infra14:33
*** dpawlik has joined #openstack-infra14:33
*** yboaron_ has quit IRC14:35
*** yboaron_ has joined #openstack-infra14:35
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Add arbitrary node attributes config option  https://review.openstack.org/62069114:36
*** fuentess has joined #openstack-infra14:37
*** dpawlik has quit IRC14:38
openstackgerritJeremy Stanley proposed openstack-infra/project-config master: Temporarily disable vexxhost-sjc1 in nodepool  https://review.openstack.org/62092414:38
fungiconfig-core: ^14:38
fungii'll see about adding nl03 to the emergency disable list and manually applying that diff while we wait for the change to merge14:39
mordred++14:39
pabelanger+314:39
*** graphene has joined #openstack-infra14:40
pabelangerAt one point in time, we quickly discussed the idea of running multiple mirrors, so if one when down we didn't have to disable it in nodepool. logan- actually rehashed the discussion in berlin for another reason14:41
fungiafter editing nodepool.yaml, do i need to reload the config with the nodepool rpc cli?14:41
pabelangerfungi: no, it will be live on save14:42
fungiperfect14:42
fungi#status log temporarily added nl03.o.o to the emergency disable list and manually applied https://review.openstack.org/620924 in advance of it merging14:42
openstackstatusfungi: finished logging14:42
*** rh-jelabarre has joined #openstack-infra14:45
*** jcoufal has joined #openstack-infra14:54
*** bhavikdbavishi has joined #openstack-infra14:56
*** udesale has joined #openstack-infra14:58
*** lpetrut has quit IRC14:58
*** xek has joined #openstack-infra15:00
*** trown has quit IRC15:03
*** trown has joined #openstack-infra15:04
fungiwe're down to 0 nodes in the vexxhost-sjc1 main pool now15:08
*** ykarel is now known as ykarel|away15:11
*** ykarel|away has quit IRC15:15
*** zul has quit IRC15:18
*** hamerins has joined #openstack-infra15:23
*** roman_g has quit IRC15:24
mnaserhi15:24
* mnaser looks15:24
mnaseris it all vms or just mirror, fungi ?15:25
cmurphyclarkb: mordred any idea why this failed https://review.openstack.org/602380 http://logs.openstack.org/80/602380/3/gate/infra-puppet-apply-3-ubuntu-trusty/a3a1e0c/job-output.txt.gz#_2018-11-27_21_19_01_265126 and if i recheck is someone around to babysit in case it makes it through?15:26
mnasersigh15:26
mnaserok i know what sgoing on15:26
mnaserfungi: it should be back15:28
*** udesale has quit IRC15:29
*** mriedem is now known as mriedem_afk15:29
*** udesale has joined #openstack-infra15:29
*** hamerins has quit IRC15:30
mnaserconfig-core: feel free to propose a revert of that patch15:30
mnaserthe issue should be fixed now15:30
*** yboaron_ has quit IRC15:31
ianychoiHello infra team, would some system-config cores kindly review https://review.openstack.org/#/c/620661/ and +A? so many spams (>400 just for about latest 12 hours..) to me really heart my mail box..15:31
*** hamerins has joined #openstack-infra15:31
ianychois/heart/hurt15:31
*** ykarel|away has joined #openstack-infra15:32
fungimnaser: okay, thanks for finding the issue!15:32
fungii confirm i can ping it now15:32
fungiand ssh into it15:33
*** ykarel|away is now known as ykarel15:34
*** panda is now known as panda|pto15:34
fungii've abandoned the change since it hadn't merged yet, and will unroll the emergency disablement15:35
mnaserfungi: thanks15:38
evrardjpa non important patch in my opened things just need a simple review of someone here: https://review.openstack.org/#/c/619216/15:39
*** ahosam has joined #openstack-infra15:39
*** ccamacho has quit IRC15:39
*** ramishra has quit IRC15:39
openstackgerritMerged openstack-infra/system-config master: Add kube config to nodepool servers  https://review.openstack.org/62075515:44
openstackgerritMerged openstack-infra/system-config master: Nodepool group no longer hosts zookeeper  https://review.openstack.org/62076015:44
*** ccamacho has joined #openstack-infra15:49
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics  https://review.openstack.org/61958915:50
*** quiquell is now known as quiquell|off15:55
*** aojea has quit IRC15:59
fungiwe're back up running with the original max-servers count in vexxhost-sjc116:00
*** dklyle has joined #openstack-infra16:01
*** graphene has quit IRC16:03
*** graphene has joined #openstack-infra16:04
*** jamesmcarthur has joined #openstack-infra16:05
openstackgerritSorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify DLRN builds fail intermittently (network errors)  https://review.openstack.org/62095016:07
*** jamesmcarthur has quit IRC16:10
openstackgerritMerged openstack-infra/system-config master: Blackhole messages to openstack-ko-owner@l.o.o  https://review.openstack.org/62066116:13
fricklermnaser: can you share what the issue was? (just being curious from an ops perspective)16:19
*** udesale has quit IRC16:20
*** mriedem_afk is now known as mriedem16:22
mnaserfrickler: openstack-ansible bug seems like didn't cleanly restart neutron-openvswitch-agent on reboot16:23
*** janki has quit IRC16:24
*** boden has quit IRC16:25
*** e0ne has quit IRC16:26
fricklermnaser: ah, ok, that certainly causes network issues ;)  luckily should be pretty easy to spot. thx16:26
*** jcoufal has quit IRC16:31
*** jcoufal has joined #openstack-infra16:31
ssbarnea|roveranyone looking at zuul , it seems unresponsive16:31
ssbarnea|roverhttp://zuul.openstack.org/status .... not loading.16:31
*** ykarel is now known as ykarel|away16:33
mnaserloads for me ssbarnea|rover16:33
mnaserthere is a lot of things in queue so many it takes a little while before it comes up16:33
ssbarnea|rovermnaser: yep, it did load for me like 1-2 minutes later....16:34
*** ccamacho has quit IRC16:34
pabelangeryes, status json file is pretty large, but does load16:34
*** hamerins has quit IRC16:34
pabelangerwow, tripleo queue is 45hrs16:34
pabelangerwonder what is going on there16:35
*** hamerins has joined #openstack-infra16:36
*** boden has joined #openstack-infra16:36
ssbarnea|roverpabelanger: i do have the impression that this was caused by pip : http://status.openstack.org/elastic-recheck/ for which I raised a CR yesterday which was not approved due to risks.16:36
*** armax has joined #openstack-infra16:37
ssbarnea|roverpabelanger: https://review.openstack.org/#/c/620630/ if you remember well16:37
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests  https://review.openstack.org/62095416:37
pabelangerssbarnea|rover: yah, looks like vexxhost had a large impact, but that seems to just be from this morning. Do you know if rax is still having an issue?16:38
ssbarnea|rovernow I didn't see the failure going down since yesterday so I suspect the problem is still valid.16:38
*** e0ne has joined #openstack-infra16:38
pabelangerssbarnea|rover: http://status.openstack.org/elastic-recheck/data/others.html#tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates16:39
*** roman_g has joined #openstack-infra16:39
fungipabelanger: rax networking issues cleared up around 1800z yesterday16:39
pabelangerlooks to be unclassified failures, if we add them to elastic-recheck we'll get a better idea of is happening too16:39
pabelangerfungi: ack16:39
fungissbarnea|rover: it's not that i pushed back on those changes due to "risks" but that you seem to not understand how extra-index-url works. it won't result in fewer failures (if anything it'll result in more)16:40
fungiit adds indexes, all of which will be queried, and then pip will nondeterministically pick one at random to pull packages from if there are duplicate entries16:41
ssbarnea|roveri know about its weird implementation, i do not question the downvote ;)16:41
fungiit's not a "fallback" mechanism16:41
mordredit would be so nice if it was a fallback mechanism16:41
ssbarnea|roverbut based on my local tests it should make it more reliable16:41
ssbarnea|roverif I would not be under pressure i would have tried to fix pip16:42
fungiit will likely result in ~half of our packages skipping the mirror and pulling from pypi.org anyway16:42
ssbarnea|roverfungi: and is that a big issue?16:42
ssbarnea|roverthe idea is when one of the sources fails the other ones should still be able to serve it , right?16:43
fungiyes, that's why we have a local cache in each region. reduces the calls to outside services which, because of the unreliability of the internet, fails at random more often16:43
fungithat's not how pip works16:43
fungiif the source it decides to pick fails, pip will return an error and exit16:44
mordredit never ceases to boggle my mind that this is how it works :)16:44
ssbarnea|roverfungi: not if it timesout, if it times out other source should win.16:44
ssbarnea|rovermordred++ :D16:45
fungiit doesn't pull from both. it picks one of the entries exclusively16:45
*** ahosam has quit IRC16:46
ssbarnea|roverfungi: btw, why we not just using a local http proxy?16:46
fungifirst it retrieves all the indices, and then it decides which entry from which index will satisfy the stated requirement, and then it tries to retrieve that. and if it gets an error, it fails (and if you have it set to retry, it just retries that same one multiple times)16:46
fungissbarnea|rover: we _are_ using a local http proxy16:46
mordredyeah. that's what the mirrors are16:46
*** hamzy has quit IRC16:46
fungissbarnea|rover: furthermore, the errors we saw in rackspace's dfw region yesterday (which prompted you to write that change) weren't because the proxy was broken but because networking from that provider region to the pypi.org cdn was broken. bypassing the proxy wouldn't have made any difference there16:47
pabelangerin the case of vexxhost this morning, the mirror was down for unrelated pypi reasons. So, other things outside of pip were also affected, eg: apt / rpm.  In the past we talked about the idea of standing up a 2nd mirror to load balance requests, maybe we should look back to that and do something like round robin DNS.  As I type this, I am also not sure how our validate-hosts roles didn't catch the16:47
pabelangerissue and abort the job16:47
fungipabelanger: then the load balancer becomes a single point of failure instead of the proxy16:47
*** gfidente has quit IRC16:48
pabelangeragreed16:48
logan-pabelanger fungi: my thinking is we should handle the mirror selection using a random list of eligible mirrors in the pre-run. select a random one until a health check passes, then use it for the job16:48
fungiunless we set up some sort of distributed lb with address takeover anyway16:48
*** e0ne has quit IRC16:48
pabelangerright, we could make our configure-mirror role a little smarter in that way16:49
fungibut that could be an interesting challenge in providers who block multicast16:49
logan-there is no need for a lb or spof that way, the zuul pre-run can select a mirror from a list of 1 or more eligibles16:49
pabelangerget IP from dns, validate online, then use16:49
*** trown is now known as trown|lunch16:49
pabelangerthat is kinda how validate-host role should work16:49
fungilogan-: yeah, if the job node performs the health check, then that at least gets us a solution for situations where one of the servers has died completely, maybe not for when one is having intermittent issues and not the other16:50
ssbarnea|rover... just to recapitulate in less thab 24h we were hit by two different mirrors going down: rex, and later vexhost.16:51
fungihowever, most of the time when there's a problem with the mirror/proxy servers it's either a global issue or it's an issue impacting an entire region in a provider, so multiple servers is really only a solution for a fraction of these situations16:51
fungissbarnea|rover: correct16:51
pabelangerOdd, validate-host does not actually check our mirrors16:51
pabelangerI thought it did16:51
ssbarnea|roveroh,... next time someone tells me that mirrors are reliable I will send them a link to irc logs.16:52
fungissbarnea|rover: i don't dispute that these problems occurred, i'm saying they were different problems entirely and there's no single solution here16:52
ssbarnea|roverdoes pip respect http_proxy env value or not really?16:52
fungissbarnea|rover: they weren't "mirror problems" (in both cases they were "network problems")16:52
pabelangerssbarnea|rover: fungi: logan-: At a minium, I think we could update our pre-run playbook in base, to do a health check of both git.o.o and regional mirror, of if either fail, the job will abort and hopefully rerun on another provider16:53
fungithey were noticed as errors from the mirror servers because that's where jobs were trying to hit the outside through16:53
ssbarnea|roverfungi: sure, mirror can got down for various reasons, but if pip would know how to fallback, this could be one solution coverving both outages.16:53
fungithat's like saying that you have a "foot problem" because your leg fell off16:53
fungissbarnea|rover: it wouldn't have solved both outages, no16:54
ssbarnea|roverand the irony is that pip would have worked without custom mirror in both cases16:54
fungissbarnea|rover: yesterday, the inability of rackspace's dfw region to reach pypi was the problem. the only solution would have been to stop running jobs there16:55
fungiwhich we nearly did, but then their network issues in that region cleared up16:56
fungi(either a network problem within that region or a problem with the nearest endpoint for the fastly cdn pypi.org uses)16:56
fungiif nodes in that region had tried to connect directly to pypi.org the failure rate would have been identical16:57
fungino amount of patching pip or its configuration will solve that16:57
ssbarnea|roverfungi: ohh, are you sure? if i heard this issue was limited to ipv6, and if the nodes were using ipv4, it would have worked.16:59
fungissbarnea|rover: yes, an alternative would have been to figure out how to disable ipv6 on our test nodes in that region. also a drastic enough solution that you're not going to just work around it17:00
ssbarnea|roverwhat's the status with vex, is it sorted? i am worried about gate queue which seems to only go up.17:01
fungissbarnea|rover: it's sorted, yes17:01
fungithe gate queue is only going up because tripleo monopolizes it. sorry, i have to point that out. complaints from the team who consume most of our resources are not really making this a fun thing to spend my valuable time maintaining today17:02
ssbarnea|roverthe gate queue is still 1.9 didn't see it going down at all.17:02
ssbarnea|rover1.9days, not hours :D17:02
fungii think you mean tripleo's gate queue17:03
ssbarnea|roveryeah17:03
*** eernst has joined #openstack-infra17:03
fungiyeah17:03
*** david-lyle has joined #openstack-infra17:04
*** ginopc has quit IRC17:05
*** dklyle has quit IRC17:05
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests  https://review.openstack.org/61535617:06
fungissbarnea|rover: ^ that should help get everyone else's changes moving faster at least17:06
openstackgerritPaul Belanger proposed openstack-infra/project-config master: base-test: Check that regional mirror is online  https://review.openstack.org/62096117:08
pabelangerssbarnea|rover: fungi: logan-: ^is something we can test, to offer some protection to jobs17:08
*** shardy has quit IRC17:09
clarkbssbarnea|rover: fungi right re proxy our mirror is a proxy17:09
clarkbsetting up a different proxy wouldnt help17:09
*** gyee has joined #openstack-infra17:09
clarkbalso we shouldnt set up a transparent proxy as the potential for abuse on that skyrockets17:10
mordredpabelanger: I think that would be helpful17:10
clarkbthis is why we reverse proxy specific services from our mirrors as we do17:10
pabelangermordred: yah, I could have sweared we did that in validate-host before, but seems to be just git.o.o17:10
pabelangeralso think, we could maybe make validate-hosts take a list of hosts too, and have it do validate there17:11
fungii'd be good with either solution17:11
clarkbwe always kne this would be a risk with using a proxy instead of proper mirror17:12
clarkbbut pypi growth just isnt manageable anymore to mirror properly17:12
clarkbblame cuda linking17:12
*** bhavikdbavishi has quit IRC17:15
*** bhavikdbavishi has joined #openstack-infra17:16
pabelangerclarkb: https://review.openstack.org/620961/ should help with vexxhost mirror outage this moring, if you'd like to review. We can do a bit of testing before deciding to move it to our base job17:19
clarkbpabelanger: I'm not sure it will change anything. We already run bindep and things like devstack in pre17:20
clarkbthese will look for the distor package mirrors and fail in pre if that is down17:20
clarkbwe can certainly explicitly check things but I dont expect a major change in job retry behavior17:21
*** agopi is now known as agopi|food17:21
pabelangertripleo isn't using devstack, so they don't have coverage. That said, we could just move that check into their jobs17:21
pabelangerbut figured, since mirrors our infra servers, having the check in base might help protect all jobs17:22
clarkbpabelanger: ya we can explicitly check. Maybe this also points to tripleo maybe needing to move stuff into pre? I dont know I get lost every time I try to trace through a job there17:23
*** jamesmcarthur has joined #openstack-infra17:24
pabelangerclarkb: yah, that is fair17:24
*** dpawlik has joined #openstack-infra17:25
*** dpawlik has quit IRC17:26
clarkbgenerally low churn boot strapping steps that are epected to succeed should be in pre17:26
clarkbfor most of our jobs this means install distro packages is in pre17:26
*** dpawlik has joined #openstack-infra17:26
clarkbmaybe not exclusivle as tripleo is a deployment project an wamts to test those steps but tripleo must have deps itself?17:27
clarkbpabelanger: because the next thing we'll run into is mirror is up but we rsynced a bad state and so its broken when you install17:28
clarkbthen we'll go through all of this again17:28
*** jpich has quit IRC17:29
*** hamerins has quit IRC17:29
pabelangerclarkb: Yah, that was my thought of adding this check to validate-host, there we only do a traceroute to git.o.o. If that failures, we assume network issue.  We could update it to support a list of hosts, and fail if we cannot traceroute both git.o.o and mirror17:30
*** hamerins has joined #openstack-infra17:31
*** lujinluo has joined #openstack-infra17:33
fungii thought we did a ping. the traceroute was more so that we could perform post-mortem analysis on reachability issues17:36
clarkbits all to git.o.o though that host is no longer a good one actually since zuul pushes all git state into jobs now17:39
clarkbmaybe change git to the in region mirror and do that host instead17:39
pabelangeryah, we could do that too17:40
pabelangerI think the idea with git.o.o, is to confirm we can route out side the provider network17:40
pabelangerso, might be good to also keep that17:40
fungiyes, granted it doesn't actually confirm that in rax-dfw since that's where git.o.o resides17:42
clarkbin the past all jobs had to talk to that node17:42
clarkbso not being able to talk to that load balancer would result in job failure17:43
clarkbthis is not longer true but I think that is why it was chosen17:43
fungibut having the node try to reach across the internet to some host we expect to always be up can be a good canary to keep17:43
clarkbyup17:43
openstackgerritJames E. Blair proposed openstack-infra/zuul master: More strongly recommend the simple reverse proxy deployment  https://review.openstack.org/62096917:44
fungiclarkb: is there an "easy" way to export the events data from a logstash query in kibana?17:52
*** ykarel|away has quit IRC17:53
fungii have a query with 449 results and don't feel like stitching together 5 pages of copy+paste17:53
*** jpena is now known as jpena|off17:53
fungithough that's what i did in the end17:55
clarkbthere should be a csv export option somewhere17:56
fungii found where to adjust the pagination at least17:57
*** wolverineav has joined #openstack-infra17:59
openstackgerritMerged openstack-infra/system-config master: docs: add info on generating DS records  https://review.openstack.org/61933418:00
fungiconfig-core: we can add another 80 nodes back to the pool with https://review.openstack.org/61975018:03
clarkbmordred: pabelanger and the actual http GET runs on all of the ansible test nodes not the ansible process on the executor?18:03
clarkbfungi: we aren't concerned that it will go back to being unstable?18:04
fungihalving our utilization in ovh-bhs1 didn't solve the excessive timeouts there18:04
clarkbroger18:04
fungisee the comment i just added18:04
openstackgerritPaul Belanger proposed openstack-infra/zuul master: Add support for zones in executors  https://review.openstack.org/54919718:05
*** derekh has quit IRC18:06
openstackgerritSorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify DLRN builds fail intermittently (network errors)  https://review.openstack.org/62095018:06
*** trown|lunch is now known as trown18:08
pabelangerclarkb: yah, all nodes should curl the mirror18:08
openstackgerritSorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify DLRN builds fail intermittently (network errors)  https://review.openstack.org/62095018:08
sshnaidmcan somebody tell why ara-report is from Sep in todays job? http://logs.openstack.org/64/606064/2/gate/tripleo-ci-centos-7-containers-multinode/dfc9a9c/18:09
sshnaidmis some server not time synced?18:09
sshnaidmara-report/2018-09-04 17:5418:09
fungithat's definitely weird18:09
clarkbthat ara report is generated by the zuul executor I think. Could be that one of them got confused?18:10
fungithe timestamps inside the ara report look like they're from today18:10
pabelangerara-report folder will contain a sqlite.db18:10
corvusfungi, clarkb: can one of you please obtain an opendev.org cert?18:10
pabelangerI think that is created from the executor18:10
fungii think apache is confused18:10
clarkbcorvus: yes, I can do that18:11
fungils -l /srv/static/logs/64/606064/2/gate/tripleo-ci-centos-7-containers-multinode/dfc9a9c/18:11
fungidrwxr-xr-x  2 jenkins jenkins   4096 Nov 29 11:30 ara-report18:11
clarkbcorvus: it will likely require we edit DNS to verify ownership of the domain though now that gdpr means whois is useless18:11
*** hamzy has joined #openstack-infra18:12
pabelangerlol, jenkins18:12
corvusclarkb: the automation should be in place18:12
clarkbcorvus: cool, I'll push up a change for that when I've got the details from namecheap18:12
fungipabelanger: yeah, we haven't renamed the account on static.o.o18:12
*** bobh has quit IRC18:12
*** bhavikdbavishi has quit IRC18:13
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently  https://review.openstack.org/61002918:13
*** bhavikdbavishi1 has joined #openstack-infra18:13
ssbarnea|roverfungi: clarkb : if you could help with recent CRs on https://review.openstack.org/#/q/project:openstack-infra/elastic-recheck+status:open it would be great18:13
openstackgerritJames E. Blair proposed openstack-infra/system-config master: Serve opendev.org website from files.o.o  https://review.openstack.org/62097918:14
*** dpawlik has quit IRC18:14
pabelangerfungi: sshnaidm: http://logs.openstack.org/64/606064/2/gate/tripleo-ci-centos-7-containers-multinode/dfc9a9c/job-output.txt.gz#_2018-11-29_11_30_44_008080 I think is when we create the directory18:14
*** david-lyle has quit IRC18:15
TheJuliaso the maximum zuul job length is 3 hours?18:15
*** agopi|food is now known as agopi18:15
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently  https://review.openstack.org/61002918:15
pabelangerTheJulia: yes18:15
*** bhavikdbavishi1 is now known as bhavikdbavishi18:15
clarkbcorvus: fungi: a year cert as per normal? I expect we'll be maybe using free certs in the not too distant future so that makes snse to me18:15
corvusclarkb: ++18:16
sshnaidmpabelanger, so is it apache fault..?18:16
fungiclarkb: yeah, that's what i'd do18:16
*** wolverineav has quit IRC18:16
openstackgerritPurnendu Ghosh proposed openstack-infra/project-config master: Create airship-spyglass repo  https://review.openstack.org/61949318:17
clarkbsshnaidm: that is fungi's theory18:17
fungisshnaidm: yeah, must be? i can't for the life of me figure out where apache is getting that timestamp: http://paste.openstack.org/show/736434/18:17
TheJuliapabelanger: I guess I should try and see what I can prune out of a grenade job :(18:18
openstackgerritJames E. Blair proposed openstack-infra/system-config master: DNS: replace ip addresses with names  https://review.openstack.org/62098018:18
*** wolverineav has joined #openstack-infra18:18
corvusTheJulia: for context: the idea was that 3 hours is plenty of headroom for jobs that were designed to be 1 hour long18:19
*** jistr has quit IRC18:19
*** jistr has joined #openstack-infra18:19
TheJuliaexcept grenade has always been 2-2.518:19
*** bhavikdbavishi has quit IRC18:19
TheJuliahit a slow node, and you pass 3 hours18:20
fungialways?18:20
sshnaidmfungi, yeah, weird..maybe some network storage bug18:20
TheJuliaas far as I can remember18:20
fungii vaguely remember grenade jobs taking ~1.25 hours when devstack jobs took ~0.75 hours18:20
TheJuliaYou have to keep in mind the whole fake baremetal deployment boot/deploy cycle for ironic adds a chunk of time18:20
clarkbcorvus: fungi: do we want an alt name of www.opendev.org?18:20
fungisshnaidm: i have a feeling it's something weird in apache's caching layer18:20
corvusclarkb: yeah i think so18:21
*** roman_g has quit IRC18:22
fungiclarkb: yes, that would then be consistent with the one for zuul-ci.org18:22
fungiX509v3 Subject Alternative Name: DNS:zuul-ci.org, DNS:www.zuul-ci.org18:22
fungiecho|openssl s_client -connect zuul-ci.org:443|openssl x509 -text18:23
fungiso "do whatever you did for that domain"18:23
clarkbfungi: I don't think I did that domain18:23
*** jamesmcarthur has quit IRC18:24
openstackgerritJames E. Blair proposed openstack-infra/zone-opendev.org master: Add A(AAA) records for (www.)opendev.org  https://review.openstack.org/62098218:25
fungiclarkb: it's in bridge.o.o:~root/certs/2018-03-26/ so one of us did anyway18:25
fungioh, wait, that's git.zuul-ci.org18:26
fungiin 2018-01-19 instead18:26
*** lujinluo has quit IRC18:26
corvusfungi, clarkb: i set "topic:opendev" on all related changes18:27
fungithanks!!!18:27
clarkbI'll model it off of the static cert and createa  new opendev.org cnf file and generate with that18:27
openstackgerritMerged openstack-infra/project-config master: Revert "Halve ovh-bhs1 max-servers temporarily"  https://review.openstack.org/61975018:28
fungiclarkb: i think it adds www automagically as a san, but i could be wrong18:29
clarkbfungi: it being openssl?18:29
fungiclarkb: namecheap18:29
fungiX509v3 Subject Alternative Name: DNS:git.zuul-ci.org, DNS:www.git.zuul-ci.org18:30
clarkboh huh18:30
fungiX509v3 Subject Alternative Name: DNS:zuul.openstack.org, DNS:www.zuul.openstack.org18:30
clarkbya same for git.openstack.org. In that case I'll just use our normal process and it should just work (tm)18:31
clarkb(I hope)18:31
clarkbeasy enough if so18:31
fungiso if you just ask for a cert for opendev.org they should end up giving you www.opendev.org as a san18:31
*** hamzy has quit IRC18:31
*** diablo_rojo has joined #openstack-infra18:31
clarkbyup18:31
openstackgerritClark Boylan proposed openstack-infra/zone-opendev.org master: Add SSL Cert verification record  https://review.openstack.org/62098518:38
openstackgerritClark Boylan proposed openstack-infra/zone-opendev.org master: Revert "Add SSL Cert verification record"  https://review.openstack.org/62098618:38
clarkbfungi: corvus ^ double check me on that its been years since I hand edited a bind zone file :)18:39
*** eernst has quit IRC18:40
clarkbI didn't increment the serial18:41
clarkblet me fix that real quick18:41
* fungi was about to point that out ;)18:41
fungimake sure to increment it again in the "revert" too18:41
corvusemacs'll do it for you18:41
fungiemagic18:41
*** eernst has joined #openstack-infra18:42
openstackgerritClark Boylan proposed openstack-infra/zone-opendev.org master: Add SSL Cert verification record  https://review.openstack.org/62098518:43
openstackgerritClark Boylan proposed openstack-infra/zone-opendev.org master: Revert "Add SSL Cert verification record"  https://review.openstack.org/62098618:43
clarkbcorvus: yup got both of them18:43
fungiyou might want to wip 620986 until you're done, just to be safe18:44
*** gfidente has joined #openstack-infra18:45
clarkb++18:45
frickler$ ls -l /usr/local/bin/ara-wsgi-sqlite18:45
frickler-rwxr-xr-x 1 root root 1818 Sep  4 17:54 /usr/local/bin/ara-wsgi-sqlite18:45
*** dklyle has joined #openstack-infra18:45
fricklertogether with WSGIScriptAliasMatch ^.*/ara-report(?!/ansible.sqlite) /usr/local/bin/ara-wsgi-sqlite18:45
frickleris what makes that timestamp18:46
fungifrickler: oh! so that's the timestamp of the cgi18:46
fungisshnaidm: ^ mystery solved18:46
fricklerbit of a weird behaviour of apache I'd say18:46
sshnaidminteresting18:46
clarkbcorvus: fungi: ok I've approved the dns updates so now we wait for that to merge and apply and comodo to notice. I'll get the cert data into hiera/ansiblevars as soon as I have it18:49
*** gfidente has quit IRC18:51
openstackgerritPaul Belanger proposed openstack-infra/zuul master: Clarify executor zone documentation  https://review.openstack.org/62098918:51
*** hamzy has joined #openstack-infra18:52
clarkbssbarnea|rover: pabelanger out of curiousity where is the rdo/rpm packaging line vs pypi line drawn for tripleo testing? is there a general rule there ?18:52
openstackgerritMerged openstack-infra/zone-opendev.org master: Add A(AAA) records for (www.)opendev.org  https://review.openstack.org/62098218:52
*** hamzy_ has joined #openstack-infra18:57
openstackgerritMerged openstack-infra/zone-opendev.org master: Add SSL Cert verification record  https://review.openstack.org/62098518:57
*** lpetrut has joined #openstack-infra18:58
*** hamzy has quit IRC18:58
pabelangerclarkb: I'm not sure myself, but believe anything that is openstack (and dependency), gets built as RPM.18:59
fungii think they have some tox jobs too though?19:01
*** mtreinish has joined #openstack-infra19:03
*** wolverineav has quit IRC19:04
*** wolverineav has joined #openstack-infra19:05
weshaythanks clarkb!19:05
weshayfor the elastic recheck reviews19:05
*** wolverineav has quit IRC19:06
clarkbweshay: np. I'm happy to see people using it for this :)19:06
*** wolverineav has joined #openstack-infra19:06
clarkbweshay: I figure if mriedem or fungi or some other current root doesn't review in the next day or so I can approve the stack (except maybe the py3 support change since ianw was reviewing that one already and its a bit bigger than adding queries)19:06
ssbarnea|roverclarkb: yep, the main rule is use rpm whenever is possible, but there are few exceptions, like what tox testing.19:06
weshayk19:07
*** eernst has quit IRC19:07
*** wolverineav has quit IRC19:07
*** dklyle has quit IRC19:07
mriedemhuh?19:07
*** wolverineav has joined #openstack-infra19:07
clarkbmriedem: e-r reviews19:07
ssbarnea|roveri seen even other workarounds, but usually are only temporary, like installing a package from pip until we get a rpm for it.19:07
ssbarnea|roverclarkb: exception do apply only for non shipping code like test related. anything that ships / installs on production must be rpm based.19:08
clarkbssbarnea|rover: got it,thanks19:08
*** fuentess has quit IRC19:08
*** electrofelix has quit IRC19:12
*** eernst has joined #openstack-infra19:13
clarkbssbarnea|rover: comment on https://review.openstack.org/#/c/620950/319:16
mtreinishclarkb, fungi: so I've got kind of a random question, do you have a pointer to the script used to build wheels?19:17
clarkbmtreinish: ya I'll dig it up19:17
mtreinishI got a request to upload wheels for stestr to pypi, and I've been using the old tarball script to upload that to pypi19:17
mtreinishand reading the twine docs was not at all helpful19:17
mtreinishclarkb: cool, thanks19:18
*** eernst has quit IRC19:18
clarkbmtreinish: https://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/build-wheels/files/wheel-build.sh19:18
clarkbdarn I typed that out wrong19:19
*** wolverineav has quit IRC19:19
*** eernst has joined #openstack-infra19:19
clarkbmtreinish: https://git.openstack.org/cgit/openstack-infra/project-config/tree/roles/build-wheels/files/wheel-build.sh19:19
clarkbthere19:19
fungiyou can do it in the same command where you also build an sdist, i.e. `python setup.py bdist_wheel sdist`19:20
*** wolverineav has joined #openstack-infra19:20
clarkboh wait you are wanting the publish side of wheel building19:20
clarkbthat is the build all the wheels for mirroring script19:20
fungiaha, yes19:20
fungisorry, i too misread19:20
clarkbbut ya you basically pip wheel ./ or python setup.py bdist_wheel19:21
fungi`twine upload dist/*`19:21
mtreinishthanks19:21
mtreinishhmm, that's what I tried19:21
*** eernst has quit IRC19:21
fungiif you want to test against the dev pypi, do `twine upload --repository-url https://test.pypi.org/legacy/ dist/*`19:21
mtreinishwell I've got a bug fix release to push, so I'll give it a try on a fresh tag19:21
mtreinishoh, that's good to know19:22
fungibefore doing that, i also recommend running `twine check dist/*`19:22
clarkbcorvus: root email says ns2.opendev.org requires a reboot to complete package upgrades. Maybe we should do that once dns is verified with comodo?19:22
fungiat the moment i think that only checks the long description to make sure pypi will render it successfully, but in the future the expectation is that will grow additional checks for things like invalid trove classifiers19:22
clarkbod that ns1 wouldn't require it, but then i Remember we likely use different base images in the two different clouds19:23
*** xek has quit IRC19:23
ssbarnea|roverweshay: please read https://review.openstack.org/#/c/620950/3 and add your intake, i am not sure if voting:1 should be in or not.19:23
*** wolverineav has quit IRC19:24
*** wolverineav has joined #openstack-infra19:24
ssbarnea|rovermriedem: fungi : i am sure if you are also aware about "twine check" command which proved to VERY useful as it also lints the readme that goes to pypi and assures it renders well.19:24
ssbarnea|rovermtreinish: ^^ this was for you. :)19:25
clarkbssbarnea|rover: the openstack release process uses that command to check things before we make releases19:25
clarkbit is indeed quite helpful19:25
ssbarnea|roverclarkb: the problem is that most of the project do not run it as part of their tox targets, so we find about breakage late in the process.19:26
ssbarnea|rovermy personal preference is to include "twine check" as part of tox-linters ... as this is what it does, mostly.19:26
clarkbssbarnea|rover: ya, though at least before we try to publish now19:26
mtreinishheh, yeah it looks like it will be useful19:26
fungissbarnea|rover: well, twine check doesn't technically check the readme, it checks the long description field of the built packages (which in our case are embedded copies of a readme)19:26
mtreinishbut there is a typo in the warning message about a missing dep19:27
mtreinishtold me to install 'readme_render[md]' but it meant 'readme_renderer[md]'19:27
ssbarnea|roveryeah, i know about this. i think I made a PR to fix it, or wanted to.19:28
fungithough if you don't use markdown you can ignore that19:28
fungirestructuredtext is supported by default19:28
mtreinishyeah the README for stestr is rst, but I have other projects that use md so I figured better to have it19:28
ssbarnea|rovermtreinish: i had the same impression, better to have it.19:29
ssbarnea|roverclarkb: wes replied, on https://review.openstack.org/#/c/620950/ -- you can make a decision. I think in this case is better to have voting on.19:31
ssbarnea|roverto eliminate expected noise19:31
*** graphene has quit IRC19:32
*** graphene has joined #openstack-infra19:33
clarkbssbarnea|rover: ok I'm ok if we choose voting, just wanted to make sure it was explicit19:34
ssbarnea|roverclarkb: now we only need to find someone else to workflow these. slowly we improve the categ rate.19:36
*** markvoelker has joined #openstack-infra19:36
clarkbssbarnea|rover: as I mentioend I'm happy to approve the query changes with my +2 if no one else reviews them today. The py3 port should get more eyes though19:36
*** jamesmcarthur has joined #openstack-infra19:37
ssbarnea|roveri wonder if we track its value over time, so we can see how it goes. it would be nice to have an alarm, when it goes above: x% we start working on it until we bring it to y%.19:38
openstackgerritJeremy Stanley proposed openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list"  https://review.openstack.org/62099519:38
ssbarnea|roveri do find elastic-recheck extremly useful, especially for those doing ruck/rovering.19:38
fungiclarkb: i think in the past we've only expected a single +2 on e-r query additions/removals19:39
*** dpawlik has joined #openstack-infra19:39
fungimriedem can correct me there19:39
clarkbfungi: yes, though typically those reviews were from mriedem and mtreinish who know how to review the changes :)19:39
clarkbI'm quite rusty :)19:39
mriedemit's one thing if it's a query19:40
mriedemthese are py3 changes right?19:40
*** markvoelker has quit IRC19:40
clarkbmriedem: one change is a py3 porting. That one should have multiple reviews. The others are all queries which I've +2'd and can aopprove if that is what we want19:40
mriedemoh ok let me review19:42
corvusclarkb: ns2 reboot post comodo wfm19:43
ssbarnea|rovermriedem: thanks, i am here to answer your question. the only tricky part was around few lp modules managed by canonical, which were initially not py3 ready, but they made a new release.... those libraries do not even have a *CI*... :D19:43
mtreinishooh, an updated review on: https://github.com/ansible/ansible/pull/23769 maybe we won't have to carry a local version soon19:48
fungithat would be swell19:49
fungilatest state just got two shipits. i think you're set?19:50
*** jamesmcarthur has quit IRC19:50
* fungi has no idea how the review process for ansible works19:50
mtreinishnor do I19:51
ssbarnea|rovermtreinish: i can help you with few hints around: ping key people on #ansible-devel -- bcoca helped me many times.19:51
ssbarnea|rovermtreinish: ok, now you got feedback on it, be sure you address it.19:52
clarkbok ansible + puppet are not running19:52
*** jamesmcarthur has joined #openstack-infra19:52
clarkbthis explains why I'm awiting on dns records for longer than I expected19:52
clarkbFailed to discover available identity versions when contacting https://La1.citycloud.com:5000/v3/. Attempting to parse version from URL.19:53
clarkbhrm is that a cloud outage? we should've fixed the ctiycloud per region keystone thing19:54
clarkband I get 502 bad gateway if I try to talk to that url19:54
mtreinishssbarnea|rover: thanks, it might be a while though. I don't have a lot of bandwidth for it right now. It sat idle for a long time and is low on my prio list right now19:54
clarkbmordred: ^ any thoughts?19:54
*** wolverineav has quit IRC19:55
ssbarnea|roverclarkb: mtreinish regarding elastic-recheck I observed that in many cases before a CR is reviewed the logstash already recycled the logs so we should aim to review changes while they are fresh. I aim to review all in 24-48h to avoid this.19:56
*** wolverineav has joined #openstack-infra19:56
*** sshnaidm is now known as sshnaidm|afk19:56
*** markvoelker has joined #openstack-infra20:00
*** jamesmcarthur has quit IRC20:00
mriedemnot sure why specific tests are called out in this https://review.openstack.org/#/c/617579/20:00
*** wolverineav has quit IRC20:00
clarkbhttp://cnstatus.com/?p=4413 maybe the issue is that20:00
mriedemas generic ssh failures can hit most of tempest now since it runs with validatoin in tempest-full jobs20:00
openstackgerritMerged openstack-infra/elastic-recheck master: Categorize ImageNotFoundException on tripleo jobs  https://review.openstack.org/62011420:02
mordredclarkb: looking20:04
*** wolverineav has joined #openstack-infra20:04
mordredclarkb: yeah - I think it's likely that20:05
mordredclarkb: sto2.citycloud.com is working20:05
*** wolverineav has quit IRC20:10
mriedemssbarnea|rover: https://review.openstack.org/#/c/616578/920:10
ssbarnea|roversure, taking care of these now.20:11
*** slaweq has quit IRC20:11
*** irdr has quit IRC20:12
*** jtomasek has quit IRC20:12
mriedemclarkb: btw, i haven't seen e-r commenting on failures lately20:13
mriedemi wonder if the log index workers are overwhelmed with tripleo console log indexing?20:13
mriedemone of the comments when you brought this up as a goal in berlin was that it'd be nice if we had some kind of status page / dashboard for the logstash workers and/or e-r bot to know if it's off the rails20:14
openstackgerritMerged openstack-infra/nodepool master: Add arbitrary node attributes config option  https://review.openstack.org/62069120:14
clarkbmriedem: we do sort of have one for the logstash workers. http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1 the logstash job queue graph there is part of the zuul status20:14
clarkbmriedem: it looks like its keeping up, though individual files in the pipeline may be lagging more than say 20 minutes20:15
openstackgerritSorin Sbarnea proposed openstack-infra/elastic-recheck master: Made elastic-recheck py3 compatible  https://review.openstack.org/61657820:15
fungipossible the bot has crashed/hung?20:15
mriedema lot of the time, e-r is dead or something20:16
mriedemb/c of a bad query or something like that20:16
mriedemalthough i'd think a bad query would also break the graph20:16
*** ralonsoh has quit IRC20:16
fungiin the middle of making fried rice, but can take a look once i'm done20:17
mriedemoh i didn't know efried was there20:17
clarkbmriedem: ya I would expect that to be the case too20:17
efried:P20:17
clarkbfungi: thanks!20:17
* clarkb returns to reviewing relative priority support in zuul20:17
openstackgerritMatt Riedemann proposed openstack-infra/elastic-recheck master: fix tox python3 overrides  https://review.openstack.org/60561820:18
fungimy kitchen would be a lot more awesome if efried were running it, i'm sure20:18
ianwinfra-root: can we look at ansible 2.7.2 install for bridge with -> https://review.openstack.org/#/c/617218/ .  the other version didn't get reviews, and the cloud-launcher is still broken.  i know we're not rolling out cloud changes, but evidence shows it tends to bitrot easily20:18
*** eernst has joined #openstack-infra20:19
efriedI make a mean fried rice. Though I'm better at curries.20:19
openstackgerritMatt Riedemann proposed openstack-infra/elastic-recheck master: Include query results in graph  https://review.openstack.org/26018820:19
*** florianf is now known as florianf|afk20:19
*** irdr has joined #openstack-infra20:19
ianwinfra-root: and actually, now i look at http://grafana.openstack.org/d/qzQ_v2oiz/bridge-runtime?orgId=1&from=now-12h&to=now ... clearly something has gone wrong20:20
*** e0ne has joined #openstack-infra20:20
pabelangerNice, didn't know it was hooked up to grafana20:21
clarkbianw: see above, citycloud outage is preventing us from generating inventory20:21
*** eernst has quit IRC20:21
clarkbianw: http://cnstatus.com/?p=4413 is the issue I Think (sparse on details though)20:21
ianwclarkb: oh, oh cool, if we have a reason good :)20:21
ianwmordred: if around, would be great if you could review at least the glean bits of https://review.openstack.org/#/q/status:open+topic:fedora29 to enable networkmanager support20:23
*** eernst has joined #openstack-infra20:25
*** jamesmcarthur has joined #openstack-infra20:25
openstackgerritMerged openstack-infra/elastic-recheck master: Categorize ovs crash bug #1805176  https://review.openstack.org/62010520:26
openstackbug 1805176 in tripleo "tripleo jobs failing to setup bridge: fatal_signal|WARN|terminating with signal 14 (Alarm clock)" [High,Triaged] https://launchpad.net/bugs/180517620:26
clarkbmriedem: fwiw on the bug that is limited to those specific tests I also left a note that maybe the qa team wants to be involved since many jobs seems to match20:26
clarkbssbarnea|rover: as far as alerting goes, we've generally try to avoid any semblence of on call, must react now type behavior. Instead we present the data so that it can be consumed by individuals as they have time/ability20:27
clarkbssbarnea|rover: so we generate graphs. We could also maybe light a batsignal if thresholds are reached that requires you to "look to the sky" or wherever for that rather than it hitting your laptop/phone20:28
*** eernst has quit IRC20:29
*** jamesmcarthur has quit IRC20:29
*** e0ne has quit IRC20:30
ssbarnea|roverclarkb: sure.20:30
openstackgerritMerged openstack-infra/elastic-recheck master: Categorize error mounting image volumes due to libpod bug[1]  https://review.openstack.org/61905920:31
clarkbssbarnea|rover: I think in the first (less explicit) signal has been the categorization rate being under like 80%20:31
clarkb*in the past the first20:31
clarkbsince the first order of business is tracking the issues, then gaining understanding, then fixing them20:31
clarkbwe can probably have some metric we call out on the stuff we understand too. Like X failures in a day20:32
*** slaweq has joined #openstack-infra20:32
ssbarnea|roverclarkb: practical question, this is uncategorized: http://logs.openstack.org/04/618604/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/e892675/job-output.txt.gz -- any idea on how to categorize it?20:35
ssbarnea|roveri guess i found the string: "PRE-RUN END RESULT_UNREACHABLE"20:36
clarkbssbarnea|rover: I think we already have a query for that one. Its a cloud level issue with using duplicate IPs :/20:36
clarkbhowever because that ran in pre the job will be retried20:36
clarkbhrm not seeing a query for it anymore. maybe it was cleaned up? its a known issue we've engaged rax on. But I'm not sure if they know what causes it or if there is a fix20:37
ssbarnea|roverclarkb: i was not able to find any bug or query with "PRE-RUN END RESULT_UNREACHABLE" in it, so maybe this one was missed. I will create new one, unless someone knows an existing one that can be adapted.20:42
ssbarnea|roverhttps://bugs.launchpad.net/openstack-gate/+bug/180590020:45
openstackLaunchpad bug 1805900 in OpenStack-Gate "PRE-RUN END RESULT_UNREACHABLE" [Undecided,New] - Assigned to Sorin Sbarnea (ssbarnea)20:45
*** jcoufal has quit IRC20:46
fungirecheck  21279  0.2 14.0 964108 569948 ?       Sl   Nov16  40:13 /usr/bin/python /usr/local/bin/elastic-recheck /etc/elastic-recheck/elastic-recheck.conf20:46
fungilooks like the recheck bot is running, at least20:46
openstackgerritSorin Sbarnea proposed openstack-infra/elastic-recheck master: Identify POST-RUN END RESULT_UNREACHABLE  https://review.openstack.org/62100420:50
*** wolverineav has joined #openstack-infra20:51
*** wolverineav has quit IRC20:53
*** wolverineav has joined #openstack-infra20:53
*** jamesmcarthur has joined #openstack-infra20:53
*** olivierbourdon38 has quit IRC20:57
corvusmordred, clarkb: any news on the ansible issue?21:00
openstackgerritMerged openstack-infra/nodepool master: Asynchronously update node statistics  https://review.openstack.org/61958921:00
openstackgerritMerged openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list"  https://review.openstack.org/62099521:00
clarkbcorvus: pretty sure its that outage I linked to on citycloud status page. We can disable those regions if it persists21:01
clarkbI'm eating lunch now. back in a bit21:01
corvusmordred, clarkb: our theory with nodepool is that clouds are unreliable so we handle them disappearing gracefully.  but we have the same clouds now blocking our operations whenever there's an error.  is there a way to mitigate this, or should we just switch to static inventories?21:03
corvusclarkb: this can wait till after lunch :)21:03
*** hamerins has quit IRC21:05
mordredcorvus: I'm honestly torn on that question21:05
mordredcorvus: there's a config flag that can be given to the inventory to not let cloud errors bomb the whole thing out21:05
fungiwhat other alternatives are there? not update the cached inventory for a given provider if shade hits an error trying to query it?21:06
mordredbut I think we've been reluctant to set it in the past out of fear we'd silently ignore a chunk of our inventory21:06
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Fix updateFromDict overrides  https://review.openstack.org/62100821:07
mordredfungi: it's a single inventory and the caching is at the inventory level, so there's not really a way to selectively update or not update portions of the cache21:07
mordredwe could also switch to static/generated inventories - it's not like the vms we're running in our clouds are terribly dynamic21:07
mordredso MOST of the time it's the same servers21:07
fungiyeah, i get that would likely entail an overhaul of the dynamic inventory generation21:07
mordredbut we'd need to think through workflows for that with new server creation and old server deletion21:08
corvusmordred: yeah, i'd worry about losing the inventory.  i think the biggest problem is we'd have to think about that possibility every time we make a change ("will this role work if half the servers are gone?")21:08
fungiperhaps if shade errors aborted inventory generation we could just proceed with the previous cache?21:08
corvusprevious cache or pre-generation sound better to me21:09
fungias you say, it's the same servers 99.9% of the time anyway21:09
mordredfungi: maybe? I'm not sure if we have that possibility - the caching layer is actually handled inside of ansible itself now21:09
fungiworst case it goes unnoticed until we need to add or remove a server and notice the inventory's not updating21:09
corvusprevious cache and pre-generation are functionally the same thing; it sounds like just doing pre-generation would be the simplest way of getting the result21:10
*** dklyle has joined #openstack-infra21:10
mordredyeah. I mean - I was thinking generating the inventory and shoving it in to git21:10
corvusmordred: oh, i was imagining we just run a script at the start of the cron job, and if it errors, continue anyway...21:11
fungiright, only copy the generated inventory into place if the pregen script succeeds, otherwise leave the old one in place21:11
mordredcorvus: I was thinking basically just update the inventory when we create or delete servers21:11
mordredand just remove it from being in the cronjob execution path completely21:12
corvusauto-proposed changes to git is interesting21:12
corvusi don't object to that, but it sounds like extra work and i don't know what we'd gain (but maybe that's lack of imagination on my part)21:13
*** dpawlik has quit IRC21:13
mordredcorvus: I think the main thing I was thinking we'd gain is less moving parts at runtime - and I wasn't thinking auto-proposed as much as "when you're done running launch-node, run this script and submit a patch to git" ... but you're right, at that point I don't know that there's much benefit to having the data in git - other than visibility of the otherwise hidden data21:14
mordredcorvus: I'm not sure I'm *actually* advocating that we do that- it's just been a thought in the back of my head when we have inventory issues21:15
corvusmordred: yeah. that'd work too.21:16
clarkbmordred: ansible handling the caching is why I think we are noticing this now21:21
fungiif we ever want more automation in server launching though, that's one more wait-for-a-human step21:21
clarkbI mean I'm sure it was an issue before but we just used the last version right?21:21
fungigranted, we've already got the "submit a patch to add dns records" step which needs to wait for reviewers21:22
fungiunless we decide there's a subdomain we want to run completely off an autogenerated zonefile21:23
clarkbanother option could be to put all of the mirror nodes in their own ansible cron run thing and stop generating inventories for them in the main control plane run21:23
clarkbfor the main control plane run we only care about vexxhost and rax today21:23
clarkbdoesn't fix the issue but limits the scope of it21:24
*** agopi is now known as agopi|pto21:24
clarkbcitynetwork says the issue I Found is estimated to be fixed at 2400 CET21:25
clarkbwhich is 35 minutes from now?21:25
clarkbor is CET only +1?21:25
*** agopi|pto has quit IRC21:26
*** yamamoto has quit IRC21:26
clarkbmordred: would it be worthwile to suggest to ansible (or implement for ansible) fallback to prior cache?  I mean what is the cache actually buying us if we can't use it without hitting the clouds?21:26
*** dpawlik has joined #openstack-infra21:28
corvusclarkb: i wask thinking about splitting, but we have some control-plane nodes in non-rax clouds.  i don't think i want to inhibit more of that in the future, so i prefer the idea of making it more robust.21:29
fungiclarkb: utc+121:29
clarkbcorvus: ya I'm beginning to think the most generally robust thing would be for ansible cacheing to act as a cache that doesn't need refreshing every run21:30
fungicest (their dst) is utc+221:30
mordredI've got a patch locally with a static copy of the inventory that we can look at for sake of argument. I'm 99% sure it's safe to push up - does anyone want to double-check it somewhere private before I do?21:30
clarkbcorvus: but that is also likely the longest fix time wise21:30
clarkbmordred: rax sets passwords, if that is done via metadata that might leak out in the inventory?21:31
mordredall the adminPass fields are null21:31
corvusi'll give it a look, you want to put it on bridge?21:31
mordredthey only show it to you the one time in the initial server creation response21:31
mordredcorvus: /home/mordred/static-inventory.yaml on bridge21:31
clarkbmordred: ah21:31
corvusthat's a restless api21:31
mordredalso - we could write a friendlier generation script than what I have there - we don't actually use 99% of those variables21:32
clarkbif we go the static inventory route do we want to use the machine generated inventory applied against our groups.yaml file? or should we just write an inventory that accomdoates both things for human and machine consumption21:32
clarkbmordred: ya that21:32
openstackgerritSean McGinnis proposed openstack-infra/project-config master: Add openstack/arch-design  https://review.openstack.org/62101221:33
mordredclarkb: easiest first step is just have a our current groups.yaml plus a really simple file with server_name: ansible_host: ip_address list21:33
*** dpawlik has quit IRC21:33
mordredI mean- we could even leave out the ansible_host thing and just rely on dns21:33
*** hamerins has joined #openstack-infra21:34
openstackgerritSean McGinnis proposed openstack-infra/project-config master: Add openstack/arch-design  https://review.openstack.org/62101221:34
*** wolverineav has quit IRC21:34
*** wolverineav has joined #openstack-infra21:35
fungii don't think we want to rely on dns for this21:35
fungimultiple instances with the same names, being able to update configuration before dns is in place...21:36
fungialso allows things to keep working even if dns won't resolve for unrelated (or related!) reasons21:37
*** wolverineav has quit IRC21:38
*** wolverineav has joined #openstack-infra21:38
*** markvoelker has quit IRC21:38
*** markvoelker has joined #openstack-infra21:38
corvusis our rax user id at all sensitive?21:39
*** dklyle has quit IRC21:39
clarkbI don't think so.21:40
corvusi kind of doubt it.  but that's the only thing i can think to question.21:40
corvusthe file lgtm.21:40
*** markvoelker has quit IRC21:43
mordredok. I've actually got a slimmed down version21:43
openstackgerritMonty Taylor proposed openstack-infra/system-config master: Switch to a static inventory  https://review.openstack.org/62103121:44
*** rlandy is now known as rlandy|biab21:44
clarkbmordred: the location: block there isn't gonna cause ansible to do any lookups that would fail similarly to inventory generation right? (I don't think so)21:45
clarkbjust double checking that it is info only21:45
mordrednope. it's just a piece of metadata from the shade record that I thought might be useful to us as humans looking at a record21:45
clarkb++21:45
*** kjackal has quit IRC21:46
clarkbI'm willing to give ^ a go. It will change how we launch new servers, which might be a little weird until we get into the practice of that21:46
*** dpawlik has joined #openstack-infra21:46
mordrednow - if we decided to go this route - we probably want to make a script that generates that file decently - I pulled that one from the json in the ansible inventory cache and then did some transforms on it21:46
mordredso consider it 'hand made'21:46
clarkbmordred: and maybe we check in the script not the file?21:47
clarkblike it could just be run the regularly inventory generation, if failed us last sucessful result?21:47
mordredwell - we should _definitely_ check in the script ... but I think running it regularly as part of runs doesn't gain much value21:47
*** hamzy_ has quit IRC21:47
mordredif we don't check the inventory in - we should at most just run it after launch-node21:47
mordred(if we're not going to be fully dynamic)21:48
*** jamesmcarthur has quit IRC21:48
mordredbut honestly - I still don't know what I think about this :)21:48
*** jamesmcarthur has joined #openstack-infra21:48
clarkbya I mostly want to avoid needing to launch node without any inventory (so you only get base server), then push a git change, wait for two people to approve it, then be able to run ansible/puppet/docker on your new server21:49
clarkbif we have to do that temporarily for a bit thats fine, and maybe we discover its not that painful21:49
clarkbmordred: what does the ansible cache actually cache?21:50
clarkbI think understanding ^ may help us formulate a plan too. Like maybe its a matter of using the cache more effectively?21:50
mordredclarkb: if you look in ./playbooks/roles/install-ansible/files/inventory_plugins/openstack.py21:51
mordredaround line 19321:52
mordredthat's where the generation sets the cache data21:52
mordredclarkb: maybe what we need is to be able to tell if fail_on_errors caused anything to be skipped21:53
mordredclarkb: and if so, skip the cache.set step21:53
mordredso that we can run during that period on partial data21:55
mordredbut not cache the partial data, so we're sure to get full data once it comes back?21:55
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests  https://review.openstack.org/61535621:57
*** hamerins has quit IRC21:57
clarkbmriedem: fungi looking at the bot logs it seems that the bot is behind on querying console logs? like its looking for a failure in neutron-grenade from the 26th against today's index so that has no results21:57
*** hamerins has joined #openstack-infra21:57
clarkbmriedem: fungi: I think this is a bug in how we check for current results in e-r, not an issue with indexing?21:57
*** rcernin has joined #openstack-infra21:57
mriedemhmm21:58
clarkbthe data is there in the index from a few days ago. My guess is over time we get further and further behind then we start querying newer indexes for older data and never get results and then at that point are forever behind on the bot side21:59
clarkbmordred: ++21:59
fungioh, yeah i looked at the log but didn21:59
fungi't notice the time on those events21:59
*** markvoelker has joined #openstack-infra22:01
*** hamerins has quit IRC22:02
*** jamesmcarthur has quit IRC22:03
clarkbbasically we have such a long time out that things in the queue pile up for a relatively short period of time and we'll be backlogged logn enough that future queries stop working22:04
clarkbits almost like we want a more global timeout rather than timing out per event22:04
clarkbevent comes in, mark that time, then if after 20 minutes from then (regardless of how quickly anything before it went) we don't have results move on22:05
*** trown is now known as trown|outtypewww22:05
*** yamamoto has joined #openstack-infra22:06
fungii can restart it for now i guess22:10
*** xek has joined #openstack-infra22:11
clarkbya that should reset things22:11
*** manjeets has quit IRC22:15
fungi#status log manually restarted elastic-recheck service on status.openstack.org to clear event backlog22:16
openstackstatusfungi: finished logging22:16
*** yamamoto has quit IRC22:18
*** dklyle has joined #openstack-infra22:19
*** rlandy|biab is now known as rlandy22:21
*** graphene has quit IRC22:23
*** dklyle has quit IRC22:24
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests  https://review.openstack.org/62095422:26
*** manjeets has joined #openstack-infra22:28
*** rh-jelabarre has quit IRC22:29
*** manjeets has quit IRC22:29
*** manjeets has joined #openstack-infra22:29
*** dklyle has joined #openstack-infra22:32
*** mriedem is now known as mriedem_afk22:33
*** sshnaidm|afk is now known as sshnaidm|off22:37
*** agopi has joined #openstack-infra22:38
*** slaweq has quit IRC22:41
openstackgerritClark Boylan proposed openstack-infra/elastic-recheck master: Better event checking timeouts  https://review.openstack.org/62103822:42
clarkbmriedem_afk: fungi ^ something like that should help22:42
*** tpsilva has quit IRC22:43
*** dklyle has quit IRC22:44
*** xek has quit IRC22:44
clarkbssbarnea|rover: ^ you may be interested in that too (though it doesn't affect the dashboard generation of elastic-recheck, just the IRC and gerrit commenting)22:45
*** jonher has joined #openstack-infra22:45
*** dpawlik has quit IRC22:47
*** kgiusti has left #openstack-infra22:47
*** dpawlik has joined #openstack-infra22:48
*** dpawlik has quit IRC22:48
*** slaweq has joined #openstack-infra22:53
ianwdmsimard: thanks for review :)  i just removed the +w as explained, sorry i should have had it marked as wip as it requires a glean release22:53
*** rkukura_ has joined #openstack-infra22:54
*** rkukura has quit IRC22:57
*** rkukura_ is now known as rkukura22:57
*** slaweq has quit IRC22:57
openstackgerritMerged openstack-infra/project-config master: Remove ansible-role-redhat-subscription from central repo  https://review.openstack.org/61797422:58
openstackgerritMerged openstack-infra/project-config master: add jobs to publish library from governance repo  https://review.openstack.org/61934722:58
openstackgerritIan Wienand proposed openstack/diskimage-builder master: Revert "Make tripleo-buildimage-overcloud-full-centos-7 non-voting"  https://review.openstack.org/62020123:00
openstackgerritIan Wienand proposed openstack/diskimage-builder master: package-installs: provide for skip from env var  https://review.openstack.org/61911923:00
openstackgerritIan Wienand proposed openstack/diskimage-builder master: simple-init: allow for NetworkManager support  https://review.openstack.org/61912023:00
openstackgerritIan Wienand proposed openstack/diskimage-builder master: Revert "Make tripleo-buildimage-overcloud-full-centos-7 non-voting"  https://review.openstack.org/62020123:03
openstackgerritIan Wienand proposed openstack/diskimage-builder master: package-installs: provide for skip from env var  https://review.openstack.org/61911923:03
openstackgerritIan Wienand proposed openstack/diskimage-builder master: simple-init: allow for NetworkManager support  https://review.openstack.org/61912023:03
*** boden has quit IRC23:07
*** lpetrut has quit IRC23:07
*** mgutehal_ has joined #openstack-infra23:08
*** mgutehall has quit IRC23:09
*** agopi has quit IRC23:16
*** eernst has joined #openstack-infra23:17
*** eernst has quit IRC23:22
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: OpenStack: count leaked nodes in unmanaged quota  https://review.openstack.org/62104023:22
*** jamesdenton has quit IRC23:23
*** dhellmann_ has joined #openstack-infra23:26
*** eernst has joined #openstack-infra23:26
*** dhellmann has quit IRC23:26
*** eernst has quit IRC23:27
*** dhellmann_ is now known as dhellmann23:30
*** jamesdenton has joined #openstack-infra23:32
ianwhttp://grafana.openstack.org/d/qzQ_v2oiz/bridge-runtime?orgId=1&from=1543271530069&to=1543377490234 this does seem to be a consistent increase in bridge runtime from around 2018-11-2723:32
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: OpenStack: store ZK records for launch error nodes  https://review.openstack.org/62104323:38
*** pbourke has quit IRC23:46
*** manjeets has quit IRC23:46
*** manjeets has joined #openstack-infra23:46
*** pbourke has joined #openstack-infra23:47
ianwlooks like the missing 1/2 hour is in here -> http://paste.openstack.org/show/736459/23:48
clarkbpuppet unhappy on the arm node?23:49
ianwi'm guessing so, logs on host look weird23:55
ianwNov 29 11:30:19 mirror01 puppet-user[31959]: Compiled catalog for mirror01.nrt1.arm64ci.openstack.org in environment production in 4.82 seconds23:55
ianwlike it starts but then nothing?23:55
ianwoh, wow, a lot of stuck processes23:55
ianwinteresting, attach strace to one and now strace is dead23:56
ianwoh dear, i think we have a smoking gun23:57
ianw1582391.992063] Call trace:23:57
ianw[1582391.998695]  afs_linux_raw_open+0x114/0x158 [openafs]23:57
ianw[1582392.008571]  osi_UFSOpen+0xa4/0x1d8 [openafs]23:57
ianwafs has got stuck23:57
ianwi'm rebooting the host23:57
ianwactually23:58
ianw[1241534.846289] print_req_error: I/O error, dev sdb, sector 31493481623:58
ianw[1241534.854136] Aborting journal on device dm-1-8.23:58
ianw[1241534.964702] sd 0:0:0:1: [sdb] tag#79 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE23:58
ianwseems it's not happy in multiple ways23:58
clarkbouch that is a cinder volume?23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!