Thursday, 2018-11-29

SpamapS	Well	00:20
SpamapS	you could also have the app for the app things, and then users for permission things.	00:20
SpamapS	One reason that sucks is if you're using GHE or private repos, you pay per seat.	00:20
SpamapS	But it's just one user. ;)	00:20
goern	ja, thats what we do.... the user is shared by zuul and other bots	00:21
goern	and the app is zuul's baby	00:21
SpamapS	The whole point of apps though, is so that you get all of that in one package.	00:23
SpamapS	So it's pretty frustrating that they can't just be listed in branch protections.	00:23
SpamapS	Yeah I think given GitHub's limitations, the thing to do is make it so GitHub can "click merge" with a real user and not the app user.	00:28
SpamapS	err	00:28
SpamapS	Yeah I think given GitHub's limitations, the thing to do is make it so Zuul can "click merge" with a real user and not the app user.	00:28
corvus	tristanC: 00:31 < openstackgerrit> James E. Blair proposed openstack-infra/project-config master: Nodepool: add kubernetes provider https://review.openstack.org/620756	00:31
tristanC	corvus: excellent!	00:34
corvus	tristanC: i'd love it if you could look over that one and the depends-on and see if i did it right -- you should be able to see everything about the config except the key.	00:35
SpamapS	yuuummy	00:35
tristanC	corvus: what about the roadmap, without https://review.openstack.org/570668 it won't be usable...	00:35
corvus	at least, i think the thing i removed is the key :)	00:35
clarkb	does the k8s driver not dynamically create namespaces and pods like nodepool does openstack instances?	00:36
corvus	tristanC: can you drop the depends-on to the openshift change?	00:36
clarkb	or is that specificity more label attribute and it will create many of those things	00:36
SpamapS	While we're looking at providers... https://review.openstack.org/535558 is all green and wants some feedback (and has been running in production at Good Money for about 3 months now)	00:36
corvus	clarkb: it does both of those things	00:36
SpamapS	It does have a bug that causes NODE_FAILURE that I haven't had time to look at, but I think it has something to do with quota management.	00:37
corvus	SpamapS: can we find some way to test it? it doesn't have to be perfect, but some kind of fake or mock or something that at least lets us have python evaluate some of the code (so that we can refactor, etc) would be great.	00:38
clarkb	ya ok its label attributes then it creates many of those things	00:38
tristanC	corvus: about https://review.openstack.org/#/c/620755/1/playbooks/templates/clouds/nodepool_kube_config.yaml.j2 , best is to try using it before, e.g "kubectl get pods" shouldn't return error	00:38
corvus	tristanC: yep, i ran that locally	00:39
tristanC	then 620756 should be good to go	00:39
tristanC	corvus: yes we can remove the depends-on openshift change, but can it be reviewed too?	00:41
SpamapS	corvus: definitely.. betamax should be possible.	00:41
SpamapS	Or maybe this https://github.com/spulec/moto	00:42
tristanC	clarkb: the k8s driver creates a new namespace for each node request, e.g. https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/kubernetes/provider.py#n255	00:42
corvus	tristanC: yes, though i'd like to complete the k8s work before landing the openshift work	00:42
tristanC	corvus: can you define complete? :)	00:43
corvus	tristanC: all the required parts of the spec landed?	00:44
corvus	tristanC: probably just that zuul change and any bugfixes	00:44
clarkb	corvus: I'm reading the kube config file and while I have no reason to believe it is wrong I also don't understand how the admin user at hte bottom of the file is associated with the vexxhost cluster. If we added a second cluster would that cluster have to have a different username than admin because the existing cluster already uses the username admin? if so I wonder if k8s would consider that a bug	00:45
SpamapS	(funny enough, I may not even want the ec2 driver anymore, I might just kubernetes it up... we'll see)	00:45
clarkb	seems like you should be able to use the same username in any cluster	00:45
* SpamapS pokes at moto'ing		00:45
clarkb	SpamapS: tristanC ^ you may know the answer to that question	00:45
corvus	clarkb: i do not know the answer; the config is mostly what 'openstack coe config' spat out	00:46
tristanC	clarkb: you can use a single username in any cluster or any namespace yes	00:47
clarkb	tristanC: how do you express that in the config file?	00:48
clarkb	tristanC: as is the config file doesn't map the cluster to the user other than by name	00:48
clarkb	tristanC: so it would have to be unique or use the same key and cert?	00:48
clarkb	tristanC: https://review.openstack.org/#/c/620755/1/playbooks/templates/clouds/nodepool_kube_config.yaml.j2 for example	00:48
SpamapS	clarkb: guessing contexts?	00:49
tristanC	clarkb: oh right, the 'name' in kubectl are just key, not actual username	00:49
tristanC	clarkb: the username is defined by the auth-data sent to the cluster	00:49
clarkb	tristanC: gotcha so that is part of the key?	00:49
clarkb	so we'd have a vexxhost-admin and just update those name keys but key-data stays the same?	00:50
tristanC	clarkb: maybe? i don't know how k8s works under the hood, but in openshift you can login using just a token, no username required	00:50
clarkb	we can make that update if/when there is more than one cluster	00:50
SpamapS	There are various auth providers	00:50
SpamapS	AWS has a token based one	00:50
SpamapS	GKE has something similar	00:50
clarkb	ah so this may even come down to implementation details. yay	00:51
SpamapS	http://paste.openstack.org/show/736344/	00:51
* corvus eods		00:51
SpamapS	that's my user section for one of my k8s clusters	00:51
SpamapS	the username is not actually there, because it's defined by my AWS credentials that aws-iam-authenticator uses.	00:52
SpamapS	So it's just a name to refer to in context in the config file as "unique way to get access"	00:52
clarkb	SpamapS: ya then each user: block would determine the actual details	00:53
clarkb	makes sense. I msotly wanted to make sure we didn't need to manage more than one file and doesn't seem like we do. We'd have to update the existing file if we add another k8s but it should work fine (and we have to do that anyway)	00:53
SpamapS	correct	00:53
SpamapS	yeah most systems that I've used have this exec method that grabs credentials.	00:54
SpamapS	that way we can all just have one big happy kube config	00:54
SpamapS	similar to clouds.yaml with secure.yaml having the actual creds	00:54
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: executor: add support for generic build resource https://review.openstack.org/570668	01:07
SpamapS	this moto thing is perfect actually	01:19
*** bhavikdbavishi has joined #zuul		02:42
*** rlandy\|bbl is now known as rlandy		03:18
openstackgerrit	Merged openstack-infra/zuul master: Remove uneeded if statement https://review.openstack.org/617984	04:02
openstackgerrit	Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Amazon EC2 driver https://review.openstack.org/535558	04:29
SpamapS	corvus: ^ now with a unit test!	04:29
*** bhavikdbavishi has quit IRC		05:18
*** bhavikdbavishi has joined #zuul		05:18
*** bhavikdbavishi has quit IRC		05:30
*** bhavikdbavishi has joined #zuul		06:12
*** bjackman has joined #zuul		06:20
SpamapS	hrm	06:23
SpamapS	upgraded to 3.3.1 and I'm getting no consoles :(	06:23
SpamapS	http://paste.openstack.org/show/736353/	06:23
SpamapS	gah	06:28
SpamapS	n/m	06:28
SpamapS	another backward incompatible change. :-p	06:28
tristanC	SpamapS: "Header Upgrade is not defined"... what rewrite rules are you using?	06:28
SpamapS	No the problem is that somewhere between 3.2.x and 3.3.0 (which is in the release notes) the requirements changed for static offload	06:28
SpamapS	Honestly.. this is getting pretty frustrating. I don't feel like retooling my deployment every time I do a minor upgrade. :-/	06:29
SpamapS	I mean, I love the new UI	06:29
SpamapS	so grateful for all the new stuff	06:29
SpamapS	but.. can.. I just change my version pointer.. just once.. and not have to spend 2 hours mopping up? :-P	06:30
* SpamapS will stop venting		06:30
tristanC	SpamapS: static offload had to be impacted to support the new html5 urls... and since it's outside zuul control, there isn't much we can do about it	06:38
tristanC	perhaps don't offload, that works out of the box	06:39
tobiash	SpamapS: you do static offload?	06:39
*** bhavikdbavishi has quit IRC		07:01
*** quiquell\|off is now known as quiquell		07:07
*** chkumar\|away has quit IRC		07:13
*** chandan_kumar has joined #zuul		07:15
quiquell	Good morning	07:27
quiquell	What is the cccp.yaml ?	07:27
tobiash	quiquell: morning	07:55
tobiash	?	07:55
*** gtema has joined #zuul		08:01
quiquell	tobiash: Nah is not here https://github.com/openstack-infra/zuul/tree/master/doc/source/admin/examples	08:02
quiquell	tobiash: I think it get that generated after build images	08:02
tobiash	quiquell: what file are you referring to? cccp.yaml doesn't tell anything to me ;)	08:03
quiquell	tobiash: Me neither, will just remove it	08:04
quiquell	tobiash: thanks anyways	08:04
tobiash	k	08:04
*** bjackman has quit IRC		08:10
*** quiquell is now known as quiquell\|brb		08:13
*** bjackman has joined #zuul		08:13
*** bhavikdbavishi has joined #zuul		08:18
bjackman	Just noticed from reading the semaphore thread on zuul-discuss that acquisition/release is tied to job begin/start. In my ultimate use-case I'd like to be able to acquire/release them in the middle of jobs (specifically, I don't really want to lock test HW until I've made sure the code compiles)	08:20
bjackman	Just a thought	08:20
bjackman	(In this case test HW does not mean the node provided by nodepool, it's a separate resource)	08:21
*** hashar has joined #zuul		08:24
*** jpena\|off is now known as jpena		08:31
*** bjackman has quit IRC		08:34
*** themroc has joined #zuul		08:36
*** bjackman has joined #zuul		08:37
*** quiquell\|brb is now known as quiquell		08:37
*** bhavikdbavishi1 has joined #zuul		08:37
*** bhavikdbavishi has quit IRC		08:38
*** bhavikdbavishi1 is now known as bhavikdbavishi		08:38
bjackman	Anyway Jim Blair said in an email on zuul-discuss that there's been some discussion about what he calls the "bisecting pipeline manager", does anyone know where that discussion took place / if there's a record of it?	08:39
bjackman	(Mail I'm referring to http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-November/000636.html)	08:39
*** pcaruana has joined #zuul		08:48
openstackgerrit	Brendan proposed openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema https://review.openstack.org/620838	09:25
*** ssbarnea has quit IRC		09:45
*** ssbarnea\|rover has joined #zuul		09:45
openstackgerrit	Merged openstack-infra/nodepool master: Update node request during locking https://review.openstack.org/618807	09:50
*** bhavikdbavishi has quit IRC		09:55
openstackgerrit	Merged openstack-infra/nodepool master: Add second level cache of nodes https://review.openstack.org/619025	09:59
openstackgerrit	Merged openstack-infra/nodepool master: Add second level cache to node requests https://review.openstack.org/619069	09:59
openstackgerrit	Merged openstack-infra/nodepool master: Only setup zNode caches in launcher https://review.openstack.org/619440	09:59
openstackgerrit	Benoît Bayszczak proposed openstack-infra/zuul master: Disable Nodepool nodes lock for SKIPPED jobs https://review.openstack.org/613261	10:23
*** bjackman has quit IRC		10:30
*** AJaeger_ has quit IRC		10:32
*** electrofelix has joined #zuul		11:04
*** AJaeger_ has joined #zuul		11:13
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589	11:57
*** jpena is now known as jpena\|lunch		11:59
*** panda\|pto is now known as panda		12:41
*** gtema has quit IRC		12:45
*** sshnaidm\|afk is now known as sshnaidm		12:51
quiquell	tobiash: Do you an example of log config for scheduler ? I feel lazy today :-/	13:06
quiquell	tobiash: I mean log_config the yaml files with filters and all	13:06
tobiash	quiquell: we're still using the ini style, but not at computer atm	13:07
tobiash	quiquell: look in system-config repo, there should be a log config too	13:08
quiquell	tobiash: system-config that was the name I was looking for thanks!	13:08
*** gtema has joined #zuul		13:10
*** rlandy has joined #zuul		13:11
*** jpena\|lunch is now known as jpena		13:31
*** j^2 has joined #zuul		14:04
*** themroc has quit IRC		14:23
*** themroc has joined #zuul		14:25
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Add arbitrary node attributes config option https://review.openstack.org/620691	14:36
Shrews	mordred: s/metadata/attributes/ ^^^	14:37
mordred	Shrews: brilliant!@	14:37
*** bhavikdbavishi has joined #zuul		14:56
*** themroc has quit IRC		15:16
*** themroc has joined #zuul		15:16
quiquell	Is there any way to exclude especific jobs from configuration in the tenant yaml config ?	15:23
tobiash	quiquell: only all jobs	15:32
quiquell	tobiash: ack	15:32
quiquell	tobiash: Maybe I can shadow the ones I don't want ?	15:32
*** panda is now known as panda\|pto		15:34
*** hashar has quit IRC		15:37
tobiash	quiquell: that is possible, shadowing allows you to override jobs	15:38
quiquell	tobiash: ok, thanks	15:40
corvus	quiquell: though you might consider moving the jobs you don't want into a new repo	15:40
corvus	tobiash: why did you add the event.clear in 619589	15:41
quiquell	corvus: ack	15:41
tobiash	corvus: because otherwise it updated just every second regardless of cache triggered changes	15:41
tobiash	just noticed that today and then read about threading events. Without the clear any subsequent wait effectively doesn't wait at all because the event is still set	15:43
corvus	tobiash: left a couple comments	15:43
tobiash	corvus: good points	15:45
tobiash	corvus: shall we signal on stop and still keep the timeout as a safety net?	15:48
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589	15:50
corvus	tobiash: i like dropping the timeout; less wasted work	15:51
tobiash	Ok, timeout dropped	15:54
*** quiquell is now known as quiquell\|off		15:55
*** bjackman has joined #zuul		15:58
*** gtema has quit IRC		16:23
*** themroc has quit IRC		16:23
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954	16:37
corvus	tobiash, Shrews: ^ that's the nodepool half of that. i'll finish up the zuul change now.	16:38
tobiash	Cool, looking after dinner :)	16:40
pabelanger	Oooh, I'm going to review that too! Totally interested in the code	16:40
*** bjackman has quit IRC		16:55
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356	17:06
corvus	tobiash, Shrews, pabelanger: ^ that should be ready now too	17:06
corvus	clarkb: ^	17:06
pabelanger	looks	17:08
clarkb	corvus great. I'm listening to an openci meeting noe but will review after	17:09
corvus	pabelanger: can you look at https://review.openstack.org/620691 and make sure it will work for zones?	17:10
pabelanger	sure	17:10
corvus	pabelanger: feel free to +W if it works for you	17:12
*** dkehn_ has joined #zuul		17:15
*** bhavikdbavishi has quit IRC		17:15
*** dkehn has quit IRC		17:15
*** dkehn_ is now known as dkehn		17:15
*** bhavikdbavishi has joined #zuul		17:16
pabelanger	+A	17:17
pabelanger	Thanks!	17:17
pabelanger	I'll update zuul patch now	17:17
tobiash	corvus: re nodepool relative prios, we have the case that we have many providers that decline because they don't support the label (static providers). In this case there will be much lock contention leading to skipped node requests in the loop. So I think we should make sure that the loop is frequently restarted from the beginning with a fresh sorted list.	17:44
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: More strongly recommend the simple reverse proxy deployment https://review.openstack.org/620969	17:44
tobiash	corvus: but I think in combination with 610029 that might be good enough	17:45
corvus	tobiash: does relative_priority alter that behavior? -- is that something we need to address in the relative_priority change, or can that be a followup?	17:46
corvus	mordred: based on the feedback from SpamapS earlier, i worked on 620969, but i noticed that we suggest static offloading for white-labeling because white-label without static offload is said to be complex. you were talking about dropping static-offload for openstack -- were you thinking of doing that while retaining white-label? should we document that case?	17:48
tobiash	corvus: no, I just wanted to mention this as our lock contention will circumvent the relative priority without 610029	17:48
tobiash	corvus: +2 from me	17:50
corvus	tobiash: ah yes. i agree they will be better together and we should merge 029 shortly. can you explain what the 'yield' does there?	17:50
tobiash	corvus: I did that based on Simon's first comment	17:52
*** jpena is now known as jpena\|off		17:53
corvus	tobiash: that makes sense now	17:54
corvus	i'll leave the approval of https://review.openstack.org/610029 to Shrews	17:55
tobiash	corvus: but I think the yield is not good anymore with the relative priority as it just continues the loop while we now actually want to restart the loop	17:55
corvus	tobiash: agreed that would be better, but not a fatal flaw. do you want to do a followup after they both merge, or rebase 029 on relative_priority?	17:57
tobiash	corvus: I think I'll rebase on relative priority	17:57
corvus	tobiash: ok. better wip it then, it has a lot of +2s	17:58
openstackgerrit	Paul Belanger proposed openstack-infra/zuul master: Add support for zones in executors https://review.openstack.org/549197	18:05
pabelanger	tobiash: corvus: clarkb: mordred: ^now updated for node-attributes logic	18:06
*** bhavikdbavishi has quit IRC		18:13
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029	18:13
*** bhavikdbavishi1 has joined #zuul		18:13
openstackgerrit	Tobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently https://review.openstack.org/610029	18:15
*** bhavikdbavishi1 is now known as bhavikdbavishi		18:15
tobiash	corvus: that should do it ^	18:16
tobiash	pabelanger: while you're at it, do you want to address my doc comment of PS18?	18:18
tobiash	pabelanger: it's not provider anymore ;)	18:18
*** bhavikdbavishi has quit IRC		18:19
pabelanger	+1	18:22
sshnaidm	where can I see possible zuul statuses? like SUCCESS, FAILURE, TIMED_OUT...?	18:29
corvus	sshnaidm: i don't think the whole list is enumerated anywhere	18:32
pabelanger	tobiash: provider pool? is that better?	18:38
tobiash	pabelanger: maybe just zone as this is what you actually configure?	18:38
pabelanger	sure	18:38
tobiash	pabelanger: '... that have nodes with the specified executor-zone attribute...'	18:39
tobiash	pabelanger: how about this ^ ?	18:39
pabelanger	sure	18:40
tobiash	corvus: I've commented on 615356, what do you think?	18:44
corvus	tobiash: yep	18:45
openstackgerrit	Paul Belanger proposed openstack-infra/zuul master: Clarify executor zone documentation https://review.openstack.org/620989	18:51
pabelanger	tobiash: ^	18:51
tobiash	pabelanger: thanks	18:52
tobiash	+2 on both	18:52
pabelanger	thank you!	18:56
Shrews	tobiash: i don't think i'm clear on the need for the kazoo election recipe (though docs are horrible). seems like only a single thread is involved in the election, so what purpose is that serving?	18:57
tobiash	Shrews: there may be multiple launcher processes involved in the election	18:57
Shrews	tobiash: oh right. external scaling. what happens if the leader is stopped? will another take it's place?	18:58
Shrews	s/it's/its	18:58
tobiash	Shrews: yes, that's how the leader election works	18:58
tobiash	as soon as the leader finished its function or looses its session a new leader is elected	18:59
Shrews	tobiash: ok, so election.run() basically blocks forever while it is not the leader	19:00
tobiash	yes	19:00
Shrews	tobiash: gotcha. thx	19:00
tobiash	:)	19:00
Shrews	it wasn't clear to me if it returned when it LOST the election, but that makes sense	19:00
tobiash	yeah I like that concept	19:01
*** hashar has joined #zuul		19:06
Shrews	tobiash: what if the leader gets an exception and the thread stops, but the zk session (which is shared among threads) remains active?	19:08
clarkb	Shrews: if the thread stops the connection should break allowing another process to become leader?	19:09
clarkb	though I guess if its a shared session maybe that doesn't happen	19:09
tobiash	Shrews: if the leader returns from the function that is triggered by the election process a new leader is elected	19:09
clarkb	might want to do connection/session specific to elections	19:09
Shrews	how does the connection break? it's a shared connection	19:09
clarkb	Shrews: well it could be some logical action over the top of the connection too	19:10
clarkb	tcp might still be happy though	19:10
tobiash	Shrews: so throwing an exception would lead to a new leader election as well	19:10
Shrews	tobiash: nod	19:11
*** electrofelix has quit IRC		19:12
tobiash	Shrews: that leader election is much simpler than I thought: https://kazoo.readthedocs.io/en/latest/_modules/kazoo/recipe/election.html#Election.run	19:16
tobiash	It just gets a lock and runs the leader function while having the lock, as soon as this returns (no matter how) the lock is released and the next one in the election gets the lock	19:17
openstackgerrit	Jeremy Stanley proposed openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list" https://review.openstack.org/620995	19:38
*** sshnaidm is now known as sshnaidm\|afk		19:56
SpamapS	tobiash: yeah I do static offload because when I set up my zuul cherrypy was totally screwed up and not serving static content.	20:12
SpamapS	(I think actually the problem was the yarn hooks or something.. anyway.. it's fixed now)	20:12
SpamapS	And also because of what corvus mentioned.. it was suggested as the simpler path (which it is).	20:13
SpamapS	I am sure I'll find a multi-tenant case at some point, but thus far, I've always been single tenant.	20:13
openstackgerrit	Merged openstack-infra/nodepool master: Add arbitrary node attributes config option https://review.openstack.org/620691	20:14
openstackgerrit	Merged openstack-infra/nodepool master: Asynchronously update node statistics https://review.openstack.org/619589	21:00
openstackgerrit	Merged openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list" https://review.openstack.org/620995	21:00
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Fix updateFromDict overrides https://review.openstack.org/621008	21:07
corvus	tobiash, Shrews: ^ just noticed that wee typo.	21:07
Shrews	oops	21:12
pabelanger	+3	21:13
pabelanger	corvus: tobiash: Shrews: I have a provider in nodepool, that seems to be having some quota issues. Right now my tenant, vexxhost-mtl1, is full of nodes in ERROR state. But it seems, this does seem to reflect in the openstack quota from the provider. I wonder if we could some how try to calculate the error instances in the unmanagedQuotaUsed() call:	21:35
pabelanger	http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/driver/openstack/provider.py#n176	21:35
pabelanger	because right now, I have jobs that are wedged waiting for nodes to free up in the provider	21:36
pabelanger	and nodepool won't try another provider it seems	21:36
clarkb	pabelanger: the issue is those nodes actually count against your quota	21:36
clarkb	and you will likely be charged for them so I don't think nodepool should ignore them	21:36
pabelanger	but for some reason, openstack api says there is quota available	21:37
pabelanger	but really, there isn't	21:37
corvus	pabelanger: why won't nodepool try another provider?	21:37
pabelanger	corvus: I don't know, still trying to figure that out	21:37
clarkb	pabelanger: they may not count against cpu/memory/disk etc quota if they never started but they should count against instance quota. I have definitelyseen error instances count against instance quota on clouds	21:38
*** rlandy is now known as rlandy\|biab		21:44
dkehn	clarkb: FYI the lastest update did fix the .ssh/known_host issue	21:52
clarkb	dkehn: great]	21:53
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests https://review.openstack.org/615356	21:57
pabelanger	corvus: clarkb: looking at the log, I can see new node requests being created each time vexxhost-mtl fails, then a few pools decline the new node request, because they don't have the right flavor. But of the 4 provider pools that do have the flavor, vexxhost-mtl is always the first to fulfill the node request, and I don't know why that it	21:57
pabelanger	so, the 3 behind vexxhost-mtl, never get a change to try and fulfill it	21:58
pabelanger	leading to stuck jobs in zuul	21:59
clarkb	pabelanger: the other three should get it after mtl1 fails 3 times?	21:59
pabelanger	clarkb: no, we delete the node request it looks like, and create it new	22:00
clarkb	mtl1 should decline the request after three errors, that gives the other regions opportunity to pick it up	22:00
clarkb	we being nodepool?	22:00
clarkb	that seems like a bug	22:00
pabelanger	1 sec, let me get a log	22:00
pabelanger	yah, nodepool	22:00
corvus	nodepool can delete min-ready node requests, but otherwise doesn't delete requests.	22:03
dmsimard	clarkb, SpamapS, corvus: Has there been any traction on reproducing jobs outside Zuul ? I wasn't in Berlin so I may have missed out.	22:11
dmsimard	I asked about this briefly before Berlin: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2018-11-06.log.html#t2018-11-06T17:39:05	22:13
*** manjeets has quit IRC		22:15
pabelanger	corvus: clarkb: okay, I am looking at the right node requests now:	22:16
pabelanger	2018-11-29 20:58:35,195 DEBUG nodepool.PoolWorker.vexxhost-ansible-network-mtl1-v2-highcpu-1: Active requests: ['300-0000331819', '300-0000331925']	22:16
corvus	dmsimard: in the conversation you linked, i noted who was working on it; maybe ask them?	22:16
pabelanger	corvus: clarkb: it seems vexxhost-mtl is retrying indefinitely because we hit a quota exception	22:17
pabelanger	trying to see why these are not failing over the other providers now	22:18
dmsimard	corvus: not a terrible idea :/	22:19
*** rlandy\|biab is now known as rlandy		22:21
corvus	pabelanger, Shrews, tobiash, clarkb: https://review.openstack.org/536930 is the change which introduced the behavior which is vexing pabelanger now	22:22
corvus	pabelanger: however, it shouldn't retry indefinitely, it should pause	22:23
corvus	pabelanger: (which means you should end up with one stuck request, and others continue)	22:23
pabelanger	looking	22:23
corvus	pabelanger: wait, you're not getting a quota exception though, are you? the way you described it, it sounded like it was nodepool hitting quota limits and not trying to boot nodes...	22:24
corvus	i'm not longer certain 536930 is relevant	22:24
pabelanger	corvus: no, I am. I was tracing the wrong node request	22:24
corvus	pabelanger: maybe you could paste the logs?	22:24
pabelanger	yup	22:24
pabelanger	1 min	22:24
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests https://review.openstack.org/620954	22:26
pabelanger	corvus: https://pabelanger.fedorapeople.org/launcher.log	22:27
pabelanger	I was tracing 300-0000331925	22:27
pabelanger	which leads to an aborted node for 0000318660	22:28
*** manjeets has joined #zuul		22:28
*** manjeets has quit IRC		22:29
*** manjeets has joined #zuul		22:29
corvus	pabelanger: okay, so 536930 does describe the behavior you're seeing, and nodepool is not pausing because nova is incorrectly reporting quota usage (it's not reporting the quota is used, but if you try to launch a node, it fails with a quota error)	22:33
pabelanger	do you know why it is not trying quota on another provider pool? for example vexxhost-sjc1 should have space	22:34
corvus	to be honest, i don't see a way to resolve this other than going back to the original behavior (where unexpected quota errors counted against the three strikes which eventually caused a node-request decline). but it tobiash's environment, an unexpected quota error is common enough for that to be a problem.	22:34
corvus	pabelanger: the only way for another provider to try the request is for this provider to decline it. that was the old behavior. tobiash wanted it changed so that quota errors weren't fatal.	22:35
corvus	(mind you, the underlying problem is openstack lying; fix that and ther's no issue. the issue we're grappling with is how to work around that.)	22:36
*** sshnaidm\|afk is now known as sshnaidm\|off		22:37
pabelanger	corvus: okay, I understand now	22:37
pabelanger	unless we start tracking quota retries, and fail decline after x attempts	22:38
corvus	pabelanger: that was the old behavior	22:38
pabelanger	yah	22:38
pabelanger	okay, will wait until tobiash is back online to better understand why that broke him	22:38
clarkb	my understanding is that nova at least only periodically updates the quota usage for what you see via the api	22:39
pabelanger	but yes, in this case openstack is at fault	22:39
clarkb	hwoever on every boot it seems to know an up to date number? seems like it could give you a valid number on every api requiest too	22:39
corvus	pabelanger: yeah, let's discuss with tobiash. i think we need to do something -- the possibility for nodepool to wedge is not acceptable. maybe we can come up with something other than reverting to the original algorithm.	22:41
corvus	clarkb: it sounds like this has been going on for some time for pabelanger -- so maybe it's not updating the quota at all?	22:42
clarkb	or it has gotten out of sync somehow (we've also seen that)	22:42
pabelanger	corvus: wfm, for now I can disable the provider, so no rush.	22:43
corvus	pabelanger: do those error nodes show up in openstack server list?	22:46
corvus	it actually looks like we ignore the quota used values from openstack; instead we only use the max values, and then perform our own subtraction. so if nodepool doesn't see those servers (or doesn't have them associated with a flavor), it won't see them as impacting the quota...	22:47
pabelanger	corvus: yes, they show up in ERROR	22:48
corvus	pabelanger: are there zk node entries for them?	22:48
pabelanger	corvus: I'd have to ask tristanC or nhicher to look for that, I don't have access myself	22:49
corvus	pabelanger: you... have logs?	22:49
pabelanger	I ask for them	22:49
pabelanger	I don't have access to the production servers, just a user	22:49
corvus	pabelanger: can you get the output of 'nodepool list'. the /node-list endpoint of the nodepool web app would have it	22:52
pabelanger	let me see	22:54
mordred	https://softwarefactory-project.io/zuul/t/ansible-network/nodes <-- I'm only seeing 2 nodes in mtl1	22:55
pabelanger	yah, I don't believe nodepool-web is exposed	22:56
pabelanger	dmsimard: around to run the commands corvus asked for?	22:57
dmsimard	o/	22:58
dmsimard	let me see	22:58
corvus	pabelanger, mordred, dmsimard: we'll need the nodepool ids of the error nodes (should be in the nova metadata), and then the nodepool logs for those ids	22:58
pabelanger	http://paste.openstack.org/show/736454/	22:58
pabelanger	is openstack server show of an ERROR node	22:59
corvus	pabelanger, dmsimard: great, so nodepool logs of 0000318690 would be helpful	22:59
pabelanger	let me check if that is in the copy I have, I pruned the posted one	23:00
dmsimard	corvus: logs http://paste.openstack.org/show/736455/	23:00
corvus	it's starting to look like openstack is providing the right quota info, but nodepool isn't subtracting out those error nodes because it doesn't have zk records for them	23:00
corvus	dmsimard: can you confirm there aren't any other interesting log lines between the last two?	23:01
corvus	(ie, an exception or something)	23:01
pabelanger	0000318690 shows up in https://pabelanger.fedorapeople.org/launcher.log	23:01
corvus	it's only a few microseconds, so shouldn't be much there	23:01
clarkb	thats curious, nodepool should keep records for error'd nodes until they get deleted	23:01
clarkb	(we even show the instance after delete to make sure it actually deleted iirc)	23:02
pabelanger	http://paste.openstack.org/show/736456/	23:02
*** hashar has quit IRC		23:02
pabelanger	that is the full list of ERROR nodes, FYI	23:02
corvus	oh... look closer at the history for 318690 -- it reused the node id after the initial error	23:03
dmsimard	corvus: there's a few traces: http://paste.openstack.org/show/736457/	23:04
corvus	it didn't think a server was created, so it just launched it again, and overwrote the zk record. but a server really was created in attempt #1.	23:04
dmsimard	It's hard to correlate that last trace with something in specific	23:04
clarkb	corvus: ah so that might be the openstack bug	23:05
clarkb	(incorrect return to create request?)	23:05
dmsimard	"keystoneauth1.exceptions.http.Unauthorized" is 1384 times in the logs but I'm not sure for which cloud or provider it is.	23:05
corvus	mordred: maybe you can help here -- look at the first exception in http://paste.openstack.org/show/736457/	23:06
pabelanger	dmsimard: that looks to be limestone, I need to see why	23:06
corvus	mordred, clarkb: nodepool is getting an exception from openstacksdk and thinks there's no server, but there really is a server in nova -- the error message even gives us the id	23:06
corvus	(that uuid matches up with pabelanger's paste in http://paste.openstack.org/show/736456/ )	23:07
dmsimard	I need to step away for dinner, I'll be back later	23:07
pabelanger	dmsimard: thanks for help	23:07
corvus	mordred: the issue is that http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py#n125 raises an exception; we only save the external id of the server if it succeeds (see line 142).	23:14
corvus	so basically, every server create that hits an error on the initial call ends up leaked	23:15
pabelanger	ah, nice debugging	23:17
pabelanger	corvus: do we still need to look in zookeeper or does https://softwarefactory-project.io/zuul/t/ansible-network/nodes give the proper information?	23:21
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: OpenStack: count leaked nodes in unmanaged quota https://review.openstack.org/621040	23:22
corvus	pabelanger: that's all the info we need. you should never need to talk to zookeeper directly.	23:23
corvus	pabelanger, tobiash, clarkb: ^ 621040 may help mitigate this situation	23:23
clarkb	corvus: that fix should work, but would it be better to record the leaked nodes so they can be deleted?	23:23
corvus	clarkb: yes. i do not know how to do that. i was hoping mordred could help.	23:24
pabelanger	corvus: ack, thanks	23:24
corvus	pabelanger: do you know which version of openstacksdk you're using?	23:25
pabelanger	okay, cool. that kinda lines up with my initial thought, updating unmanagedQuotaUsed to account for the leak :)	23:25
pabelanger	corvus: I don't think so, let me check but tristanC will know	23:25
pabelanger	I think that is pulled in via RPM	23:26
corvus	line 6831 in openstackcloud.py is the middle of a comment on master; so it's a bit hard to line up that traceback	23:26
corvus	clarkb: oh nice -- the error string in that exception is generated/formatted by openstacksdk. the server id is an attribute on openstack.cloud.exc.OpenStackCloudCreateException	23:28
corvus	so we should be able to catch that and snag the server id and keep the zk record around	23:29
clarkb	corvus: yup I'm writing that change now	23:29
pabelanger	corvus: educated guess for openstacksdk is 0.19.0, that is just looking at the spec file however. Not sure if that is the version running	23:29
pabelanger	0.17.2 was the version before that	23:29
corvus	clarkb: oh so was i. how many lines have you written? :)	23:30
clarkb	corvus: uh the try except with self.node.state = zk.FAILED and set the uuid and zk.storeNode? oh and reraise	23:31
clarkb	mine is sort of handwavy though with placeholders for actual values (like I don't know what the actual server id attribute is yet	23:31
clarkb	if you've got specifics sorted out then you should go for it	23:31
corvus	clarkb: i'll do it; you review :)	23:33
clarkb	ok	23:33
clarkb	I'll continue sorting out mine ebacuse then I'll be able to review yours easily :)	23:33
clarkb	http://paste.openstack.org/show/736458/ I think that is functional	23:35
clarkb	(but totally untested)	23:36
clarkb	oh the if e.uuid should be if e.resource_id	23:36
openstackgerrit	James E. Blair proposed openstack-infra/nodepool master: OpenStack: store ZK records for launch error nodes https://review.openstack.org/621043	23:38
clarkb	ah ok yours is a bit cleaner by relying on the callee to store the failed node	23:39
corvus	clarkb: yeah-- and actually i think we want to do that after a closer reading of http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py#n244	23:39
clarkb	corvus: one thing I did was move image and hostname up so that we'd have that data in the record we store (which might help a human debug later)	23:39
corvus	clarkb: it actually generates a new zk node	23:39
corvus	clarkb: i almost did that too until i realized what line 244+ was doing	23:40
clarkb	oh ya huh	23:40
clarkb	I mean with the uuid you should be able to figure it out	23:40
corvus	(so the leaked instance gets a new node, and the original node gets reused)	23:40
clarkb	so probably not a major concern	23:40
corvus	tbh, i think we may want to revisit that -- the metadata of the leaked node will refer to a different node id.	23:41
corvus	but that's going to be a bigger change	23:42
corvus	er, leaked isn't the right word -- error node i guess	23:42
pabelanger	cool, I've +2 both too. in case mordred wants to review	23:43
pabelanger	corvus: thanks for helping debug this today	23:43
clarkb	corvus: ya I would expect that we'd create a new node for the retry not for tracking teh delete	23:43
corvus	pabelanger: thank you for bringing it up. and dmsimard for pitching in :)	23:43
clarkb	in any case I much prefer this fix to the first workaround :)	23:43
corvus	clarkb: oh, i think we can have both	23:44
corvus	the second will make the first one mostly a noop	23:44
corvus	but it's still safe to have in case there are other leaks	23:44
clarkb	ya the first is more belt and suspenders. I just didn't watn to only rely on that first on	23:44
corvus	++	23:44
pabelanger	wfm	23:45
*** manjeets has quit IRC		23:46
*** manjeets has joined #zuul		23:46

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!