Thursday, 2018-11-29

SpamapSWell00:20
SpamapSyou could also have the app for the app things, and then users for permission things.00:20
SpamapSOne reason that sucks is if you're using GHE or private repos, you pay per seat.00:20
SpamapSBut it's just one user. ;)00:20
goernja, thats what we do.... the user is shared by zuul and other bots00:21
goernand the app is zuul's baby00:21
SpamapSThe whole point of apps though, is so that you get all of that in one package.00:23
SpamapSSo it's pretty frustrating that they can't just be listed in branch protections.00:23
SpamapSYeah I think given GitHub's limitations, the thing to do is make it so GitHub can "click merge" with a real user and not the app user.00:28
SpamapSerr00:28
SpamapSYeah I think given GitHub's limitations, the thing to do is make it so *Zuul* can "click merge" with a real user and not the app user.00:28
corvustristanC: 00:31 < openstackgerrit> James E. Blair proposed openstack-infra/project-config master: Nodepool: add kubernetes provider  https://review.openstack.org/62075600:31
tristanCcorvus: excellent!00:34
corvustristanC: i'd love it if you could look over that one and the depends-on and see if i did it right -- you should be able to see everything about the config except the key.00:35
SpamapSyuuummy00:35
tristanCcorvus: what about the roadmap, without https://review.openstack.org/570668 it won't be usable...00:35
corvusat least, i think the thing i removed is the key :)00:35
clarkbdoes the k8s driver not dynamically create namespaces and pods like nodepool does openstack instances?00:36
corvustristanC: can you drop the depends-on to the openshift change?00:36
clarkbor is that specificity more label attribute and it will create many of those things00:36
SpamapSWhile we're looking at providers... https://review.openstack.org/535558 is all green and wants some feedback (and has been running in production at Good Money for about 3 months now)00:36
corvusclarkb: it does both of those things00:36
SpamapSIt does have a bug that causes NODE_FAILURE that I haven't had time to look at, but I think it has something to do with quota management.00:37
corvusSpamapS: can we find some way to test it?  it doesn't have to be perfect, but some kind of fake or mock or something that at least lets us have python evaluate some of the code (so that we can refactor, etc) would be great.00:38
clarkbya ok its label attributes then it creates many of those things00:38
tristanCcorvus: about https://review.openstack.org/#/c/620755/1/playbooks/templates/clouds/nodepool_kube_config.yaml.j2 , best is to try using it before, e.g "kubectl get pods" shouldn't return error00:38
corvustristanC: yep, i ran that locally00:39
tristanCthen 620756 should be good to go00:39
tristanCcorvus: yes we can remove the depends-on openshift change, but can it be reviewed too?00:41
SpamapScorvus: definitely.. betamax should be possible.00:41
SpamapSOr maybe this https://github.com/spulec/moto00:42
tristanCclarkb: the k8s driver creates a new namespace for each node request, e.g. https://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/kubernetes/provider.py#n25500:42
corvustristanC: yes, though i'd like to complete the k8s work before landing the openshift work00:42
tristanCcorvus: can you define complete? :)00:43
corvustristanC: all the required parts of the spec landed?00:44
corvustristanC: probably just that zuul change and any bugfixes00:44
clarkbcorvus: I'm reading the kube config file and while I have no reason to believe it is wrong I also don't understand how the admin user at hte bottom of the file is associated with the vexxhost cluster. If we added a second cluster would that cluster have to have a different username than admin because the existing cluster already uses the username admin? if so I wonder if k8s would consider that a bug00:45
SpamapS(funny enough, I may not even want the ec2 driver anymore, I might just kubernetes it up... we'll see)00:45
clarkbseems like you should be able to use the same username in any cluster00:45
* SpamapS pokes at moto'ing00:45
clarkbSpamapS: tristanC ^ you may know the answer to that question00:45
corvusclarkb: i do not know the answer; the config is mostly what 'openstack coe config' spat out00:46
tristanCclarkb: you can use a single username in any cluster or any namespace yes00:47
clarkbtristanC: how do you express that in the config file?00:48
clarkbtristanC: as is the config file doesn't map the cluster to the user other than by name00:48
clarkbtristanC: so it would have to be unique or use the same key and cert?00:48
clarkbtristanC: https://review.openstack.org/#/c/620755/1/playbooks/templates/clouds/nodepool_kube_config.yaml.j2 for example00:48
SpamapSclarkb: guessing contexts?00:49
tristanCclarkb: oh right, the 'name' in kubectl are just key, not actual username00:49
tristanCclarkb: the username is defined by the auth-data sent to the cluster00:49
clarkbtristanC: gotcha so that is part of the key?00:49
clarkbso we'd have a vexxhost-admin and just update those name keys but key-data stays the same?00:50
tristanCclarkb: maybe? i don't know how k8s works under the hood, but in openshift you can login using just a token, no username required00:50
clarkbwe can make that update if/when there is more than one cluster00:50
SpamapSThere are various auth providers00:50
SpamapSAWS has a token based one00:50
SpamapSGKE has something similar00:50
clarkbah so this may even come down to implementation details. yay00:51
SpamapShttp://paste.openstack.org/show/736344/00:51
* corvus eods00:51
SpamapSthat's my user section for one of my k8s clusters00:51
SpamapSthe username is not actually there, because it's defined by my AWS credentials that aws-iam-authenticator uses.00:52
SpamapSSo it's just a name to refer to in context in the config file as "unique way to get access"00:52
clarkbSpamapS: ya then each user: block would determine the actual details00:53
clarkbmakes sense. I msotly wanted to make sure we didn't need to manage more than one file and doesn't seem like we do. We'd have to update the existing file if we add another k8s but it should work fine (and we have to do that anyway)00:53
SpamapScorrect00:53
SpamapSyeah most systems that I've used have this exec method that grabs credentials.00:54
SpamapSthat way we can all just have one big happy kube config00:54
SpamapSsimilar to clouds.yaml with secure.yaml having the actual creds00:54
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: executor: add support for generic build resource  https://review.openstack.org/57066801:07
SpamapSthis moto thing is perfect actually01:19
*** bhavikdbavishi has joined #zuul02:42
*** rlandy|bbl is now known as rlandy03:18
openstackgerritMerged openstack-infra/zuul master: Remove uneeded if statement  https://review.openstack.org/61798404:02
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Amazon EC2 driver  https://review.openstack.org/53555804:29
SpamapScorvus: ^ now with a unit test!04:29
*** bhavikdbavishi has quit IRC05:18
*** bhavikdbavishi has joined #zuul05:18
*** bhavikdbavishi has quit IRC05:30
*** bhavikdbavishi has joined #zuul06:12
*** bjackman has joined #zuul06:20
SpamapShrm06:23
SpamapSupgraded to 3.3.1 and I'm getting no consoles :(06:23
SpamapShttp://paste.openstack.org/show/736353/06:23
SpamapSgah06:28
SpamapSn/m06:28
SpamapSanother backward incompatible change. :-p06:28
tristanCSpamapS: "Header Upgrade is not defined"... what rewrite rules are you using?06:28
SpamapSNo the problem is that somewhere between 3.2.x and 3.3.0 (which is in the release notes) the requirements changed for static offload06:28
SpamapSHonestly.. this is getting pretty frustrating. I don't feel like retooling my deployment every time I do a minor upgrade. :-/06:29
SpamapSI mean, I love the new UI06:29
SpamapSso grateful for all the new stuff06:29
SpamapSbut.. can.. I just change my version pointer.. just once.. and not have to spend 2 hours mopping up? :-P06:30
* SpamapS will stop venting06:30
tristanCSpamapS: static offload had to be impacted to support the new html5 urls... and since it's outside zuul control, there isn't much we can do about it06:38
tristanCperhaps don't offload, that works out of the box06:39
tobiashSpamapS: you do static offload?06:39
*** bhavikdbavishi has quit IRC07:01
*** quiquell|off is now known as quiquell07:07
*** chkumar|away has quit IRC07:13
*** chandan_kumar has joined #zuul07:15
quiquellGood morning07:27
quiquellWhat is the cccp.yaml ?07:27
tobiashquiquell: morning07:55
tobiash?07:55
*** gtema has joined #zuul08:01
quiquelltobiash: Nah is not here https://github.com/openstack-infra/zuul/tree/master/doc/source/admin/examples08:02
quiquelltobiash: I think it get that generated after build images08:02
tobiashquiquell: what file are you referring to? cccp.yaml doesn't tell anything to me ;)08:03
quiquelltobiash: Me neither, will just remove it08:04
quiquelltobiash: thanks anyways08:04
tobiashk08:04
*** bjackman has quit IRC08:10
*** quiquell is now known as quiquell|brb08:13
*** bjackman has joined #zuul08:13
*** bhavikdbavishi has joined #zuul08:18
bjackmanJust noticed from reading the semaphore thread on zuul-discuss that acquisition/release is tied to job begin/start. In my ultimate use-case I'd like to be able to acquire/release them in the middle of jobs (specifically, I don't really want to lock test HW until I've made sure the code compiles)08:20
bjackmanJust a thought08:20
bjackman(In this case test HW does not mean the node provided by nodepool, it's a separate resource)08:21
*** hashar has joined #zuul08:24
*** jpena|off is now known as jpena08:31
*** bjackman has quit IRC08:34
*** themroc has joined #zuul08:36
*** bjackman has joined #zuul08:37
*** quiquell|brb is now known as quiquell08:37
*** bhavikdbavishi1 has joined #zuul08:37
*** bhavikdbavishi has quit IRC08:38
*** bhavikdbavishi1 is now known as bhavikdbavishi08:38
bjackmanAnyway Jim Blair said in an email on zuul-discuss that there's been some discussion about what he calls the "bisecting pipeline manager", does anyone know where that discussion took place / if there's a record of it?08:39
bjackman(Mail I'm referring to http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-November/000636.html)08:39
*** pcaruana has joined #zuul08:48
openstackgerritBrendan proposed openstack-infra/zuul master: Fix "reverse" Depends-On detection with new Gerrit URL schema  https://review.openstack.org/62083809:25
*** ssbarnea has quit IRC09:45
*** ssbarnea|rover has joined #zuul09:45
openstackgerritMerged openstack-infra/nodepool master: Update node request during locking  https://review.openstack.org/61880709:50
*** bhavikdbavishi has quit IRC09:55
openstackgerritMerged openstack-infra/nodepool master: Add second level cache of nodes  https://review.openstack.org/61902509:59
openstackgerritMerged openstack-infra/nodepool master: Add second level cache to node requests  https://review.openstack.org/61906909:59
openstackgerritMerged openstack-infra/nodepool master: Only setup zNode caches in launcher  https://review.openstack.org/61944009:59
openstackgerritBenoĆ®t Bayszczak proposed openstack-infra/zuul master: Disable Nodepool nodes lock for SKIPPED jobs  https://review.openstack.org/61326110:23
*** bjackman has quit IRC10:30
*** AJaeger_ has quit IRC10:32
*** electrofelix has joined #zuul11:04
*** AJaeger_ has joined #zuul11:13
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics  https://review.openstack.org/61958911:57
*** jpena is now known as jpena|lunch11:59
*** panda|pto is now known as panda12:41
*** gtema has quit IRC12:45
*** sshnaidm|afk is now known as sshnaidm12:51
quiquelltobiash: Do you an example of log config for scheduler ? I feel lazy today :-/13:06
quiquelltobiash:  I mean log_config the yaml files with filters and all13:06
tobiashquiquell: we're still using the ini style, but not at computer atm13:07
tobiashquiquell: look in system-config repo, there should be a log config too13:08
quiquelltobiash: system-config that was the name I was looking for thanks!13:08
*** gtema has joined #zuul13:10
*** rlandy has joined #zuul13:11
*** jpena|lunch is now known as jpena13:31
*** j^2 has joined #zuul14:04
*** themroc has quit IRC14:23
*** themroc has joined #zuul14:25
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Add arbitrary node attributes config option  https://review.openstack.org/62069114:36
Shrewsmordred: s/metadata/attributes/  ^^^14:37
mordredShrews: brilliant!@14:37
*** bhavikdbavishi has joined #zuul14:56
*** themroc has quit IRC15:16
*** themroc has joined #zuul15:16
quiquellIs there any way to exclude especific jobs from configuration in the tenant yaml config ?15:23
tobiashquiquell: only all jobs15:32
quiquelltobiash: ack15:32
quiquelltobiash: Maybe I can shadow the ones I don't want ?15:32
*** panda is now known as panda|pto15:34
*** hashar has quit IRC15:37
tobiashquiquell: that is possible, shadowing allows you to override jobs15:38
quiquelltobiash: ok, thanks15:40
corvusquiquell: though you might consider moving the jobs you don't want into a new repo15:40
corvustobiash: why did you add the event.clear in 61958915:41
quiquellcorvus: ack15:41
tobiashcorvus: because otherwise it updated just every second regardless of cache triggered changes15:41
tobiashjust noticed that today and then read about threading events. Without the clear any subsequent wait effectively doesn't wait at all because the event is still set15:43
corvustobiash: left a couple comments15:43
tobiashcorvus: good points15:45
tobiashcorvus: shall we signal on stop and still keep the timeout as a safety net?15:48
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Asynchronously update node statistics  https://review.openstack.org/61958915:50
corvustobiash: i like dropping the timeout; less wasted work15:51
tobiashOk, timeout dropped15:54
*** quiquell is now known as quiquell|off15:55
*** bjackman has joined #zuul15:58
*** gtema has quit IRC16:23
*** themroc has quit IRC16:23
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests  https://review.openstack.org/62095416:37
corvustobiash, Shrews: ^ that's the nodepool half of that.  i'll finish up the zuul change now.16:38
tobiashCool, looking after dinner :)16:40
pabelangerOooh, I'm going to review that too! Totally interested in the code16:40
*** bjackman has quit IRC16:55
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests  https://review.openstack.org/61535617:06
corvustobiash, Shrews, pabelanger: ^ that should be ready now too17:06
corvusclarkb: ^17:06
pabelangerlooks17:08
clarkbcorvus great. I'm listening to an openci meeting noe but will review after17:09
corvuspabelanger: can you look at https://review.openstack.org/620691 and make sure it will work for zones?17:10
pabelangersure17:10
corvuspabelanger: feel free to +W if it works for you17:12
*** dkehn_ has joined #zuul17:15
*** bhavikdbavishi has quit IRC17:15
*** dkehn has quit IRC17:15
*** dkehn_ is now known as dkehn17:15
*** bhavikdbavishi has joined #zuul17:16
pabelanger+A17:17
pabelangerThanks!17:17
pabelangerI'll update zuul patch now17:17
tobiashcorvus: re nodepool relative prios, we have the case that we have many providers that decline because they don't support the label (static providers). In this case there will be much lock contention leading to skipped node requests in the loop. So I think we should make sure that the loop is frequently restarted from the beginning with a fresh sorted  list.17:44
openstackgerritJames E. Blair proposed openstack-infra/zuul master: More strongly recommend the simple reverse proxy deployment  https://review.openstack.org/62096917:44
tobiashcorvus: but I think in combination with 610029 that might be good enough17:45
corvustobiash: does relative_priority alter that behavior?  -- is that something we need to address in the relative_priority change, or can that be a followup?17:46
corvusmordred: based on the feedback from SpamapS earlier, i worked on 620969, but i noticed that we suggest static offloading for white-labeling because white-label without static offload is said to be complex.  you were talking about dropping static-offload for openstack -- were you thinking of doing that while retaining white-label?  should we document that case?17:48
tobiashcorvus: no, I just wanted to mention this as our lock contention will circumvent the relative priority without 61002917:48
tobiashcorvus: +2 from me17:50
corvustobiash: ah yes.  i agree they will be better together and we should merge 029 shortly.  can you explain what the 'yield' does there?17:50
tobiashcorvus: I did that based on Simon's first comment17:52
*** jpena is now known as jpena|off17:53
corvustobiash: that makes sense now17:54
corvusi'll leave the approval of https://review.openstack.org/610029 to Shrews17:55
tobiashcorvus: but I think the yield is not good anymore with the relative priority as it just continues the loop while we now actually want to restart the loop17:55
corvustobiash: agreed that would be better, but not a fatal flaw.  do you want to do a followup after they both merge, or rebase 029 on relative_priority?17:57
tobiashcorvus: I think I'll rebase on relative priority17:57
corvustobiash: ok.  better wip it then, it has a lot of +2s17:58
openstackgerritPaul Belanger proposed openstack-infra/zuul master: Add support for zones in executors  https://review.openstack.org/54919718:05
pabelangertobiash: corvus: clarkb: mordred: ^now updated for node-attributes logic18:06
*** bhavikdbavishi has quit IRC18:13
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently  https://review.openstack.org/61002918:13
*** bhavikdbavishi1 has joined #zuul18:13
openstackgerritTobias Henkel proposed openstack-infra/nodepool master: Ensure that completed handlers are removed frequently  https://review.openstack.org/61002918:15
*** bhavikdbavishi1 is now known as bhavikdbavishi18:15
tobiashcorvus: that should do it ^18:16
tobiashpabelanger: while you're at it, do you want to address my doc comment of PS18?18:18
tobiashpabelanger: it's not provider anymore ;)18:18
*** bhavikdbavishi has quit IRC18:19
pabelanger+118:22
sshnaidmwhere can I see possible zuul statuses? like SUCCESS, FAILURE, TIMED_OUT...?18:29
corvussshnaidm: i don't think the whole list is enumerated anywhere18:32
pabelangertobiash: provider pool? is that better?18:38
tobiashpabelanger: maybe just zone as this is what you actually configure?18:38
pabelangersure18:38
tobiashpabelanger: '... that have nodes with the specified executor-zone attribute...'18:39
tobiashpabelanger: how about this ^ ?18:39
pabelangersure18:40
tobiashcorvus: I've commented on 615356, what do you think?18:44
corvustobiash: yep18:45
openstackgerritPaul Belanger proposed openstack-infra/zuul master: Clarify executor zone documentation  https://review.openstack.org/62098918:51
pabelangertobiash: ^18:51
tobiashpabelanger: thanks18:52
tobiash+2 on both18:52
pabelangerthank you!18:56
Shrewstobiash: i don't think i'm clear on the need for the kazoo election recipe (though docs are horrible). seems like only a single thread is involved in the election, so what purpose is that serving?18:57
tobiashShrews: there may be multiple launcher processes involved in the election18:57
Shrewstobiash: oh right. external scaling. what happens if the leader is stopped? will another take it's place?18:58
Shrewss/it's/its18:58
tobiashShrews: yes, that's how the leader election works18:58
tobiashas soon as the leader finished its function or looses its session a new leader is elected18:59
Shrewstobiash: ok, so election.run() basically blocks forever while it is not the leader19:00
tobiashyes19:00
Shrewstobiash: gotcha. thx19:00
tobiash:)19:00
Shrewsit wasn't clear to me if it returned when it LOST the election, but that makes sense19:00
tobiashyeah I like that concept19:01
*** hashar has joined #zuul19:06
Shrewstobiash: what if the leader gets an exception and the thread stops, but the zk session (which is shared among threads) remains active?19:08
clarkbShrews: if the thread stops the connection should break allowing another process to become leader?19:09
clarkbthough I guess if its a shared session maybe that doesn't happen19:09
tobiashShrews: if the leader returns from the function that is triggered by the election process a new leader is elected19:09
clarkbmight want to do connection/session specific to elections19:09
Shrewshow does the connection break? it's a shared connection19:09
clarkbShrews: well it could be some logical action over the top of the connection too19:10
clarkbtcp might still be happy though19:10
tobiashShrews: so throwing an exception would lead to a new leader election as well19:10
Shrewstobiash: *nod*19:11
*** electrofelix has quit IRC19:12
tobiashShrews: that leader election is much simpler than I thought: https://kazoo.readthedocs.io/en/latest/_modules/kazoo/recipe/election.html#Election.run19:16
tobiashIt just gets a lock and runs the leader function while having the lock, as soon as this returns (no matter how) the lock is released and the next one in the election gets the lock19:17
openstackgerritJeremy Stanley proposed openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list"  https://review.openstack.org/62099519:38
*** sshnaidm is now known as sshnaidm|afk19:56
SpamapStobiash: yeah I do static offload because when I set up my zuul cherrypy was totally screwed up and not serving static content.20:12
SpamapS(I think actually the problem was the yarn hooks or something.. anyway.. it's fixed now)20:12
SpamapSAnd also because of what corvus mentioned.. it was suggested as the simpler path (which it is).20:13
SpamapSI am sure I'll find a multi-tenant case at some point, but thus far, I've always been single tenant.20:13
openstackgerritMerged openstack-infra/nodepool master: Add arbitrary node attributes config option  https://review.openstack.org/62069120:14
openstackgerritMerged openstack-infra/nodepool master: Asynchronously update node statistics  https://review.openstack.org/61958921:00
openstackgerritMerged openstack-infra/zuul-website master: Revert "Add a promotional message banner and events list"  https://review.openstack.org/62099521:00
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Fix updateFromDict overrides  https://review.openstack.org/62100821:07
corvustobiash, Shrews: ^ just noticed that wee typo.21:07
Shrewsoops21:12
pabelanger+321:13
pabelangercorvus: tobiash: Shrews: I have a provider in nodepool, that seems to be having some quota issues. Right now my tenant, vexxhost-mtl1, is full of nodes in ERROR state. But it seems, this does seem to reflect in the openstack quota from the provider. I wonder if we could some how try to calculate the error instances in the unmanagedQuotaUsed() call:21:35
pabelangerhttp://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/driver/openstack/provider.py#n17621:35
pabelangerbecause right now, I have jobs that are wedged waiting for nodes to free up in the provider21:36
pabelangerand nodepool won't try another provider it seems21:36
clarkbpabelanger: the issue is those nodes actually count against your quota21:36
clarkband you will likely be charged for them so I don't think nodepool should ignore them21:36
pabelangerbut for some reason, openstack api says there is quota available21:37
pabelangerbut really, there isn't21:37
corvuspabelanger: why won't nodepool try another provider?21:37
pabelangercorvus: I don't know, still trying to figure that out21:37
clarkbpabelanger: they may not count against cpu/memory/disk etc quota if they never started but they should count against instance quota. I have definitelyseen error instances count against instance quota on clouds21:38
*** rlandy is now known as rlandy|biab21:44
dkehnclarkb: FYI the lastest update did fix the .ssh/known_host issue21:52
clarkbdkehn: great]21:53
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Set relative priority of node requests  https://review.openstack.org/61535621:57
pabelangercorvus: clarkb: looking at the log, I can see new node requests being created each time vexxhost-mtl fails, then a few pools decline the new node request, because they don't have the right flavor. But of the 4 provider pools that do have the flavor, vexxhost-mtl is always the first to fulfill the node request, and I don't know why that it21:57
pabelangerso, the 3 behind vexxhost-mtl, never get a change to try and fulfill it21:58
pabelangerleading to stuck jobs in zuul21:59
clarkbpabelanger: the other three should get it after mtl1 fails 3 times?21:59
pabelangerclarkb: no, we delete the node request it looks like, and create it new22:00
clarkbmtl1 should decline the request after three errors, that gives the other regions opportunity to pick it up22:00
clarkbwe being nodepool?22:00
clarkbthat seems like a bug22:00
pabelanger1 sec, let me get a log22:00
pabelangeryah, nodepool22:00
corvusnodepool can delete min-ready node requests, but otherwise doesn't delete requests.22:03
dmsimardclarkb, SpamapS, corvus: Has there been any traction on reproducing jobs outside Zuul ? I wasn't in Berlin so I may have missed out.22:11
dmsimardI asked about this briefly before Berlin: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2018-11-06.log.html#t2018-11-06T17:39:0522:13
*** manjeets has quit IRC22:15
pabelangercorvus: clarkb: okay, I am looking at the right node requests now:22:16
pabelanger2018-11-29 20:58:35,195 DEBUG nodepool.PoolWorker.vexxhost-ansible-network-mtl1-v2-highcpu-1: Active requests: ['300-0000331819', '300-0000331925']22:16
corvusdmsimard: in the conversation you linked, i noted who was working on it; maybe ask them?22:16
pabelangercorvus: clarkb: it seems vexxhost-mtl is retrying indefinitely because we hit a quota exception22:17
pabelangertrying to see why these are not failing over the other providers now22:18
dmsimardcorvus: not a terrible idea :/22:19
*** rlandy|biab is now known as rlandy22:21
corvuspabelanger, Shrews, tobiash, clarkb: https://review.openstack.org/536930 is the change which introduced the behavior which is vexing pabelanger now22:22
corvuspabelanger: however, it shouldn't retry indefinitely, it should pause22:23
corvuspabelanger: (which means you should end up with one stuck request, and others continue)22:23
pabelangerlooking22:23
corvuspabelanger: wait, you're not getting a quota exception though, are you?  the way you described it, it sounded like it was nodepool hitting quota limits and not trying to boot nodes...22:24
corvusi'm not longer certain 536930 is relevant22:24
pabelangercorvus: no, I am. I was tracing the wrong node request22:24
corvuspabelanger: maybe you could paste the logs?22:24
pabelangeryup22:24
pabelanger1 min22:24
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: Support relative priority of node requests  https://review.openstack.org/62095422:26
pabelangercorvus: https://pabelanger.fedorapeople.org/launcher.log22:27
pabelangerI was tracing 300-000033192522:27
pabelangerwhich leads to an aborted node for 000031866022:28
*** manjeets has joined #zuul22:28
*** manjeets has quit IRC22:29
*** manjeets has joined #zuul22:29
corvuspabelanger: okay, so 536930 does describe the behavior you're seeing, and nodepool is not pausing because nova is incorrectly reporting quota usage (it's not reporting the quota is used, but if you try to launch a node, it fails with a quota error)22:33
pabelangerdo you know why it is not trying quota on another provider pool? for example vexxhost-sjc1 should have space22:34
corvusto be honest, i don't see a way to resolve this other than going back to the original behavior (where unexpected quota errors counted against the three strikes which eventually caused a node-request decline).  but it tobiash's environment, an unexpected quota error is common enough for that to be a problem.22:34
corvuspabelanger: the only way for another provider to try the request is for this provider to decline it.  that was the old behavior.  tobiash wanted it changed so that quota errors weren't fatal.22:35
corvus(mind you, the underlying problem is openstack lying; fix that and ther's no issue.  the issue we're grappling with is how to work around that.)22:36
*** sshnaidm|afk is now known as sshnaidm|off22:37
pabelangercorvus: okay, I understand now22:37
pabelangerunless we start tracking quota retries, and fail decline after x attempts22:38
corvuspabelanger: that was the old behavior22:38
pabelangeryah22:38
pabelangerokay, will wait until tobiash is back online to better understand why that broke him22:38
clarkbmy understanding is that nova at least only periodically updates the quota usage for what you see via the api22:39
pabelangerbut yes, in this case openstack is at fault22:39
clarkbhwoever on every boot it seems to know an up to date number? seems like it could give you a valid number on every api requiest too22:39
corvuspabelanger: yeah, let's discuss with tobiash.  i think we need to do something -- the possibility for nodepool to wedge is not acceptable.  maybe we can come up with something other than reverting to the original algorithm.22:41
corvusclarkb: it sounds like this has been going on for some time for pabelanger -- so maybe it's not updating the quota at all?22:42
clarkbor it has gotten out of sync somehow (we've also seen that)22:42
pabelangercorvus: wfm, for now I can disable the provider, so no rush.22:43
corvuspabelanger: do those error nodes show up in openstack server list?22:46
corvusit actually looks like we ignore the quota used values from openstack; instead we only use the max values, and then perform our own subtraction.  so if nodepool doesn't see those servers (or doesn't have them associated with a flavor), it won't see them as impacting the quota...22:47
pabelangercorvus: yes, they show up in ERROR22:48
corvuspabelanger: are there zk node entries for them?22:48
pabelangercorvus: I'd have to ask tristanC or nhicher to look for that, I don't have access myself22:49
corvuspabelanger: you... have logs?22:49
pabelangerI ask for them22:49
pabelangerI don't have access to the production servers, just a user22:49
corvuspabelanger: can you get the output of 'nodepool list'.  the /node-list endpoint of the nodepool web app would have it22:52
pabelangerlet me see22:54
mordredhttps://softwarefactory-project.io/zuul/t/ansible-network/nodes <-- I'm only seeing 2 nodes in mtl122:55
pabelangeryah, I don't believe nodepool-web is exposed22:56
pabelangerdmsimard: around to run the commands corvus asked for?22:57
dmsimardo/22:58
dmsimardlet me see22:58
corvuspabelanger, mordred, dmsimard: we'll need the nodepool ids of the error nodes (should be in the nova metadata), and then the nodepool logs for those ids22:58
pabelangerhttp://paste.openstack.org/show/736454/22:58
pabelangeris openstack server show of an ERROR node22:59
corvuspabelanger, dmsimard: great, so nodepool logs of 0000318690 would be helpful22:59
pabelangerlet me check if that is in the copy I have, I pruned the posted one23:00
dmsimardcorvus: logs http://paste.openstack.org/show/736455/23:00
corvusit's starting to look like openstack is providing the right quota info, but nodepool isn't subtracting out those error nodes because it doesn't have zk records for them23:00
corvusdmsimard: can you confirm there aren't any other interesting log lines between the last two?23:01
corvus(ie, an exception or something)23:01
pabelanger0000318690 shows up in https://pabelanger.fedorapeople.org/launcher.log23:01
corvusit's only a few microseconds, so shouldn't be much there23:01
clarkbthats curious, nodepool should keep records for error'd nodes until they get deleted23:01
clarkb(we even show the instance after delete to make sure it actually deleted iirc)23:02
pabelangerhttp://paste.openstack.org/show/736456/23:02
*** hashar has quit IRC23:02
pabelangerthat is the full list of ERROR nodes, FYI23:02
corvusoh... look closer at the history for 318690 -- it reused the node id after the initial error23:03
dmsimardcorvus: there's a few traces: http://paste.openstack.org/show/736457/23:04
corvusit didn't think a server was created, so it just launched it again, and overwrote the zk record.  but a server really was created in attempt #1.23:04
dmsimardIt's hard to correlate that last trace with something in specific23:04
clarkbcorvus: ah so that might be the openstack bug23:05
clarkb(incorrect return to create request?)23:05
dmsimard"keystoneauth1.exceptions.http.Unauthorized" is 1384 times in the logs but I'm not sure for which cloud or provider it is.23:05
corvusmordred: maybe you can help here -- look at the first exception in http://paste.openstack.org/show/736457/23:06
pabelangerdmsimard: that looks to be limestone, I need to see why23:06
corvusmordred, clarkb: nodepool is getting an exception from openstacksdk and thinks there's no server, but there really is a server in nova -- the error message even gives us the id23:06
corvus(that uuid matches up with pabelanger's paste in http://paste.openstack.org/show/736456/ )23:07
dmsimardI need to step away for dinner, I'll be back later23:07
pabelangerdmsimard: thanks for help23:07
corvusmordred: the issue is that http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py#n125 raises an exception; we only save the external id of the server if it succeeds (see line 142).23:14
corvusso basically, every server create that hits an error on the initial call ends up leaked23:15
pabelangerah, nice debugging23:17
pabelangercorvus: do we still need to look in zookeeper or does https://softwarefactory-project.io/zuul/t/ansible-network/nodes give the proper information?23:21
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: OpenStack: count leaked nodes in unmanaged quota  https://review.openstack.org/62104023:22
corvuspabelanger: that's all the info we need.  you should never need to talk to zookeeper directly.23:23
corvuspabelanger, tobiash, clarkb: ^ 621040 may help mitigate this situation23:23
clarkbcorvus: that fix should work, but would it be better to record the leaked nodes so they can be deleted?23:23
corvusclarkb: yes.  i do not know how to do that.  i was hoping mordred could help.23:24
pabelangercorvus: ack, thanks23:24
corvuspabelanger: do you know which version of openstacksdk you're using?23:25
pabelangerokay, cool. that kinda lines up with my initial thought, updating unmanagedQuotaUsed to account for the leak :)23:25
pabelangercorvus: I don't think so, let me check but tristanC will know23:25
pabelangerI think that is pulled in via RPM23:26
corvusline 6831 in openstackcloud.py is the middle of a comment on master; so it's a bit hard to line up that traceback23:26
corvusclarkb: oh nice -- the error string in that exception is generated/formatted by openstacksdk.  the server id is an attribute on openstack.cloud.exc.OpenStackCloudCreateException23:28
corvusso we should be able to catch that and snag the server id and keep the zk record around23:29
clarkbcorvus: yup I'm writing that change now23:29
pabelangercorvus: educated guess for openstacksdk is 0.19.0, that is just looking at the spec file however. Not sure if that is the version running23:29
pabelanger0.17.2 was the version before that23:29
corvusclarkb: oh so was i.  how many lines have you written? :)23:30
clarkbcorvus: uh the try except with self.node.state = zk.FAILED and set the uuid and zk.storeNode? oh and reraise23:31
clarkbmine is sort of handwavy though with placeholders for actual values (like I don't know what the actual server id attribute is yet23:31
clarkbif you've got specifics sorted out then you should go for it23:31
corvusclarkb: i'll do it; you review :)23:33
clarkbok23:33
clarkbI'll continue sorting out mine ebacuse then I'll be able to review yours easily :)23:33
clarkbhttp://paste.openstack.org/show/736458/ I think that is functional23:35
clarkb(but totally untested)23:36
clarkboh the if e.uuid should be if e.resource_id23:36
openstackgerritJames E. Blair proposed openstack-infra/nodepool master: OpenStack: store ZK records for launch error nodes  https://review.openstack.org/62104323:38
clarkbah ok yours is a bit cleaner by relying on the callee to store the failed node23:39
corvusclarkb: yeah-- and actually i think we want to do that after a closer reading of http://git.zuul-ci.org/cgit/nodepool/tree/nodepool/driver/openstack/handler.py#n24423:39
clarkbcorvus: one thing I did was move image and hostname up so that we'd have that data in the record we store (which might help a human debug later)23:39
corvusclarkb: it actually generates a *new* zk node23:39
corvusclarkb: i almost did that too until i realized what line 244+ was doing23:40
clarkboh ya huh23:40
clarkbI mean with the uuid you should be able to figure it out23:40
corvus(so the leaked instance gets a new node, and the original node gets reused)23:40
clarkbso probably not a major concern23:40
corvustbh, i think we may want to revisit that -- the metadata of the leaked node will refer to a different node id.23:41
corvusbut that's going to be a bigger change23:42
corvuser, leaked isn't the right word -- error node i guess23:42
pabelangercool, I've +2 both too. in case mordred wants to review23:43
pabelangercorvus: thanks for helping debug this today23:43
clarkbcorvus: ya I would expect that we'd create a new node for the retry not for tracking teh delete23:43
corvuspabelanger: thank you for bringing it up.  and dmsimard for pitching in :)23:43
clarkbin any case I much prefer this fix to the first workaround :)23:43
corvusclarkb: oh, i think we can have both23:44
corvusthe second will make the first one mostly a noop23:44
corvusbut it's still safe to have in case there are other leaks23:44
clarkbya the first is more belt and suspenders. I just didn't watn to only rely on that first on23:44
corvus++23:44
pabelangerwfm23:45
*** manjeets has quit IRC23:46
*** manjeets has joined #zuul23:46

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!