17:00:22 #startmeeting ironic 17:00:23 Meeting started Mon Jul 13 17:00:22 2015 UTC and is due to finish in 60 minutes. The chair is devananda. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:00:27 o/ 17:00:28 The meeting name has been set to 'ironic' 17:00:29 o/ 17:00:30 o/ 17:00:32 #chair NobodyCam 17:00:32 Current chairs: NobodyCam devananda 17:00:35 o/ 17:00:37 o/ 17:00:57 oh hello 17:01:17 the agenda, though light, is here: https://wiki.openstack.org/wiki/Meetings/Ironic 17:01:32 o/ 17:01:47 and I need to apologize for my absense last week and lack of preparation for the meeting today. 17:01:56 #topic announcements 17:02:15 o/ 17:02:20 o/ 17:02:26 probably the biggest thing to announce today is just a reminder for our midcycle 17:02:44 o/ 17:02:55 :) 17:03:41 has everyone filled out the Mid-Cycle Lunch questions 17:03:59 we've got an etherpad started, though very light at this point 17:04:06 #link https://etherpad.openstack.org/p/ironic-liberty-midcycle 17:04:09 reminder that lunch questions and an invite for tuesday dinner/drinks is here 17:04:13 ah, deva beat me 17:04:22 :) 17:04:36 and if you hvaen't, please "buy" a free ticket from eventbrite so I can track attendees with the site coordinators 17:04:50 perhaps we should start tracking what we want to hack on? 17:04:54 * NobodyCam thinks he has but is not sure 17:04:55 or is it too early? 17:05:16 #link https://www.eventbrite.com/e/openstack-ironic-sprint-august-2015-tickets-17533862254 17:05:19 jroll: not too early at all 17:05:38 k :) 17:05:47 NobodyCam: you would have gotten a confirmation email from eventbrite ... 17:06:15 :) 17:06:32 * devananda checks attendee list 17:06:36 NobodyCam: no - you have not signed up 17:06:42 oh 17:06:50 does BadCub have a +1? 17:07:01 NobodyCam: this doesn't take +1's 17:07:02 NobodyCam: yes 17:07:06 lol 17:07:17 I ordered two tickets if memory serves 17:07:18 BadCub: oooh. you *do* list this as 2 tickets 17:07:24 please dont do that :) 17:07:27 lol 17:07:36 * devananda wonders how he can disable that 17:07:38 ugh 17:07:45 maybe BadCub was ordering two chairs so he can put his feet up 17:07:52 hehehe 17:07:53 * NobodyCam will sign up to buy the free ticket 17:07:57 something something lazy PMs 17:07:59 :P 17:08:02 jroll: that seems reasonable 17:08:07 lol 17:08:19 ok - any other announcements from folks? 17:08:40 just a reminder python-ironicclient gate is broken :-( 17:08:51 there's a patch fixing it but gate is pretty slow right now 17:09:05 #link https://review.openstack.org/201043 17:09:23 lucasagomes: thanks! 17:09:39 lucasagomes: thanks, wasn't sure if you and ruby had decided how you wanted to order that :P 17:09:46 lucasagomes: seems like that would affect other projects, no? 17:09:57 * jroll +2 17:10:02 devananda, yup it did. It affect pretty much all projects 17:10:05 devananda: it's just unit tests 17:10:12 ironic/nova are already fixed up though 17:10:13 ironic is already fixed, but I forgot to look at python-ironicclient on friday 17:10:18 just found out it was broken this morning 17:10:18 gotcha 17:10:31 ok, moving on 17:10:31 jroll: yeah, lucasagomes and I have a plan :) 17:10:34 #topic subteam reports 17:10:38 mock the world, break the world. 17:10:39 jroll, re order, me and rloo are working on it 17:10:50 lucasagomes: rloo ok :) 17:11:09 jroll: it is actually, we didn't use mock properly, and mock is now telling us :) 17:11:09 networking subteam report: these specs are *so* ready. 17:11:21 rloo: right, I know :) 17:11:24 one more announcement: don't forget to submit your summit talk ;) 17:11:30 * dtantsur already did 17:11:39 * jroll assumes dtantsur is giving a talk about microversions 17:11:44 jroll: I'm going to dig into that spec again today, i promise 17:11:45 lol 17:11:48 LOOOL :D 17:11:50 lol! 17:12:02 something with gifs, like in Vancouver 17:12:12 heh 17:12:38 dtantsur: oh speaking of microversions, you should review my update: https://review.openstack.org/#/c/196320 17:13:00 will do! let the flame war begin :) 17:13:10 devananda: devref in specs? :/ 17:13:22 why isn't devref in ironic tree? 17:13:36 jroll: a) we should have a devref in ironic tree (mostly just reorg of what's there) 17:13:43 jroll: b) because we have very long lived specs 17:13:57 which are aspirational and not completed in one (or two) cycles 17:14:00 devananda: okay 17:14:01 right 17:14:10 * lucasagomes adds to his todo list 17:14:15 why not call them 'long-lived' then? 17:14:32 rloo: I'm not tied to the name "devref" 17:14:48 but 'aspirational' doesn't seem to instill confidence in our users :P 17:15:01 * rloo will look/comment in the patch itself later :) 17:15:18 other subteams want to chime in? 17:15:22 yep 17:15:40 I'd like our simple inspector gate to join ironic experimental pipeline https://review.openstack.org/#/c/198381/ 17:15:55 with a goal of eventually joining other pipelines :) 17:16:01 secure boot for pxe-ilo spec has been there for very long time, plese review 17:16:05 dtantsur: ++ 17:16:30 dtantsur: will add to my review list but ++ on the idea 17:16:36 thnx! 17:17:01 oh talking about gate, devananda I think this is waiting for you https://review.openstack.org/#/c/199494/ 17:17:18 any updates on docs or qa? or those folks still out on PTO ? 17:17:24 making pxe_ipa gate jobs voting (it's been running since march reliably) 17:17:38 I'd like to help 17:17:53 I know jlvillal is out 17:18:17 lucasagomes: ack, adding to my list 17:18:24 wrt docs, sigh. https://review.openstack.org/#/c/191900/. you know how they/we use 'bare metal service' vs 'ironic' 17:18:40 lana seems open to using just 'ironic' instead of 'bare metal services' in the install guide 17:18:56 i'm not quite sure that makes sense but am mentioning it 17:19:46 i thought that was just about service name capitalization -- not about whether to use project vs service name? 17:20:11 devananda: well, the install guide is being 'cleaned up' in that patch. and we use both 'bare metal service' and 'ironic' in that guide. 17:20:29 devananda: so i asked them if they were cleaning it up, why they left some 'ironic's around... 17:20:44 devananda: i suspect i should just stick with reviewing code 17:21:13 hrmm 17:21:23 devananda: specifically, line 1645 for comments: https://review.openstack.org/#/c/191900/6/doc/source/deploy/install-guide.rst 17:21:28 so I will give it a skim, but overal I'd like the docs team to help us 17:21:39 devananda: yeah, i was hoping the doc team would help us... 17:21:46 under the assumption that they know more about making words that non-developers will understand than I do 17:21:54 so I think this is them trying to help us 17:22:47 rloo: ok, let's discuss this outside the meeting 17:23:00 devananda: i'm fine if you make an executive decision :) 17:23:03 i need to read the discussion on that doc chang e... 17:23:04 * NobodyCam adds to his list of open tabs 17:23:32 going to time box this section since we have the etherpad status, too 17:23:39 thanks, all, for the reports :) 17:23:54 #topic API retries 17:24:21 hrm, this item on the agenda doesn't follow the format for agenda items 17:24:49 Oh why? I have added it 17:24:54 lucasagomes: it's your bug report -- https://bugs.launchpad.net/ironic/+bug/1472565 17:24:54 Launchpad bug 1461140 in Ironic "duplicate for #1472565 conflict (HTTP 409) incorrect for some cases" [Undecided,New] - Assigned to Ruby Loo (rloo) 17:25:12 ah, great. the floor is yours :) 17:25:15 yeah later on I found out it was duplicated. But I kept this link because I put some suggestions there 17:25:41 So basically our client do retry on every 409 (Conflict) 17:26:05 * dtantsur wants it to do more retries btw.. 17:26:06 but in some situations I think it makes no sense to retry, for example, when one try to create a port which the mac address is already registered 17:26:24 this is not something that the server will fix up eventually so we shouldn't retry 17:26:25 or if you try to create a node with an existing name :-( 17:26:30 yeah 17:26:41 I added two suggestions about how to fix it in the bug 17:26:41 so I tend to think that client auto-retries are just a band-aid, an anti-pattern if you will 17:26:48 and we should just fix the real issue 17:26:56 which is that the number of locks is too damn high 17:27:03 jroll: ++ 17:27:14 well, sometimes you just have to wait... 17:27:15 I did it because it's hard to use Ironic right now without retries, but I'm open for better fix :) 17:27:30 rloo: sure, and the error message should indicate that :) 17:27:31 yeah, today I saw hardware where power on request took 17 seconds 17:27:40 right, so one suggetsion would be to use a header Retry-After 17:27:41 http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.37 17:28:08 dtantsur: that shouldn't block the client, though 17:28:17 and the client would look at it and would only retry in case the header is specified. The value of that header is the number of seconds the client should wait before it retry 17:28:20 devananda, but nothing is possible while this happens 17:28:22 lucasagomes: could we switch some of the 409 to 406's 17:28:24 dtantsur: hardware IS slow though 17:28:38 * nothing = no operations except for get 17:28:44 NobodyCam, another option would be to change the return code yes. 17:28:44 dtantsur: nothing is possible *for that Node, because it's locked by the driver during that time? 17:28:47 I suggested 422 for that 17:28:54 NobodyCam: that's the second suggestion, don't use 409 for the non-retries. 17:28:58 dtantsur: or nothing is possible *at all* because the conductor is frozen? 17:29:09 devananda, sorry, late evening :) for this node obviously 17:29:14 dtantsur: ok :) 17:29:27 dtantsur: there's a bug with the dell driver that blocks even other nodes 17:29:46 yeah, yeah.. no, that's about one node 17:29:55 I also tend to think 409 is a bad status code for "node is locked", it's not a client error which 4xx designates 17:30:01 so re: 409, I agree that we're overloading the meaning of Conflict 17:30:12 jroll: right 17:30:30 IMHO I believe 409 is correct for the situations we described, re creating a port with a duplicated mac address 17:30:35 409 is the correct error for duplicate mac, duplicate name, things like that 17:30:42 agree 17:30:50 I think it's also the correct error for invalid state transitions 17:30:54 yes 17:31:16 this is even merged in the API guidelines, to use 409 for async operations 17:31:18 503 service unavailable then? looks a bit too much, but maybe.. 17:31:23 when you try to start something which is already started 17:31:29 dtantsur: no - that means the service as a whole is down 17:31:35 yep 17:31:36 dtantsur: gateways and proxies will interpret that 17:31:39 yeah, the hard part is that no 5xx codes really fit well 17:31:49 that's why I like suggestion 1) because we then can indicate whether we should retry or not on 409 17:32:08 520 Unknown Error? :D 17:32:08 are there cases that 409 is incorrect for, aside from NodeLocked ? 17:32:14 dtantsur, we know the error 17:32:41 yep, just no other codes fit even remotely IMO 17:32:57 to reiterate, I don't believe that retrying is good behavior for the client. besides the fact that it's just slapping a bandaid on the problem, what if the node is locked because it's doing some operation that changes the state of the node, after which maybe you don't want your request to go through? 17:33:19 well, I do want 17:33:31 jroll, right, yeah that's why I think suggestion 1) would be good. Because it gives the server the power to say 17:33:36 this is retryable and this is not 17:33:36 according to https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_Client_Error, 403 == locked? 17:33:37 I have a script that invokes series of operations, and unless previous one failed I'd like to proceed 17:33:47 the client just needs to respect the header 17:33:56 rloo 423? 17:33:58 lucasagomes: if we do (1) I still don't think the client should retry 17:34:04 NobodyCam: oh yeah, 423. 17:34:12 hah. 423 sounds reasonable :P 17:34:22 jroll, so you suggest everyone to continue implementing own retries? nova, inspector, downstream scripts... 17:34:59 are there any other cases? 17:35:27 dtantsur: yes. or perhaps we add a method to the python client, or an argument or whatever, to make it retry. but I don't think it should retry by default. 17:35:40 if the only issue is around NodeLocked - perhaps the solution is, after all, to move those retries into the API, and after some #, return a timeout 17:35:41 devananda: cases for retrying? I think only when node is locked? or out of workers? 17:35:50 rloo: ah - out of workers, yes 17:36:00 devananda, ++ the best solution IMO 17:36:01 that's a great example of a server-side issue 17:36:05 devananda: the conductor already retries on NodeLocked, iirc 17:36:08 out of workers gives 503, no? 17:36:15 rloo: and really highlights that this isn't a 4xx error at all 17:36:17 jroll, not always 17:36:32 jroll, IIRC node-update fails early if it detects lock presence 17:36:40 NodeLocked and ConductorOutOfWorkers are transient server-side errors 17:37:04 dtantsur: that's the api I guess 17:37:08 but the conductor always retries 17:37:11 https://github.com/openstack/ironic/blob/master/ironic/conductor/task_manager.py#L191 17:37:27 I mean, that's not what I thought a year or two ago, but that is becoming clear 17:37:33 yeah, that's true. but e.g. in inspector node-update fails if something is going on with a node 17:37:38 yes, NoFreeConductorWorker == 503 17:37:55 I think the real solution is to lock less 17:38:00 power sync loop shuoldn't lock 17:38:08 jroll: ++ 17:38:15 agent heartbeats probably shouldn't lock by default 17:38:18 jroll, what about power on/off? 17:38:18 the problem with 423 is that it represents a REST API client's ability to lock a node, which we do not expose 17:38:23 that eliminates 90% 17:38:32 dtantsur: I'm not sure 17:38:42 probably should? I'd have to look at it more 17:38:46 jroll: ++ to power sync loop using shared lock, escalating to exclusive lock IFF it needs to power on/off the node 17:38:58 anyone want to file bug & fix that ^ ? 17:38:59 jroll, but that's the source of problems in my today's case (power on/off taking 17 seconds) 17:39:01 devananda: well, the client indirectly locks by issuing a request that causes a lock on the node. 17:39:21 dtantsur: I feel like that's not normal. you should RMA that machine. :) 17:39:29 jroll: agent heartbeat locks because it goes through vendor passthru 17:39:41 devananda: right, passthru shouldn't lock by default. 17:39:43 jroll: it's a great test case, though! 17:40:17 jroll, then it should be an explicit error, but I don't wanna people report bugs about "node locked error" :) 17:40:33 ironic has real issues that make it hard to use and we're just patching it over by making clients retry automatically 17:40:47 which doesn't help for people not using the official client, either 17:41:02 dtantsur: then we should make the error messages better, too. 17:41:02 we can retry in API, as devananda suggested above.. 17:41:02 right 17:41:12 I would like to time box this discussion -- these are all very good points and I think we agree on the problems 17:41:26 yeah I would like at least an action plan for it 17:41:40 do people like idea of retries on API level? 17:41:42 I can take a look at stop locking the nodes in some parts 17:41:43 lucasagomes: do you have time / want to coordinate fixing these issues? 17:41:49 * dtantsur can write a spec 17:41:56 devananda, yes 17:42:02 dtantsur: I'm not sure what "retry on api level" means. the api retries the rpc calls? 17:42:05 I think rloo was/is looking at solving it too 17:42:14 jroll, at first glance, yes 17:42:32 I would like to see an outline of these problems -- unfortunately, yea, a spec is probably the right way to go, just to make it digestible to everyone 17:42:44 because this is going to affect several areas of the project 17:42:54 dtantsur: the rpc calls that lock a thing already retry, I think we just need to remove the 'reservation' check in node-update etc 17:43:02 to be clear then, until there is a spec etc we shouldn't make any more changes, like extending the retrying at the api level? 17:43:31 jroll, I have to research more, I can't say for sure right now 17:43:40 dtantsur: same 17:43:41 rloo: lest we all try to solve this in different ways, probably a good idea 17:44:05 #agreed we all feel that there are issue with the current locking model, especially around 409 Conflict and NodeLocked 17:44:08 rloo: devananda +1 17:44:22 right, let's investigate which areas we may be overusing the locking the noes 17:44:24 nodes* 17:44:25 #agreed lucas and dmitry are going to put a plan together to address these 17:44:30 as jroll have pointed out 17:44:31 ack 17:44:33 makes sense. are we good with LOCKED = 409? I don't think so. 17:44:39 thanks much! 17:44:57 rloo, changing an error code is a breaking change btw 17:45:01 #topic open discussion 17:45:10 dtantsur: I know. will leave that to the spec to discuss :) 17:45:17 dtantsur would, i'm sure, like to say some things about API versions 17:45:19 dtantsur, yeah we probably will need to use micro versions for it 17:45:28 \o/ 17:45:32 I have some strong opinions as well on them, which I wrote into a revision of the old spec 17:45:46 #link https://review.openstack.org/196320 17:46:13 I don't understand the -compatible header 17:46:37 idk if you want to explain here or in the patch 17:46:54 jroll: see the ref material in the patch, it's explained there 17:47:25 devananda: I don't see any new references? 17:47:27 wait, no it's not :( 17:47:31 urgh. one sec 17:47:44 hah 17:47:44 jroll: http://www.gnu.org/software/libtool/manual/libtool.html#Updating-version-info 17:48:07 * dtantsur always hated libtool versioning 17:48:36 devananda: ctrl+f compatible gives me nothing relevant 17:49:01 this says bump the version if you change the api 17:49:08 the only thing that we're trying to achieve with hiding features is to prevent people from "cheating" and not requesting the correct version, right? 17:49:08 also, before I forget, I want to bring up the topic of meeting times again 17:49:49 I did a poll on this a while back, and got ~17 repsonses 17:49:53 the night time meeting are very hard for /me to be there for? 17:50:02 dtantsur: IMO it's valuable because you can know exactly what versions they are in, and thus if your ironic has them or not 17:50:08 esp with daylight savings time 17:50:12 dtantsur: in other words I like sean's take on it 17:50:21 dtantsur: though I don't think we have time to talk about this atm 17:50:24 NobodyCam: I've missed several of the 0500GMT meetings as well 17:50:30 jroll, it's not about hiding features, it's about stating versions. but yeah, better on the spec. 17:50:52 ++ to not talk about micro versioning now 17:50:55 * NobodyCam tends to fall asleep with laptop in his lap :-p 17:51:36 the responses were more in favor of keeping the meeting, even though I do not feel that the 0500 meetings are productive 17:51:54 I have never attended the 0500 meeting because the time is just too bad for me. Would be good to to listen to the people that attend it, see if they find it useful or not 17:52:04 I answered "keep" because I didn't want people to be excluded. But if core team does not attend them, then I'd change my vote.. 17:52:08 I'm mentioning it now in case anyone wants to discuss -- i'm goign to write up my thoughts and post to the ML (it's much overdue) 17:52:08 I agree that 0500 meetings aren't typically productive 17:52:17 devananda: for the ones that wanted to keep the meeting -- are they happy keeping the meeting if no cores attend? 17:52:20 dtantsur: yea, usualy the core team isn't there 17:52:28 rloo: probably not :) 17:52:35 ++ for ML 17:52:38 we do have 2 cores in that timezone, though 17:52:45 that's the problem with polls... can't get to the nitty gritty details. 17:52:46 or, well, not in US/EU 17:52:50 yea 17:52:51 I am one of them .. (almost sleepy now) 17:52:57 rameshg87: indeed :) 17:53:10 rameshg87: also hi there! 17:53:19 rameshg87: thank you for being here :) 17:53:25 we could ask Haomeng if he he would be able to attend that meeting more often 17:53:37 I too typically feel nothing much happens in 0500 meeting. I would personally rather prefer this time every week :) 17:53:58 I personally don;t do the 0500 mtg at all. 17:54:14 + for ML 17:54:19 rameshg87: you're the most active core that that meeting is attempting to serve -- and if you'd rather just have this time, that makes it easy 17:54:47 we completely miss mrda-away with this time, however. *sigh* 17:54:48 devananda: I am all for this time rather than having an not-much-of-a-meeting at 0500 GMT 17:54:58 rameshg87: thanks 17:55:10 is there some other time that works? 17:55:17 ok - i appreciate everyone's feedback. will get a post up shortly 17:55:50 rloo: there is no time that works for everyone, and this seems to work for the majority pretty well, and we're all used to it 17:56:12 also, 5 minutes left - and it's open discussion :) 17:56:15 devananda: well, i mean another time that works for most cores + others that can't make this meeting. 17:56:48 dtansur has something: https://review.openstack.org/#/c/166386/ 17:57:04 of course, it is microversion related 17:57:12 my beloved microversions :) 17:57:27 dtantsur: what do you want us to do about that? +1? 17:57:48 dtantsur: fwiw, I would like to just call it "API version negotiation" 17:57:54 because there's really nothing "micro" about it 17:57:56 I'm mostly just bringing attention, if someone has time to help them land it 17:58:08 ++ for versions 17:58:26 what about milliversions? 17:58:32 :P 17:58:51 names are hard, tho yeah negotiation makes more sense at least in my understanding 17:58:51 dtantsur: http://hintjens.com/blog:85 17:59:02 "The End of Software Versions" 17:59:20 dtantsur: shorthand "μv" 18:00:01 oh awesome! 18:00:02 ok I gotta run, thanks everyone 18:00:07 thats time 18:00:14 great meeting all 18:00:15 hmm right 18:00:16 cheers - thanks everyone! see you next time! 18:00:17 thanks! 18:00:20 thanks 18:00:22 :) 18:00:22 bye 18:00:27 #endmeeting