14:00:34 #startmeeting cinder 14:00:35 Meeting started Wed Jun 17 14:00:34 2020 UTC and is due to finish in 60 minutes. The chair is rosmaita. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:36 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:38 The meeting name has been set to 'cinder' 14:00:44 o/ 14:00:44 #topic roll call 14:00:48 Hi 14:00:49 Hi 14:00:50 hi 14:00:50 o/ 14:00:58 hi 14:01:06 o/ 14:01:15 o/ 14:01:24 hi! o/ 14:01:45 nice turnout! 14:01:46 #link https://etherpad.openstack.org/p/cinder-victoria-meetings 14:02:06 hi 14:02:20 #topic announcements 14:02:28 o/ 14:02:45 thanks to everyone who filled out the poll to determine a good time for the virtual midcycle 14:02:50 which is next week! 14:03:00 the winner is:\ 14:03:11 Wednesday 24 June 1400-1600 UTC 14:03:26 so the first hour overlaps with the time of the usual cinder meeting 14:03:38 so everyone should be able to attend at least the first part 14:03:52 and we will need topics: 14:04:02 #link https://etherpad.opendev.org/p/cinder-victoria-mid-cycles 14:04:03 hi 14:04:42 since it's the week before spec freeze, anyone working on a spec who has questions or needs some guidance, the virtual midcycle would be a good place to get answers 14:05:02 anyone working on a spec should check out the virtual PTG etherpad 14:05:19 each spec was discussed, and there are comments on the etherpad 14:05:34 #link https://etherpad.opendev.org/p/victoria-ptg-cinder 14:05:50 and, of course, we can discuss other issues as well 14:05:57 mep 14:06:16 i know it's coming up fast, but this is a short cycle, and spec freeze is in 2 weeks 14:06:58 rosmaita, we will have only 1 meeting right ? (of 2 hours) 14:07:01 Crazy. 14:07:16 whoami-rajat: well, there will be part 2, but at week R-9 14:07:24 so yes, just one meeting before spec freeze 14:07:31 rosmaita, yep. ack. 14:07:32 thanks 14:07:47 #action rosmaita announcement to ML about virtual midcycle 14:08:05 we'll do bluejeans again as a videoconf technology 14:08:46 the other topic is that last time i looked, all devstack-based zuul jobs are broken 14:09:00 #link http://lists.openstack.org/pipermail/openstack-discuss/2020-June/015432.html 14:09:13 so recheck is futile until that's fixed 14:09:59 they've proposed a patch which is merged in master but when i looked at grenade failure, it shows error on previous code, not sure why 14:10:45 https://review.opendev.org/#/c/577955/27/lib/apache 14:10:55 the new uwsgi brought havoc 14:11:16 but the stable branches don't seem to be doing so well, either 14:11:17 whoami-rajat: grenade install the previous branch 14:11:27 #link https://zuul.opendev.org/t/openstack/builds?job_name=tempest-full# 14:11:34 I thought at least train was fixed as of yesterday. 14:11:41 so they sequence is: make grenade non-voting, make it non-voting also in the backport, fix the issue in the previous branch, reenable it 14:12:01 but yeah, until stein things should work, but not rocky 14:12:04 and queens 14:12:15 tosky, shouldn't it use the previous uwsgi version then? isn't it constrained for every release? 14:12:29 ok, so we think stein through master should work now? 14:13:10 i still see failure in my master patch i updated yesterday https://review.opendev.org/#/c/735513/ 14:14:35 uhm, but that was yesterday 14:14:58 whoami-rajat: this patch has recent results https://review.opendev.org/#/c/712832/ 14:15:00 hopefully ajaeger will send an update to the ML when everything is back to normal 14:15:29 #topic python2.7 compatibility issue 14:15:36 oh, thanks tosky 14:16:21 ok, so we've been talking about backports being more difficult now that master and ussuri are python3-only, and most of the stable branches are not 14:16:41 so, here's a case in point so we can see what we're up against 14:16:52 #link https://review.opendev.org/#/c/733100/ 14:17:10 open that link and take a look at the zuul results 14:17:49 also, since this is a driver change, there was third party ci run on it manually for stable/train 14:18:04 but, take alook at this: 14:18:17 #link https://review.opendev.org/#/c/733100/1/os_brick/privileged/scaleio.py 14:18:36 look at line 99 and think for a minute before opening the next link 14:18:53 #link https://launchpad.net/bugs/1883654 14:18:53 Launchpad bug 1883654 in os-brick train "os-brick fail to get password from config file" [Undecided,In progress] - Assigned to hamza (alqtaishat) 14:19:27 so, you may remember hearing me say that we need to have good tests that will break on python3 only features 14:19:41 :( 14:19:42 and you may also notice that i +2'd the above patch 14:20:02 anyway, here's a test: 14:20:05 #link https://review.opendev.org/#/c/735989/6/os_brick/tests/privileged/test_scaleio.py 14:20:27 which is a 68 line test file to test a function that contains 4 statements 14:20:41 and i'm not even saying that's a great test 14:20:56 for instance, there's a pretty good chance that it doesn't work on windows 14:21:34 but, the point is that we need to up our test game 14:21:40 or at least i do :D 14:21:48 fwiw, there is no scaleio connector for windows 14:22:02 cool 14:22:21 but i guess this unit test would run anyway 14:22:37 i think i left a comment on master version of this patch regarding testing that method 14:22:37 the problem is using the tempfile 14:22:51 i believe you did 14:23:09 so bonus points for whoami-rajat 14:23:28 :D 14:23:48 we were just testing if the method was called but not going inside it 14:24:08 i'm not sure if it would've helped with this situation 14:24:24 whoami-rajat++ 14:24:41 yeah, so the code wasn't completely un-tested 14:24:47 just not tested thoroughly enough 14:25:11 my main point is that reviewers should feel empowered to demand more tests 14:25:28 and reviewees need to be patient when more tests are demanded 14:26:04 rosmaita++ 14:26:08 any comments or observations before we move on? 14:26:40 i guess i have one more -- anyone guess why the third-party CI passed? 14:26:49 I agree 14:26:56 because we made everyone move to Python 3!!! 14:27:24 :) 14:27:26 rosmaita, on stable branches also the CI running is py3? 14:27:36 i guess it's the same CI 14:27:37 so even if third-party CI is testing stable branches, we can't count on that to detect problems 14:28:23 ok, so in the next topic we can discuss why this is a particularly painful bug 14:28:36 #topic updating OSSN-0086 14:28:54 ok, so the patch that introduced that bug is one of the patches addressing a security issue 14:29:06 the security issue had to be fixed in both cinder and os-brick 14:29:17 luckily, the cinder side of the fix doesn't contain this problem 14:29:28 (and hopefully doesn't contain a different one) 14:29:58 so, ussuri and master are python3 only, so we don't need to worry about those 14:30:23 but we do have T, S, R, and Q (plus some joker has proposed a backport from Q -> P) 14:30:39 at this point, ocata is dead to me, and pike is getting there 14:30:52 but we can discuss pike later 14:31:04 in the ossn, we promised to fix back to queens 14:31:14 so that's all i'm worried about now 14:31:36 here's the train patch, which still needs some work: 14:31:41 #link https://review.opendev.org/#/c/735989/ 14:32:07 in particular, smcginnis and jungleboyj please look at the release note on that patch 14:32:30 i tested locally, it will be a clean backport from T to S 14:32:52 so once we get train fixed, that will be big progress 14:33:05 Ok. 14:33:21 so, the OSSN lists the patches. i will have to update it to include the new patch 14:33:49 we just released cinder and os-brick from train, i think we will have to do new releases? 14:34:12 so what we're looking at is: merge when gate is fixed, release new os-brick, wait for u-c change to merge, update cinder requirements, release new cinder 14:34:29 ++ 14:34:44 ok, thanks, just wanted to make sure i wasn't being overly dramatic 14:34:45 ++ 14:35:30 i'll include a similar release note on the new cinder release, just saying you only need to upgrade if you are using that particular backend and using python 2.7 14:35:59 ok, so now when we get to Rocky and Queens, we don't release any more becuase they are in Extended Maintenance mode 14:36:52 the commit message for the cinder patch refers to the os-brick patch (which is now broken) 14:37:06 but it also mentions the OSSN, which i will update with the second brick patch 14:37:39 and i think if the commit message on the second patch says "Fix regression", anyone grabbing the first patch will notice the second in the git log? 14:38:15 plus, i can send out an embarassing email to the ML giving operators a heads up 14:38:27 i guess that will take care of Rocky and Queens? 14:38:44 anyway, if anyone has a better suggestion, let me know 14:38:53 i am just thinking out loud here 14:39:30 so basically, what i really need is for stable cores to look for these patches 14:39:39 i will liberally ping people in IRC 14:39:51 ++ 14:39:58 #topic expected behavior when the default volume type config option specifies a non-existent volume_type 14:40:48 ok, the context is the work in Train to make sure that every volume has a volume_type 14:41:00 because we used to allow no type, and that caused problems 14:41:27 so now we have __DEFAULT__ if an operator doesn't set a default type in the cinder conf 14:42:00 and we have a migration (i think in ussuri?) that makes any untyped volume __DEFAULT__ 14:42:05 (is that correct whoami-rajat ?) 14:42:28 rosmaita, yes 14:42:31 cool 14:42:46 basically it modifies the volumes,snapshots,encryption tables but yes 14:43:22 Sounds familiar. :-) 14:43:29 ok, so there is a slight regression at the REST API layer (nothing to do with the database) 14:43:58 and in fixing that, i am having an argument with whoami-rajat about expected behavior 14:44:28 so the issue is that before train 14:44:40 if the operator mis-configured the default volume type 14:44:44 with a non-existent type 14:45:02 we would log a message, and let the build succeed with no volume type 14:45:13 the build ? 14:45:22 building the volume 14:46:05 the code in Train changes this so that instead of continuing the volume creation 14:46:14 an exception is raised, and no volume 14:46:37 oh 14:46:38 i was thinking that the analog to the old behavior would be to allow the create to succeed, but with __DEFAULT__ as the volume type 14:46:46 failing the create is the right thing to do 14:46:58 #link https://review.opendev.org/#/c/639180/41/cinder/volume/volume_types.py 14:47:01 L#186 14:47:08 I think failing is the right thing to do 14:47:11 well, except you get a 202 and only find out the volume create failed later 14:47:12 it's a misconfigured cinder 14:47:43 we want to fail the create because it also enables the case where an admin wants to make users select a volume type instead of defaulting to one 14:48:14 ok, what about this case 14:48:16 so the intention is to set an invalid default on purpose ? 14:48:29 no, but it's logged for the operator 14:48:46 all i'm saying is that we have __DEFAULT__, why not use it? 14:48:55 in eharney's case an invalid default is specified with no intention of fixing it ? 14:48:57 The problem with failing is that it is an API change. Right? 14:49:05 do we check for an invalid volume type in the config at service start up? 14:49:13 not sure 14:49:18 the other place this will occur 14:49:18 i honestly don't remember exactly but this made sense as if we configure a wrong default type (the right one points to a specific backend) then the volumes will end up in a totally unexpected backend (with the __DEFAULT__ type) 14:49:29 failing should happen already for a volume type passed in that's invalid 14:49:53 At one point in history we were checking bad defaults at start-up. 14:49:54 eharney: We probably should have a check on start up. 14:50:04 smcginnis +1 14:50:15 ok, we can follow up with that 14:50:25 smcginnis: +1. it's a good suggestion 14:50:27 ++ 14:50:39 the other case is if a user creates a volume from an image 14:50:43 the only downside to that is that they have to start cinder w/o a default set, then create the volume type, then change the default in cinder.conf, then bounce cinder 14:50:52 you can put metadata on the image saying what volume type you want 14:51:11 vs w/o the start check, they can specify the missing type, then start cinder and create it. 14:51:13 and since that's done by end-users, they are bound to make mistakes 14:51:47 do we want to fail the build, or give them the default? 14:52:06 they used to get a successful build with no volume type 14:52:08 create should fail with invalid volume type 14:52:17 as it does today if you pass in an invalid type 14:52:35 yeah, i don't think we want to quietly use the default in the image case either 14:52:50 ok 14:53:17 we could do a check at startup and throw a warning 14:53:22 I agree. We either need to fail on startup, or we should fail the volume creation. 14:53:37 well, for the image case, we can't check at startup 14:53:54 smcginnis: ++ 14:54:08 Ok, then it sounds like we need to fail at creation. 14:54:19 ok, so the conclusion is: if a volume-type is mis-specified, we do not give you a default, we fail the create 14:54:28 yes 14:54:29 which actually makes sense 14:54:31 Yes. 14:54:36 it's just different from pre-Train 14:55:05 but we already do it so we're fine :) 14:55:27 well, apparently we don't do it too much, or someone would have caught the bug earlier :) 14:55:44 for more justification: defaulting to a different type would be quite bad for someone who thinks they're using encrypted volumes but accidentally isn't 14:56:12 yeah, silently doing something that sort of works is never a good strategy 14:56:19 rosmaita, yep, that too! 14:56:38 i'll revise my patch to preserve the train behavior, then 14:56:54 ok, thanks for the discussion 14:57:02 #topic open discussion 14:57:12 LiangFang is having connection problems 14:57:28 but is asking people to please look at his patches for volume-local-cache 14:57:47 the brick patch is using six, which i think we don't want any more 14:58:08 but i forgot to actually leave that comment on the patch 14:58:14 ++ 14:58:18 ok I'll take a look 14:58:39 hemna_ had some questions or observations in the cinder channel just before the meeting started 14:58:49 rosmaita thanks 14:58:53 Hi, same request here for the NFS encryption patch https://review.opendev.org/#/c/597148/ 14:59:18 i had really good intentions to review both, but this unending security bug ate my time 14:59:28 so I was wondering why the volume's existing AZ isn't specified in the request spec during retype by default? 14:59:53 we will have to continue this in the cinder channel, horizon meets in 20 seconds 14:59:55 this ends up resulting in the scheduler picking a backend existing in a different AZ, which fails 14:59:59 heh ok 15:00:09 sorry about that 15:00:12 thanks everyone 15:00:15 #endmeeting