Friday, 2016-12-16

shaohe_fengzhenguo: $ openstack baremetal compute instance show 2ff1bf8b-08c0-486f-8691-05f45fe98f57 -c created_at -c status -f json00:31
shaohe_fengzhenguo: {00:31
shaohe_feng  "status": "building",00:31
shaohe_feng  "created_at": "2016-12-15T03:38:19+00:00"00:31
shaohe_feng}00:31
shaohe_fengzhenguo: $ date00:32
shaohe_fengThu Dec 15 20:25:30 CST 201600:32
shaohe_fengY00:32
shaohe_fengzhenguo: I has build to long times00:33
shaohe_fengzhenguo: IMHO, something wrong with the instance00:33
shaohe_fengzhenguo: maybe network or volume00:34
zhenguoshaohe_feng: with the taskflow patch?00:34
shaohe_fengzhenguo: In these scenario, what should user do?00:34
shaohe_fengzhenguo: yes00:34
zhenguoshaohe_feng: the only way is to delete it00:35
zhenguoshaohe_feng: I have tested with the taskflow many times and it worked00:36
shaohe_fengzhenguo: so, there is no way for users to know what's wrong?00:36
zhenguoshaohe_feng: yes, we should make sure it goes to error state00:37
zhenguoshaohe_feng: you can test without the taskflow patch first to make sure your env is ok00:37
shaohe_fengzhenguo: so I need to test these cases: 1. create the instance as normal 2. terminate the creating process, such as delete. 3. delete the instance during creating 4. restart the nimble-engine during creating00:39
shaohe_fengzhenguo: any other cases?00:39
shaohe_fengzhenguo: 5. make the network error during creating00:40
zhenguoshaohe_feng: please don't test delete related cases now as we haven't refactor delete now00:40
shaohe_fengzhenguo: so 1, 4, 500:41
zhenguoshaohe_feng: and stop ironic during creating00:42
shaohe_fengzhenguo: OK. 6. stop ironic00:42
zhenguoshaohe_feng: ok, thanks00:43
shaohe_fengzhenguo: now, let discuss my scenario, Do you think, user should know some details about the creating, so they can know why it takes long time in building status?00:45
shaohe_fengzhenguo: or we has miss some edge case. In my scenario, the nimble should mark this as error, it should perceive there's something wrong with the create.00:48
*** kevinz has joined #openstack-nimble01:10
*** yuntongjin has joined #openstack-nimble01:38
*** yuntongjin has quit IRC01:53
shaohe_fengzhenguo: I'm in office now01:53
zhenguoshaohe_feng: hi01:57
*** yuntongjin has joined #openstack-nimble01:57
shaohe_fengzhenguo: Hi.02:06
shaohe_fengzhenguo: re-test.02:06
shaohe_fengzhenguo: report error. I'm debug it.02:06
shaohe_fengzhenguo: Failed to retrieve node by instance_uuid 0c95c2b8-0ff4-44e8-8529-bdef65│··········································02:07
shaohe_feng2257af from ironic api: Remote error: NotFound Not Found (HTTP 404)02:07
shaohe_fengzhenguo: $ openstack baremetal node list -f json -c UUID |grep UUID02:08
shaohe_fengzhenguo: "UUID": "f3289f56-f272-4695-b699-eb1033eb278f"02:08
shaohe_feng    "UUID": "26f4524b-a1a9-48d9-b598-fc7bc104c3f4"02:09
shaohe_feng    "UUID": "c83ee6e7-0d46-4451-a2f1-613ca73611f6"02:09
shaohe_fengzhenguo: but I run the ironic command, it can get the node.02:09
zhenguoshaohe_feng: with the taskflow patch?02:09
zhenguoshaohe_feng: that's because I have cleaned the instance_info from ironic node02:10
zhenguoshaohe_feng: so you can list the nodes but can't get it by instance_uuid :(02:10
zhenguoliusheng: if you are available, please help to test the taskflow patch as well, thanks!02:13
liushengzhenguo: ok, will do that later02:14
zhenguoas it changes the basic codes, may block many other works02:14
zhenguoliusheng: thanks02:14
shaohe_fengzhenguo: OK. Ironic node: f3289f56-f272-4695-b699-eb1033eb27802:15
shaohe_fengf virtual to physical interface count  mismatch (Vif count: 2, Pif count: 1)02:15
liushengzhenguo: np02:15
shaohe_fengzhenguo: I config two network. so it can not works.02:15
shaohe_fengzhenguo: I need to remove one of the network02:16
zhenguoshaohe_feng: yes, it will raise an exception in the network task, but not revert, only setinstanceinfo task revert02:16
zhenguoshaohe_feng: no, it's a test case02:16
zhenguoshaohe_feng: so you are stuck on the builing status , right?02:16
shaohe_fengzhenguo: it reports error. This is OK.02:17
zhenguoshaohe_feng: and the instance is set to be error?02:17
shaohe_fengzhenguo: testcase 7. with wrong parameters to create, for 2 networks. The instance status is "error"02:18
shaohe_fengzhenguo: yes.02:18
zhenguoshaohe_feng: nice02:18
zhenguoshaohe_feng: on yesterday's party, some china mobile guy said they will have more than 30000 bare metals in future, really exciting actually02:22
shaohe_fengzhenguo: really a good news.02:23
shaohe_fengzhenguo: https://etherpad.openstack.org/p/nimble-testcases-for-create02:23
shaohe_fengzhenguo: liusheng: we can add our manually tests in this link.02:24
zhenguoshaohe_feng: ok, I'm adding UT now, but can't tests the real flow work, only mock execute and revert in a single task.02:25
shaohe_fengzhenguo: How long does it take for building?02:32
zhenguoshaohe_feng: just a few miniutes02:32
shaohe_fengzhenguo: It has passed 4 minutes, still in building  status02:32
zhenguoshaohe_feng: you can check the engine logs, it will looping call ironic to get the node status02:32
shaohe_fengzhenguo: yes.02:33
shaohe_fengzhenguo: 2016-12-15 22:27:00.785 DEBUG nimble.engine.flows.create_instance [-] Current ironic node state is wait call-back from (pid=18104) _wait_for_active /opt/stack/nimble/nimble/engine/flows/create_instance.py:29802:34
zhenguoshaohe_feng: hah02:35
shaohe_fengzhenguo: $ sudo virsh list --all02:35
shaohe_feng Id    Name                           State02:36
shaohe_feng----------------------------------------------------02:36
shaohe_feng 2     node-0                         running02:36
shaohe_fengzhenguo: the node-0 is on running, sonmething wrong?02:36
zhenguoshaohe_feng: that means it's power on, and wait_call_back means it's still deploying02:37
zhenguoshaohe_feng: so, it's still in wait-call-back?02:38
shaohe_fengzhenguo: yes.02:40
shaohe_fengzhenguo: but, in mysql02:40
shaohe_fengzhenguo: mysql> select node_uuid from instances where UUID="4010b1f7-ac43-4d73-bfeb-8c9d22f71b52";02:40
shaohe_feng+-----------+02:40
shaohe_feng| node_uuid |02:40
shaohe_feng+-----------+02:40
shaohe_feng| NULL      |02:40
shaohe_feng+-----------+02:40
zhenguoshaohe_feng: so the instance status is ?02:41
*** openstackgerrit has joined #openstack-nimble02:41
openstackgerritliusheng proposed openstack/python-nimbleclient: Add support for updating instance  https://review.openstack.org/41159102:41
shaohe_fengzhenguo: $ openstack baremetal compute instance show 4010b1f7-ac43-4d73-bfeb-8c9d22f71b52 -f json -c status02:41
shaohe_fengzhenguo: { "status": "building" }02:42
zhenguoshaohe_feng: that's weird, some exceptions in engine logs?02:42
zhenguoshaohe_feng: I think it triggers the setinstanceinfo revert to clean instance_info and instance_uuid from the ironic node02:43
zhenguoshaohe_feng: maybe loopingcall timeout is raised02:43
shaohe_fengzhenguo: many logs on  "Current ironic node state is wait call-back"02:44
zhenguoshaohe_feng: no other logs?02:44
shaohe_fengzhenguo: some many logs, let me find other clue02:45
shaohe_fengzhenguo: another log "Running periodic task EngineManager._sync_node_resources"02:46
shaohe_fengzhenguo: this is not an error02:47
zhenguoshaohe_feng: so no any error log?02:47
zhenguoshaohe_feng: you can show the ironic node to see wheter it has instance_uuid and instance_info02:48
shaohe_fengzhenguo: yes, and other logs for AMQP, recieve msg, and reply msg.02:48
shaohe_fengzhenguo: yes, good idea.02:48
shaohe_fengzhenguo: in mysql or cmd-line?02:49
zhenguoshaohe_feng: CLI02:49
zhenguoironic node-show02:49
liushengzhenguo: shaohe_feng I can create instance sucessfully with the taskflow patch02:51
zhenguoliusheng: nice02:51
zhenguoliusheng: and maybe need to test with some exceptions02:52
liushengzhenguo: I assume the current Nimble cannot work with most exception scenarios, hah02:54
zhenguoliusheng: hah,02:55
shaohe_fengzhenguo: $ openstack baremetal node list |grep f3289f56-f272-4695-b699-eb1033eb278f02:55
shaohe_fengzhenguo: | f3289f56-f272-4695-b699-eb1033eb278f | node-0 | 4010b1f7-ac43-4d73-bfeb-8c9d22f71b52 | power on    | wait call-back     |02:55
zhenguoliusheng: yes, although it can't work, I want to see whether it reverts to a state we expected.02:55
shaohe_fengzhenguo: it has instance UUID 4010b1f7-ac43-4d73-bfeb-8c9d22f71b52, that's right02:56
zhenguoshaohe_feng:yes02:56
shaohe_fengzhenguo: and it is "wait  call-back"02:56
zhenguoshaohe_feng:there is still no error log in you engine log?02:57
zhenguoshaohe_feng: and you instance doesn't associate with any node?02:57
shaohe_fengzhenguo: $ openstack baremetal compute instance list |grep 4010b1f7-ac43-4d73-bfeb-8c9d22f71b5202:58
shaohe_fengzhenguo: | 4010b1f7-ac43-4d73-bfeb-8c9d22f71b52 | test_instance1 |               | building |          |02:58
shaohe_fengzhenguo: yes. let me try the mysql again02:58
liushengzhenguo, shaohe_feng: which exception scenario do you want to expect to test ?03:01
zhenguoliusheng: https://etherpad.openstack.org/p/nimble-testcases-for-create03:01
liushengzhenguo: ok03:02
shaohe_fengzhenguo: $ openstack baremetal node list03:02
shaohe_fengzhenguo: the status of first node, node-0 is "clean wait"03:03
shaohe_fengzhenguo: the status node-1 is "wait call-back"03:03
zhenguoshaohe_feng: so there's a rescheduling occurred03:04
zhenguoshaohe_feng: if so, there must be some logs03:04
shaohe_fengzhenguo: does this is right, when "wait call-back", we should not assign the node-uuid to instance03:04
shaohe_fengzhenguo: yes, some logs03:07
shaohe_fengzhenguo: first there is a bug:03:08
shaohe_fengzhenguo: Failed to provision instance 4010b1f7-ac43-4d73-bfeb-8c9d22f71b52: Timeout reached while waiting for call03:08
shaohe_fengback for node f3289f56-f272-4695-b699-eb1033eb278f03:08
liushengzhenguo, shaohe_feng the *error* state of instance after creating instance with two netowrks is expected ?03:09
liushengIronic node: 0fe83120-4132-49ae-90b7-835198e6f155 virtual to physical interface count  mismatch (Vif count: 2, Pif count: 1)03:10
liushengtransitioned into state 'FAILURE' from state 'RUNNING'03:10
liushengNoValidNode: Exceeded max scheduling attempts 3 for instance 33db0e39-e7d9-4ea9-ae16-a084195a4b0003:10
zhenguoshaohe_feng: Tiemout triggers the rescheduling process, but the new node-uuid should be set to the instance03:10
shaohe_fengzhenguo: yes, then do rescheduing find another node.03:11
zhenguoliusheng: yes, we just have one physical interface in ironic node03:11
liushengzhenguo: so it seems works ok03:12
zhenguoliusheng: we expect the ironic node instance info and instance uuid has been cleaned in this scenario03:12
zhenguoshaohe_feng: yes, find another node and set it to the instance03:12
liushengzhenguo: the instance uuid didn't be set to the ironic node in my env03:13
zhenguoliusheng: yes, because SetInstance task revert to clean it03:13
liushengzhenguo: so it is OK!03:14
zhenguoliusheng: yes, and the instance status is error, right?03:14
liushengzhenguo: yes03:14
zhenguoliusheng:nice03:14
zhenguoliusheng: if instance_info and instance_uuid left in the ironic node, the node will be never used again03:15
liushengzhenguo: yes03:15
* zhenguo brb03:16
shaohe_fengzhenguo: so we must clean the instance_info and instance_uuid when no instance associated to this node.03:17
shaohe_fengzhenguo: Task 'nimble.engine.flows.create_instance.ScheduleCreateInstanceTask;instance:create' (43507aae-5419-4a4e-a5ca-2716e643cc18) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None'03:18
shaohe_fengzhenguo: So here, 43507aae-5419-4a4e-a5ca-2716e643cc18 is the uuid of a task? but  I did not find a task db in mysql03:19
shaohe_fengzhenguo: and the state 'SUCCESS' or 'RUNNING' is also the tasks?03:19
shaohe_fengzhenguo: seem all log is OK.  and it reports find a port successfully for the node.03:21
shaohe_fengzhenguo: until this error:03:23
shaohe_fengl InstanceDeployFailure: Failed to provision instance 4010b1f7-ac43-4d73-bfeb-8c9d22f71b52: Timeout reached while waiting for callback for node f3289f56-f272-4695-b699-eb1033eb278f03:23
zhenguoshaohe_feng: yes03:27
*** yuntongjin has quit IRC03:28
zhenguoshaohe_feng: if InstanceDeployFailure, it will trigger revert, first the CreateInstance task revert then Network revert and then SetInstanceinfo revert and trigger the OnfailureRescheulding revert to cast a rpc call to create_instance again03:29
shaohe_fengzhenguo: why  Timeout?03:29
zhenguoshaohe_feng: then if max attempts of rescheduling reached, the Schedule task revert will clean the node_uuid03:29
zhenguoshaohe_feng: because the node status is in wait call back for a long time03:30
shaohe_fengzhenguo: sounds reasonable.03:30
shaohe_fengzhenguo: why can not call back?03:30
zhenguoshaohe_feng: there's many reasons,03:31
zhenguoshaohe_feng: like when ironic node starts but can't get an IP from the neutron network03:31
zhenguoshaohe_feng: or something wrong with the IPA during depoying03:32
zhenguoshaohe_feng: there must be something wrong with your env, but we should also handle that error03:33
zhenguoshaohe_feng: so currently you only get one Timeout exception and rescheduling then no new node uuid associated the instance, but the node is in wait call back state?03:34
zhenguoshaohe_feng: during revering we also cleaned the create ports in neutron, and clean the port association information in ironic03:34
zhenguos/revering/ reverting03:35
shaohe_fengzhenguo: yes.03:40
shaohe_fengzhenguo: I can seen unplug log.03:40
shaohe_fengzhenguo: now another error:03:41
zhenguoshaohe_feng: so the instance status is still in building and without a node associate it, right?03:41
shaohe_fengzhenguo: 2016-12-15 23:24:04.818 TRACE taskflow.engines.action_engine.engine NetworkError: Could not remove VIF 93405cd4-0f8e-4cfa-b0fb-385c6fb8a161 of instance 4010b1f7-ac43-4d73-bfeb-8c9d22f71b52, possibly a network issue: auth_url was not provided to the Neutron client03:41
shaohe_fengzhenguo: let me check03:41
shaohe_fengzhenguo: it is still in building03:42
shaohe_fengzhenguo: I find my keystone use fernet-key this time to build the evn by devstack03:43
zhenguoshaohe_feng: I find a issue, because after OnFailureRescheduleTask cast a create instance rpc, the ScheduleTask revert will clean the node_uuid, there a race03:44
zhenguoshaohe_feng: maybe it's not proper to add the clean node_uuid action to ScheduleTask revert03:45
zhenguoshaohe_feng: I will find another place to clean the instance node_uuid, and the reason why I want to clean this is that, I don't want to do the clean resouces work again if our revert is successfully03:47
* zhenguo away for lunch04:04
*** yuntongjin has joined #openstack-nimble06:02
openstackgerritliusheng proposed openstack/nimble: Fix the AttributeError when list available zones  https://review.openstack.org/41165506:59
shaohe_fengzhenguo: still building.07:07
zhenguoshaohe_feng: the ironic node status is...?07:07
shaohe_fengzhenguo: IMHO, the scheduler has tried all nodes, and failed, so it should be error.07:08
shaohe_fengzhenguo: let me check.07:08
shaohe_fengzhenguo: You think, there is race, so the node-uuid does not attched to instance?07:09
shaohe_fengzhenguo: $ openstack baremetal node list07:10
shaohe_feng| UUID                                 | Name   | Instance UUID | Power State | Provisioning State | Maintenance |07:10
shaohe_feng| f3289f56-f272-4695-b699-eb1033eb278f | node-0 | None          | power off   | clean failed       | True        |07:10
shaohe_feng| 26f4524b-a1a9-48d9-b598-fc7bc104c3f4 | node-1 | None          | power off   | clean failed       | True        |07:10
shaohe_feng| c83ee6e7-0d46-4451-a2f1-613ca73611f6 | node-2 | None          | power off   | available          | False       |07:11
*** yuntongjin has quit IRC07:11
shaohe_fengzhenguo: call you by wechat?07:12
zhenguoshaohe_feng: so it's still in building status? seems it didn't try all nodes, as node-2 is still availble07:27
zhenguoshaohe_feng: you can call my phone07:28
zhenguoshaohe_feng: so I think there are two Timeout exceptions in your log, and rescheduling doesn't reach the max attempts, right?07:29
openstackgerritMerged openstack/nimble: Replace six.iteritems() with .items()  https://review.openstack.org/40782407:29
shaohe_fengzhenguo: why does the scheduler stop try?07:30
zhenguoshaohe_feng: not sure07:32
zhenguoshaohe_feng: any new clue in your log?07:32
zhenguoshaohe_feng: please remove the revert method in ScheduleTask, and try again07:33
shaohe_fengzhenguo:  Could not remove VIF 93405cd4-0f8e-4cfa-b0fb-385c6fb8a161 of instance 4010b1f7-ac43-4d73-bfeb-8c9d22f71b52, possibly a network issue: auth_url was not provided to the Neutron client07:37
shaohe_fengzhenguo: Traceback07:37
shaohe_feng2016-12-15 23:24:04.801 TRACE nimble.common.neutron NoAuthURLProvided: auth_url was not provided to the Neutron client07:37
zhenguoshaohe_feng: maybe something wrong with your keystone07:37
shaohe_fengzhenguo: so, scheduler stop is also this reason?07:38
zhenguoshaohe_feng: I read some openttack insight report this moring, it said keystone has changed it's default token type, maybe that's the cause of your problem07:39
zhenguoshaohe_feng: no,07:39
shaohe_fengzhenguo: I update all project repo before run the devstack07:40
zhenguoshaohe_feng: you can search the scheduler retries information in your logs to see why, it will present how many times of the retries07:40
zhenguoshaohe_feng: maybe that's the problem07:40
shaohe_fengzhenguo: what's the default value.07:40
zhenguoshaohe_feng: let me check07:41
zhenguoshaohe_feng: changed from uuid to fernet07:43
shaohe_fengzhenguo: Oh, got it. I did not upgrade the devstack07:47
shaohe_fengzhenguo: so the devstack should help me to create the fernet dir and key.07:47
zhenguoshaohe_feng: not sure, but you can try to upgrade devstack first07:48
shaohe_fengzhenguo: yes.07:48
shaohe_fengzhenguo: I should fix it manually.07:48
zhenguoshaohe_feng:yes07:49
shaohe_fengzhenguo: I should not fix it manually.07:49
zhenguoshaohe_feng: oh, yes, haha07:49
shaohe_fengso bad.07:50
zhenguoshaohe_feng: I talked with some guys at yesterday's party, seems they are curious about what  capabilites we can provide in O release07:53
zhenguoand liusheng :P07:54
zhenguoshaohe_feng, liusheng: and someone invited me to introduce nimble on next Beijing OpenStack Meetup activities.07:56
liushengzhenguo: :)07:57
zhenguoliusheng: hah07:58
*** yuntongjin has joined #openstack-nimble07:58
liushengzhenguo: I am coding, what are you talking about, lol07:58
zhenguoliusheng: lol07:59
liushengzhenguo: it is a good chance to introduce Nimble to more people07:59
zhenguoliusheng: yes, maybe attract new contributors08:00
liushengzhenguo: yeah, really hope more new comers08:01
zhenguoliusheng: not that easy, they said many big tent projects or even core projects don't have many contributor as before now08:02
liushengzhenguo: yes, that is current situation08:03
zhenguoliusheng: yes, so actually we are active than many big tent projects08:06
liushengzhenguo: hah, if there is real requirement, I think more contributor will be insterested in it08:09
zhenguoliusheng: yes, hah08:12
shaohe_fengzhenguo: call again08:13
zhenguoshaohe_feng: our espace can't call that number :(08:13
shaohe_fengzhenguo: let me call you08:13
zhenguoshaohe_feng: ok08:13
shaohe_fengzhenguo: you number?08:14
zhenguoshaohe_feng: oh, I don't have a number, only my phone number, lol08:14
shaohe_fengzhenguo: espace number?08:15
zhenguoshaohe_feng: I remeber JF used a web conference tool before08:15
shaohe_fengzhenguo: +1 (861) 832-903508:15
zhenguoshaohe_feng: no, it's a internal chat tools08:15
shaohe_fengzhenguo: you can call me08:15
openstackgerritliusheng proposed openstack/python-nimbleclient: Support availability zone in nimbleclient  https://review.openstack.org/41169008:15
zhenguoshaohe_feng: let me try08:16
openstackgerritliusheng proposed openstack/nimble: Fix the AttributeError when list available zones  https://review.openstack.org/41165508:22
zhenguoliusheng: have you tested the 'uuid as primary key' patch08:34
liushengzhenguo: yes08:35
zhenguoliusheng: ok, thanks08:35
zhenguoliusheng: thanks for taking over the client work, really appreciate :)09:05
liushengzhenguo: np:)09:14
*** yuntongjin has quit IRC09:16
openstackgerritMerged openstack/nimble: Use uuid as the primary key  https://review.openstack.org/40516509:17
openstackgerritMerged openstack/nimble: Fix the AttributeError when list available zones  https://review.openstack.org/41165509:20
openstackgerritliusheng proposed openstack/python-nimbleclient: Add support for setting power state for instance  https://review.openstack.org/41172909:41
openstackgerritliusheng proposed openstack/python-nimbleclient: Add support for setting power state for instance  https://review.openstack.org/41172909:49
*** kevinz has quit IRC10:02
openstackgerritZhenguo Niu proposed openstack/nimble: Add create instance taskflow  https://review.openstack.org/40355510:06
shaohe_fengzhenguo: still not assign node uuid. I need go throgh the patch.10:15
*** openstackgerrit has quit IRC10:18
*** openstackgerrit has joined #openstack-nimble11:31
openstackgerritZhenguo Niu proposed openstack/nimble: Add create instance taskflow  https://review.openstack.org/40355511:31
openstackgerritZhenguo Niu proposed openstack/nimble: Remove unused fake_constants.py  https://review.openstack.org/41176711:35
*** shihanzhang has quit IRC13:12
openstackgerritMerged openstack/nimble: Remove unused fake_constants.py  https://review.openstack.org/41176714:40
openstackgerritZhenguo Niu proposed openstack/nimble: Add create instance taskflow  https://review.openstack.org/40355515:15
*** liusheng has quit IRC18:58
*** liusheng has joined #openstack-nimble19:21

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!