Friday, 2018-07-27

rm_workso I'm still not hearing anything that is a problem for my deployment00:00
dansmithheh, okay00:00
rm_workand if I can do it, others can do it00:00
dansmithhere's one problem: file injection has been deprecated :P00:00
rm_workyes, thanks00:00
rm_workI happened to notice that recently ;P00:01
*** tetsuro has joined #openstack-nova00:01
penickSo the root problem is you need to put arbitrary files in an instance, or you need instances to have x509 certs chains?00:02
rm_worki don't think we're going to get anywhere today on this, maybe we pick up at the PTG00:02
rm_workright now it's a cert-chain, PK, and agent config00:02
rm_workso I guess "arbitrary files"00:02
rm_workand they all contain data that we consider "sensitive"00:03
rm_work(obviously, in the case of the PK)00:03
penickAre they "shared" secrets, like the keypair for a public website? Barbican might be the right place for those.00:03
*** david-lyle has joined #openstack-nova00:03
rm_workwell, specifically the PK00:03
rm_workwe do use Barbican, but our instances have no way to auth against it00:03
rm_workone is shared00:04
rm_workthe other is generated specifically for the VM in question00:04
rm_work(the PK)00:04
*** shaohe_feng has quit IRC00:04
penickWe generate secrets on our instances, then have another system the instances call to have their csr signed, it asserts their identity before it's signed by our root of trust00:05
rm_workok so that there is the important bit00:05
johnsomSpeaking of FF - had to go take care of that. Yeah I think at least of floppy disk's worth of storage is reasonable. lol  Like the PXE boot image size.00:05
rm_workwe USE those certs/PK to assert identity00:05
*** harlowja has joined #openstack-nova00:05
rm_workhow do you assert VM's identity without that?00:05
rm_worki mean, that is exactly our workflow00:06
rm_workwell ... ALMOST our workflow00:06
rm_workwe reach out to the VM, not the other way around00:06
johnsomYeah, this was the whole discussion that led use to what was implemented years ago.00:06
*** shaohe_feng has joined #openstack-nova00:06
dansmithpenick is headed down the right path here, which is not to pass everything to nova and expect it to keep it (most people) or disavow it (some people), and only give nova enough information to let you interact with some service that can do what you want00:06
*** dklyle has quit IRC00:06
dansmithinformation that is not sensitive forever00:06
johnsomI'm just concerned that if we don't trust how we store and handle images we are in trouble before we even get to config data and establishing secure channels.00:07
*** linkmark has quit IRC00:07
penickWe create a signed bearer document that's time limited and place it in the instance, on boot the instance creates a PK and CSR, then sends those along with the attestation document (created as part of vendor data) to the token server, which verifies the signature in the attestation document (and then invalidates the document) then calls openstack to verify the details in the CSR00:08
rm_workwhich "we" is that?00:08
penickeg, ensure the IP, UUID, etc in the CSR match the instance00:08
rm_workwhich service00:08
rm_workbecause that does sound like the workflow we're aiming for00:08
penickthe service is called Athenz, and the system we've built to integrate it into OpenStack is called copper argos00:08
penickI have a talk on it, one sec..00:09
rm_worki was hoping to just glance at the repo00:09
rm_workhttps://github.com/yahoo/athenz ?00:09
penickhttps://www.openstack.org/videos/vancouver-2018/attestable-service-identity-with-copper-argos00:10
penickyup00:10
rm_workhttps://github.com/yahoo/athenz/blob/master/docs/copper_argos_dev.md00:10
penickAyup, that's it00:10
*** vladikr has quit IRC00:11
rm_workso basically, we're screwed once Stein hits, and we have to get something like this working before then? :P00:11
rm_worksounds like another day at the office, lol00:11
penickI feel like it benefits me to say Yes :)00:11
rm_workwe'll investigate00:11
dansmithrm_work: you should really read the spec you're freaking out about00:11
rm_workI did00:11
dansmith"Since personality file injection will still be supported with older microversions, there will be nothing removed from the backend compute code related to file injection"00:11
penickWe're eager to have other people use this, so lmk if y'all (who..are..you?) are interested in using Athenz. It'd be good to get other organizations using/contributing to Athenz00:12
rm_workyeah, but in Octavia we don't necessarily control the nova deployments00:12
rm_workso we can't guarantee they have the thing enabled00:12
rm_workbut we still need our stuff to work00:12
*** dklyle has joined #openstack-nova00:12
*** gbarros has joined #openstack-nova00:12
dansmithrm_work: oooh, I have good news for you00:12
rm_workpenick: we'd be writing something like that into Octavia00:12
dansmithrm_work: user_data will always work? see how nice it is to have features that don't come and go with the deployment choices? :)00:12
rm_worklol00:13
rm_workexcept user-data already doesn't work :P00:13
johnsomWell, nova is a stable api, so it shouldn't be going away any time soon or they are dropping their stable assertion....00:13
penickWe'll be using octavia with this in the near future. It's one of the things we have to suss out this qtr00:13
dansmithyou mean jamming a bus into your wallet won't work00:13
penickBut, we already have Athenz in place00:13
penickdansmith: Well not with that attitude00:13
dansmithjohnsom: that's what I'm trying to point out00:13
rm_workbut you're saying it's already disabled in most nova deploys?00:14
dansmithjohnsom: which is what you get if you read a paragraph down below "and now lose your mind"00:14
*** dklyle_ has joined #openstack-nova00:14
dansmithrm_work: no, we're saying that file injection is disabled, but as you pointed out we're putting those personality files into the config drive the first time we make it00:14
*** david-lyle has quit IRC00:14
*** shaohe_feng has quit IRC00:15
rm_work[16:38:53]  <dansmith>so this has been disabled by default for libvirt for a long time,00:15
rm_work^^ so what did that mean?00:15
dansmithrm_work: file. injection.00:15
*** itlinux has joined #openstack-nova00:15
rm_workyes, which has always worked via personality files?00:15
dansmithrm_work: you saw the part where I said "I'm not sure how this is going into config drive" and then ... found and quoted the code right?00:15
*** shaohe_feng has joined #openstack-nova00:15
*** jamesde__ has quit IRC00:15
rm_workmaybe?00:15
johnsomdansmith I was shocked because we hadn't heard of this and it was the *way* to do this securely and reliably and user-data was  .... less than ideal00:16
rm_workhttps://github.com/openstack/nova/blob/master/nova/api/metadata/base.py#L191-L194 this link?00:16
rm_workI thought that was via libvirt using the thing you said was disabled00:16
dansmithjohnsom: you know that config drive is disable-able and depending on it is also not reliable yeah?00:16
dansmithrm_work: no00:16
dansmithrm_work: I get that it says libvirt there, but...00:17
*** jamesden_ has joined #openstack-nova00:17
*** dklyle has quit IRC00:17
*** Sundar has quit IRC00:17
johnsomdansmith We force require it as the metadata service was swiss cheese and blew up if you booted more than a few instances at a time00:17
rm_workif that's not "file injection" then I don't know00:17
dansmithrm_work: the rest of the spec is talking about file injection specifically, which has nothing to do with config drive and is all about violating the very sanctity of the image by forcing large things into small holes00:17
rm_workerr00:18
penickrm_work what's generating the secrets that you're putting into the instance? (amphora vms?)00:18
rm_workso *are we using file injection or not*?00:18
dansmithI'm serious, you should totes read the spec :)00:18
rm_workI read the spec00:18
rm_workseveral sections more than once00:18
rm_workso obviously whatever you're hinting at, i'm not going to get00:18
johnsomYeah, the terminology in that spec is super confusing compared to the nova API and client API00:18
dansmiththat's the point of the first #1 bullet00:19
rm_workthis whole conversation started because I asked "is what we are doing the deprecated file injection" and multiple people said "yes"00:19
dansmithusers can't know whether they will get the files they send, because either the deployment may have actual injection disabled (the default),00:19
rm_workwhich #1 bullet, there are several00:19
dansmithor they may have disabled config drive (the other way to get these files)00:19
dansmithrm_work: I said the first :)00:19
rm_work(in fact, I DID notice something new by re-reading -- that SECTION has two, rofl)00:20
openstackgerritMerged openstack/nova master: conf: Add '[neutron] physnets' and related options  https://review.openstack.org/56444000:20
dansmithlet me try to restate this whole thing00:20
dansmithand if that doesn't help, then I'll leave and you can keep your torches and pitchforks for whatever you want00:21
dansmithin the olden times,00:21
dansmiththere was a feature called "file injection"00:21
dansmiththere are two halves of said feature:00:21
dansmith1. The API (personality files) by which people provide this data which may get ignored if config is unfriendly00:21
johnsomAnyhow, any change we can bump that max size of user-data up to a floppy size? Is it just the API limitation and a DB column alter, or is cloud-init going to need to spin too?00:22
dansmith2. The actual injection part, where the virt driver (some not all) could inject files into images forcibly, literally by taking a hard-coded partition number, and writing over it with your data00:22
dansmithare you with me?00:22
dansmithconfig drive didn't exist at this point00:22
*** gyee has quit IRC00:23
dansmithaight, I guess nobody wants to hear my story00:24
rm_worki'm trying to parse it00:24
*** medberry has joined #openstack-nova00:24
*** vladikr has joined #openstack-nova00:24
dansmithwhich part?00:24
rm_workso, file-injection IS what we're using, correct? so right now, we are using both halves of this?00:24
dansmithno,00:24
dansmithyou're using the first part,00:24
rm_workor this was just the past, and it's changed now, and you're getting to that00:24
dansmithand another part I haven't gotten to yet00:24
rm_workk00:25
*** shaohe_feng has quit IRC00:25
dansmiththe #2 part is the really nasty bit, which has been disabled by default, and which we _actually_ want to be rid of00:25
*** itlinux has quit IRC00:25
dansmithhowever, the first part is problematic because we don't store it and it breaks several of our other features (agree to disagree on this)00:25
dansmithso, in the middle ages, long before you showed up,00:25
dansmiththis config_drive thing was created00:25
*** shaohe_feng has joined #openstack-nova00:26
dansmithwhich was a way to avoid the metadata server's restrictions, complication, whatever00:26
*** gbarros has quit IRC00:26
dansmithapparently when we create that the first time, we also put those files in there (TIL)00:26
*** gbarros has joined #openstack-nova00:26
dansmithbut we can't re-create it later, which is the #2 part of the spec problem section00:27
dansmithso,00:27
dansmithyou're using the API part, and the config drive part, but not the actual injection thing which is the most smelly bit00:27
rm_workha, right, which is funny because the #2 "problem" is actually WHY we chose this method00:27
dansmithfine, but whatever00:27
rm_workok so if #2 was the bad part, and that's just not done anymore... why is the first part being removed?00:27
dansmith#2 is related to the API not the really bad part00:28
rm_workerr00:28
rm_worksorry, PART 1 and 200:28
dansmiththe API part is bad because it takes arbitrary files and then kind keeps track of them, until a rebuild or something and then we lose them00:29
rm_workper "1. The API (personality files) by which people provide this data" and "the #2 part is the really nasty bit, which has been disabled by default, and which we _actually_ want to be rid of"00:29
rm_workhmmm00:29
dansmiththe #2 part is the libvirt injection partition thing00:29
dansmithsorry00:29
dansmitheff,00:29
rm_workyeah00:30
dansmiththis straightening isn't going well00:30
rm_workso right, #2 part (libvirt) isn't even done anymore00:30
rm_worknow it puts things into config-drvie00:30
rm_workwhich is ... fine?00:30
rm_workit's just that nova then loses track of that data, which you consider bad (but we don't)00:30
rm_work(and it has worked that way for a while?)00:31
dansmithokay, you know, it's after 5pm and I'm getting more frustrated here, so I'm just going to go00:31
rm_workkk00:31
rm_workprolly just discussing at the PTG is best00:31
*** Ileixe has joined #openstack-nova00:31
IleixeHello guys00:32
IleixeRecently I implement custom hooking code for server create api in nova-api by hook api.00:33
johnsomMy take away. There was some nasty bit taking files and making some strange partition at boot. We aren't using that and never have. Then there is the bit that takes files, stashes them in the config drive and cloud-init drops them in the guest filesystem.  This what we use. However to remove the partition stuff the config drive part got removed too00:33
*** jangutter has joined #openstack-nova00:35
IleixeOh sorry there was converstation in now. Never mind. I ask later00:35
*** shaohe_feng has quit IRC00:35
rm_workIleixe: we are ... wrapped up on that :P00:35
rm_workit's fine00:35
rm_worklol00:35
*** shaohe_feng has joined #openstack-nova00:36
IleixeThanks rm_work :) just simple qeustion. I found hook api was deprecated, and the api was the right thing for my logic, so i wonder what replace hook api00:37
*** namnh has joined #openstack-nova00:40
*** namnh has quit IRC00:44
*** shaohe_feng has quit IRC00:45
*** Ileixe_ has joined #openstack-nova00:46
*** Ileixe has quit IRC00:47
*** shaohe_feng has joined #openstack-nova00:49
*** Ileixe_ has quit IRC00:51
*** ileixe has joined #openstack-nova00:53
*** felipemonteiro has quit IRC00:55
*** shaohe_feng has quit IRC00:56
melwittargh, looks like we have a new gate failure as of today00:57
*** shaohe_feng has joined #openstack-nova00:57
melwitthttp://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Unsupported%20VIF%20type%20unbound%20convert%20'_nova_to_osvif_vif_unbound'%5C%22%20AND%20tags:screen-n-cpu.txt&from=7d00:57
melwittunless it's only the numa-aware-vswitches patches that are affected... looking closer00:58
melwittit's hitting several of the numa-aware-vswitches patches but is hitting other patches as well. started very recently01:04
*** slaweq has joined #openstack-nova01:05
*** shaohe_feng has quit IRC01:06
*** gbarros has quit IRC01:07
openstackgerritXiaohan Zhang proposed openstack/nova master: compute node local_gb_used include swap disks  https://review.openstack.org/58592801:07
*** shaohe_feng has joined #openstack-nova01:08
*** artom has quit IRC01:09
*** gbarros has joined #openstack-nova01:09
*** slaweq has quit IRC01:10
*** mrsoul has joined #openstack-nova01:12
*** abhishekk has joined #openstack-nova01:12
*** medberry has quit IRC01:13
*** mrsoul` has quit IRC01:15
*** shaohe_feng has quit IRC01:16
*** harlowja has quit IRC01:17
*** shaohe_feng has joined #openstack-nova01:17
*** tiendc has joined #openstack-nova01:25
*** shaohe_feng has quit IRC01:26
*** gbarros has quit IRC01:27
*** shaohe_feng has joined #openstack-nova01:27
*** gbarros has joined #openstack-nova01:28
*** ileixe has quit IRC01:33
*** ileixe has joined #openstack-nova01:34
*** shaohe_feng has quit IRC01:37
*** shaohe_feng has joined #openstack-nova01:37
*** sean-k-mooney has joined #openstack-nova01:39
*** tbachman has quit IRC01:43
*** namnh has joined #openstack-nova01:43
mriedemmelwitt: i was noticing those randomly the last couple of weeks01:44
mriedemunless it's major, just recheck01:44
melwittmriedem: oh, logstash was claiming it started today. and I was wondering if it might be related to https://review.openstack.org/52253701:45
melwittI've rechecked the numa patches at least twice because of it so far. maybe it's a coincidence. I'll keep trying to recheck01:45
mriedemhmm, yeah it might be, mostly hitting on the live migration and multinode jobs01:46
mriedemwhich is where that is turned on01:46
mriedemwell that would be...awesome01:47
mriedemcan you report a neutron bug?01:47
*** shaohe_feng has quit IRC01:47
melwittthat patch landed at 13:00 (my time) which coincides with the logstash start of hits01:47
melwittmriedem: can do. was just writing it up for nova not realizing it's neutron. will copy it over and open for neutron01:48
*** yamahata has quit IRC01:48
mriedemit could be either01:48
mriedemjust add both01:49
*** dklyle has joined #openstack-nova01:49
melwittoh, right. we can do that01:49
mriedemKevin_Zheng: fyi, might need to see if zhaobo can investigate this ^01:49
mriedemmlavalle is already gone for the day01:49
*** shaohe_feng has joined #openstack-nova01:49
Kevin_ZhengACK, I will ask him01:49
mriedemmelwitt: there would be an easy way to disable it in nova if needed01:50
melwittk01:50
mriedemand then could be tracked as an rc bug (it will need to be an rc bug anyway)01:50
mriedemrather than revert01:50
*** dklyle_ has quit IRC01:50
openstackgerritMatt Riedemann proposed openstack/nova-specs master: Fix problem description number in deprecate file injection spec  https://review.openstack.org/58638501:51
mriedemi'm also going to fast approve ^ b/c of the confusion i saw in the backscroll01:51
*** namnh has quit IRC01:52
Kevin_Zhengmriedem, could you provide a error log?01:55
dansmithmriedem: way ahead of you01:55
Kevin_Zhengmriedem, never mind, Igot it01:55
melwittmriedem: https://bugs.launchpad.net/neutron/+bug/178391701:57
openstackLaunchpad bug 1783917 in OpenStack Compute (nova) "live migration fails with NovaException: Unsupported VIF type unbound convert '_nova_to_osvif_vif_unbound'" [Undecided,New]01:57
openstackgerritMatt Riedemann proposed openstack/nova master: api-ref: document user_data length restriction  https://review.openstack.org/58638801:57
melwittKevin_Zheng ^01:57
*** shaohe_feng has quit IRC01:57
Kevin_ZhengThanks01:57
*** medberry has joined #openstack-nova01:57
mriedemi'll push up an e-r and nova wip patch and then i have to run i think01:57
melwittoh, I'm not 100% sure it makes live migration "fail", I meant to change that to "raises"01:58
*** shaohe_feng has joined #openstack-nova01:58
mriedeme-r query https://review.openstack.org/#/c/586389/01:59
mriedemit fails01:59
mriedemhttp://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Live%20migration%20failed%5C%22%20AND%20message%3A%5C%22Unsupported%20VIF%20type%20unbound%20convert%20'_nova_to_osvif_vif_unbound'%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d01:59
melwittalthough yeah, all the logstash hits containing the message are build failures02:00
melwittbah *changes it back*02:00
*** takashin has left #openstack-nova02:00
melwittcool, thanks for adding the e-r query02:01
*** david-lyle has joined #openstack-nova02:01
*** dklyle has quit IRC02:02
*** david-lyle has quit IRC02:03
*** dklyle_ has joined #openstack-nova02:04
*** alexpilotti has quit IRC02:04
sean-k-mooneyso im going to sleep now but http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_01_083831 looks like its happening because we are calling unplug on the source node after we have activated the binding on the dest02:07
*** shaohe_feng has quit IRC02:07
melwittsean-k-mooney: thanks. so maybe something we need to adjust given the use of the new binding API? I dunno02:08
*** shaohe_feng has joined #openstack-nova02:08
melwittI'll add your comment to the bug02:08
openstackgerritMatt Riedemann proposed openstack/nova master: Temporarily disable port binding flows for live migration  https://review.openstack.org/58639102:09
mriedem^ is an option for temporarily disabling this while debugging a fix02:09
mriedemi hope it doesn't have to come to that, but would understand if it's causing a lot of failures02:10
* melwitt nods02:10
sean-k-mooneymelwitt: i can try and reporduce this in the morning. we proably need to stor the original vif type and use that to constuct the vif object  and use that or do the unplug on the host.02:10
melwittmriedem: okay, we'll decide what to do in the morning tomorrow when other people are around02:11
openstackgerritMerged openstack/nova-specs master: Fix problem description number in deprecate file injection spec  https://review.openstack.org/58638502:11
mriedemyeah the error is from unplugging vifs in _post_live_migration which happens on the source,02:12
mriedemhttps://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/compute/manager.py#L658102:12
mriedemright before that,02:12
mriedemhttps://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/compute/manager.py#L657202:12
mriedemwe activate the port bindings for the dest host02:13
melwittah, I see02:13
sean-k-mooneymriedem: yep that will deactivaate all other port bindings for that port meaning it will be in the unbound state on the sorce host02:13
melwittso just flip that?02:13
mriedemhttps://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/network/neutronv2/api.py#L253402:14
mriedemi didn't know we couldn't unplug a deactivated port...02:14
melwittI wonder how it doesn't fail 100% of the time02:14
mriedemmelwitt: race02:14
mriedemapparently02:14
melwittah02:14
melwittyeah, what luck that the actual change *didn't* fail02:14
sean-k-mooneymriedem: your raising with the notification neutron send for the port status change02:15
mriedemhmm http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_97424802:15
mriedemmelwitt: i had seen this once in the series and mlavalle debugged it and couldn't find anything wrong02:15
mriedemJul 27 01:44:00.974248 ubuntu-xenial-rax-dfw-0001002000 nova-compute[2629]: DEBUG nova.network.neutronv2.api [None req-33283139-ba55-4106-b76c-8751a025f153 service nova] [instance: 6b72a721-0995-446e-848f-f407b788c7f4] Port 21095ff0-6bcd-414b-9d6f-b63e03aacb23 binding to destination host ubuntu-xenial-rax-dfw-0001002004 is already ACTIVE. {{(pid=2629) migrate_instance_start /opt/stack/new/nova/nova/network/neutronv2/api.py:2502:15
melwittah, okay02:15
mriedemoh i know why it's already active,02:16
mriedembecause we activate the dest host port binding during post-copy02:16
mriedemwhich is the whole point of the blueprint - to shorten the window of time that you don't have networking on the dest host02:16
melwittright02:17
melwittshorten the window02:17
mriedemthis is the unplug event http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_06952602:18
*** shaohe_feng has quit IRC02:18
mriedemthis is where we activate the ports on the dest host during post-copy http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_58_56139102:18
*** shaohe_feng has joined #openstack-nova02:18
mriedemwe could have the live migration method wait for the unplug event before starting with post live migration, but (1) i'm not sure that helps anything and (2) it might not work that way for all virt drivers - only libvirt + post-copy has this02:19
melwittyeah, events are sketch depending on which networking backend too, right02:20
melwittlike ovs vs other02:20
openstackgerritYikun Jiang (Kero) proposed openstack/nova master: Change deprecated policies to policy  https://review.openstack.org/58343402:20
mriedemmelwitt: shouldn't be in this case,02:20
mriedemodl should send the event on host binding changes02:20
openstackgerritYikun Jiang (Kero) proposed openstack/nova master: Fix all invalid obj_make_compatible test case  https://review.openstack.org/57424002:20
openstackgerritYikun Jiang (Kero) proposed openstack/nova master: Fix all invalid obj_make_compatible test case  https://review.openstack.org/57424002:20
mriedemjust not plug/unplug02:20
melwittoh, because neutron knows about it and not relying on anything else? ok02:20
melwittjust remember getting burned by the whole plug event thing for reboot02:21
melwittbut that was because we so os-vif plug only, not any call to neutron and the agent (or something) has to notice it02:21
sean-k-mooneymelwitt: the binding change is handeld in the common ml2 layer if i rember corrrectly yes. the port wire up/tear down event however has to come form the backend not the common layer hence the delta between odl/ovs in that case02:22
*** gongysh has joined #openstack-nova02:22
melwittsean-k-mooney: yeah, I was having trouble remembering what the deal was. thanks02:22
*** psachin`` has joined #openstack-nova02:23
sean-k-mooneymelwitt: the reason it did not work with linux bridge is its pools. the reason it did not work for odl was they were missing the handeler for the event in odl to send it to the websocket creeated by netowrking odl. i think they have fixed that. maybe02:24
sean-k-mooneyany way nova is reciving the port update event in this case from neutron and its updating the network info cacche so by the time we call nova_to_osvif_vif the vif_type is set to unbound and boom. if we still have the migration data object at this point we should have a copy of the original vif object that we could use instead of the info_cache versions to work around it.02:27
*** Dinesh_Bhor has joined #openstack-nova02:28
mriedemso migrate_instance_start() was always a noop before this series,02:28
*** shaohe_feng has quit IRC02:28
mriedemso its order in _post_live_migration would have never mattered except for nova-network02:28
*** shaohe_feng has joined #openstack-nova02:29
mriedemgiven we already call migrate_instance_start during post-copy, i don't think moving the order of those calls in _post_live_migration will matter,02:29
mriedembecause from these logs, i can see that when we call migrate_instance_start from _post_live_migration, it's a noop b/c the dest port binding is already active02:29
mriedemhttp://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_97424802:30
mriedemso i would think it means, we need to handle unbound vifs during unplug in the driver?02:30
mriedemor just not call unplug_vifs in certain cases02:30
mriedemnot totally sure though02:30
mriedemall the libvirt driver does in post_live_migration_at_source is unplug_vifs02:31
sean-k-mooneyif we dont call unplug_vif we could leak the linux bridges we create for ovs hybrid plug02:32
mriedemumm...02:33
mriedemoh i see what you were saying about storing off the vif_type then02:33
*** psachin`` has quit IRC02:33
mriedemb/c i was going to say, we could just not call unplug_vifs if the vif type (after refreshing the network info cache from neutron) was now 'unbound'02:33
mriedemif it is, we can temporarily heal that using migrate_data.vifs02:33
mriedemthat has the vif type in it02:34
sean-k-mooneymriedem: yep02:34
mriedemok i could try cooking something up real quick,02:34
mriedemmy wife is going to kill me though02:34
melwittyou could do tomorrow morning?02:34
sean-k-mooneyi can try this in the morning too. i just need a 2 node vanila devstack install right02:35
melwittunless you were thinking to fast-approve this tonight02:35
mriedemwhy would the vif type be unbound?02:36
mriedemshouldn't it be bound to the dest host?02:36
mriedemsince we activated it there?02:36
sean-k-mooneymriedem: it is. each host has its own binding now. only one will be in the bound state all the rest will be unbound02:37
mriedembut i think the port in our info cache is not host-aware...02:38
*** shaohe_feng has quit IRC02:38
mriedemi need to check02:38
*** medberry has quit IRC02:39
*** shaohe_feng has joined #openstack-nova02:39
mriedemhttp://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_72693502:39
mriedemthat's where we refresh the info cache in _post_live_migration02:40
mriedemafter activating the dest host port binding02:40
mriedem[{"profile": {"migrating_to": "ubuntu-xenial-rax-dfw-0001002004"}, "ovs_interfaceid": null, "preserve_on_delete": false, "network": {"bridge": null, "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.1.0.10"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "10.1.0.0/28", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.1.0.1"}}], "meta": {"in02:40
mriedemed": false, "tenant_id": "7dbeedd7076e472091193779ebbcf887", "mtu": 1400}, "id": "1d8de970-331e-46b5-8c7b-574821e891e5", "label": "tempest-LiveMigrationTest-411356071-network"}, "devname": "tap21095ff0-6b", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {}, "address": "fa:16:3e:34:c9:90", "active": false, "type": "unbound", "id": "21095ff0-6bcd-414b-9d6f-b63e03aacb23", "qbg_params": null}]02:40
mriedemyeah...that's wrong02:40
mriedemit should be bound to the dest host02:40
sean-k-mooneywell it was bound shortly before http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_59_31189602:44
mriedemyup we hit post-copy callback here and activate the dest host port binding http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_58_56139102:46
mriedemrefresh nw info cache here http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_59_31073802:47
mriedemthen we get an unplugged vif event from neutron02:48
mriedemcould be concurrently02:48
sean-k-mooneywhats happening is liekly that when the ovs neutron agent sees the tap device disapear it is sending an update to notify us the port state has changed on the souce node.02:48
mriedemyeah we get the unplugged event and refresh the cache and it's unbound http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_72693502:48
*** shaohe_feng has quit IRC02:48
mriedempost live migrate the dest host port binding is already active http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_97424802:49
mriedemthen we unplug and kablammo02:50
mriedemdoesn't help that we route all of these plug/unplug neutron events to the source host only, that's a nova limitation during live migration right now02:50
mriedemand there might be some kind of delay in the state updates or something in the neutron db?02:50
*** shaohe_feng has joined #openstack-nova02:51
openstackgerritTetsuro Nakamura proposed openstack/nova master: Fix create_all() to replace_all() in comments  https://review.openstack.org/58639602:51
mriedemanyway, i can hack around this a bit i think but kind of sucks02:51
sean-k-mooneymriedem: well there is a delay in the neutron agent sendign the update over the rabbit rpc bus to the neutron-server and then the rest call to nova.02:51
mriedemi just worry the port isn't wired up on the dest or something, but that shouldn't be the case b/c we plug_vifs on the dest host during pre_live_migration now02:52
mriedemit's just inactive until post-copy02:52
sean-k-mooneywe could prabably hack in a filter to ignore nay info cache updates where teh vif type is unbound and the port profile containts a migrating_to field02:52
mriedemyeah...02:53
mriedemthat would coincide with this http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_59_31073802:53
sean-k-mooneymriedem: yes if the pulgin fails in pre_live_migration we bail out early and try another host so at this point the dest networking shoudl be fully set up02:54
mriedemalso, if we get the info cache based on what's setup for the dest host, we could have changed vif types, so unplugging on the source could be a different vif type...couldn't it?02:55
mriedemthis gets a bit wonky02:56
mriedemwe do have an exact copy of the source_vif in the migrate data vifs02:56
sean-k-mooneyyes it could have changed.02:56
sean-k-mooneyyep02:56
sean-k-mooneythe migrate data has everything you need.02:56
sean-k-mooneyjust look up the vif by the port uuid and unplug or better yet just loop over all the vifs in migrate data instead of instance02:57
mriedemthat's kind of what i'm going to do, will hack something up quick and post it then flesh it out more in the morning02:57
mriedemsean-k-mooney: and for the love of toast go to bed02:57
bzhao__Sorry for a nic break, I have a brief in the neutron log from the link shows.  For the failure test instance, seem It works correct in Neutron side.02:58
sean-k-mooneyhaha its only 4 am. but ya. il be back only in 6-8 hours and ill take a look at it then. nighto/02:58
*** shaohe_feng has quit IRC02:59
*** shaohe_feng has joined #openstack-nova03:00
melwittbzhao__: thanks. feel free to add a comment to explain about the neutron side in https://bugs.launchpad.net/neutron/+bug/1783917 see comment #603:01
openstackLaunchpad bug 1783917 in OpenStack Compute (nova) "live migration fails with NovaException: Unsupported VIF type unbound convert '_nova_to_osvif_vif_unbound'" [High,Confirmed]03:01
bzhao__melwitt:  Thanks, I will. ;-)03:02
mriedemgot a patch, pretty simple, no tests but can be easily added by someone else tonight or in the morning03:07
sapdHi everyone. I got this error when attach a SR-IOV port to instance http://paste.openstack.org/show/726723/  Please help me03:09
*** shaohe_feng has quit IRC03:09
mriedemsapd: read through https://docs.openstack.org/neutron/latest/admin/config-sriov.html and check everything in there03:10
melwittmriedem: coolness, sounds good03:10
*** shaohe_feng has joined #openstack-nova03:12
sapdmriedem: yep. I have read it. And follow the guide to config. Everything I setup is correct. Because I already launched an instance using SR-IOV successful. But It did not receive DHCP. So I launched another instance using Openvswitch then add SR-IOV port to the instance. But got above error.03:14
melwittsapd: looks like the bug has been around for awhile and still not resolved https://bugs.launchpad.net/nova/+bug/1708433 they say you can boot with the port if you pass it during server create, but that attaching port separately is broken03:16
openstackLaunchpad bug 1708433 in OpenStack Compute (nova) "Attaching sriov nic VM fail with keyError pci_slot" [Undecided,Expired]03:16
*** abhishekk has quit IRC03:17
melwittsapd: what release of nova are you using?03:18
sapdmelwitt: I'm using queens version. 17.0.403:18
openstackgerritMatt Riedemann proposed openstack/nova master: WIP: Use source vifs when unplugging on source during post live migrate  https://review.openstack.org/58640203:18
mriedemmelwitt: bzhao__: Kevin_Zheng: sean-k-mooney: ^ just needs unit tests03:18
melwittsapd: okay, I'm going to re-open that bug and mention what version you saw it in. it will need to be worked on03:19
Kevin_Zhengmriedem, got it, just finish reading launchpad report03:19
mriedemask sahid to look at it03:19
mriedemthe sriov bug i mean03:19
*** shaohe_feng has quit IRC03:19
melwittk03:19
sean-k-mooney[m]Melwitt we used ti have an api check at one point to expresly forbid attach sriov port to existing instances.03:20
melwitthmm, interesting. I wonder what happened to that03:20
sapdmelwitt: I'm waiting.03:21
melwitthah03:21
*** shaohe_feng has joined #openstack-nova03:21
sean-k-mooney[m]Melwitt im guessing some of artoms changes03:21
melwittokay, I'll ask him about it03:23
*** dave-mccowan has quit IRC03:24
openstackgerritMerged openstack/os-vif stable/rocky: Add vif_plug_noop to setup.cfg packages  https://review.openstack.org/58634003:26
melwitthot dog03:26
bzhao__mriedem:  So so quick....  =。=03:29
*** shaohe_feng has quit IRC03:29
*** annp has quit IRC03:31
*** tiendc has quit IRC03:31
*** trungnv has quit IRC03:31
melwittI think I'm gonna give up on rechecking the r-3 patches, seems like a pretty high fail rate with the live migration thing03:31
*** shaohe_feng has joined #openstack-nova03:32
*** tiendc has joined #openstack-nova03:32
*** trungnv has joined #openstack-nova03:32
melwittget the fix sorted in the morning and go from there03:32
*** annp has joined #openstack-nova03:32
*** gbarros has quit IRC03:39
*** shaohe_feng has quit IRC03:40
*** shaohe_feng has joined #openstack-nova03:40
*** vladikr has quit IRC03:45
*** vladikr has joined #openstack-nova03:45
mriedemshould have tests done pretty soon03:48
*** shaohe_feng has quit IRC03:50
*** vladikr has quit IRC03:51
*** vladikr has joined #openstack-nova03:51
*** shaohe_feng has joined #openstack-nova03:52
*** links has joined #openstack-nova03:52
*** Dinesh_Bhor has quit IRC03:52
*** gongysh has quit IRC03:52
*** yamahata has joined #openstack-nova03:53
*** Dinesh_Bhor has joined #openstack-nova03:54
openstackgerritMatt Riedemann proposed openstack/nova master: Use source vifs when unplugging on source during post live migrate  https://review.openstack.org/58640203:56
mriedemalright gang there it is with a test ^03:57
* melwitt clicks03:58
*** Dinesh_Bhor has quit IRC04:00
*** shaohe_feng has quit IRC04:00
*** shaohe_feng has joined #openstack-nova04:01
*** vladikr has quit IRC04:03
*** vladikr has joined #openstack-nova04:03
mriedemand now i'm going to bed04:04
mriedemo/04:04
*** mriedem has quit IRC04:04
melwittgnite04:04
*** mschuppert has joined #openstack-nova04:06
*** tiendc has quit IRC04:10
*** shaohe_feng has quit IRC04:10
*** tiendc has joined #openstack-nova04:11
*** slaweq has joined #openstack-nova04:11
*** shaohe_feng has joined #openstack-nova04:11
*** slaweq has quit IRC04:16
*** shaohe_feng has quit IRC04:21
*** mdnadeem has joined #openstack-nova04:21
*** itlinux has joined #openstack-nova04:22
*** shaohe_feng has joined #openstack-nova04:22
*** pcaruana has joined #openstack-nova04:28
*** pcaruana has quit IRC04:30
*** shaohe_feng has quit IRC04:31
*** shaohe_feng has joined #openstack-nova04:33
*** shaohe_feng has quit IRC04:41
*** shaohe_feng has joined #openstack-nova04:41
openstackgerritXiaohan Zhang proposed openstack/nova master: compute node local_gb_used include swap disks  https://review.openstack.org/58592804:47
*** gongysh has joined #openstack-nova04:50
*** shaohe_feng has quit IRC04:51
*** shaohe_feng has joined #openstack-nova04:53
*** vladikr has quit IRC04:53
*** vladikr has joined #openstack-nova04:54
*** flwang1 has quit IRC04:59
*** shaohe_feng has quit IRC05:02
*** shaohe_feng has joined #openstack-nova05:02
*** vladikr has quit IRC05:05
*** itlinux has quit IRC05:05
*** tbachman has joined #openstack-nova05:06
*** vladikr has joined #openstack-nova05:08
*** tbachman has quit IRC05:11
vishakhamelwitt : Hi, waiting for your response https://review.openstack.org/#/c/580271/. Thanks05:11
*** shaohe_feng has quit IRC05:12
*** slaweq has joined #openstack-nova05:13
*** shaohe_feng has joined #openstack-nova05:14
*** tbachman has joined #openstack-nova05:16
*** Bhujay has joined #openstack-nova05:17
*** slaweq has quit IRC05:17
*** Bhujay has quit IRC05:21
*** shaohe_feng has quit IRC05:22
*** shaohe_feng has joined #openstack-nova05:23
*** vladikr has quit IRC05:27
*** vladikr has joined #openstack-nova05:29
*** shaohe_feng has quit IRC05:32
*** sridharg has joined #openstack-nova05:32
*** shaohe_feng has joined #openstack-nova05:34
*** shaohe_feng has quit IRC05:43
*** shaohe_feng has joined #openstack-nova05:46
*** tbachman has quit IRC05:46
*** vladikr has quit IRC05:48
*** josecastroleon has joined #openstack-nova05:48
*** vladikr has joined #openstack-nova05:51
*** trungnv has quit IRC05:51
*** annp has quit IRC05:51
*** tiendc has quit IRC05:51
*** tiendc has joined #openstack-nova05:52
*** trungnv has joined #openstack-nova05:52
*** annp has joined #openstack-nova05:52
*** zigo_ has joined #openstack-nova05:53
*** zigo has quit IRC05:53
*** shaohe_feng has quit IRC05:53
*** shaohe_feng has joined #openstack-nova05:54
*** Luzi has joined #openstack-nova05:54
*** vladikr has quit IRC06:01
*** vladikr has joined #openstack-nova06:02
*** shaohe_feng has quit IRC06:03
*** shaohe_feng has joined #openstack-nova06:05
openstackgerritVishakha Agarwal proposed openstack/nova master: No change in  field 'updated' in server  https://review.openstack.org/58644606:08
*** shaohe_feng has quit IRC06:13
*** shaohe_feng has joined #openstack-nova06:15
*** alexchadin has joined #openstack-nova06:15
*** sapd has quit IRC06:22
*** sapd has joined #openstack-nova06:23
*** shaohe_feng has quit IRC06:24
openstackgerritVishakha Agarwal proposed openstack/nova master: No change in  field 'updated' in server  https://review.openstack.org/58644606:25
*** shaohe_feng has joined #openstack-nova06:26
*** tiendc_ has joined #openstack-nova06:28
*** tiendc has quit IRC06:30
ileixeHello again06:32
ileixeDoes any body know how to expand APIExtensionBase for pre-processing not for post-processing..?06:33
*** shaohe_feng has quit IRC06:34
*** shaohe_feng has joined #openstack-nova06:35
*** abhishekk has joined #openstack-nova06:41
*** mgoddard has joined #openstack-nova06:41
*** shaohe_feng has quit IRC06:44
*** shaohe_feng has joined #openstack-nova06:45
*** vladikr has quit IRC06:45
openstackgerritXiaohan Zhang proposed openstack/nova master: compute node local_gb_used include swap disks  https://review.openstack.org/58592806:47
*** vladikr has joined #openstack-nova06:48
*** mgoddard has quit IRC06:50
*** brault has joined #openstack-nova06:51
*** tesseract has joined #openstack-nova06:52
*** shaohe_feng has quit IRC06:54
*** shaohe_feng has joined #openstack-nova06:56
*** rcernin has quit IRC07:00
*** ispp has joined #openstack-nova07:00
*** liuyulong__ has joined #openstack-nova07:02
*** shaohe_feng has quit IRC07:05
*** shaohe_feng has joined #openstack-nova07:05
*** liuyulong_ has quit IRC07:06
*** ileixe has quit IRC07:09
*** ttsiouts has joined #openstack-nova07:14
*** shaohe_feng has quit IRC07:15
openstackgerritChen proposed openstack/nova master: Make nova-manage capable of syncing all cell databases  https://review.openstack.org/51927507:15
*** tiendc has joined #openstack-nova07:15
*** tiendc_ has quit IRC07:16
*** shaohe_feng has joined #openstack-nova07:16
*** ccamacho has joined #openstack-nova07:20
*** dtantsur|afk is now known as dtantsur07:21
*** ttsiouts has quit IRC07:24
*** shaohe_feng has quit IRC07:25
*** shaohe_feng has joined #openstack-nova07:26
*** ileixe has joined #openstack-nova07:27
*** ispp has quit IRC07:27
*** AlexeyAbashkin has joined #openstack-nova07:29
*** gibi is now known as giblet07:30
openstackgerritVishakha Agarwal proposed openstack/nova master: No change in  field 'updated' in server  https://review.openstack.org/58644607:33
*** shaohe_feng has quit IRC07:35
*** shaohe_feng has joined #openstack-nova07:37
openstackgerritTetsuro Nakamura proposed openstack/nova master: Fix create_all() to replace_all() in comments  https://review.openstack.org/58639607:43
*** shaohe_feng has quit IRC07:46
*** shaohe_feng has joined #openstack-nova07:46
*** tssurya has joined #openstack-nova07:48
*** ispp has joined #openstack-nova07:48
*** alexchadin has quit IRC07:52
*** ttsiouts has joined #openstack-nova07:54
*** shaohe_feng has quit IRC07:56
*** shaohe_feng has joined #openstack-nova07:57
*** rpittau has quit IRC07:57
*** rpittau has joined #openstack-nova07:57
*** dtantsur is now known as dtantsur|bbl08:00
*** abhishekk has quit IRC08:04
*** alexchadin has joined #openstack-nova08:05
*** shaohe_feng has quit IRC08:06
*** vladikr has quit IRC08:07
*** vladikr has joined #openstack-nova08:08
*** shaohe_feng has joined #openstack-nova08:08
*** mgoddard has joined #openstack-nova08:12
*** tetsuro has quit IRC08:14
*** vladikr has quit IRC08:15
*** vladikr has joined #openstack-nova08:15
*** shaohe_feng has quit IRC08:16
*** shaohe_feng has joined #openstack-nova08:19
*** bauzas is now known as PapaOurs08:19
kashyapHey folks, I'm hitting a "POST_FAILURE" state for the 'nova-live-migration' CI job; seems like a Zuul problem?08:20
kashyap(For this change: https://review.openstack.org/#/c/567258/)08:20
PapaOurskashyap: nothing raised by infra AFAIK08:21
PapaOurskashyap: but maybe you should ask in #openstack-infra ?08:21
kashyapNod; in the past I've seen channel topic being changed when such errors occurreed.08:21
kashyapPapaOurs: Yep, was just about to check there.08:21
kashyapWhen I look into the log, it's the SSH failing08:22
*** derekh has joined #openstack-nova08:23
*** shaohe_feng has quit IRC08:27
*** shaohe_feng has joined #openstack-nova08:28
*** avolkov has joined #openstack-nova08:28
*** mgoddard has quit IRC08:34
*** flwang1 has joined #openstack-nova08:34
*** shaohe_feng has quit IRC08:37
*** jaosorior has quit IRC08:38
*** shaohe_feng has joined #openstack-nova08:38
*** vivsoni has quit IRC08:41
openstackgerritVishakha Agarwal proposed openstack/nova master: No change in  field 'updated' in server  https://review.openstack.org/58644608:43
*** mgoddard has joined #openstack-nova08:43
*** flwang1 has quit IRC08:46
*** shaohe_feng has quit IRC08:47
*** shaohe_feng has joined #openstack-nova08:49
*** lifeless has quit IRC08:54
*** vladikr has quit IRC08:55
*** vladikr has joined #openstack-nova08:55
*** vishakha has quit IRC08:57
*** shaohe_feng has quit IRC08:57
*** jaosorior has joined #openstack-nova08:58
*** shaohe_feng has joined #openstack-nova08:58
*** vivsoni has joined #openstack-nova09:05
*** shaohe_feng has quit IRC09:08
*** shaohe_feng has joined #openstack-nova09:08
*** flwang1 has joined #openstack-nova09:09
*** josecastroleon has quit IRC09:09
*** lifeless has joined #openstack-nova09:11
*** vladikr has quit IRC09:11
*** vladikr has joined #openstack-nova09:12
*** akki has joined #openstack-nova09:12
*** akki has quit IRC09:13
*** akki has joined #openstack-nova09:13
akkican we take lxd container snapshots and use them to launch new containers?09:15
*** cdent has joined #openstack-nova09:18
*** naichuans has quit IRC09:18
*** shaohe_feng has quit IRC09:18
*** josecastroleon has joined #openstack-nova09:18
PapaOursdo folks have any idea why we stupidly set the device owner of a port to be compute:<instance_az> ?09:18
openstackgerrithuanhongda proposed openstack/nova master: hypervisor-stats shows wrong disk usages with shared storage  https://review.openstack.org/14987809:18
*** vladikr has quit IRC09:21
*** shaohe_feng has joined #openstack-nova09:21
*** shaohe_feng has quit IRC09:28
*** shaohe_feng has joined #openstack-nova09:29
*** MultipleCrashes has joined #openstack-nova09:29
MultipleCrashesLooking for further review from sometime , please have a look https://review.openstack.org/#/c/563418/09:29
openstackgerrithuanhongda proposed openstack/nova master: Change the metadata re to match the unicode  https://review.openstack.org/53623609:32
*** vladikr has joined #openstack-nova09:33
*** MultipleCrashes has quit IRC09:37
*** shaohe_feng has quit IRC09:38
*** shaohe_feng has joined #openstack-nova09:41
*** Dinesh_Bhor has joined #openstack-nova09:45
*** andymccr- has joined #openstack-nova09:47
*** shaohe_feng has quit IRC09:49
*** jaosorior has quit IRC09:49
*** shaohe_feng has joined #openstack-nova09:49
*** andymccr_ has quit IRC09:50
*** johnthetubaguy has quit IRC09:52
*** flwang1 has quit IRC09:55
*** flwang1 has joined #openstack-nova09:56
*** shaohe_feng has quit IRC09:59
*** shaohe_feng has joined #openstack-nova10:00
*** flwang1 has quit IRC10:00
*** vladikr has quit IRC10:03
*** stakeda has quit IRC10:03
*** vladikr has joined #openstack-nova10:04
*** andymccr has quit IRC10:04
*** andymccr- is now known as andymccr10:05
*** liuzz_ has quit IRC10:09
*** shaohe_feng has quit IRC10:09
*** ispp has quit IRC10:09
*** shaohe_feng has joined #openstack-nova10:10
*** Dinesh_Bhor has quit IRC10:10
*** flwang1 has joined #openstack-nova10:13
openstackgerritBalazs Gibizer proposed openstack/nova master: Use placement 1.28 in scheduler report client  https://review.openstack.org/58366710:15
*** trungnv has quit IRC10:18
*** shaohe_feng has quit IRC10:19
*** shaohe_feng has joined #openstack-nova10:21
*** alexchadin has quit IRC10:26
*** cdent has quit IRC10:27
*** shaohe_feng has quit IRC10:30
*** shaohe_feng has joined #openstack-nova10:30
*** ttsiouts has quit IRC10:31
*** ispp has joined #openstack-nova10:33
*** vladikr has quit IRC10:36
*** vladikr has joined #openstack-nova10:36
sean-k-mooney[m]kashyap: post_failure means the job failed to upload the logs/result10:36
kashyapsean-k-mooney[m]: Ah, I see10:36
kashyapsean-k-mooney[m]: I hit a recheck, let's see if it goes through.10:37
kashyapsean-k-mooney[m]: Would you happen to have time to have a gander at this: https://review.openstack.org/#/c/567258/ ("libvirt: Remove usage of migrateToURI{2} APIs")10:37
kashyapFairly mechanical, but some churn in there.10:37
kashyap(The 'recheck' is still in progress, though.)10:38
kashyapIt's slow as molasses.10:38
sean-k-mooney[m]Am sure. I'll take a look once i ger coffee10:38
sean-k-mooney[m]Its feature freeze time the gate is under a lot of load. Rechek is all you could have done in this case10:39
*** alexchadin has joined #openstack-nova10:39
*** shaohe_feng has quit IRC10:40
*** shaohe_feng has joined #openstack-nova10:42
kashyapAh, right10:47
*** gongysh has quit IRC10:47
*** dtantsur|bbl is now known as dtantsur10:49
*** sridharg has quit IRC10:50
*** brault_ has joined #openstack-nova10:50
*** shaohe_feng has quit IRC10:50
*** shaohe_feng has joined #openstack-nova10:51
openstackgerritMerged openstack/nova master: doc: add missing permission for the vCenter service account  https://review.openstack.org/58568310:52
*** brault has quit IRC10:53
*** savvas has quit IRC10:53
*** savvas has joined #openstack-nova10:53
*** vladikr has quit IRC10:55
*** vladikr has joined #openstack-nova10:55
*** gilfoyle has joined #openstack-nova10:58
gilfoyleI'm trying to replicate some of nova's (the cli util) is doin. This is an old deployment of openstack. My goal is to understand how it is getting the zone-related information from the database when no zones are created10:59
gilfoylecould someone help me by pointing out where in the repos should I be looking for this?11:00
gilfoylethe relevant command is `nova availability-zone-list`11:00
*** shaohe_feng has quit IRC11:00
*** shaohe_feng has joined #openstack-nova11:01
sean-k-mooneygilfoyle: what is the result you are getting and what were you expecting11:04
sean-k-mooneyther are 2 default az that exist without you creating any11:05
sean-k-mooneyinternal and nova11:05
sean-k-mooneythe contoler nodes will be in internal and all computes will be in nova11:05
*** dave-mccowan has joined #openstack-nova11:06
*** pooja_jadhav has quit IRC11:08
sean-k-mooneykashyap: i was going to ask why ther is a migrateToURI() migrateToURI2() and migrateToURI3() then i rembered libvirt is written in c...11:08
gilfoylesean-k-mooney: my issue is that I'm running a query against a database that's not returning me any of the coputes in the `nova` and from the nova command above I do see it thee11:10
gilfoylethere even, apologies11:10
*** shaohe_feng has quit IRC11:11
sean-k-mooneygilfoyle: yes i think the api layer injects the nova az before it gets to the client11:11
*** shaohe_feng has joined #openstack-nova11:11
*** takedakn has joined #openstack-nova11:12
gilfoyleis it a case of if a compute node has been added without specifying an AZ, the reporting then returns it as being `nova`? that's how I've handled it in the past11:13
*** s10 has joined #openstack-nova11:15
sean-k-mooneygilfoyle: yes and that is still how its handeled today11:15
gilfoyleor, let me restart, if the compute node has not been added to an AZ, it ends up in 'nova'? I've seen occasions where the aggregates.name came up as NULL, so I used the following shortcut in mysql `IFNULL(aggregates.name, 'nova') as zone`11:16
gilfoyles/restart/restate11:16
sean-k-mooneygilfoyle: ah no if you have added a host to a host aggregate and you have set the availablity_zone metadata key on the aggregate it should not show up in nova anymore11:17
gilfoyleah, that explains my conundrum then, however, I now have a different question/ask11:18
gilfoylewhat's the case where aggregates.name is NULL?11:18
gilfoyleif this isn't an obvious one, then I'll go back to the drawing board and try to analyse it further :)11:19
sean-k-mooneygilfoyle: i belive we allow you to have host aggregate where you only set the uuid11:19
sean-k-mooneyi cant rember of the top of my head why however11:20
*** shaohe_feng has quit IRC11:21
gilfoyleah, cool :)11:22
*** flwang1 has quit IRC11:22
*** vivsoni has quit IRC11:23
*** takedakn has quit IRC11:23
*** shaohe_feng has joined #openstack-nova11:24
sean-k-mooneygilfoyle: the name filed on the aggregate is not the availability_zone name by the way. its the host aggregate name just incase you taught they were the same11:24
sean-k-mooneyi mean i personally always set them the same but they dont have to be11:24
gilfoylesean-k-mooney: Oh. interesting, I've been using a query with a relationship between aggregates, aggregate_hosts, compute_nodes and services tables to try and get all nodes for all AZs11:25
*** flwang1 has joined #openstack-nova11:26
sean-k-mooneygilfoyle: an avlailblity zone isnet really a thing in nova. its just a host_aggregate with metadata key called availability_zone in it11:27
*** cdent has joined #openstack-nova11:28
*** Shilpa has quit IRC11:28
*** ttsiouts has joined #openstack-nova11:29
sean-k-mooneyso to get all host in an az you just find the host_aggregate with the correct metadata key then list its host.11:29
*** pooja_jadhav has joined #openstack-nova11:29
sean-k-mooneythe nova and internal az are special however11:29
gilfoylecould you possibly eyeball this and see if you can spot any obvious assumption(s) https://paste.ubuntu.com/p/sDFRDffzpy/ ?11:30
sean-k-mooneyi think the nova az is calulated by taking gennerating a  list of host that are not part of another az11:31
*** jamesde__ has joined #openstack-nova11:31
*** shaohe_feng has quit IRC11:31
*** jamesden_ has quit IRC11:32
*** flwang1 has quit IRC11:33
*** jamesde__ has quit IRC11:34
*** shaohe_feng has joined #openstack-nova11:34
gilfoylethat seems to make sense to me, so I assume it does that as a separate step/query in the `nova` cli? would you have any idea where this defined in the source?11:34
sean-k-mooneygilfoyle: i think services.topic = 'compute' can be changed in the nova conf. so that might be more fragile then looking at the service.binary11:34
*** flwang1 has joined #openstack-nova11:34
*** alexchadin has quit IRC11:34
*** alexchadin has joined #openstack-nova11:35
sean-k-mooneygilfoyle: but that should list the capsity of all compute nodes ordered by the az they are in11:35
*** alexchadin has quit IRC11:35
*** alexchadin has joined #openstack-nova11:36
gilfoyleyes, that's the goal, but for a cluster w/o any zones, I don't see the only compute node with it. Probably because it needs to be a separate query as you suggested above :)11:36
sean-k-mooneyactully no it wont11:36
*** alexchadin has quit IRC11:36
*** alexchadin has joined #openstack-nova11:36
sean-k-mooneyya thats because you are matching on the aggregate name not the az name11:37
*** alexchadin has quit IRC11:37
sean-k-mooneyactully thats not quite true either11:37
sean-k-mooneyby default you will not have any aggregates s the left join on aggregate_hosts.host = compute_nodes.hypervisor_hostname will filter out all the hosts11:38
gilfoyleyup, that became apparent after your nugget above, too :)11:39
*** kholkina has joined #openstack-nova11:39
sean-k-mooneygilfoyle: so what you need to do is rather then set the aggregate.name to nova if null is also join this result with a suuquey on the computenodes table for every host that is not in the first result set11:40
gilfoylethank you sean-k-mooney! :)11:41
*** shaohe_feng has quit IRC11:41
sean-k-mooneygilfoyle: do you want to view this by host_aggregate or availablty zone by the way11:42
sean-k-mooneythe service has teh az  embeded https://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/objects/service.py#L19011:42
*** shaohe_feng has joined #openstack-nova11:42
gilfoyleby availability zone :)11:42
*** abhishekk has joined #openstack-nova11:46
openstackgerritMerged openstack/nova master: [placement] Use base test in placement functional tests  https://review.openstack.org/58577811:49
*** shaohe_feng has quit IRC11:52
kashyapsean-k-mooney: Was AFK for lunch11:52
kashyapsean-k-mooney: Hehe, yeah.  I linked to a libvirt commit that explains it11:52
*** tiendc has quit IRC11:53
*** shaohe_feng has joined #openstack-nova11:53
*** shaohe_feng has quit IRC12:02
*** shaohe_feng has joined #openstack-nova12:03
*** linkmark has joined #openstack-nova12:03
*** savvas has quit IRC12:04
*** medberry has joined #openstack-nova12:04
*** ispp has quit IRC12:08
*** savvas has joined #openstack-nova12:09
*** savvas has quit IRC12:11
*** savvas_ has joined #openstack-nova12:11
*** shaohe_feng has quit IRC12:12
*** shaohe_feng has joined #openstack-nova12:14
*** alexchadin has joined #openstack-nova12:16
*** edmondsw has joined #openstack-nova12:17
*** johnthetubaguy has joined #openstack-nova12:17
*** alexchadin has quit IRC12:20
*** ispp has joined #openstack-nova12:20
*** armaan has joined #openstack-nova12:22
*** shaohe_feng has quit IRC12:22
*** shaohe_feng has joined #openstack-nova12:23
*** sridharg has joined #openstack-nova12:24
*** wolverineav has joined #openstack-nova12:26
*** annp has quit IRC12:27
*** Shilpa has joined #openstack-nova12:31
*** mdnadeem has quit IRC12:32
*** alexchadin has joined #openstack-nova12:33
*** shaohe_feng has quit IRC12:33
*** shaohe_feng has joined #openstack-nova12:33
*** lyan has joined #openstack-nova12:34
*** lyan is now known as Guest8780812:34
*** vladikr has quit IRC12:35
*** mriedem has joined #openstack-nova12:35
*** ispp has quit IRC12:36
*** savvas_ has quit IRC12:40
*** armaan has quit IRC12:41
*** shaohe_feng has quit IRC12:43
*** flwang1 has quit IRC12:43
*** shaohe_feng has joined #openstack-nova12:44
*** armaan has joined #openstack-nova12:45
*** savvas has joined #openstack-nova12:45
*** flwang1 has joined #openstack-nova12:46
mriedemhttp://status.openstack.org/elastic-recheck/index.html#1783917 is clearly our top code-related gate failure so need eyes on the proposed fix https://review.openstack.org/#/c/586402/12:47
gibletmriedem: as sean-k-mooney is +1 on the change I'm going to approve it12:49
mriedemgiblet: ok. i'm looking at what other calls we make on the source,12:49
*** armaan has quit IRC12:49
mriedemrollback_live_migration looks OK - nothing directly using the info cache in there12:49
*** savvas has quit IRC12:50
*** armaan has joined #openstack-nova12:50
*** ttsiouts has quit IRC12:50
PapaOursmriedem: there were some POST_FAILURE gate issues this morning too12:51
mriedemPapaOurs: that's not code related12:51
PapaOursyup, I know, just FYI12:51
mriedemand has been a known issue the last few weeks with one of the node providers12:51
PapaOursthat I didn't know of12:51
PapaOurseither way, giblet +Wd your change12:52
*** shaohe_feng has quit IRC12:53
*** armaan has quit IRC12:54
*** savvas has joined #openstack-nova12:54
*** shaohe_feng has joined #openstack-nova12:55
*** ttsiouts has joined #openstack-nova12:56
mriedemi do see one potential place i missed12:57
*** savvas has quit IRC12:59
mriedemgiblet: comment inline, i'll do a follow up12:59
*** rmart04 has joined #openstack-nova12:59
gibletmriedem: OK, cool13:00
*** pchavva has joined #openstack-nova13:01
*** vladikr has joined #openstack-nova13:01
mriedemhyperv ci failed but on unrelated tests13:01
mriedemlooks like those were failing due to ssh and timeouts13:01
mriedem{7} tempest.api.volume.test_volumes_extend.VolumesExtendTest.test_volume_extend_when_volume_has_snapshot [365.093541s] ... FAILED13:01
mriedemhuh13:03
mriedem2018-07-27 05:15:36.661 5060 105049744 MainThread WARNING nova.scheduler.client.report [req-640b132e-9a1b-4f75-8f8d-7ae96964af72 c329c90c52a44fe2889e0284651a21f0 82e0a447215e49079fe42481922ccd81 - default default] Failed to save allocation for 390d33d0-36e2-469e-85be-8ec10658e953. Got HTTP 400: {"errors": [{"status": 400, "request_id": "req-fc67d1c6-b641-475a-afdf-27075995c0ff", "detail": "The server could not comply with the13:03
mriedemuest since it is either malformed or otherwise incorrect.\n\n JSON does not validate: {} does not have enough properties  Failed validating 'minProperties' in schema['properties']['allocations']['items']['properties']['resources']:     {'additionalProperties': False,      'minProperties': 1,      'patternProperties': {'^[0-9A-Z_]+$': {'minimum': 1,                                             'type': 'integer'}},      'type':13:03
mriedemect'}  On instance['allocations'][0]['resources']:     {}  ", "title": "Bad Request"}]}13:03
*** shaohe_feng has quit IRC13:03
*** ispp has joined #openstack-nova13:03
mriedemSending updated allocation [{'resource_provider': {'uuid': u'b2979fd7-376b-4f9e-a1b9-b4c69d619cb9'}, 'resources': {}}] for instance 390d33d0-36e2-469e-85be-8ec10658e95313:03
mriedem2018-07-27 05:15:36.513 5060 105049744 MainThread INFO nova.compute.manager [req-640b132e-9a1b-4f75-8f8d-7ae96964af72 c329c90c52a44fe2889e0284651a21f0 82e0a447215e49079fe42481922ccd81 - default default] [instance: 390d33d0-36e2-469e-85be-8ec10658e953] Doing legacy allocation math for migration 8221f52a-c72b-4b7b-81d9-67cb67fb37bc after instance move13:04
*** shaohe_feng has joined #openstack-nova13:04
mriedemi'm not sure why the hyperv ci would be hitting that in rocky13:05
mriedemedmondsw: powervm in-tree ci took over 5 hours here and timed out https://review.openstack.org/#/c/586402/13:06
mriedemfyi13:06
*** mgariepy has quit IRC13:08
*** mgariepy has joined #openstack-nova13:10
*** edleafe is now known as figleaf13:11
openstackgerritBalazs Gibizer proposed openstack/nova master: Use placement 1.28 in scheduler report client  https://review.openstack.org/58366713:11
cdentIs this already a known thing: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Unsupported%20VIF%20type%20unbound%20convert%5C%2213:12
cdentoh never mind, my search on launchpad just hit13:13
mriedemhttp://status.openstack.org/elastic-recheck/index.html#178391713:13
cdentit didn't when I was missing a closing t13:13
mriedemfix is in the gate13:13
cdentcool, thanks13:13
*** flwang1 has quit IRC13:14
*** antosh has joined #openstack-nova13:14
*** shaohe_feng has quit IRC13:14
*** shaohe_feng has joined #openstack-nova13:14
mriedembased on the 50 mocks i have to do in _post_live_migration, clearly that method is too big13:15
cdentugh13:16
*** savvas has joined #openstack-nova13:16
*** cdent has quit IRC13:16
*** savvas has quit IRC13:21
*** savvas has joined #openstack-nova13:21
*** abhishekk has quit IRC13:21
*** shaohe_feng has quit IRC13:24
*** shaohe_feng has joined #openstack-nova13:25
edmondswmriedem the powervm ci is borked right now. I'm trying to help get it fixed13:26
*** ttsiouts has quit IRC13:26
*** mdrabe has joined #openstack-nova13:26
*** gbarros has joined #openstack-nova13:27
*** flwang1 has joined #openstack-nova13:31
*** jistr is now known as jistr|mtg13:32
*** Luzi has quit IRC13:32
openstackgerritBalazs Gibizer proposed openstack/nova master: Use placement 1.28 in scheduler report client  https://review.openstack.org/58366713:33
*** shaohe_feng has quit IRC13:34
dansmithefried: what should happen if I have compute nodes with MISC_SHARES (and thus no DISK_GB inventory)? Should the scheduler receive split allocations from placement with disk on the sharing provider?13:35
*** shaohe_feng has joined #openstack-nova13:35
dansmithbecause I have yet to convince it to do that in a functional test13:36
*** alexchadin has quit IRC13:36
*** medberry has quit IRC13:37
*** gilfoyle has quit IRC13:37
*** alexchadin has joined #openstack-nova13:42
*** burt has joined #openstack-nova13:43
*** tbachman has joined #openstack-nova13:44
*** shaohe_feng has quit IRC13:44
*** shaohe_feng has joined #openstack-nova13:45
*** fanzhang has quit IRC13:45
*** fanzhang has joined #openstack-nova13:45
*** alexchadin has quit IRC13:49
*** shaohe_feng has quit IRC13:55
*** ttsiouts has joined #openstack-nova13:56
*** shaohe_feng has joined #openstack-nova13:56
*** mlavalle has joined #openstack-nova13:57
mriedemspeaking of, i think this is going to be the money patch https://review.openstack.org/#/c/586363/13:57
mriedemcreates a shared storage provider using the DISK_GB calculated from the compute node provider, then removes the compute node provider's DISK_GB inventory before the compute service host is discovered13:58
*** awaugama has joined #openstack-nova13:58
*** med_ has quit IRC13:58
s10Please check this bug: https://bugs.launchpad.net/nova/+bug/178400613:59
openstackLaunchpad bug 1784006 in OpenStack Compute (nova) "Instances misses neutron QoS on their ports after unrescue and soft reboot" [Undecided,New]13:59
*** ttsiouts has quit IRC14:00
*** blkart has quit IRC14:01
s10User can easily drop QoS limitations on ports with _soft_reboot() or unrescue() for libvirt driver.14:01
*** blkart has joined #openstack-nova14:01
*** ttsiouts has joined #openstack-nova14:02
mriedems10: i think we do plug_vifs on hard reboot now, but maybe not in pike...14:04
mriedemor maybe only for certain types of vifs...14:04
mriedemit's kind of a mess14:04
*** shaohe_feng has quit IRC14:05
s10plug_vifs are executed on hard reboot and spawn(). Not for soft reboot, in master: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L270614:06
mriedemoh right, i missed the "without" here: "Execute nova reboot (without parameter --hard)"14:06
melwitthm, I thought for soft reboot they shouldn't have been unplugged in the first place, but the bug says a new domain is created, which I didn't think happened either. I wonder if something changed there14:07
*** shaohe_feng has joined #openstack-nova14:08
*** gbarros has quit IRC14:08
dansmithsoft reboot will turn into a hard reboot if the guest doesn't shut down voluntarily right?14:08
melwittshutdown and then a create14:08
mriedemcorrect14:08
dansmithit's trivial for me to make my guest not shut down when asked14:09
mriedembut apparently in this case soft reboot works14:09
melwittlooking at the code, indeed it does a guest.shutdown() followed by a create. so you'd think you'd have to plug the vifs in again, I wonder how this normally works?14:09
*** links has quit IRC14:10
dansmithhmm, it doesn't do an actual reboot?14:10
*** gilfoyle has joined #openstack-nova14:10
melwittdoesn't look like it? I guess I've never looked at soft reboot in detail before https://github.com/openstack/nova/blob/stable/pike/nova/virt/libvirt/driver.py#L254714:11
dansmithhmm, yeah, I didn't think this was like this14:11
openstackgerritMatt Riedemann proposed openstack/nova master: Pass source vifs to driver.cleanup in _post_live_migration  https://review.openstack.org/58656814:12
mriedemgiblet: ^14:12
gibletmriedem: looking14:12
dansmithI thought if we were running a virt that could do real reboot, we did that and only fell back to the shutdown/restart if not14:13
dansmithbut I don't see that14:13
melwittmriedem: why were you thinking not to use the source vifs throughout the entire method? just wondering14:13
mriedemno particular reason, just wanted to minimize the amount of change,14:14
mriedembut we could just do that at the top rather than get the refreshed nw info cache14:14
*** gilfoyle has quit IRC14:14
mriedemi.e. here https://review.openstack.org/#/c/586568/1/nova/compute/manager.py@655514:15
*** eharney has joined #openstack-nova14:15
*** shaohe_feng has quit IRC14:15
mriedemi can definitely make that change if it makes more sense14:17
*** shaohe_feng has joined #openstack-nova14:17
melwittyeah, I'm not 100% sure but it feels like it should be consistent throughout. but I guess that's never guaranteed anyway because neutron events in flight could change the network_info as it goes through the method anyway?14:17
mriedemshouldn't14:18
*** jlvacation is now known as jlvillal14:18
mriedeman event would be processed separately and shouldn't be able to modify that network_info variable by reference14:18
melwittoh, yeah, okay14:18
mriedemthe instance.info_cache might be updated concurrently, sure14:18
mriedembut we're using the local variable in most places14:18
*** r-daneel has joined #openstack-nova14:18
melwittyeah14:18
mriedemthe versioned notifications will still use instance.info_cache14:18
*** felipemonteiro has joined #openstack-nova14:19
mriedemleft that as a comment so giblet can also ponder it14:20
mriedemi didn't do it in https://review.openstack.org/#/c/586402/ because (1) it was late and (2) i just wanted to get the immediate fire put out14:20
melwittI guess I could see the rationale in only using the source vifs for the relevant actions because like I think you mentioned, maybe the notifications should reflect the state of the network info cache at the time it was queried14:20
melwittthat's the only other thing network info is used for in that method, I assume?14:21
pooja_jadhavmriedem: hello14:21
*** gilfoyle has joined #openstack-nova14:22
pooja_jadhavsean-k-mooney : hello14:22
mriedemand unfilter_instance in the firewall driver,14:22
mriedemi looked at how it was used in the various drivers and it was just getting the mac address off the vifs in one case14:23
mriedemwhich i don't think should change14:23
mriedembut,14:23
mriedemadmittedly, only passing the source vifs from migrate_data to 2 spots indicates tight coupling into knowing exactly what those methods are doing with network_info14:23
gibletmriedem, melwitt: I think having the current network infor send in the notification is the valid thing as we are notifying about current state14:23
pooja_jadhavsean-k-mooney, mriedem: I am trying live migrate and using nfs storage, its failing for "Binding failed for port e973dde6-d68c-4aec-a70d-86dcd81fa11b and host Neha-VirtualBox."14:24
mriedempooja_jadhav: i can't really help you debug that right now14:24
melwittgiblet, mriedem: I think that makes sense too, the more I think about it14:24
mriedempooja_jadhav: i'd suggest using something besides devstack if you want a more sophisticated deployment tool for multi-node with live migration, like openstack-ansible14:24
pooja_jadhavmriedem: ok14:24
mriedemmelwitt: i'm totally fine with making the generic switch at the top of the method14:25
mriedemi don't like the tight coupling that's in here really14:25
mriedemi just wanted to reduce any exposure to regression14:25
mriedempooja_jadhav: or look at a nova-live-migration job config and see how it set things up14:25
mriedembut those don't use nfs14:25
*** shaohe_feng has quit IRC14:25
mriedemhttp://logs.openstack.org/02/586402/2/check/nova-live-migration/2db7a54/14:25
*** med_ has joined #openstack-nova14:25
*** med_ has quit IRC14:25
*** med_ has joined #openstack-nova14:25
*** alexchadin has joined #openstack-nova14:26
mriedempooja_jadhav: binding failed means something failed in neutron14:26
mriedemso network is messed up14:26
*** gilfoyle has quit IRC14:26
pooja_jadhavmriedem: Hmm14:26
*** mdrabe has quit IRC14:26
*** shaohe_feng has joined #openstack-nova14:27
melwittmriedem: yeah, I'm thinking I agree with giblet though, that we should leave it the way you have it. let the notifications use the fresh network info and not artificially send source vif. I think the only reason to use source vifs there is if somehow a notifications listener might want to know which vif is actually being acted upon during the actions in the method. hmm.14:27
pooja_jadhavmriedem: But I am not able to see any error logs at neutron side.. thats the problem14:28
mriedemmelwitt: giblet: well, only the versioned notifications will use the instance.info_cache,14:28
mriedemthe legacy ones would end up using the source vifs14:28
melwittoh14:29
*** links has joined #openstack-nova14:29
mriedemanyway, we could always change this later i guess if it causes some other unanticipated problem14:29
melwittyeah14:30
mriedemlet me check to make sure the mac address on the vif is the same between source and dest14:30
mriedemsince that's used in the firewall driver to unfilter14:30
*** alexchadin has quit IRC14:30
gibletmriedem: in the current code the legacy notification uses the local network_info and I guess that is the same as what the versioned gets from instance.info_cache14:31
mriedemsource vif "address": "fa:16:3e:cc:ff:66"14:31
mriedemfrom the cache: "address": "fa:16:3e:cc:ff:66"14:31
mriedemso yeah the mac doesn't change14:31
mriedemgiblet: yes14:31
gibletmriedem: then I still think that the current code in your patch is good14:32
*** cdent has joined #openstack-nova14:32
*** breton has quit IRC14:33
*** gilfoyle has joined #openstack-nova14:34
s10What could be done with unrescue/soft reboot QoS issue? Should we use _create_domain_and_network() in that functions instead of simple _create_domain()? Or call plug_vifs()?14:34
* giblet is logging off for the weekend14:35
*** shaohe_feng has quit IRC14:36
*** shaohe_feng has joined #openstack-nova14:37
*** jistr|mtg is now known as jistr14:39
*** tidwellr has joined #openstack-nova14:39
*** gilfoyle has quit IRC14:39
*** tidwellr has quit IRC14:39
*** tidwellr has joined #openstack-nova14:39
*** bhagyashris has quit IRC14:41
*** flwang1 has quit IRC14:41
mriedemwoot ceph shared storage change got through stack.sh and is now running tempest14:42
*** felipemonteiro_ has joined #openstack-nova14:42
cdenthuzzah14:42
dansmithcdent: did you see my question to efried earlier?14:42
cdentdansmith: no sir, what's up?14:43
openstackgerritChris Dent proposed openstack/nova master: [placement] Retry allocation writes server side  https://review.openstack.org/58604814:43
dansmith[06:36:22]  <dansmith>efried: what should happen if I have compute nodes with MISC_SHARES (and thus no DISK_GB inventory)? Should the scheduler receive split allocations from placement with disk on the sharing provider?14:43
dansmith[06:36:46]  <dansmith>because I have yet to convince it to do that in a functional test14:43
dansmithcdent: ^14:43
cdentone sec, let me find something14:44
cdentdansmith: this is current passing: https://github.com/cdent/placecat/blob/master/gabbits/fridge.yaml#L204-L21314:45
cdentwhich is an example of some allocations with sharing providers14:45
*** [fcandido] has joined #openstack-nova14:45
cdentso in theory it should work, but I'm not clear on what needs to happen on compute-node side to set things up14:45
*** felipemonteiro has quit IRC14:46
dansmithcdent: that is asserting what? that one of the providers only has a part of the whole?14:46
*** shaohe_feng has quit IRC14:46
dansmithor, two providers in the request14:46
mriedem # but there are two resource providers in that one allocations block14:46
*** shaohe_feng has joined #openstack-nova14:46
cdent^14:46
dansmithyeah14:46
*** gilfoyle has joined #openstack-nova14:47
dansmithso, that tells me that a single non-fancy request to placement should return a split allocation14:47
mriedemdansmith: we should know shortly from this ceph patch i have14:47
cdentIf we need a specific functional test for something, I'm semi idle right now, so could make something if someone tells me what it needs to be14:47
dansmithand the scheduler is doing a non-fancy request, so it should be getting back a split allocation I guess14:47
melwittmriedem: in https://review.openstack.org/586568 is that taking care of the live migration rollback scenario? or is that still an open question14:47
mriedemmelwitt: i looked at rollback and didn't see anything that needed this type of thing14:47
dansmithcdent: well, I've tried writing a very hacky one and placement is returning no allocation requests14:47
melwittmriedem: ack14:47
cdentdansmith: do you want to push it up and I'll tune it and you can go review something?14:48
mriedemmelwitt: i'd say if we ever go the generic route in _post_live_migration, we'd want to do the same in _rollback_live_migration14:48
mriedemrollback is likely less of an issue b/c if we failed live migration, we won't activate the dest host port bindings and get into this mess14:48
*** lucasbagb has joined #openstack-nova14:49
[fcandido]http://eavesdrop.openstack.org/meetings14:49
*** efried is now known as fried_rice14:49
*** [fcandido] has left #openstack-nova14:49
melwittack14:49
openstackgerritDan Smith proposed openstack/nova master: WIP: funtional test with sharing providers  https://review.openstack.org/58658914:49
dansmithcdent: ^14:49
cdenton it14:49
dansmithcdent: warning, it's very, uh, forced14:50
cdentha, noted14:50
*** flwang1 has joined #openstack-nova14:50
fried_ricedansmith/superdan: I haven't caught up on the whole conversation, but you're asking about a compute node that's marked as a sharing provider?14:50
dansmithcdent: attempts to create a provider with disk, associate with the compute host providers, nuke the disk inventory from one and then try to boot and see if we got the shared bit14:50
cdent14:51
dansmithfried_rice: no, not a compute node marked as sharing, just a compute with no disk because it's associated to a shared disk provider14:51
*** mlavalle has quit IRC14:52
*** imacdonn has quit IRC14:52
*** mlavalle has joined #openstack-nova14:52
mriedemdansmith: why not write a simple fake virt driver that doesn't report DISK_GB inventory?14:52
*** imacdonn has joined #openstack-nova14:52
dansmithmriedem: because this was quick14:52
dansmithmriedem: obviously not mergeable14:52
*** fgonzales_ has joined #openstack-nova14:53
mriedemyour max_unit is wrong14:54
melwittargh, gate bug fix just failed merge for POST_FAILURE14:54
mriedemyour sharing provider has 1gb14:54
mriedemunless flavor1 doesn't have any root_gb14:55
*** bacape has joined #openstack-nova14:56
*** breton has joined #openstack-nova14:56
dansmithit has 1024 GB14:56
*** Bellesse has joined #openstack-nova14:56
*** jfinck has joined #openstack-nova14:56
*** shaohe_feng has quit IRC14:56
mriedembut max you can request in a chunk is 1 right?14:56
dansmithoh max_unit14:56
cdenti'll mess with it14:57
mriedemmax_unit should equal total14:57
*** shaohe_feng has joined #openstack-nova14:57
dansmithstill no dice14:57
dansmither, hmm it didn't update14:57
dansmithah, I'm setting inventory twice for some reason14:58
sean-k-mooneymelwitt: the live migrate one?14:58
mriedemyeah14:58
mriedemyou might be using a 1 root_gb flavor anyway14:58
mriedemso the max_unit being 1 might not make a difference14:59
dansmithI was, and still no difference14:59
dansmithyeah14:59
melwittsean-k-mooney: yeah14:59
fried_ricedansmith/superdan: Okay, you're trying to make a setup that has its disk allocated from a sharing provider, not the compute node. And then what, migrate it?14:59
mriedemboot and then migrate14:59
mriedembut boot fails?14:59
dansmithfried_rice: well, boot first would be nice14:59
fried_ricebhagyashri got that working live and in a func test with the libvirt driver.15:00
dansmithfried_rice: I believe migrate will mangle the allocations, but trying to prove it15:00
fried_riceHave you located that func test yet?15:00
dansmithnope15:00
fried_ricedansmith: I suspect you may be right.15:00
fried_riceokay, stand by...15:00
mriedemfried_rice: that libvirt func test doesn't go through the scheduler though right?15:00
dansmithfried_rice: yeah, so in that case, I want to remove the bit of the libvirt inventory thing that will not expose disk_gb, because people may turn that on, and then be mangling their allocations with migrations for a couple days before realizing it15:01
fried_ricemriedem: I sure thought it did.15:01
fried_ricehttps://review.openstack.org/#/c/560459/15:01
mriedemhmm yeah https://review.openstack.org/#/c/560459/17/nova/tests/functional/libvirt/test_shared_resource_provider.py15:01
fried_riceyup15:02
*** links has quit IRC15:02
dansmithyeah, so I dunno why it's not working for me15:02
dansmithbut that's fine15:03
sean-k-mooneydansmith: only the allocation for the compute resouces need to be migrated correct. the shard storage allocation should remain the same.15:03
mriedemsean-k-mooney: well, that's the point of the test,15:03
dansmithsean-k-mooney: right, but we don't do that properly15:03
fried_ricedansmith: Building on that one and trying a migration would be informative. I would be surprised if it works properly, because we have no logic to do ^15:03
mriedembecause we have FIXME notes all over the migration code15:03
sean-k-mooneyi guess unless we are migrating with a block migraion to a different storage provider15:03
dansmithfried_rice: I have fixmes about it being broken and known15:03
fried_riceyup15:04
*** alexchadin has joined #openstack-nova15:04
dansmithfried_rice: so, yeah, I'm not sure why we landed the patch to do that for inventory in that case, but.. alas15:04
fried_ricedansmith: So that we wouldn't be double-reporting inventory allocations.15:04
*** gilfoyle_ has joined #openstack-nova15:05
fried_ricedansmith: Can't you only migrate an instance that's on volume storage anyway?15:05
dansmithfried_rice: right, but that has been broken since forever, and this change means we *lose* data15:05
dansmithno15:05
fried_ricewhat happens to the disk?15:05
mriedemssh to the dest15:05
dansmithhah15:05
fried_riceeek, really?15:05
dansmithit gets migrated15:05
*** mdrabe has joined #openstack-nova15:05
dansmitheither block migration or shared (non-volume) storage in teh backend15:05
fried_riceOkay, so what are we expecting to happen here?15:05
dansmithfor live, and yeah, scp to dest for the cold migration case15:06
fried_riceI would have thought we would ssh the data to whatever disk got allocated on the dest.15:06
*** gilfoyle has quit IRC15:06
dansmithI think we need to remove that bit of the inventory logic that doesn't expose DISK_GB15:06
dansmithso that we don't get split allocations that we trash during a migration15:06
dansmithbecause we'll end up with instances with no DISK_GB allocation at all15:06
fried_ricewhich may or may not be the same provider as we started on, but to a different spot on that disk - which would be something to fix later15:06
dansmithand then start overcommitting15:06
*** shaohe_feng has quit IRC15:06
fried_riceI don't understand that thinking. And IMO it is premature to land a patch to yank that out until we've demonstrated that anything bad happens.15:07
*** shaohe_feng has joined #openstack-nova15:07
dansmiththat's why I'm trying to write a test15:07
fried_ricesounds good.15:07
fried_riceneed help?15:07
*** josecastroleon has quit IRC15:07
dansmithI asked for help and now am working on using that functional test to do my bidding15:08
* cdent is still poking at the test too15:08
*** r-daneel_ has joined #openstack-nova15:08
*** r-daneel has quit IRC15:09
*** r-daneel_ is now known as r-daneel15:09
mriedemi believe this is the problem https://github.com/openstack/nova/blob/master/nova/conductor/tasks/migrate.py#L4815:09
mriedemb/c we're assuming only allocations on the source compute node provider15:09
fried_riceI think the worst that happens is we fail to remove the original allocation for the DISK_GB on the sharing provider.  What happens after that depends on whether we migrated to a compute node with or without sharing disk. But the doubled allocation leaves us in no worse shape than we were before this fix, I would have thought.15:09
mriedemand copy those to the migration consumer15:09
mriedemwhich won't include the DISK_GB allocation on the shared provider15:09
sean-k-mooneyfried_rice: dansmith  do we handel flavors with root_gb=0 in placement by the way. preplacement we jsut did not track there disk usage properly. im assuming that is stil broken15:09
mriedemsean-k-mooney: fixed like 1 week ago15:10
mriedemsean-k-mooney: https://review.openstack.org/#/q/topic:bug/1469179+(status:open+OR+status:merged)15:10
sean-k-mooneymriedem:  fixed by reading disk size form image?15:10
dansmithfried_rice: and I think we lose the disk allocation silently15:10
fried_ricedansmith: mriedem: oic, yeah, that makes sense.15:10
mriedemsean-k-mooney: no, we don't request DISK_GB allocations for bfv15:10
sean-k-mooneymriedem: i was thinking about the non boot form volume case15:11
fried_riceI didn't realize we don't go through GET /a_c to request the resources on the dest.15:11
mriedemyes this is what removes the instances allocations https://github.com/openstack/nova/blob/master/nova/conductor/tasks/migrate.py#L6015:11
mriedemfrom all providers15:11
mriedemfried_rice: we do to pick the dest host during scheduling15:12
sean-k-mooneymriedem: for example the nano flavor with with the cirros image in devstack with no volume for the guest.15:12
fried_ricemriedem: GET /a_c or just GET /rps?resources=... ?15:12
mriedemGET /a_c,15:12
mriedemwe have to do that in the scheduler to figure out which providers to filter for a dest host15:12
fried_ricemriedem: and then ignore that result and just copy the resources from the src to the dest?15:12
mriedemi'm looking to confirm that15:13
fried_ricemriedem: well, you could have used GET /rps?resources=... as well15:13
mriedemsure but we don't in the scheduler15:13
fried_riceThe right thing would be to use GET /a_c to pick the host *and* create the allocations. Then we wouldn't be having this problem.15:13
mriedemoh you know what,15:13
mriedemyes that's what we o15:14
mriedem*do15:14
mriedemwe move the existing allocs from the instance on the source node to the migration record,15:14
mriedemand then call the scheduler and claim on the dest host15:14
mriedemso the migration has allocs on source host and instance has allocs on dest host15:14
mriedemthen on successful migration we delete the migration allocs on the source host15:14
mriedemon failure, we delete allocs for instance on dest and move allocs from migratoin on source host to instane15:15
mriedem*instance15:15
fried_riceoh, so what's actually happening is we're erroneously losing the DISK_GB allocation for a minute during the migration, but picking it up again on the dest.15:15
mriedemso we don't hit _move_operation_alloc_request in the scheduler report client15:15
cdentdansmith: the root problem in your test is that two compute nodes are not in the aggregate, when you do the put for that it is coming up 404, so the resource providers don't exist yet, not sure why that is15:15
dansmithcdent: ah, okay15:15
dansmithmriedem: hmm, so we end up double-claiming on the shared provider?15:16
dansmithmriedem: I thought even with the new accounting we had to grab the allocation for the provider in question and regenerate it, which would mean the instance's allocation on the dest wouldn't include the shared one15:16
mriedemi don't think so...as eric said, we'll remove the allocs for the instance on the shared provider,15:16
mriedemthen claim on the dest during scheduling15:16
*** shaohe_feng has quit IRC15:17
dansmithbecause we do a full regular schedule?15:17
mriedemi think on a revert or failed migration we'd eff that up though15:17
mriedemyes15:17
mriedem*EXCEPT* in the case of forced live migrate15:17
dansmithoh you're saying we drop the disk allocation but only because we don't copy it for the migration15:17
mriedemwe don't go through the scheduler there15:17
mriedemdansmith: yeah15:17
dansmithso,15:18
*** shaohe_feng has joined #openstack-nova15:18
mriedemon a revert or failed resize, we'll delete the allocs for the instance on the dest host (created by the scheduler) and move those back from the migration to the instance, but the migration allocs won't be on the sharing provider15:18
dansmithwhat happens if placement picks a different sharing provider than we had before? our disk doesn't actually move15:18
mriedemso we'd lost the DISK_GB alocs in that case15:18
dansmithah, yeah, anything where we use the migration's allocations would be wrong15:18
*** niraj_singh has quit IRC15:18
mriedemdansmith: yup that's this https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L413815:19
mriedemwell, we wouldn't hit that yet15:19
mriedemthe migration consumer will only have VCPU and MEMORY_MB allocations against the source node15:19
mriedemso this https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L415515:19
mriedemso that's definitely busted - we could easily test that with a resize revert test and verify the DISK_GB allocation for the instance is gone15:20
dansmithand forced live15:20
mriedemi haven't stepped through forced live yet (or evac for that matter)15:20
sean-k-mooneydansmith: we should only be able to pick a different provider in a block migrate case correct? if we do not set that flag we should not allow the shareing provider to change15:21
dansmithalso,15:21
dansmithmigrate to an older node that doesn't have this will drop the shared disk allocation15:21
dansmithbecause placement will allocate from its own disk inventory, even though it's the same pool,15:21
mriedemsean-k-mooney: re: "for example the nano flavor with with the cirros image in devstack with no volume for the guest." i don't know what you're asking me15:21
dansmithand then when we upgrade that node, we won't convert the allocations15:21
dansmithin fact,15:21
dansmithany upgrade where we boot up on rocky code and change our inventory will break all our allocations right?15:22
dansmithfried_rice: cdent what happens if I have allocations against my disk_gb inventory and then I update my inventory with no disk_gb ?15:22
fried_riceThe inv update will bounce 409 InventoryInUse.15:22
mriedemi don't think you can do that15:22
mriedemyeah15:22
fried_riceon every periodic15:23
dansmithokay, so anyone with MISC_SHARES now will failboat on upgrade to rocky15:23
fried_riceupdate_from_provider_tree will never succeed.15:23
dansmithand anyone that sets that on non-empty computes will stop reporting15:23
mriedemoh right b/c upt removes the DISK_GB from the node provider if it sees it's in a sharing provider relationship15:23
dansmithyeah15:24
fried_riceNote that we didn't document that you could do this.15:24
mriedemand if that DISK_GB is being used it will blow up on the remove15:24
mriedemfried_rice: heh i know15:24
dansmithfried_rice: and yet, it's in documentation and people have tried it, hence the bug yeah?15:24
sean-k-mooneymriedem: in that instance. the flavor has root_gb=0 the imange is like 20MB in glance  and we boot it on the dest without claim space in placement. the vm can use as much space as disk topology in the image specifies15:24
*** alexchadin has quit IRC15:24
*** tssurya has quit IRC15:24
fried_ricedansmith: The bug was opened because bhagyashri was working on it and I said it should have a bug report.15:24
mriedemshared storage providers is definitely a feature/spec15:25
fried_rice...which we don't claim works yet.15:25
mriedemgiven the upgrade/CI/etc15:25
mriedemi know, but15:25
*** ttsiouts has quit IRC15:25
fried_ricewe should document that we *don't* support it.15:25
dansmithwell, there's mention of that trait in our own docs, and given what it breaks it's not trivial, IMHO15:25
fried_riceand then take some time to resolve these issues correctly.15:25
mriedemthat's what dansmith and melwitt were talking about last night15:25
sean-k-mooneymriedem: anyway thats unrelated to dans question excpet for the fact we dont track the disk usage correctly in placemnet15:25
*** ttsiouts has joined #openstack-nova15:26
melwittyes, described here L8 https://etherpad.openstack.org/p/nova-rocky-release-candidate-todo15:26
mriedemsean-k-mooney: yes i think that's correct and likely a bug; i'm not entirely sure how the resource tracker reports disk usage for a flavor like that which *is* using local disk15:26
*** fgonzales_ has quit IRC15:26
mriedemwhere root_gb=015:26
mriedemsean-k-mooney: we've also said you shouldn't use root_gb=0 except for volume-backed flavors15:27
mriedemand added a policy rule in rocky to disable that15:27
*** shaohe_feng has quit IRC15:27
sean-k-mooneymriedem: oh cool15:27
sean-k-mooneyis is set by default15:27
mriedemsean-k-mooney: https://github.com/openstack/nova/commit/763fd62464e9a0753e061171cc1fd826055bbc0115:28
mriedemthe plan was to disable that by default starting in stein15:28
*** jfinck has quit IRC15:28
mriedemso you can't boot a server with a root_gb=0 flavor unless you're doing boot from volume15:28
cdentdansmith: microversions :(15:28
mriedemhow are microversions related to this?15:28
dansmithI assume because I messed up a version in my test15:29
*** shaohe_feng has joined #openstack-nova15:29
cdent(sorry, his test)15:29
mriedemah15:29
mriedemwhew15:29
sean-k-mooneymriedem: right ok cool. if you have existing instces booted that way we sould have to update teh embeeded flavor or resouce dict to indicate or live migration will explode15:29
mriedemsean-k-mooney: so figuring out how we track disk usage for those types of flavors in the resource tracker would be good to know15:29
mriedembecause if it was never tracked as usage before, then it's not really a huge regression to not be tracking it in placement15:30
sean-k-mooneymriedem: im pretry sure we track it as 015:30
sean-k-mooneye.g. we dont track it at all15:30
*** ttsiouts has quit IRC15:30
mriedemsean-k-mooney: i think that too b/c https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L146115:30
mriedemobject_or_dict.flavor.root_gb15:30
sean-k-mooneyit was a way to bypass qouta in the past15:31
mriedemthe is_bfv in there was just recently added in the same series of fixes for the bfv thing15:31
mriedemright, so to summarize, don't set flavor root_gb=0 *unless* those flavors are only used with bfv instances,15:31
mriedemand we have the is_bfv root_gb reporting in the RT and placement fixed in rocky15:31
dansmithmelwitt: aight, well, anyway, my recommendation is that we just remove that inventory quirk for rocky since it can't work and it's one line. alternatively, at least a known-issue reno just to cover our butts in case someone hits it15:32
sean-k-mooneymriedem: ya i think though we will have to fix up the allcoation for existing instance that are not bfv going forward15:32
mriedemsean-k-mooney: we do15:32
dansmithmelwitt: it's like having a half-merged feature.. doesn't really serve any purpose and is externally tickle-able to failure15:32
dansmithobviously your call on what to do15:33
mriedemsean-k-mooney: https://review.openstack.org/#/c/583715/15:33
mriedemsean-k-mooney: we'll heal on move15:33
cdentany swag on how hard to make it go, now-ish?15:33
mriedem"make it go" == make it work?15:33
mriedemwe don't even have multi-node shared storage provider CI15:33
mriedemso very high risk IMO15:33
sean-k-mooneymriedem: in the non BFV case we need to read the size form the image if the flavor root_gb=015:33
mriedemway too late15:34
mriedemsean-k-mooney: yup15:34
dansmithyeah, way too late to try to make any of the broken non-broken15:34
mriedembut that's not reported to the RT as far as i know15:34
fried_ricedansmith: I can propose that if you like.15:34
dansmithfried_rice: which?15:34
mriedemdansmith: so we should likely start with a bug saying this stuff will nuke your DISK_GB allocations on failure or revert at least15:34
*** andymccr has quit IRC15:34
mriedemfried_rice: melwitt: ^15:34
fried_ricedansmith: Taking that line out of the libvirt driver.15:34
dansmithmriedem: for sure15:35
*** andymccr has joined #openstack-nova15:35
dansmithfried_rice: sure, I'm happy to do it as well, either way15:35
mriedemand we can track the various bugs in a spec in stein if we're going to go full on and support this15:35
fried_ricedansmith: You want to write up the bug, I'll do the patch?15:36
mriedemb/c we need a spec for the upgrade impacts obviously, and how to deploy the thing, plus CI requirements (which i'm already half-way done with)15:36
melwittsounds like a plan15:36
dansmithfried_rice: if mriedem isn't going to15:36
mriedemgo ahead15:36
dansmithfried_rice: I think mriedem really wants to do it15:36
dansmithI heard him say earlier15:36
dansmithso I don't want to step on his toes15:36
mriedemi'm cleaning up stephenfin's last 2 changes in his vswitc hseries15:36
dansmithbecause I think he measures his weekly progress by bugs reported15:37
dansmithmriedem: more?15:37
mriedemplus, zuul just f'ed my ceph ci run that was almost done!15:37
melwittthis is new, gate failure RETRY_LIMIT15:37
melwittgreat15:37
mriedemmelwitt: yes same15:37
openstackgerritChris Dent proposed openstack/nova master: WIP: funtional test with sharing providers  https://review.openstack.org/58658915:37
mriedeminfra just posted a status15:37
mriedem#status alert A zuul config error slipped through and caused a pile of job failures with retry_limit - a fix is being applied and should be back up in a few minutes15:37
*** shaohe_feng has quit IRC15:37
mriedemso don't recheck15:37
cdentdansmith: ^ that gets the test actually making reasonable requests, but no more that that15:38
cdentnot sure if you care given the earlier discussion, but in case you do...15:38
*** shaohe_feng has joined #openstack-nova15:38
dansmithcdent: yeah, probably don't care now that I found this other one15:38
dansmithbut thanks for setting me straight15:38
-openstackstatus- NOTICE: A zuul config error slipped through and caused a pile of job failures with retry_limit - a fix is being applied and should be back up in a few minutes15:39
*** ChanServ changes topic to "A zuul config error slipped through and caused a pile of job failures with retry_limit - a fix is being applied and should be back up in a few minutes"15:39
*** hongbin_ has joined #openstack-nova15:39
fried_ricecdent: All I care about is that you misspelled funtional15:40
mriedemi can write the bug if no one has started yet15:40
fried_riceDo it. And let the English see you do it.15:41
mriedemalright15:41
cdentfried_rice: that was dansmith in this case15:42
dansmithI was rushing15:42
cdentbut I can see how it being me would be unsurprising15:42
* cdent is always rushing15:42
*** Shilpa has quit IRC15:42
cdentis my new excuse15:42
fried_ricecdent: If it had been three weeks ago, and it had been fuctional, I would have totally known it was you.15:42
cdentI fuctional tests all the time15:43
fried_riceIs there a way to mark a normal funtional test as an xfail?15:44
fried_riceoh, shit, bhagyashri's test still succeeds with that bit commented out :(15:45
*** mdrabe has quit IRC15:46
fried_riceignore me, phew.15:46
*** mdrabe has joined #openstack-nova15:46
*** AlexeyAbashkin has quit IRC15:46
cdentfried_rice: https://docs.python.org/3/library/unittest.html#unittest.expectedFailure15:47
cdenthttps://docs.python.org/3/library/unittest.html#skipping-tests-and-expected-failures15:47
fried_ricethanks cdent15:47
fried_riceonly py3? Are we running func tests on only py3 these days?15:47
*** shaohe_feng has quit IRC15:47
sean-k-mooneyfried_rice: i think we have both. still15:48
fried_riceyup sean-k-mooney15:48
sean-k-mooneyyou can proably do a version check and jsut skip on 2 and expect failure on 315:49
*** shaohe_feng has joined #openstack-nova15:49
mriedemhere you go https://bugs.launchpad.net/nova/+bug/178402015:50
openstackLaunchpad bug 1784020 in OpenStack Compute (nova) "Shared storage providers are not supported and will break things if used" [High,Triaged]15:50
mriedemdansmith: fried_rice: melwitt: ^15:50
dansmithoh thanks15:50
* dansmith closes the empty bug report he had open15:50
cdentfried_rice: https://docs.python.org/2.7/library/unittest.html#skipping-tests-and-expected-failures15:50
melwittnow that's a bug report15:50
dansmithmine would have been 5% of that15:51
mriedemfried_rice: testtools has an expectedFailure thing15:51
melwitthah, I know15:51
mriedemyou said i take pride in it...15:51
dansmith"s'broken, kthx"15:51
mriedemmostly because if i don't put those details in there, i'll totally forget wtf we talked about a year from now15:51
melwittyeah. the details are super helpful15:51
sean-k-mooneymriedem: i think that bug also falls into the catagory of "we dont have ci for it so its broken by default"15:52
mriedemwell,15:52
mriedemwe don't have CI for a lot of things15:52
mriedemand we still support them, <cough>evacuate</cough>15:52
*** gyee has joined #openstack-nova15:52
sean-k-mooneyyes and i assume they are broken by default unless proven otherwise by it working when i use it and being happy15:53
dansmithevacuate is hard to test for legit reasons, but this shared thing is not15:53
dansmithand it's also often broken15:53
mriedemyup15:54
mriedembtw, yes, forced host live migrate/evacuate will drop the DISK_GB allocation on the shared provider15:54
*** flwang1 has quit IRC15:55
dansmithmriedem: from your test?15:56
mriedemno just looking at teh code15:57
mriedemhttps://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/scheduler/utils.py#L50015:57
mriedemwe only get the allocations for the instance against the source node15:57
dansmithoh15:57
mriedemand copy those to the dest node for the instance15:57
mriedemdouble up15:57
mriedemdoesn't put anything on the migration record in the force cas15:58
mriedem*case15:58
*** shaohe_feng has quit IRC15:58
*** rpittau has quit IRC15:58
*** r-daneel_ has joined #openstack-nova15:58
mriedemhmm, which makes me wonder if we ever cleanup the dest host allocations on a failed live migration15:58
mriedemthat is forced15:58
*** flwang1 has joined #openstack-nova16:00
mriedemlooks like post_live_migration will give you a warning but remove the doubled allocation16:00
*** shaohe_feng has joined #openstack-nova16:00
*** r-daneel has quit IRC16:00
*** r-daneel_ is now known as r-daneel16:00
mriedemhttps://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/manager.py#L6638L666916:00
mriedemoops16:00
mriedemi'll write a functional test for the rollback forced live migration case16:01
*** openstackgerrit has quit IRC16:04
mriedemhttps://bugs.launchpad.net/nova/+bug/178402216:06
openstackLaunchpad bug 1784022 in OpenStack Compute (nova) "Failed forced live migration does not rollback doubled up allocations in placement" [High,Triaged]16:06
mriedemlooks like we regressed that in queens16:07
*** shaohe_feng has quit IRC16:08
mriedemblarg https://review.openstack.org/#/c/507638/25/nova/compute/manager.py@625216:08
*** shaohe_feng has joined #openstack-nova16:09
*** jangutter has quit IRC16:10
dansmithmriedem: are you saying we don't have a migration record if we do a forced?16:11
mriedemdansmith: we do, but we don't put the allocations on it16:11
*** lbragstad_ is now known as lbragstad16:11
mriedemb/c we don't go through the scheduler for forced16:11
*** ispp has quit IRC16:11
dansmithum16:11
mriedemthis is just one of the many reasons for the dreaded -5 in dublin16:11
*** flwang1 has quit IRC16:12
mriedemdansmith: forced live migration calls this method to double up the allocations from the source to the forced dest https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/scheduler/utils.py#L47316:12
mriedemthat's from pike when doubling was all the rage16:12
dansmithokay, so you're saying on forced we don't do the migration allocations, we just allocate against the newhost, then if we have to revert, we don't have the migration allocations to revert to the instance?16:12
melwittis it safe to recheck yet? I didn't see another status update16:12
mriedemdansmith: correct16:13
mriedemmelwitt: yeah i just did16:13
melwittok16:13
mriedemdansmith: i'll write a functional test for it when i'm back from lunch16:13
dansmithmriedem: okay but the doubling is not intentional, just incidental since we didn't replace the instance allocs with the migration one yeah?16:13
mriedemit's intentional16:13
mriedemit mimics the behavior of doubling in the scheduler from before quens16:13
mriedem*queens16:14
dansmithright, but we shouldn't be doing any doubling anymore16:14
mriedemsure,16:14
mriedembut we are :)16:14
mriedemfor forced16:14
mriedemb/c forced is FUN16:14
mriedem-20!16:14
*** links has joined #openstack-nova16:14
dansmithI'm saying we shouldn't intend to be doing that,16:14
mriedemnot anymore no16:14
dansmithwhich means it's a case we missed in converting to non-doubling16:14
mriedembut we missed it in queens with your bp16:14
mriedemyes16:14
dansmithright, that's what I mean16:14
dansmithunintentional16:14
mriedemyeah16:14
mriedemok lunch16:15
*** mriedem is now known as mriedem_away16:15
*** links has quit IRC16:17
*** links has joined #openstack-nova16:17
*** shaohe_feng has quit IRC16:18
*** shaohe_feng has joined #openstack-nova16:19
*** artom_ has joined #openstack-nova16:22
*** links has quit IRC16:23
*** Sundar_ has joined #openstack-nova16:23
sean-k-mooneymriedem_away: im goint to choose to read -20! as -(20 factoral) to give it the weight it should have16:23
Sundar_efried: Please ping me when you have the time16:25
*** openstackgerrit has joined #openstack-nova16:26
openstackgerritEric Fried proposed openstack/nova master: libvirt: Revert non-reporting DISK_GB if sharing  https://review.openstack.org/58661416:26
fried_ricemriedem_away, dansmith, cdent, melwitt: ^16:26
fried_riceSundar_: Bad timing :( I have to run for a bit.  Will you be around in a couple of hours?16:26
Sundar_NP, sure16:27
*** harlowja has joined #openstack-nova16:27
*** flwang1 has joined #openstack-nova16:28
*** shaohe_feng has quit IRC16:28
*** shaohe_feng has joined #openstack-nova16:29
*** derekh has quit IRC16:30
*** tesseract has quit IRC16:32
*** fried_rice is now known as fried_rolls16:33
*** vladikr has quit IRC16:35
*** vladikr has joined #openstack-nova16:35
*** shaohe_feng has quit IRC16:39
dansmithmriedem_away: when you're back: I guess I don't really see the thing requiring the dynamic opts registration as being a bad thing16:40
dansmithmriedem_away: it forces us to think about it when we write new code and the tests for it16:40
*** shaohe_feng has joined #openstack-nova16:41
*** Bellesse has quit IRC16:44
*** rmart04 has quit IRC16:46
*** shaohe_feng has quit IRC16:49
openstackgerritDan Smith proposed openstack/nova master: Assorted cleanups from numa-aware-vswitches series  https://review.openstack.org/58265116:49
openstackgerritDan Smith proposed openstack/nova master: Add additional functional tests for NUMA networks  https://review.openstack.org/58538516:49
*** shaohe_feng has joined #openstack-nova16:49
*** felipemonteiro__ has joined #openstack-nova16:52
*** felipemonteiro_ has quit IRC16:52
cdentmelwitt, dansmith, mriedem_away : next week I'm pretty broadly available, so if stuff comes up and you want to wind me up and point me particular places, please ask.16:52
melwittwill do, thanks16:54
*** shaohe_feng has quit IRC16:59
*** shaohe_feng has joined #openstack-nova17:04
*** felipemonteiro_ has joined #openstack-nova17:06
*** mgoddard has quit IRC17:07
*** yamahata has quit IRC17:07
*** burt has quit IRC17:08
*** shaohe_feng has quit IRC17:09
*** felipemonteiro__ has quit IRC17:10
*** dtantsur is now known as dtantsur|afk17:10
*** gbarros has joined #openstack-nova17:11
*** shaohe_feng has joined #openstack-nova17:12
*** bacape_ has joined #openstack-nova17:16
*** felipemonteiro__ has joined #openstack-nova17:18
*** felipemonteiro_ has quit IRC17:18
*** bacape_ has quit IRC17:18
*** bacape has quit IRC17:20
*** shaohe_feng has quit IRC17:20
*** shaohe_feng has joined #openstack-nova17:20
*** mriedem_away is now known as mriedem17:23
*** gbarros has quit IRC17:23
*** artom has joined #openstack-nova17:23
*** jmlowe has joined #openstack-nova17:24
*** artom_ has quit IRC17:26
*** savvas has quit IRC17:29
*** shaohe_feng has quit IRC17:30
*** harlowja has quit IRC17:31
*** shaohe_feng has joined #openstack-nova17:32
*** felipemonteiro_ has joined #openstack-nova17:34
*** felipemonteiro__ has quit IRC17:37
*** cfriesen_ has quit IRC17:39
*** shaohe_feng has quit IRC17:40
*** shaohe_feng has joined #openstack-nova17:41
*** gbarros has joined #openstack-nova17:42
*** mgoddard has joined #openstack-nova17:43
*** yamahata has joined #openstack-nova17:43
*** colby_ has joined #openstack-nova17:46
colby_Hey Everyone. Im trying to get metrics based filtering working in nova. I tried enabling compute_monitors but I always get an error in the logs:17:47
colby_compute_monitors=["nova.compute.monitors.cpu.virt_driver", "numa_mem_bw.virt_driver"]17:47
colby_2018-07-27 17:43:14.001 2295696 WARNING nova.compute.monitors [req-51711d41-c626-4af2-92fd-dde09c576fb2 - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).17:48
colby_Ive tried variations on the monitor: cpu.virt_driver & just virt_driver. It always gives the same error17:48
colby_Im on pike, Centos, kvm17:49
colby_I have gnocchi running and collecting resource17:49
colby_2018-07-27 17:44:36.110 2800963 INFO nova.filters [req-0e8215e5-e029-4104-8578-a917bf9edddc e28435e0a66740968c523e6376c57f68 18882d9c32ba42aeaa33c4703ad84b2c - default default] Filter MetricsFilter returned 0 hosts17:50
colby_Not sure where the problem is17:50
*** shaohe_feng has quit IRC17:50
colby_weight_setting=compute.node.cpu.percent=-1.017:51
dansmithcolby_: I really can't help you, but I can tell you that metrics have nothing to do with gnocchi/ceilo17:51
colby_ok I thought I read somewhere that it used the gnocchi metrics...17:51
dansmithcolby_: the computes have to be configured to report them in order to use the filter17:51
dansmithnope17:51
*** shaohe_feng has joined #openstack-nova17:52
colby_ok so then the compute_monitors is the issue then17:52
dansmiththe metrics come from libvirt, reported by the compute, used by the filter17:52
colby_ok then Im not sure why Im not getting the metrics17:53
colby_besides the filed driver load17:53
colby_or monitor load I mean17:53
dansmithyeah, I can't really help beyond that17:53
sean-k-mooneydansmith: colby_ if you enable the metric reporting on the compute node ceilometer is able to read them form the message bus and store them but that is a sideffect17:54
colby_Ok so does that mean my metrics reporting is working?17:55
sean-k-mooneycolby_: by the way memory bandwith monitoring is broken on skylake. both read and write metrics are are actully read...17:56
colby_Im actually just interested in the cpu.percent17:56
colby_I want to not put instances on nodes with high cpu usage. We have a large memory node and the scheduler always puts instances there even when its way overcommited on cpu17:57
sean-k-mooneyah well you could just change the order of the weigher to prefer weighing on cpus before memory. am but i have not used the metric based weigher myself so i have not tried to configure it before17:58
*** penick is now known as OcataGuy17:58
*** OcataGuy is now known as MostlyOcataGuy17:58
colby_ah ok. I treid weight_setting=cpu.percent=-1.017:59
colby_but I got zero hosts returned with metrics filter enabled17:59
*** Sundar_ has quit IRC18:00
colby_I was not aware that changing weigher order made any difference18:00
*** shaohe_feng has quit IRC18:01
colby_I just used: nova.scheduler.weights.all_weighers18:01
colby_I thought it was all just based on multipliers18:02
*** savvas has joined #openstack-nova18:02
*** med_ has quit IRC18:02
sean-k-mooneycolby_: well stickly speaking it does not but what i ment was listing only the weighers you care about and then setting there multipliers18:02
*** jdillaman has quit IRC18:03
*** shaohe_feng has joined #openstack-nova18:03
sean-k-mooneyif you only care about cpus then you can simploy only enable the cpu Weigher18:03
colby_hmm ok18:04
colby_thanks18:04
melwittcolby_: are you specifying compute_monitors= under the [DEFAULT] section of the nova.conf?18:04
colby_yes18:04
colby_but I get the error: Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors)18:04
melwittokay. the log message you posted earlier is saying it doesn't find the monitor in the list from the conf option. hm18:05
*** gbarros_ has joined #openstack-nova18:05
colby_oh wait...there is a typo <smacks head>18:06
sean-k-mooneycolby_: you could proably get a similar effect by setting ram_weight_multiplier=0 or 0.1 so that ram is basically ignored when weighing if that does not work18:08
colby_ok thanks for your help!18:08
*** gbarros has quit IRC18:09
*** gbarros_ has quit IRC18:09
*** gbarros has joined #openstack-nova18:10
*** shaohe_feng has quit IRC18:11
*** gbarros_ has joined #openstack-nova18:12
*** gbarros__ has joined #openstack-nova18:13
*** mriedem1 has joined #openstack-nova18:14
*** mriedem has quit IRC18:14
*** gbarro___ has joined #openstack-nova18:14
*** gbarros has quit IRC18:15
*** shaohe_feng has joined #openstack-nova18:15
*** gbarros has joined #openstack-nova18:15
*** harlowja has joined #openstack-nova18:15
*** gbarros_ has quit IRC18:16
*** gbarros__ has quit IRC18:18
*** sridharg has quit IRC18:18
*** gbarro___ has quit IRC18:18
sean-k-mooneymelwitt: mriedem1 https://review.openstack.org/#/c/586568/ hit the retry_limit issue after your last recheck. is that issue(retry_limit) still happening in the gate18:19
melwittI think it's been fixed18:19
sean-k-mooneywell there is no gate job for that patch at the moment. will i retry it?18:20
melwittyeah, go ahead. I didn't realize that one hadn't been rechecked18:21
*** shaohe_feng has quit IRC18:21
sean-k-mooneymelwitt: it had. you did it at 5:14 but it hit the error again18:22
sean-k-mooneyyou proably missed the fix by a few minutes18:22
melwittyeah, guh18:22
mriedem1dansmith: danicus, i have good pleasurable news18:22
*** mriedem1 is now known as mriedem18:22
dansmithum18:22
*** shaohe_feng has joined #openstack-nova18:22
mriedembug 1784022 isn't a problem18:23
openstackbug 1784022 in OpenStack Compute (nova) queens "Failed forced live migration does not rollback doubled up allocations in placement" [High,Triaged] https://launchpad.net/bugs/178402218:23
mriedemit's handled18:23
dansmithoh yeah?18:23
dansmiththat is indeed pleasurable18:23
melwittdansmith: wanna ack this? https://review.openstack.org/58661418:24
dansmithyup18:25
*** artom has quit IRC18:26
melwitthooray18:26
melwittdangit, missed artom again. I had wanted to ask him about https://bugs.launchpad.net/nova/+bug/170843318:27
openstackLaunchpad bug 1708433 in OpenStack Compute (nova) "Attaching sriov nic VM fail with keyError pci_slot" [Undecided,New]18:27
mriedemdansmith: i'll push up the functional test anyway since it didn't look like we had one, only for the non-forced rollback checks18:28
dansmithokay18:28
dansmithmriedem: did you see my comment above about stephen's set?18:29
dansmithand I pushed up the other fixes to that, btw18:29
dansmithsince you hadn't and seemingly got distracted with this other thing18:29
*** Sundar_ has joined #openstack-nova18:29
dansmithoh I see you did18:29
dansmithcool18:29
*** rmart04 has joined #openstack-nova18:31
*** jmlowe has quit IRC18:31
*** shaohe_feng has quit IRC18:31
mriedemsure did18:33
openstackgerritMatt Riedemann proposed openstack/nova master: Add functional test for forced live migration rollback allocs  https://review.openstack.org/58663618:33
*** shaohe_feng has joined #openstack-nova18:34
mriedemwell, just in time for us to kill the shared storage provider support, i got it passing the ceph job http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/18:35
dansmithpresumably because we're left with broken allocations after a revert or something, but don't check/assert them?18:36
*** artom has joined #openstack-nova18:37
mriedemright tempest won't assert any of that stuff,18:37
mriedemwe do have a post-test hook in the nova-next job for making sure there are no orphaned allocations but only on compute node providers18:37
dansmithwe had some sanity checking and logging in the RT when we removed the healing.. maybe there is some evidence in there?18:37
mriedemoh nvm it's not just computes, it's all resource providers18:38
mriedembut we don't run it on that job18:38
dansmithhttp://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_17_31_35_33725818:39
mriedemyeah i don't see any obvious warnings related to allocatoins18:40
mriedemi think if we ran our post-test leaked allocation hook on this job it would fail18:40
*** flwang1 has quit IRC18:41
mriedemwell, maybe not for single node18:41
*** shaohe_feng has quit IRC18:42
*** flwang1 has joined #openstack-nova18:42
dansmithyeah, so there are 133 logs of instance fd563ed2-d42c-4dc1-a614-8700c6e6c8fd18:42
dansmithhaving non-cleaned-up allocations18:43
*** shaohe_feng has joined #openstack-nova18:43
dansmithalthough really the allocations that we'd destroy wouldn't be against the compute node,18:43
dansmithand would be gone not stale18:43
dansmithso even your check probably wouldn't catch it18:43
dansmithbecause we'd be _losing_ not _leaking_ disk allocations18:43
dansmithalso, um18:45
dansmithI just noticed that we're logging an entire console log out of privsep somewhere18:45
dansmithhttp://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_18_07_23_55067018:45
dansmithyou could argue that is a security issue if instances log sensitive info to their console18:46
mriedemnice, 9 of those18:46
mriedemyou can open that bug18:46
*** r-daneel_ has joined #openstack-nova18:47
*** r-daneel has quit IRC18:47
*** r-daneel_ is now known as r-daneel18:47
dansmithokay18:47
dansmithdoes privsep daemon log everything over the channel or something?18:48
*** s10 has quit IRC18:48
sean-k-mooneydansmith: that log is becasue seting a route in teh guest failed http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_18_07_23_55561818:49
dansmithnot sure about that18:50
sean-k-mooneyi think18:50
Sundar_efried: I need to take off for lunch. I'll look for your response in https://review.openstack.org/#/c/577438/. We need to get this discussion to a closure.18:50
dansmithdon't think so, I'm not sure why we'd log the console output in that case18:50
*** Sundar_ has quit IRC18:50
dansmiththe route errors on the console are just there because we're logging it, if that's what you're looking at18:51
*** rmart04 has quit IRC18:51
sean-k-mooneyyes it was but this looks like the ouput for dmesg when we are unning through cloud-init18:51
sean-k-mooneywell i gess its the main console log18:52
*** shaohe_feng has quit IRC18:52
dansmithsean-k-mooney: it's the instance console log18:53
*** shaohe_feng has joined #openstack-nova18:53
dansmithwhich is more than dmesg18:53
*** rmart04 has joined #openstack-nova18:53
sean-k-mooneywell its a debug log. i wonder is it related to http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_18_07_23_54872318:54
dansmithit looks to me like privsep daemon is logging anything sent over the channel,18:54
sean-k-mooneyby that i mean its a debug log so at least it does not do this normally18:54
*** rmart04 has quit IRC18:54
dansmithand since we're using it to do a readpty of the console, it gets logged18:55
dansmithsean-k-mooney: lots of people run with debug on all the time18:55
dansmithhttps://bugs.launchpad.net/nova/+bug/178406218:55
openstackLaunchpad bug 1784062 in OpenStack Compute (nova) "Instance console data is logged at DEBUG" [Undecided,New]18:55
dansmithmelwitt: ^18:55
dansmithI dunno what will be involved in squelching that,18:55
*** gbarros has quit IRC18:55
dansmithbut might be good to fix that before GA, IMHO18:55
melwittgah, moar bugs18:55
melwittyeah, agreed. I'll put it on the RC1 list18:56
sean-k-mooneydansmith: well i know privsep propagate any excpetions back over the unix socket and any loging within the privesep deamon is redirected to the parrent too as far as i know18:57
dansmithI'd like to point to mriedem's statement that we should be finding and fixing critical bugs during this phase instead of rushing on a lot of FFEs18:57
dansmiththe last 24 hours has been pretty ... that.18:57
*** fried_rolls is now known as fried_rice18:59
*** MostlyOcataGuy is now known as penick19:00
mriedemSWEET VALIDATION19:01
*** shaohe_feng has quit IRC19:02
sean-k-mooneydansmith: its coming from this line https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L44219:02
dansmithsean-k-mooney: that wouldn't make much sense19:03
dansmithI expect it's the one below, L45519:03
dansmithTestNetworkBasicOps-1426085565] privsep: reply[140593546325360]: (4, '')19:03
sean-k-mooneysorry yes l45519:03
*** shaohe_feng has joined #openstack-nova19:03
dansmithyup19:03
*** r-daneel has quit IRC19:04
melwittthat doesn't look very squelchable19:04
sean-k-mooneyso should we just delete those?19:04
dansmithmelwitt: agree, it's sticky, but .. imagine what else we might be logging when we're running commands as root...19:04
melwittno, I agree. just thinking, how can we stop it19:05
dansmithmelwitt: maybe we recommend squelching privsep DEBUG logs in the levels as a security measure?19:05
dansmithbut still,19:05
dansmithsomething better likely needs doing19:05
sean-k-mooneywe could add a conf option for extra verbose loggin to privsep.19:05
dansmithwe control that to some degree in our default levels for libraries,19:05
dansmithassuming the daemon starts with our config19:06
*** gbarros has joined #openstack-nova19:06
sean-k-mooneythings like os-vif plugins create there own privsep deamons19:06
sean-k-mooneyit would be nice to turn that off by defaut globally19:07
*** gbarros has quit IRC19:09
dansmithdecorating privsep methods as "may return sensitive stuff" would be one way, and let the daemon just not log the result19:11
dansmithfor the DoS case, limiting what we log to 256 chars max or something seems prudent19:11
melwittare you talking about changes to oslo_privsep or nova?19:12
*** shaohe_feng has quit IRC19:12
dansmithwell, the decoration would be both19:12
dansmithwe'd decorate our things, and the daemon code would have to honor it19:13
melwittokay, I see19:13
dansmiththe log length limit would be purely privsep19:13
melwittgotcha19:13
dansmithand our forcing of a log level for our own daemon could maybe be all on our end, but not sure19:13
*** shaohe_feng has joined #openstack-nova19:14
melwittyeah, I was looking for where the default log levels come from and didn't find it yet19:14
dansmithwell, we control them for our libraries you know,19:15
dansmithbut I think the daemon itself is logging this19:15
sean-k-mooneymelwitt: well this is a devstack run so we proably hardcode the loglevel to debug in the nova conf19:15
melwittthe decorator idea sounds like a good feature but I don't know how hard it would be to coordinate that with oslo in the next week or so19:15
dansmithbut, I assumed it was following our debug=true, so..19:15
*** jaypipes has joined #openstack-nova19:16
dansmithI wonder if we've been doing this since this patch merged...19:16
melwittyeah, I mean how do we configure another library to log at a certain different level19:16
dansmithsurely thought we'd have heard of it19:16
dansmiththere's a default log levels thing19:18
sean-k-mooneywell privsep has its own log handeler that redirects everything over the unix socket https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L14419:18
dansmithhttps://docs.openstack.org/kilo/config-reference/content/list-of-compute-config-options.html19:18
dansmithdefault_log_levels =19:19
*** med_ has joined #openstack-nova19:19
*** med_ has quit IRC19:19
*** med_ has joined #openstack-nova19:19
dansmithdefault contains, for example: oslo.messaging=INFO19:19
dansmithheh, that's kilo, but... :)19:19
melwittoh, never knew about that. cool19:19
dansmithI don't see that we much control the execution of the daemon really,19:21
dansmithso not sure if it even knows what our config is19:21
dansmithor how it knows to have debug on19:22
dansmithbut yeah, if it's being fed into our logger (like sean-k-mooney is suggesting) then setting a level in that config might affect it19:22
*** shaohe_feng has quit IRC19:23
melwitthm, yeah. the example shows all kinds of libraries that aren't openstack things as being affected19:23
sean-k-mooneywell this is what is handeling the log message on the nova side of the call https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L20619:23
dansmithmelwitt: it has nothing to do with openstack19:24
*** shaohe_feng has joined #openstack-nova19:24
dansmithmelwitt: it's in our config of the root logger,19:24
dansmithwhich any library will ultimately use19:24
*** mgoddard has quit IRC19:24
dansmithit just matters that it's in our process space19:24
*** rtjure has quit IRC19:25
dansmithso the daemon being outside, would be unaffected (unless it's looking at our config), but if it's redirecting all the log traffic over the channel and we have something our side reading that and logging _as_ privsep.daemon in our process,19:25
dansmiththen our root logger config would affect it19:25
melwittokay, I see. thanks for explaining that19:25
sean-k-mooneydansmith: in this case it even going to work across process because both the root wrap and fork clients swap out the looger to redirect it over the socket19:26
dansmithsean-k-mooney: yeah I just said that :)19:26
mriedemso we just need to hard-code oslo.privsep=INFO or something in our default_log_levels yeah for that bug? didn't read all the backscroll19:27
sean-k-mooneyhah yep. i was typeing when you did :)19:27
dansmithsean-k-mooney: heh okay19:27
dansmithmriedem: yeah, sounds like it19:27
mriedemeasy peasy19:27
dansmithyup19:27
mriedemmelwitt: don't forget to defer a bunch of these https://blueprints.launchpad.net/nova/rocky19:28
melwittmriedem: right, thanks19:28
mriedemi only see 3 in there that wouldn't be deferred19:28
mriedemmox-removal, versioned notifications and stephen's numa vswitch bp19:28
melwittthanks19:28
sean-k-mooneydansmith: am could we use a decoreator/context manager to also chagne the config for spcific call?19:29
dansmithsean-k-mooney: not sure I parsed that, but I think we'd not want to override log levels in a context manager19:29
sean-k-mooneybasically im thinking about your previous suggstion of a decorator for this is sensitive  never logit cases19:30
dansmithsean-k-mooney: yep, something intentional for this might be good19:30
*** eharney has quit IRC19:31
melwittsetting the default log level for oslo.privsep is a good mitigation we can do immediately. then we can look at the idea of adding something to oslo.privsep to control this in a better, non-overrideable way (though I guess one could argue if the user really wants to override, they should be able to)19:31
sean-k-mooneythe default log level change is also good but that read tty call proablly should never be logged19:31
sean-k-mooneymelwitt: if the user really want to log it that much they could add a print()19:32
sean-k-mooneyor remvoe the decorator19:32
sean-k-mooneyits likely that you would only want to do this if your debugging19:32
*** shaohe_feng has quit IRC19:33
melwittyeah, I just meant to point out it's a consideration. not arguing either way19:33
*** shaohe_feng has joined #openstack-nova19:34
*** rtjure has joined #openstack-nova19:35
sean-k-mooneyya thats true19:36
mriedemthe default_log_levels things is backportable, which i'm assuming this needs to be19:36
mriedemwe've had privsep in for awhile19:36
dansmithI've been trying to git-review this mofo for a few minutes now19:37
sean-k-mooneywell the default_log_levels can be set in deployment tools so it can be done downstream also even if it was not upstream19:38
openstackgerritDan Smith proposed openstack/nova master: Force oslo.privsep.daemon logging to INFO level  https://review.openstack.org/58664319:38
dansmiththar ^19:38
dansmithwe can check the logs after a run of that and make sure theres no privsep debug noise in there19:39
*** lbragstad_ has joined #openstack-nova19:40
*** lbragstad has quit IRC19:41
*** shaohe_feng has quit IRC19:43
*** shaohe_feng has joined #openstack-nova19:45
*** mchlumsky_ has quit IRC19:47
* mriedem goes to get ma child19:48
*** mchlumsky has joined #openstack-nova19:50
sean-k-mooneydansmith: the only down side to this change is i used to use some of those log messages to debug os-vif  plugging stuff but in heighsight i should have proably questioned why they were there19:51
sean-k-mooneydansmith: that said http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_17_31_34_22961419:51
dansmithsean-k-mooney: this is just the default, you can still override it in config to turn it on19:51
sean-k-mooneythis is being loged form the privsep deamon but reported as oslo_concurrency19:52
sean-k-mooneydansmith: oh i know, what will we do in the gate?19:52
dansmithwell, we can override this for the gate, it just needs to not be on by default19:53
*** shaohe_feng has quit IRC19:53
dansmithsean-k-mooney: are you sure? that doesn't look like the privsep format19:54
dansmithand processutils would log something like that19:54
dansmithmaybe it's inside the daemon, but running processutils, which is emitting the actual log?19:54
sean-k-mooneydansmith: that code is executed via privsep but that message is not from that log19:54
sean-k-mooneydansmith: yes19:54
dansmithokay I'm confused about what you're saying19:55
sean-k-mooneysorry one sec19:55
*** shaohe_feng has joined #openstack-nova19:56
sean-k-mooneyits basically this https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/linux_net.py#L15519:56
*** lbragstad_ is now known as lbragstad19:56
*** mchlumsky has quit IRC19:56
sean-k-mooneywhich invokes processuitls here https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/linux_net.py#L5819:56
sean-k-mooneythe actull privsep request message is printed here http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_17_31_34_22913919:58
*** mchlumsky has joined #openstack-nova19:58
sean-k-mooneybut any loging privledge fucntions do is also relyed to the parent over the socket.19:59
*** pchavva has quit IRC19:59
*** ccamacho1 has joined #openstack-nova20:00
*** mgoddard has joined #openstack-nova20:01
*** ccamacho has quit IRC20:01
sean-k-mooneyanyway lets see if that config option just effect the default log level of oslo.privspes own internal logging or also the suff call via a privsep context20:02
*** itlinux has joined #openstack-nova20:03
dansmithokay, I'm still not sure what your concern is20:03
dansmithbut it's probably my friday brain20:03
openstackgerritArtom Lifshitz proposed openstack/nova master: DNM: Extra logs for volume detach device tags cleanup  https://review.openstack.org/58403220:03
*** shaohe_feng has quit IRC20:04
sean-k-mooneywell im hoping that oslo.preivsep.deamon=INFO just disables the debug loggin for privsep debug logs but not debug logs from things called via privsep20:04
dansmithwhy?20:05
*** mgoddard has quit IRC20:05
dansmithit should affect anything that logs with oslo.privsep.daemon, not anything else20:05
*** mchlumsky has quit IRC20:05
*** shaohe_feng has joined #openstack-nova20:05
dansmithif those concurrency logs are logged with a logger name of oslo.concurrency.processutils, then they should be unaffected20:06
dansmithis that what you mean?20:06
sean-k-mooneyyes.20:06
dansmithokay I think we'll be okay on that, assuming it works the way I think it does20:06
dansmithI expect there is some code in privsep that does:20:06
*** weaksauce2 has joined #openstack-nova20:06
weaksauce2Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/20:06
weaksauce2or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/20:06
*** weaksauce2 has quit IRC20:07
dansmithfor message_logged_in_the_daemon: logger.getLogger(message.log_name).log.$level(message.msg)20:07
*** mchlumsky has joined #openstack-nova20:07
dansmithso my change should only affect actual messages logged on the daemon log name20:07
dansmithnot anything logged in the context of the daemon at all20:07
*** liuyulong__ has quit IRC20:08
dansmithhmm, that code was kindof nonsense, let me try again:20:08
sean-k-mooneyits this code that i was unsure about https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L249-L25420:08
itlinuxhello Nova guys, when spinning up a VM, and the hypervisor is asking to pull the image from glance does that go over the storage network? thanks20:08
*** liuyulong__ has joined #openstack-nova20:08
dansmithsean-k-mooney: that's the daemon-side code that intercepts the logs to redirect20:09
sean-k-mooneyyes20:09
dansmithit's the non-daemon code that does the actual logging and would do what I surmised above20:09
*** s10 has joined #openstack-nova20:09
sean-k-mooneywell part of it20:09
sean-k-mooneyanyway we will see soon.20:09
*** errantekarmico has joined #openstack-nova20:14
*** shaohe_feng has quit IRC20:14
*** shaohe_feng has joined #openstack-nova20:15
*** slaweq has joined #openstack-nova20:15
dansmithyup20:15
*** errantekarmico has left #openstack-nova20:16
mnaserthere technically should never be rows with cell_id=NULL in instance_mappings.. right?20:19
dansmithmnaser: mappings have no cell until they're scheduled20:20
mnaserdansmith: right, but yknow, not an instance from march lets say20:21
mnaser:p20:21
dansmiththey should always end up scheduled, to cell0 at least, but they can be there transiently and/or if something fails20:21
mnaseralright so i think i'll have to write something to look in our cell vs cell0 and update mappings to make the db consistent20:22
*** dtruong_ has quit IRC20:23
*** shaohe_feng has quit IRC20:24
*** shaohe_feng has joined #openstack-nova20:25
*** dtruong_ has joined #openstack-nova20:26
*** med_ has quit IRC20:27
*** savvas has quit IRC20:28
*** savvas has joined #openstack-nova20:28
*** savvas has quit IRC20:30
*** savvas has joined #openstack-nova20:30
*** artom has quit IRC20:32
*** shaohe_feng has quit IRC20:34
*** mchlumsky has quit IRC20:37
*** tidwellr has quit IRC20:38
*** slaweq has quit IRC20:40
*** shaohe_feng has joined #openstack-nova20:40
*** felipemonteiro_ has quit IRC20:40
*** felipemonteiro_ has joined #openstack-nova20:40
*** shaohe_feng has quit IRC20:45
*** shaohe_feng has joined #openstack-nova20:45
mriedemmnaser: same issue from last week right?20:46
mriedemcould have been rpc outage so a failed db update20:46
mriedemer db?20:46
mriedemfailed write i mean20:46
*** cdent has quit IRC20:46
mnasermriedem: no it looks like over the lifetime of our cloud any rpc or db related things might have accumulated a lot of things in nova_api with cell_id = NONE20:46
mnaserlike, 20000 worth.20:46
mriedemi had also identified one spot in conductor where the build request will be gone and we don't set the instance mapping to cell020:46
mnaserhowever for 99.9999% of those, they were actually assigned a cell and not buried in cell020:47
mnaserdansmith, mriedem: http://paste.openstack.org/show/726767/ might be a useful little tool if someone ends up in the same situation20:47
mnaserconnect to api db, get all cells, go over them all and check where it can find the instance, and then print out an update statement for manual fix20:48
mriedemwe could nova-manage cell_v2 that baby20:48
mnaseri can push up an initial patch but i dunno how much i can iterate/test/etc because i've been a bit overwhelmed20:49
mnaserand it would have to be updated to use nova objects too i guess20:49
mriedemnp, or just report a bug and put this paste in it as a template20:49
mriedemlatter is fine ^20:49
mnasergood idea20:49
mriedemis this finding instances in non-cell0 cells?20:50
mriedemthat aren't in error state?20:50
mnasermriedem: im not sure about the exact logic, but i grab a list of all cells, connect to them, and loop until i find an entry inside 'instances' table with the same id20:50
mnaserif that is logically wrong, i can fix it20:51
mriedemit makes sense20:51
mriedemif the instance mapping doesn't tell what cell it's in, we have to iterate the cells looking for it20:51
mnaserand there is no change it ever being in two cells20:52
mnasers/change/chance/20:52
mriedemis that a question?20:52
mnaseryes20:52
mriedemshouldn't be no20:52
mnaserokay sounds good, because i break off once i find it and stop looping20:52
mriedembut this shouldn't be happening in the first place20:52
mnaseryeah :\ but i dunno how much to blame nova when it might have been an infra problem20:52
mriedemi mean in a normal case we create the instance in the cell here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L125720:53
mriedemif the user goes over quota we should put the instance into error state and mark the instance mapping https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L137020:53
*** david-lyle has joined #openstack-nova20:54
mriedemin a normal case, we update the instance mapping here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L132220:54
mnaserin any case -- https://bugs.launchpad.net/nova/+bug/178407420:54
openstackLaunchpad bug 1784074 in OpenStack Compute (nova) "Instances end up with no cell assigned in instance_mappings" [Undecided,New]20:54
mriedembefore deleting the build request and casting to compute20:54
mnaserhmm20:54
mnaseri wonder if i wanna update that script20:54
mnaserto check if a build_request exists20:55
mriedemif anything fails in between there we could fail to update the mapping20:55
mriedemmnaser: maybe - if the build request exists, the instance shouldn't be in a cell20:55
*** manjeets_ has joined #openstack-nova20:55
mriedemso L42 in your script is where i'd look for a build request20:55
mriedemas a sanity check20:55
*** shaohe_feng_ has joined #openstack-nova20:56
mnasermriedem: yeah i was planning to just run the mysql till a certain point and assume the rest was just unscheduled stuff but it could be confusing to hand off to others20:56
*** dklyle_ has quit IRC20:57
mnaseri'm feeling to check if a build request exists at L27 so a) i dont hit the cells and b) if a build requests exists, technically there shouldn't be an issue because api calls will interact with that build request20:57
*** manjeets has quit IRC20:57
*** anupn_ has quit IRC20:57
mnaseri think the problem is there when a build request AND cell mapping is missing20:57
mnaserbut i believe if build request is there but cell mapping is missing, it'll work just fine and not do any weird 404s on instances20:57
mriedemcorrect20:57
mriedemthis was the case i was worried about last week https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L124320:58
*** karimull has quit IRC20:58
mriedemin that case, the api has deleted the build request, and we haven't updated the instance mapping20:58
mriedembut, we wouldn't put the instance in cell0 b/c the user deleted the instance before we created it (via build request)20:58
mriedemmnaser: might be nice info to know if these unmapped instances are deleted20:59
melwittone thing that's interesting that I learned recently is that if, for some reason, there is a case where a build request exists but *no* instance mapping exists, the API does not handle it in that, the "instance" will show up in a 'nova list' but it can't be deleted because delete will raise NotFound20:59
*** shaohe_feng has quit IRC20:59
*** shaohe_feng_ is now known as shaohe_feng20:59
mriedemi don't know how that could happen20:59
mriedemwe create the build request and the instance mapping in _provision_instances20:59
mriedem*and request spec21:00
melwittand via code inspection, I don't know how that state could be gotten into other than nova-api restarting at precisely the moment after the build request is created but before the instance mapping was21:00
mriedemhttps://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L930 and then https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L94221:00
*** anupn has joined #openstack-nova21:00
mnasermelwitt: yeah that's essentially the state that these vms are in21:00
mriedemor the db failing the instance mapping insert21:00
*** karimull has joined #openstack-nova21:01
melwittmnaser: I thought you had instance mappings though, right?21:01
melwittyeah, or that21:01
mnasermelwitt: instance_mapping is there sure, but cell_id=NONE21:01
mnaserso some of those are list-able, but not delete-able21:01
melwittyeah, that's different than what I said. your case will let a delete work21:01
mriedemmnaser: are you listing as admin?21:01
melwittoh really?21:01
mriedemto list out deleted instances?21:01
mnasernope, i had a user complain they could list an instance but could not delete it21:02
mriedemi have to think you're hitting this https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L124321:02
mnaserhell i cant even delete it21:02
melwitthm, okay, that is a new case I didn't know21:02
mnaserlet me dig th eticket21:02
*** brault_ has quit IRC21:02
melwittI guess what it must do is, get the instance mapping, see cell_id=None and then think "I can't lookup the instance, therefore I can't delete it"21:03
mriedemwell,21:03
* melwitt looks at the code21:03
mnaserok so confirmed here21:03
mriedemit will fallback to trying to lookup the instance from the locally configured (in the api) [database]/connection21:03
mnasernova list --all-tenants | grep 1812c2eb-cfbc-4659-9817-4694ad3d2c37 < returns the instance with ERROR/NOSTATE21:03
mnasernova show 1812c2eb-cfbc-4659-9817-4694ad3d2c37 => ERROR (CommandError): No server with a name or ID of '1812c2eb-cfbc-4659-9817-4694ad3d2c37' exists.21:03
mriedemmnaser: is that instance deleted?21:03
mriedeminstances.deleted != 021:04
mnaserlet me double check21:04
mnaserfwiw though cell_id=NULL21:04
mnaserchecking instances21:04
*** edmondsw has quit IRC21:04
mnaserdeleted=0 but this one is in cell021:05
mriedemmelwitt: this is what i'm thinking of https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L176821:05
*** edmondsw has joined #openstack-nova21:05
mriedemmnaser: hmm, ok so the instance was created in cell0 but the instance mapping update failed21:05
mnaserin this case yes21:05
*** yamahata has quit IRC21:05
melwittthat's not what runs for a delete though21:05
*** shaohe_feng has quit IRC21:05
mriedemmelwitt: it has to lookup the instance right?21:05
mriedem_lookup_instance is called via API.get()21:05
mnaseryeah i cant even look it up, it just 404s21:05
melwittyeah but it goes here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L233321:05
*** r-daneel has joined #openstack-nova21:05
*** shaohe_feng has joined #openstack-nova21:06
mnaserlet me check21:06
mnaserit probably doesnt have a build request21:06
mnaserno build request indeed21:07
mnaserhttps://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L235321:07
mnaserso ending up here afaik21:07
mriedemhow are we listing it then...21:08
mnasermaybe list just hits the cells and ignores api stuff?21:09
mnaseri can help if i knew where the list code is :p21:09
mriedemhttps://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/instance_list.py#L9821:09
*** edmondsw has quit IRC21:10
melwitt_lookup_instance is called via API().delete, _get_instance is called via API().get21:10
melwittand the API (nova/api/openstack/compute/servers.py) does a API().get first before doing anything with an instance21:10
mriedemmnaser: you're right, we'll just iterate the cells21:10
mnaseri guess in an ideal world you retrieve list of vms from nova_api, and then generate a subsequent list to each cell with a list of instance uuids to request21:11
mnaserwhich might even eliminate extra calls if a user is located in one cell21:12
melwittso in the case of a build request with a instance mapping with cell_mapping = None, it will return build_request.instance, which I'm not sure what will happen if you try to delete that21:12
mriedemmnaser: that's what this is for https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/instance_list.py#L10121:12
melwittpresumably it fails21:12
mriedemand that's what cern uses21:12
mnaserwouldn't it be safer to only delete the build request once the cell has been set?21:13
melwittso that means build_request.instance gets passed to compute API().delete21:13
mriedemmelwitt: in that case we should go through here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L187721:14
mriedemmnaser: the idea is if the user deletes the build request before the instance has been scheduled to a cell, we never create the instance in the cell,21:15
mriedemso there is nothing to do with the instance mapping b/c it's not in a cell21:15
*** shaohe_feng has quit IRC21:15
mriedemand shouldn't get listed either b/c it's (1) not a build request and (2) not in a cell21:16
*** r-daneel has quit IRC21:16
mnaseryeah so maybe the issue here really inside list?21:16
melwittright, so the delete of the build request would succeed, but then the lookup of the instance will fail because it was just a build_request.instance shell21:16
mriedemwhich if that is really working, we get here in conductor after the build request was deleted in api https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L124321:16
melwittor well, maybe not. _lookup_instance would return None, None in the cell_mapping = None case21:17
mriedemi wonder why we don't update the instance mapping right after this https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L125721:17
*** shaohe_feng has joined #openstack-nova21:18
*** yamahata has joined #openstack-nova21:18
mriedemmelwitt: right, if _delete_while_booting returns True, we exit https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L187721:18
melwitthm, so I'm not seeing how delete would fail in that case21:19
*** manjeets_ is now known as manjeets21:20
melwittmnaser: is there any chance the service version in one of the records in the 'services' tables is < 15?21:22
mriedemheh, i asked that last week too :)21:22
mnasermelwitt: i checked that with mriedem last time we tried to look into this and no, none21:22
melwittI guess that wouldn't make sense. all of your instance GET would fail in that case21:22
mriedembtw, i thin kwe should probably remove that service version check now21:22
mriedemcommented on the bug https://bugs.launchpad.net/nova/+bug/1784074/comments/121:23
openstackLaunchpad bug 1784074 in OpenStack Compute (nova) "Instances end up with no cell assigned in instance_mappings" [Undecided,New]21:23
mriedemwith what *might* be happening21:23
mriedembut you'd have errors in the logs21:23
melwittthis doesn't make any sense how delete returns 40421:23
mriedemmelwitt: read ^ that comment in the bug because i think that could explain a window where it could happen21:23
*** liuyulong_ has joined #openstack-nova21:24
mriedemmnaser: i wonder if these are instances getting created as part of a multi-create request where they all get created in a cell, then when we go to update mappings, something fails and then the rest are left unmapped21:24
*** liuyulong__ has quit IRC21:24
mriedemthe user attempts to delete the instance, they delete the build request, but then they can still list it,21:24
mriedembut can't delete it b/c the build request is gone and the instance mapping isn't poining at a cell21:24
mriedemhence your fix up script21:24
melwittohhh21:25
mriedemthis goes back to something we've talked about before where the schedule_and_build_instances method was split into a few phases where it was originally one21:25
*** shaohe_feng has quit IRC21:26
mriedemso now we (1) get hosts from scheduler (2) create instances in cells (3) recheck quota (4) do some other stuff including updating instance mappings and casting to compute to build21:26
*** awaugama has quit IRC21:26
mriedemif anything fails in the loop in #4 we'd have this situation21:26
mnaserthese could be a multi create21:26
mnaserlet me double check21:26
*** shaohe_feng has joined #openstack-nova21:26
mriedemmnaser: you'd have to find the request spec and look that up21:27
melwittyeah, gosh21:27
mnaseri know of a customer that uses this feature all the time21:27
mnaserso it could just be them21:27
mriedemthere should be a num_instances field in the request spec for any of those instances21:27
mnasernope, at least one i randomly picked out is not a multi create21:27
mriedemok, well,21:27
mriedemi think the theory still applies21:27
mriedemif we fail *before* setting the instance mapping but after we've created the instance in the cell, we're toast21:28
mriedemdid we ever figure out if rabbit being down for notifications could screw us up too? because we send notifications before we update the instance mapping...21:29
melwittI don't know21:30
mriedemi'll throw something up quick before i have to head out21:31
mnaserso my audit script helped bring them from 20k down to 308 left which have no build_requests, no cell_id in the mapping21:32
mnaserand not existing in any cells21:32
mriedemmnaser: ok those are likely just instance mappings for deleted and purged instances21:32
mriedemdo you archive/purge the cell dbs often/21:32
mriedem?21:32
mriedemb/c it wasn't until i think rocky that we added instance mapping and reqspec hard delete to nova-manage db archive_deleted_rows when instances are archive21:33
mriedemor maybe you run your own archive/purge script?21:33
mnaserselect created_at from instances order by id asc limit 1; => 2014-12-14 02:38:5321:33
mnaser...ha.21:33
mnaserbut i think i'm mostly waiting for the rocky archive delete stuff21:34
*** shaohe_feng has quit IRC21:36
*** shaohe_feng has joined #openstack-nova21:37
mriedemdo you run your own archive script or nova-manage db archive_deleted_rows?21:41
mnasermriedem: none of the above, we just have a really really really big database21:42
mnasermysql indexing seems fast enough that it hasn't really affected us much other than just.. being a big db.21:42
sean-k-mooneymriedem: fyi i left a comment on the review but is the call to  self.driver.cleanup in https://review.openstack.org/#/c/586568/1 against the source or dest node?21:42
openstackgerritMatt Riedemann proposed openstack/nova master: WIP: Update instance mapping as soon as instance is created in cell  https://review.openstack.org/58671321:44
mriedemmnaser: melwitt: throwing things at the wall ^21:44
mriedemsean-k-mooney: source21:44
mriedem_post_live_migration and _rollback_live_migration run on the source host21:44
*** liuyulong__ has joined #openstack-nova21:45
mriedemsean-k-mooney: replie21:45
mriedem*replied21:45
sean-k-mooneymriedem: oh ok then yes it proably should have the source vif then however i dont think it actully will need them unless we replug the vifs21:45
mriedemthat's not what you said last night21:46
*** rtjure has quit IRC21:46
mriedemsomething something ovs hybrid plug cleanup21:46
mriedembut it was 4am and you were maybe loopy21:46
sean-k-mooneymriedem: for the cleanup21:46
*** shaohe_feng has quit IRC21:46
sean-k-mooneymriedem: self.driver.post_live_migration_at_source shoudl use the old source vifs so it can unplug correctly21:47
*** liuyulong_ has quit IRC21:47
*** shaohe_feng has joined #openstack-nova21:47
mriedemsean-k-mooney: yes, same thing21:48
sean-k-mooneyi dont know what self.driver.cleanup does. if its on the source however it should also proably be using the source vifs21:48
mriedemsean-k-mooney: in the commit message, i pointed out that if post_live_migration_at_source is successful, destroy_vifs=False and the libvirt driver won't try to unplug in cleanup()21:48
mriedemhowever, not all virt drivers adhere to that destroy_vifs flag21:49
mriedemthe hyperv driver doesn't for example21:49
sean-k-mooneyah ok then yes that all looks good then21:49
mriedemit looks...beautiful21:49
sean-k-mooneynormally i like shorter fuction names but the at_source and at_destination really helps keep context in this code21:50
mriedemthat's why i did https://review.openstack.org/#/c/551371/21:51
mriedembecause knowing wtf is going on in the 20 methods involved in live migration is not sometihng you can keep in your head21:51
mriedemalso https://docs.openstack.org/nova/latest/reference/live-migration.html21:52
melwittyes. every time I figure out code like that, a few months later I end up wishing I had added a lot of code comments to it, if nothing else21:52
mriedemyup also https://review.openstack.org/#/c/496861/21:53
melwitttwo thumbs up21:53
mriedemthanks ebert21:54
melwittlooking at your change, trying to remember why the instance.create() was split up from the inst mapping update in the first place21:54
mriedemRIP21:54
mriedemmelwitt: the quota stuff21:54
mriedemi can find a review comment where we talked about the split21:54
melwittyeah, trying to re-remember21:54
sean-k-mooneyya i have that bookmarked i just didnt have see we were still in _post_live_migration. that function does a lot21:54
mriedemtoo much21:54
melwittI think it was something about, if we failed a quota recheck in the middle of a multi create, and to nix all the instances before creating any mappings21:55
melwittbut we ended up not doing that and putting them in ERROR state21:55
sean-k-mooneypart of the issue is ist implementing a state machine and all of that context is mixed in with what its doing21:55
melwittso that ended up being the wrong thing to do, I think21:55
mriedemmelwitt: https://review.openstack.org/#/c/501408/2/nova/conductor/manager.py@102021:56
*** antosh has quit IRC21:56
mriedemtoo bad i didn't link that irc convo in21:56
*** shaohe_feng has quit IRC21:56
melwittyeah, this is coming back to me. there were other things like, at the time I was thinking don't create the BDMs etc until after we know we're good after the quota recheck21:58
*** shaohe_feng has joined #openstack-nova21:58
melwittbut we discussed on IRC and determined that all had a failure path to clean up anything that was created, and so should have been okay to just do everything normally and check quota at the end21:58
melwittin one loop instead of two21:59
mriedemhttp://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2017-09-06.log.html#t2017-09-06T20:33:5121:59
mriedemit was also a refactor we didn't want to backport21:59
melwittright yeah21:59
mriedemi had a todo to combine back to a single loop on my desk for a long time, b/c i had in mind how to do it,22:00
mriedembut long forgot now22:00
sean-k-mooneymriedem: haha i was just looking at the irc logs to see if i could find it for you.22:00
melwittI added it to my todo list too so hopefully one of us will do it this time. I had forgotten about it22:00
*** savvas has quit IRC22:01
*** med_ has joined #openstack-nova22:02
*** med_ has quit IRC22:02
*** med_ has joined #openstack-nova22:02
mriedem"dansmithmriedem:  we wouldn't know where to find the instance record to mark it as  deleted when they deleted the buildreq, so we'd leave that undeleted but  unfindable instance forever"22:02
mriedemheh22:02
mriedemsound familiar?22:03
mriedem"mriedemi shit my pants everytime we touch nova these days"22:04
mriedemha22:04
melwitthaha, relatable22:04
mriedemmnaser: again, congratulations to you to continue running a business on top of stuff we're still talking about fixing almost 1 year later :)22:04
*** itlinux has quit IRC22:05
openstackgerritkarim proposed openstack/nova master: Updated AggregateImagePropertiesIsolation filter illustration  https://review.openstack.org/58631722:05
*** felipemonteiro_ has quit IRC22:06
mriedemi think the tl;dr from the irc convo is just combine the loops and move the quota check to the end22:06
mriedem"locally" deleting the instance will automatically delete the tags and bdms along with the instance from the cell22:06
melwittI'm trying to think, why didn't we move the instance mapping update earlier last time?22:06
melwittyeah, that's what I'm getting from it too, merge the loops and check quota at the end22:07
*** shaohe_feng has quit IRC22:07
*** jmlowe has joined #openstack-nova22:07
*** shaohe_feng has joined #openstack-nova22:07
mriedemidk, my guess is tunnel vision on the fix at hand22:07
melwittwait, that change (last year) *did* move the inst mapping update earlier to right after the instance.create(). looking to see what happened to that22:10
*** itlinux has joined #openstack-nova22:10
*** savvas has joined #openstack-nova22:11
*** rtjure has joined #openstack-nova22:13
mriedembut only if the quota check failed22:14
mriedemb/c we exit after that22:14
mriedemwe don't bury in cell0 if quota check  fails because the instances are already created in cells at that point22:15
*** figleaf is now known as edleafe22:15
melwittI mean this, this is showing an update of the instance mapping right after we create the instance record https://review.openstack.org/#/c/501408/2/nova/conductor/manager.py@100322:16
*** savvas has quit IRC22:16
mriedemoh right yewah22:17
mriedem*yeah22:17
melwittbut in the current version of the code, the instance mapping update isn't right after the instance create anymore22:17
melwittand I can't find how that changed, looking at git blame and failing22:17
*** shaohe_feng has quit IRC22:17
mriedem_populate_instance_mapping was only ever used in the cellsv1 path22:17
mriedemthe build_instances method22:17
mriedemi'm pretty sure22:17
melwittbut in that old patch, it's in schedule_and_build_instances22:17
*** shaohe_feng has joined #openstack-nova22:18
mriedembecause mnaser was re-using it22:19
mriedemyou mean why did we talk him out of that?22:19
melwittno I mean, as of that patch, the instance mapping update was right after instance create, but the current code has the mapping update much later, and I was wondering why that was moved. I assume it was to fix some other bug or something22:20
*** savvas has joined #openstack-nova22:20
mriedemlooks like it was changed as a result of the irc convo22:21
melwittoh gaaaahhh, I didn't realize I was looking at an earlier PS22:21
melwittokay so the final version only added a mapping update to the cleanup method, like you said earlier I think. so the normal path for updating the mapping was always later on22:24
*** sambetts_ has quit IRC22:24
*** savvas has quit IRC22:25
melwittok22:25
*** sambetts_ has joined #openstack-nova22:26
mriedemyup. alright gotta run. o/22:27
melwitto/22:27
*** shaohe_feng has quit IRC22:27
*** shaohe_feng has joined #openstack-nova22:28
*** shaohe_feng has quit IRC22:37
*** savvas has joined #openstack-nova22:38
*** shaohe_feng has joined #openstack-nova22:39
*** avolkov has quit IRC22:40
*** hongbin_ has quit IRC22:42
*** shaohe_feng has quit IRC22:48
*** shaohe_feng has joined #openstack-nova22:48
*** mhg has quit IRC22:53
*** shaohe_feng has quit IRC22:58
*** shaohe_feng has joined #openstack-nova23:01
*** mschuppert has quit IRC23:03
*** gilfoyle_ has quit IRC23:08
*** shaohe_feng has quit IRC23:08
*** harlowja has quit IRC23:09
*** shaohe_feng has joined #openstack-nova23:10
*** shaohe_feng has quit IRC23:18
*** shaohe_feng has joined #openstack-nova23:20
openstackgerritMerged openstack/nova master: Use source vifs when unplugging on source during post live migrate  https://review.openstack.org/58640223:27
openstackgerritMerged openstack/nova master: Pass source vifs to driver.cleanup in _post_live_migration  https://review.openstack.org/58656823:27
openstackgerritMerged openstack/nova master: Update queued-for-delete from the ComputeAPI during deletion/restoration  https://review.openstack.org/56681323:27
*** shaohe_feng has quit IRC23:29
melwittfinally \o/23:30
*** shaohe_feng has joined #openstack-nova23:32
*** itlinux has quit IRC23:37
*** shaohe_feng has quit IRC23:39
*** shaohe_feng has joined #openstack-nova23:40
*** gongysh has joined #openstack-nova23:43
mnaserwell i know its late23:49
mnaserbut now even another whole interesting failure23:49
mnaserno record in nova_api but one in the cell23:49
mnaserlol23:49
*** shaohe_feng has quit IRC23:49
*** shaohe_feng has joined #openstack-nova23:50
*** itlinux has joined #openstack-nova23:50
*** itlinux has quit IRC23:50
*** itlinux has joined #openstack-nova23:51
*** itlinux has quit IRC23:51
*** wolverineav has quit IRC23:53
*** wolverineav has joined #openstack-nova23:54
melwittmnaser: no build request or instance mapping?23:54
mnasermelwitt: build request, no instance mapping23:55
mnaserwait sorry23:55
mnaserit doesnt exist in the cell, sorry23:55
melwittbuild request, instance in cell, no instance mapping23:56
melwittbuild request only?23:56
mnaseryes23:56
mnaserbuild request only23:56
melwittthat's the exact same thing rdo cloud ran into23:56
mnaserso shows up in list but not deletable etc23:56
melwittright23:56
mnaseri guess i can just delete the build request and have it disappear?23:56
melwittdo you have several or just a few? like does it happen a lot?23:56
melwittyes, that's what I told rdo cloud to do too23:57
mnaseri mean after running my fixup script, i still had a few instances that were stuck BUILD/scheduling23:57
melwittI dug around in the code and didn't see a way it can happen other than nova-api going down at the precise moment between the build_request.create() and the instance_mapping.create() or the instance_mapping.create() somehow failing23:57
mnaserso for context it is possible that rpc and/or db both had issues at the time23:57
mnaserdoes the build request and instance_mapping get created at the same time or?23:58
melwittwhich seems it would be crazy rare ... so maybe we're missing some other way it could happen23:58
melwittpretty much yeah. let me grab a link23:58
melwitthttps://github.com/openstack/nova/blob/3e0b17b1e138615b66293976ca5b55c291957844/nova/compute/api.py#L930-L94223:58
* mnaser is learning so much lol23:59
melwittyeah, soon you can come fix all these bugs23:59
mnaserok that's interesting23:59
mnaserhaha23:59
*** wolverineav has quit IRC23:59
*** shaohe_feng has quit IRC23:59
mnaserso build request was created, instance mapping was *not* created.   unless there was an attempt to delete the instance while it was still in build request23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!