Thursday, 2022-06-02

*** mfo is now known as Guest97302:45
*** mfo_ is now known as mfo02:45
*** arne_wiebalck_ is now known as arne_wiebalck05:28
bauzashi nova07:50
* bauzas is back today07:50
gibio/07:51
bauzaswow, missed one day and world became crazy08:56
bauzaswas hoping to do reviews this morning, apparently I was wrong :/08:56
gibiwhich crazyness you observe?08:57
bauzasgibi: just a lot of things arrived in my inbox that require a bit of priority :)09:08
bauzasdon't worry, I'm french, I'm used to complain09:09
gibi:)09:09
opendevreviewBalazs Gibizer proposed openstack/nova master: Reject AZ changes during aggregate add / remove host  https://review.opendev.org/c/openstack/nova/+/82142309:46
opendevreviewRajesh Tailor proposed openstack/nova master: Remove unnecessary if condition  https://review.opendev.org/c/openstack/nova/+/84441811:11
opendevreviewRico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest  https://review.opendev.org/c/openstack/nova/+/83064611:48
opendevreviewAlexey Stupnikov proposed openstack/nova master: Optimize _local_delete calls by compute unit tests  https://review.opendev.org/c/openstack/nova/+/84428513:41
opendevreviewAlexey Stupnikov proposed openstack/nova master: Optimize _local_delete calls by compute unit tests  https://review.opendev.org/c/openstack/nova/+/84428514:24
opendevreviewBalazs Gibizer proposed openstack/nova master: Unparent PciDeviceSpec from PciAddressSpec  https://review.opendev.org/c/openstack/nova/+/84449114:55
dansmithkashyap: slaweq is seeing a qemu segv on their fedora periodic job.. could you help us examine and open a bug for the qemu type folks to look at?15:20
kashyapdansmith: Hiya; sure.15:21
dansmithkashyap: thanks, we're still in meeting, but I imagine slaweq will be around here with a job link shortly15:21
kashyapdansmith: Got a link for it?  I'm on a call right now, but can look at the errors (I wonder which version of Fedora)15:21
kashyapSure15:21
slaweqkashyap dansmith here's failed job https://zuul.openstack.org/build/4a7f284f32eb436da6b5ef59d46e615d/logs15:22
slaweqI know that @gibi was looking briefly into it yesterday15:22
slaweqbut interesting thing is that today this job passed https://zuul.openstack.org/build/4c1f894e55f84447b8b0b0f14c774c89/logs15:23
kashyapSo it's a bit intermittent15:23
slaweqkashyap it was failing every day in at least last week, except today15:23
slaweqthis is periodic job so we can check how it will be tomorrow15:23
kashyapslaweq: Can you link me to the exact error, pls?15:23
slaweqkashyap https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/4a7f284/controller/logs/libvirt/libvirt/qemu/instance-0000002e_log.txt15:24
kashyapslaweq: Interesting ... I wonder if there's a `coredumpctl list` output, then we can get the crashdumps right away15:25
slaweqI don't think there is anything like that in the job's logs15:26
slaweqall logs are here https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/4a7f284/controller/logs/index.html15:26
kashyapslaweq: Thanks; looking while on a call15:27
dansmithkashyap: if it's just something we run post-crash we can add that as a one-off15:27
slaweqkashyap sure, I'm on call now too15:27
kashyapdansmith: What would be good is to capture both: `coredumpctl list | grep qemu`, and then for each QEMU PID log `coredumpctl info $PID`  (I know ... I'm asking too much)15:28
kashyapdansmith: E.g. see at the bottom here for the example output of `coredumpctl info $PID` - https://www.freedesktop.org/software/systemd/man/coredumpctl.html15:29
kashyapThe reason I ask is, I've successfully found several root-cause stack trace from it in the past.15:29
kashyapdansmith: slaweq: Ah, scratch the above, we could even just get this post-crash: `coredumpctl -o qemu.coredump dump /usr/bin/qemu-system-x86_64`15:31
kashyapslaweq: A quick question: the instance simply crashes when launching it?15:32
slaweqkashyap I think it crashed during snapshoting15:32
slaweqit was spawned properly15:32
opendevreviewAlexey Stupnikov proposed openstack/nova master: Optimize _local_delete calls by compute unit tests  https://review.opendev.org/c/openstack/nova/+/84428515:33
dansmithkashyap: ah, running it on a specific pid would be much harder15:33
* dansmith is catching up15:34
dansmithrunning it like you describe is something we could hack in as a post job15:34
kashyapdansmith: Nah, we can disregad the per-PID thing15:34
kashyapYeah, binary is easier indeed15:34
dansmithokay after call(s) I can help hack that in if we need, but if it's pretty repeatable it might be easier to just try to repro locally15:35
kashyapdansmith: Yeah, that's the next thing I'm looking.  It looks like it's not any Ceph-based, and just plain local storage, IIRC15:36
dansmithcool15:36
kashyapslaweq: I'm just trying to find the precise test trigger.  From looking at the 'n-cpu' log, snapshots seem to happen just fine.  I'll look more after I'm done w/ this call15:37
slaweqkashyap but IIUC nova logs, instance is gone during snapshoting process15:38
slaweqplease take Your time, it's not urgent for us for sure15:38
dansmithkashyap: it's test_create_backup15:38
dansmithhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/4a7f284/testr_results.html15:38
kashyap(Yeah, just found it; thx)15:39
kashyapdansmith: Thanks; so the above command I noted above "coredumpctl dump" will dump the most recent core dump.  I guess we have to redirect it to a file15:43
opendevreviewMerged openstack/osc-placement master: Add Python3 zed unit tests  https://review.opendev.org/c/openstack/osc-placement/+/83536915:43
kashyapdansmith: Err, ignore the above comment; the "-o" is the file.15:43
dansmithkashyap: ack, are you going to try to repro locally first?15:43
kashyapdansmith: On F35, just run the Tempest test, or construct a manual libvirt-based repro?15:44
dansmithif it doesn't repro locally that might be interesting to know as well, like whether it's related to having run a lot of tests first, or that one thing always fails in isolation15:44
dansmithkashyap: I just meant devstack, run that one tempest test15:44
kashyapAh; nod.  I can't today; but I can give it a go tomorrow.15:45
dansmithwhen I'm done here I can work on adding that as a post task, it just might take a bunch of iterations to get it right (based on experience)15:45
dansmithokay, I'll give it a shot at least when I'm done here15:45
kashyapdansmith: When you say "adding that as a post task" -- I take it you mean adding the above "coredumpctl ... dump", yeah?15:47
dansmithyep15:47
kashyapI have a deja vu about this test_create_backup test, reading its code15:49
kashyapslaweq: When you get a minute, can you please file an upstream LP bug to track this?  So we can keep all the investigation in one place?15:56
dansmithkashyap: against what, nova?16:01
kashyapYeah, I'd say so16:01
kashyapdansmith: I just looked at the compressed libvirtd.log16:01
kashyapAnd I see a familiar libvirt error:16:01
kashyap2022-06-01 03:35:33.685+0000: 87576: error : qemuMonitorJSONCheckErrorFull:412 : internal error: unable to execute QEMU command 'blockdev-del': Failed to find node with node-name='libvirt-5-storage'16:01
kashyapdansmith: In the past we found the same error earlier this year, and I recall working w/ libvirt folks to get a fix.  But that was in a different context: https://listman.redhat.com/archives/libvir-list/2022-February/msg00790.html16:02
kashyapdansmith: slaweq: The root TripleO (actually should've filed for Nova) bug was this where did the analysis: https://bugs.launchpad.net/tripleo/+bug/195901416:03
kashyapIf you open that last link, scroll from bottom for more signal16:03
dansmithkashyap: yeah I remember that one.. so to be clear, you expect this is a different issue right?16:54
dansmithkashyap: this is running, we'll see: https://review.opendev.org/c/openstack/devstack/+/84450317:09
ricolinbauzas: I think https://review.opendev.org/c/openstack/nova/+/830646 is ready for review now, could you kindly remove the -217:10
bauzasricolin: sure, lemme look17:11
ricolinbauzas: thanks:)17:12
bauzasricolin: oh, yeah you created the bp and the spec, ta17:12
ricolinbauzas: yeah, the spec merged:)17:12
ricolinand I updated the implement patch accordingly 17:13
ricolinI think:)17:13
opendevreviewRico Lin proposed openstack/nova master: Add traits for viommu model  https://review.opendev.org/c/openstack/nova/+/84450717:20
opendevreviewArtom Lifshitz proposed openstack/nova stable/ussuri: fake: Ensure need_legacy_block_device_info returns False  https://review.opendev.org/c/openstack/nova/+/84395017:21
opendevreviewArtom Lifshitz proposed openstack/nova stable/ussuri: Add a regression test for bug 1939545  https://review.opendev.org/c/openstack/nova/+/84395117:21
opendevreviewArtom Lifshitz proposed openstack/nova stable/ussuri: compute: Ensure updates to bdms during pre_live_migration are saved  https://review.opendev.org/c/openstack/nova/+/84395217:21
ricolinsean-k-mooney:  this should be the last piece for libvirt-viommu-device implementation,  but as I'm not familiar with traits, can you take a review on it and let me know if I do it right/wrong17:24
ricolinhttps://review.opendev.org/c/openstack/nova/+/844507 17:24
sean-k-mooney sure17:41
sean-k-mooneyjust so you are aware the unit test will fail untill the trait is merged and released17:41
sean-k-mooneybut the tempest test shoudl be able to pass becauses depens on works for devstack jobs17:41
sean-k-mooneybut not for tox jobs17:41
sean-k-mooneyso if you see the tox py38 job fail that will be why17:42
sean-k-mooneyassuming your tests are otherwise correct :)17:42
sean-k-mooneyricolin: the patch is deffently not correct but ill comment inline17:45
sean-k-mooneyricolin: libvirt is never going to report a iommu model of auto or none17:46
sean-k-mooneyso you need to actully see what is reported form the domain caps api17:46
sean-k-mooneyby doing virsh domcapabilities --machine q35 --arch x86_6417:47
sean-k-mooneyricolin: but looking at that this is not somethign that is reported in that api17:50
sean-k-mooneyso instead of looking at the domaincap api you need to report the traits based on the libvirt version number17:50
melwitt artom, sean-k-mooney: dunno if yall have seen this related preserve_on_delete bug from a few years ago https://bugs.launchpad.net/nova/+bug/183446318:22
opendevreviewMerged openstack/nova stable/ussuri: [stable-only] Make sdk broken job non voting until it is fixed  https://review.opendev.org/c/openstack/nova/+/84430918:43
ricolinsean-k-mooney: so I need to check libvirt version before I put iommu in devices for fakelibvirt, right? 18:52
artommelwitt, hrmm, good find19:23
melwittartom: I looked through the code and saw that _heal_instance_info_cache preserves the existing value of preserve_on_delete. tried it out on devstack (created server with nova creating port, changed the value of preserve_on_delete to true in the database, saw _heal_instance_info_cache run a number of times, then detached the port) and it did not delete the port19:43
melwittI'm realizing the scenario in the above bug is different. they're saying they removed an interface by a manual database update, then nova added it back without (obviously) the original value of preserve_on_delete. I guess they are saying if they detach the port and then reattach it, they don't get the same value of preserve_on_delete. a bit different issue19:49
melwittalthough, they should get the same value bc if they reattach the port, nova won't consider it to be created by nova and thus should set preserve_on_delete = True19:56
sean-k-mooneymelwitt: they were updating the db20:00
sean-k-mooneyso really all bets are off at that point20:01
melwittjust tried reattach and it indeed has preserve_on_delete = true. that means the bug report is very specifically the case where the interface gets removed from the info cache not via the API and then _heal_instance_info_cache runs. I don't know how that could happen during normal operation (no manual db update)20:01
sean-k-mooneyso there case does not make sense20:01
sean-k-mooneywell20:01
melwittyeah, I assumed they did the manual update to simplify a real world case but without any more data, I don't know how that case can happen20:02
sean-k-mooneythe booted with nova creating a nic20:02
sean-k-mooneythey somehow detached it without it gettting deleted20:02
sean-k-mooneyand then reattached it20:02
sean-k-mooneyso with the undocumented behavior when it got detach it shoudl have gotten delted20:02
sean-k-mooneyso there is not port to reattach20:02
melwittno, in their report they say they called server create with port_id passed in20:02
melwittso that means it begins with preserve_on_delete = true20:03
sean-k-mooneyoh then it shoudl have preserve on delete ture20:03
melwittyeah20:03
melwittno idea how what they say can happen "in real life"20:03
sean-k-mooneyso lest see20:04
sean-k-mooneythey are simulated the network info cache getting currpted20:04
sean-k-mooneyand then wating fo the heal taks to fix the info cache20:04
sean-k-mooneyand then nova things its created by it20:04
sean-k-mooneyi guess i can see that happeing if we lost the info of how the port was requested20:05
sean-k-mooneyso that is implying we sotre that in the info cache only20:05
melwittyeah, nova just sets the flag to true if it created the port, at port creation time. after that it's cache only20:05
sean-k-mooneywell thats broken20:06
sean-k-mooneyi guess we do that for attach20:06
sean-k-mooneytoo20:06
sean-k-mooneye.g. if we do attach network instead of attch port20:06
sean-k-mooneywe porably need to change this to sotre this in either the virtual interfaces tabel or instance_system_metadata if we want to avoid a db migration20:07
sean-k-mooneythe initall boot requeest would be stored in the request spec but we dont update that on network attach at least i dobt we do20:08
melwittit would be nice to save it somewhere... other than instance_info_caches if that table is apparently fraught with problems20:09
sean-k-mooneywell its ment to be a cache20:11
melwittrequest_spec seems like a good place?20:11
sean-k-mooneyas in we shoudl be able to drop it if we needed too20:12
melwittfair20:12
sean-k-mooneyrequest_sepc is in the api db20:12
melwittoh right :/20:12
sean-k-mooneyso we could update teh requested networks in the api but we would have to wait till after the virt driver finsihed attaching20:12
sean-k-mooneyis this a call or a cast20:12
sean-k-mooneyi guess its a call20:13
sean-k-mooneysince its a 200 respone20:13
sean-k-mooneyhttps://docs.openstack.org/api-ref/compute/?expanded=add-network-detail%2Ccreate-interface-detail#create-interface=20:13
melwittyeah it's a call20:13
sean-k-mooneyso we coudl update teh request_spec network_requests list if we really wanted too20:13
sean-k-mooneywe just need to make sure to only do it if the call succeds20:14
melwittbut nova-compute couldn't get to it without an upcall right20:14
sean-k-mooneynot via nova-comptue in the api20:14
sean-k-mooneywhen we wait for the call20:15
sean-k-mooneyi think there are better places to store it however20:15
melwittif nova-compute needs to rebuild the info cache from nothing, like the db row update example in the bug20:15
sean-k-mooneyya it should be able too20:16
sean-k-mooneywe have had cases where we lost ports in the cache due to buggy neturon backend or neutron policy issues20:16
sean-k-mooneye.g. where neutorn returned an empty port list20:16
melwittnova-compute can't read it from request_specs without it being an upcall. am I missing something?20:16
sean-k-mooneythe heal logic will recreate the info cache entries form the neutron data if that happens20:17
sean-k-mooneymelwitt: correct it cant20:17
melwittso storing it in request spec doesn't help afaict20:17
sean-k-mooneynot really no20:17
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/db/main/models.py#L784=20:17
sean-k-mooneythe virtual interfaces tabel shoudl store it but it has no feield we can abuse to store it without a db change20:18
melwittjust saying it sounded like a good place to store it initially but if nova-compute can't read it, it doesn't solve this issue20:18
sean-k-mooneyinstance_system_metadata can store it since it just a set of key value pairs20:18
sean-k-mooneyand thats in the cell db20:19
sean-k-mooneyso that is proably where i woudl stash it20:19
melwittyeah, that would work20:19
sean-k-mooneyso we jsut have the key be the <neutron port uuid>_preserve_on_delete20:20
sean-k-mooneyor store the list as a single key20:20
sean-k-mooneythat is proably better since its indexed by the instance_id anyway20:20
sean-k-mooneyit denormaises the db technially20:20
sean-k-mooneybut a preserve_on_delete_list key that we lookup by "select preserve_on_delete from instance_system_metadata where instance_id = xyz"20:21
sean-k-mooneyis much simpler to lookup20:22
sean-k-mooneybut either would work20:22
opendevreviewMerged openstack/placement stable/ussuri: Use 'functional-without-sample-db-tests' tox env for placement nova job  https://review.opendev.org/c/openstack/placement/+/84077321:10
opendevreviewmelanie witt proposed openstack/nova stable/train: DNM Testing for ceph setup gate fail  https://review.opendev.org/c/openstack/nova/+/84453022:40

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!