14:00:08 #startmeeting tripleo 14:00:08 #topic agenda 14:00:08 * Review past action items 14:00:08 * One off agenda items 14:00:08 * Squad status 14:00:08 * Bugs & Blueprints 14:00:08 * Projects releases or stable backports 14:00:08 * Specs 14:00:08 * open discussion 14:00:09 Meeting started Tue Oct 16 14:00:08 2018 UTC and is due to finish in 60 minutes. The chair is mwhahaha. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:10 Anyone can use the #link, #action and #info commands, not just the moderatorǃ 14:00:10 Hi everyone! who is around today? 14:00:11 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:13 The meeting name has been set to 'tripleo' 14:00:32 abishop: https://bugs.launchpad.net/tripleo/+bug/1797918 14:00:32 Launchpad bug 1797918 in tripleo "teclient returns failures when attempting to provision a stack in rdo-cloud" [Critical,Triaged] - Assigned to Marios Andreou (marios-b) 14:00:35 «o/ 14:00:41 o/ 14:00:41 o/ 14:00:43 o/ 14:00:46 you're a bit early mwhahaha :) 14:00:53 abishop, wait for the rdo-cloud status update 14:00:55 o/ 14:00:56 0/ 14:01:00 am not 14:01:03 i'm 8 seconds late 14:01:09 Meeting started Tue Oct 16 14:00:08 2018 14:01:10 o/ 14:01:12 o/ 14:01:18 o/ 14:01:23 weshay, marios|rover: ack, thx! 14:01:26 o/ 14:01:47 mwhahaha: ah, maybe my server is a bit late? thanks ntpd -.- 14:02:06 maybe you should be running chrony 14:02:07 :D 14:02:14 :] 14:02:17 easy target ;) 14:02:35 o/ 14:02:48 o/ 14:03:13 o/ 14:04:02 alright let's do this 14:04:09 o/ 14:04:10 #topic review past action items 14:04:10 None! 14:04:29 since there were no action items lets move on 14:04:33 #topic one off agenda items 14:04:33 #link https://etherpad.openstack.org/p/tripleo-meeting-items 14:04:44 (bogdando) replace ovb jobs with multinode plus HW-provisioning done on fake QEMU VMs locally on undercloud, then switched to predeployed servers 14:04:46 hi 14:04:47 o/ 14:05:00 I hope the topic is self explaining? 14:05:08 o/ 14:05:13 not really 14:05:28 where would these multinode jobs run? 14:05:33 ok, it's about removing network communications while testing HW prov 14:06:17 once we are sure qemu vms can be introspected and provisioned, we switch to deployed servers and continue the multinode setup, instead of ovb 14:06:35 so qemu are fakes to be thrown away 14:06:56 Sorin Sbarnea proposed openstack/tripleo-quickstart-extras master: Run DLRN gate role only if to_build is true https://review.openstack.org/610728 14:06:56 the expectation is to see only ironic et al bits in action and having 0 interference from L2 networking 14:07:00 so would this be an additional step to be added to like the udnercloud job? 14:07:07 this removes coverage for non-deployed-servers flow, right? 14:07:18 dtantsur: not sure, thus asking/proposing 14:07:32 btw your idea is quite close to what derekh is doing for his ironic-in-overcloud CI job 14:07:42 do we really test something of Ironic stack after the nodes introspected? 14:07:50 interesting 14:07:52 yes 14:07:57 tear down at least? 14:07:58 we need to make sure the images are good 14:08:02 I see 14:08:03 and that the provisioning askpect works 14:08:15 ok, could we provision those QEMU vms then? 14:08:18 so if you just wanted to test introspection early in the udnercloud job, that might be ok 14:08:34 but we need ovb because it provides actual provisioning coverage (and coverage for our image building) 14:08:35 bogdando: this what devstack does. nested virt is slooooooooooooooooo 14:08:39 ooooooooooooooooo 14:08:43 you got the idea :) 14:08:47 * mwhahaha hands dtantsur an ow 14:09:03 yay, here it is: ooow! thanks mwhahaha 14:09:06 well yes, but it hopefully takes not so much just to provision a node? got stats? 14:09:34 bogdando: the ironic is overcloud job can be seen here https://review.openstack.org/#/c/582294/ see tripleo-ci-centos-7-scenario012-multinode-oooq-container 14:09:40 no hard data, but expect 2-3 times slow down of roughly everything 14:09:59 given our memory constraints i'm not sure we have enough to do the undercloud + some tiny vms for provisioning 14:10:15 Michele Baldessari proposed openstack/puppet-tripleo master: Fix ceph-nfs duplicate property https://review.openstack.org/609599 14:10:17 URGENT TRIPLEO TASKS NEED ATTENTION 14:10:18 https://bugs.launchpad.net/tripleo/+bug/1797600 14:10:19 https://bugs.launchpad.net/tripleo/+bug/1797838 14:10:19 https://bugs.launchpad.net/tripleo/+bug/1797918 14:10:19 Launchpad bug 1797600 in tripleo "Fixed interval looping call 'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: InstanceDeployFailure: Failed to set node power state to power on." [Critical,Incomplete] 14:10:20 Launchpad bug 1797838 in tripleo "openstack-tox-linters job is failing for the tripleo-quickstart-extras master gate check" [Critical,In progress] - Assigned to Sorin Sbarnea (ssbarnea) 14:10:21 Launchpad bug 1797918 in tripleo "teclient returns failures when attempting to provision a stack in rdo-cloud" [Critical,Triaged] - Assigned to Marios Andreou (marios-b) 14:10:24 Would be nice to see it in real CI jobs logs, how long it takes 14:10:31 tiny? 14:10:39 in real ovb jobs 14:10:41 our IPA images require 2G of RAM just to run... 14:11:01 right we can't do that on our upstream jobs 14:11:50 anyway, once the provisioning done the idea was to throw vms away and switch to multinode as usual 14:11:51 Michele Baldessari proposed openstack/puppet-tripleo master: WIP Fix ipv6 addrlabel for ceph-nfs https://review.openstack.org/610987 14:12:16 may be run some sanity checks before that... 14:12:41 this removes the ability to check that our overcloud-full images 1. are usable, 2. can be deployed by ironic 14:12:48 o/ 14:12:59 (this is assuming that ironic does its job correctly, which is verified in our CI.. modulo local boot) 14:13:01 o/ 14:13:11 dtantsur: ;-( for 1, but not sure for 2 14:13:31 bogdando: ironic has some requirements for images. e.g. since we default to local boot, we need grub2 there. 14:13:41 and if/when we switch to uefi, fun will increase 14:13:42 I thought that 2 will be covered by Ironic provisioning those images on qemu 14:13:50 hrm.. need scenario12 doc'd https://github.com/openstack/tripleo-heat-templates/blob/master/README.rst#service-testing-matrix 14:14:01 weshay: will do 14:14:17 thank you sir /me looks at code 14:14:24 ok, how is that idea different to what derekh does? 14:15:08 bogdando: derekh only need to verify that ironic boots stuff. he does not need to verify the final result. 14:15:28 also, an alternative may be to convert those qemu into images for the host cloud 14:15:36 to continue with multi-node 14:15:56 fyi.. anyone else.. scenario12 defined here https://review.openstack.org/#/c/579603/ 14:15:59 the take away is we avoided L2 dances :) 14:16:05 bogdando: what's the problem you're trying to solve? 14:16:18 these L2 dances you're referring to? 14:16:20 avoiding networking issues 14:16:34 i think we need to fix those and not just work around them in ci 14:16:44 ++ 14:16:46 the main thing to solve is https://bugs.launchpad.net/tripleo/+bug/1797526 14:16:46 Launchpad bug 1797526 in tripleo "Failed to get power state for node FS01/02" [Critical,Triaged] 14:16:48 we need proper coverage for this 14:16:54 none knows how to solve it it seems 14:17:07 and bunch of similar issues hitting us periodically 14:17:09 this also seems like a good thing to add some resilience to 14:17:17 that's bad, but won't we just move the problems to a different part? 14:17:41 I can tell you that nested virt does give us hard time around networking from time to time 14:17:51 e.g. ubuntu updated their ipxe ROM - and we're broken :( 14:17:54 noting that it's not easy to deliniate the issues we're seeing in 3rd party atm.. 14:18:03 I do not know tbh, if solving network issues is domain of tripleo 14:18:08 introspection errors can be caused by rdo-cloud networking 14:18:11 the base OS rather 14:18:15 and infrastructure 14:18:16 so it seems like it would be better to invest in improved logging/error correction in the introspection process 14:18:17 not tripleo 14:18:19 it's a very unstable env atm 14:18:36 is this something that a retry would resolve. Like ironic.conf could be tuned for this a bit perhaps? 14:18:37 ya.. we have a patch to tcpdump during introspection 14:18:37 rather than coming up with a complex ci job to just skip it 14:18:42 I do not mean control/data plance network ofc 14:18:56 but provisioning and hardware layers 14:19:29 bogdando: OVB is an openstack cloud. if we cannot make it reliable enough for us.. we're in trouble 14:19:33 if integration testing may be done w/o real networking involved, only virtual SDNs, may be we can accept that? dunno 14:19:49 but the problems in CI (at least the ones I've looked at) were where ipmi to the bmc wasn't working over a longer term, reties aint going to help 14:19:49 just not sure if that would be moving the problems to a different part 14:20:30 dtantsur, agree.. there is work being done to address it. apevec has some details there however right now I agree it's unstable and painful 14:20:30 what's the problem with IPMI? the BMC is unstable? using UDP? 14:20:35 Can you make periodic jobs leave envs around on failure so they can be debugged ? 14:20:37 dtantsur: it is not related to the testing patches 14:20:38 and it's causing a lot of false negatives 14:20:46 if the former, we can fix it. if the latter, we can switch to redfish (http based) 14:20:53 only produces unrelevant noice and tones of grey failures 14:20:58 derekh, you can recreate those w/ the reproducer scripts 14:21:05 weshay: ++ 14:21:06 in logs/reproduce-foo.sh 14:21:12 for flase negatives 14:21:15 derekh: I'd like to know more on this. If its a long term connectivity issue.. Then that is our problem right? 14:21:24 weshay: yes, but we were not able to on friday 14:21:34 basically, 99% of errors one can observe in those integrational builds had nothing to the subject of testing 14:22:09 (taking numbers out of air) 14:22:16 John Fulton proposed openstack/tripleo-common master: Remove ceph-ansible Mistral workflow https://review.openstack.org/604783 14:22:29 are we use IPMI is always to blame in these 99% of made up cases? :) 14:22:32 derekh, k k.. not sure what you were hitting it's hit or miss w/ the cloud atm but also happy to help. derekh probably getting w/ panda locally will help too 14:22:33 It looks to me like either 1. ipmi traffic from the undercloud to the bmc node was dispupted or 2. traffic from the BMC to the rdo-cloud api was distrupted (perhapes because of DNS outages) 14:22:49 ya 14:23:02 *disrupted 14:23:26 mwhahaha: so not sure for "we need to fix thoseЭ 14:23:29 " :) 14:23:41 the next topic in the agenda is related 14:23:47 basically rdo-cloud status 14:24:00 k let's dive into the next topic then 14:24:00 RDO-Cloud and Third Party TripleO CI 14:24:06 weshay: is that yours 14:24:19 ya.. just basically summarizing what most of you already know 14:24:33 we have an unstable env for 3rd party atm.. and it's causing pain 14:24:53 the rdo team has built a new ovs package based on osp and we're waiting on ops to install 14:25:03 should we remove it from check until we can resolve stability? 14:25:10 nhicher, is also experimenting with other 3rd party clouds 14:25:13 Derek Higgins proposed openstack/tripleo-heat-templates master: Add scenario 012 - overlcoud baremetal+ansible-ml2 https://review.openstack.org/579603 14:25:24 mwhahaha: if it's not gating, you can ignore it even if it's voting 14:25:24 so we can actually devote efforts to trouble shoot rather than continue to add useless load and possibly hiding issues? 14:25:34 dtantsur: i mean not even run it on patches 14:25:35 mwhahaha, I think thats on the table for sure.. however I can't say I'm very comfortable with that idea 14:25:37 ah 14:25:56 if it's 25% what's the point of even running it other than wasting resources 14:26:00 well, that means not touching anything related to ironic until it gets fixed 14:26:03 (which may be fine) 14:26:12 testing and loading the cloud?? ya .. it's a fair question 14:26:28 mwhahaha, we still need rdo-cloud for master promotions and other branches 14:26:33 so the cloud NEEDS to work 14:26:48 I've had luck on the weekends 14:27:00 if it's fine on the weekends that points to load related issues 14:27:08 fine is relative 14:27:10 in the mean time merge the previously mentioned scenario012 and use it to verify ironic in overcloud instead of the undercloud 14:27:11 better maybe 14:28:13 weshay: do we know what has changed in the cloud itself since the stability issues started? 14:28:45 mwhahaha, personally I'd like to see the admins, tripleo folks, and ironic folks chat for a minute and see what we can do to improve the situation there 14:28:56 Hi, What's about Undercloud OS upgrade? 14:29:03 mwhahaha, I have 0 insight into any changes 14:29:23 I'm ready to join such chat, even though I'm not sure we can do much on ironic side 14:29:35 better lines of communication between tripleo, ops, and ironic, ci etc.. are needed.. 14:29:35 (well, retries for IPMI are certainly an option) 14:30:05 dtantsur, agree however being defensive and getting the steps in ci that prove it's NOT ironic would be helpful I think 14:30:20 it's too easy for folks to just think.. oh look.. introspection failed 14:30:22 imho 14:30:39 unstable IPMI can be hard to detect outside of ironic 14:30:40 when really it's the networking or other infra issue 14:30:54 as I said, if UDP is a problem, we have TCP-based protocols to consider 14:31:11 ah.. interesting 14:31:14 maybe add a tcpdump log from a tcpdump in the background ? 14:31:23 panda, ya.. there is a patch up to do that 14:31:27 we added that already 14:31:28 rascasoft, already added that code :) 14:31:40 > as I said, if UDP is a problem, we have TCP-based protocols to consider 14:31:40 good idea ot explore 14:31:49 along with a tcpdump on the BMC node 14:31:51 I can never have an original idea :( 14:31:57 lolz 14:32:07 would the IPMI retries go into the CI code or the ironic code? 14:32:20 actually I used it to debug a previous ironic problem so it should fit also this issue 14:32:29 we had retries at one point 14:32:45 we have retries in ironic. we can have moar of them for the CI. 14:33:02 rascasoft: past me the link to that change in pvt please 14:33:21 panda, it's merged into extras 14:33:34 lol 14:33:49 dtantsur: are they on by default? 14:33:54 or is it tunable in a way? 14:33:58 (and are we doing that) 14:33:59 panda, https://github.com/openstack/tripleo-quickstart-extras/commit/78c60d2102f5e22c3abb30bb0c8179d4c999829c 14:34:03 mwhahaha: yes, yes, dunno 14:34:35 I like the idea of not using udp if possible 14:34:41 dtantsur: what's involved with switching to TCP-based protocols - can we discuss after the meeting? we can try it 14:35:00 and maybe ironic error'ing out with nodes unreachable, please check network 14:35:01 but https://github.com/openstack/tripleo-quickstart-extras/commit/78c60d2102f5e22c3abb30bb0c8179d4c999829c doesn't show the vbmc VM side 14:35:03 it's a different protocol, so the BMC part will have to be changed 14:35:15 hint: etingof may be quite happy to work on anything redfish-related :) 14:35:16 and we need to place dstate there also IMO 14:35:22 dstat 14:35:23 #action rlandy, weshay, dtantsur to look into switching to TCP for IMPI and possibly tuning retries 14:35:26 tag you're it 14:35:59 btw, yesterday I ploted these ipmi outages on a undercloud I'm running on rdo-cloud, they lasted hours, retries and tcp isn't going to help that https://plot.ly/create/?fid=higginsd:2#/ 14:36:05 can we just blame evilien? 14:36:19 derekh++ 14:36:47 yeah, hours is too much 14:36:47 in the case of the env about DNS wasn't working (using 1.1.1.1 , to the BMC couldn't talk to the rdo-cloud API) 14:36:49 :-o 14:37:24 /about/above/ 14:37:27 dtantsur, derekh so would you guys be open to the idea of an ovb no-change job that runs say every 4 hours or so and does some kind of health report on the our 3rd party env? 14:37:28 for the record: these two options together control ipmi timeouts: https://docs.openstack.org/ironic/latest/configuration/config.html#ipmi 14:37:36 until we have this solved? 14:37:59 weshay: you mean, instead of voting OVB jobs? 14:38:23 I think the two are independent proposal, both I think should be considered 14:38:42 * dtantsur welcomes etingof 14:38:51 weshay: if it could be left up afterwards it would be great, I'd be happy to jump on and try and debug it 14:38:52 I think ironic takes the brunt of at least the current issues in our 3rd party cloud 14:39:05 rock on.. 14:39:06 etingof: tl;dr a lot of ipmi outages in the ovb CI, thinking of various ways around 14:39:37 really happy to see you guys in the mix and proactive here.. this has been very painful for the project.. so thank you!!!!! 14:39:38 that reminds me of my ipmi patches 14:39:45 etingof: yep, this is also relevant 14:39:49 https://imgflip.com/i/2k8a00 14:39:49 weshay: actaully if you want I couldjust do that on my own rdo account 14:39:57 we may want to increase retries here, so we'd better sort out our side 14:40:13 I'm not sure if I see the same networking issues in personal tenants as I see in the openstack-infra tenant 14:40:26 not sure what others have noticed 14:40:28 weshay: ok 14:40:31 etingof: but also thinking if switching ovb to redfish will make things a bit better 14:40:40 so I have this killer patch that limits ipmitool run time -- https://review.openstack.org/#/c/610007/ 14:41:14 getting tcpdumps from BMC is a bit more difficult 14:41:18 that patch might hopefully improve situation when we have many nodes down 14:41:23 https://docs.openstack.org/ironic/pike/admin/drivers/redfish.html ? 14:41:36 weshay: yep 14:42:21 we would probably have to switch from virtualbmc to sushy-tools emulator 14:43:20 the redfish stuff seems like a longer term investment 14:43:29 etingof: ovb does not use vbmc, but it's own bmc implementation 14:43:42 well, but you have a point: sushy-tools has a redfish-to-openstack bridge already 14:44:19 dtantsur, ah, right! because it manages OS instances rather than libvirt? 14:44:24 yep 14:44:37 luckly, sushy-tools can do both right out of the box \o/ 14:45:03 it's a big plus indeed, it means we can remove the protocol code from ovb itself 14:45:25 help me understand how sushy-tools and redfish would help introspection / provisioning when networking is very unstable 14:45:35 maybe offline, but that is not clear to me yet 14:45:47 weshay: simply because of using tcp instead of udp for power actions 14:45:54 ah k 14:45:58 I'm not sure if it's going to be a big contribution or not 14:46:01 but worth trying IMO 14:46:09 esp. since the world is (slowly) moving towards redfish anyway 14:46:10 agree, should be on the table as well 14:46:13 nhicher, ^^^ 14:46:41 sounds like a lot of work to take on before figuring out the problem 14:46:59 well, the proper fix is to make networking stable 14:47:01 although possibly work we'll do eventually anyways 14:47:08 we should wait for the admins to upgrade ovs imho 14:47:13 run for a day or two w/ that 14:47:22 any ETA on the OVS upgrade? 14:47:25 will we have to maintain ipmi-based deployments for a long time (forever)? 14:47:34 asking admins to join 14:47:46 because people in the trenches still use ipmi 14:47:52 yea we'll have to support it 14:48:12 amoralej, do you know? 14:48:16 but we don't necessarily need to rely on it 100% upstream 14:48:22 this ^^^ 14:48:53 well it needs to be tested anyway, if it's supported. 14:48:57 anyway so it looks like there's some possible improvements to OVB that could be had, who wants to take a look at that? 14:49:05 weshay, I need to do it in staging ... that will be tomorrow 14:49:06 weshay, what's the question? 14:49:12 Tengu: yea but we could limit it to a single feature set rather than all OVB jobs 14:49:16 ovs update? 14:49:18 amoralej, when we're going to get the new ovs package in rdo-cloud 14:49:25 mwhahaha: yup. 14:49:32 Tengu: and we can just do a simple 1+1 deploy rather than full HA 14:49:44 +1 14:49:52 Tengu: we techincally support the redfish stuff (i think) but we don't have any upstream coverage in tripleo 14:49:57 weshay, no idea about the plan to deploy it in rdo-cloud 14:49:59 anyway we need to move on 14:50:10 alan did a build that hopefully will help 14:50:20 who wants to take on the slushy-tools ovb review bits? dtantsur or etingof? 14:50:45 amoralej, k.. we need to expose these details more to the folks consuming 3rd party imho 14:50:48 * dtantsur pushes etingof forward :D 14:51:00 #action etingof to take a look at OVB+slushy-tools 14:51:00 done 14:51:03 moving on :D 14:51:04 dtantsur: network is unreliable by definition :) 14:51:19 bogdando: yeah, there are grades of unreliability :) 14:51:27 especially if thinking of edge cases in future, we barely can/should "fix" it 14:51:41 in all seriousness we do need to move forward, please feel free to continue this discussion after the meeting or on the ML 14:51:51 k k 14:52:01 (rfolco) CI Community Meeting starts immediately upon this meeting closing in #oooq. This is our team's weekly "open office hours." All are welcome! Ask/discuss anything, we don't bite. Agenda (add items freely) --> https://etherpad.openstack.org/p/tripleo-ci-squad-meeting ~ L49. 14:52:40 that's it on the meeting items 14:52:50 moving on to the status portion of our agenda 14:52:55 #topic Squad status 14:52:55 ci 14:52:55 #link https://etherpad.openstack.org/p/tripleo-ci-squad-meeting 14:52:55 upgrade 14:52:55 #link https://etherpad.openstack.org/p/tripleo-upgrade-squad-status 14:52:55 containers 14:52:55 #link https://etherpad.openstack.org/p/tripleo-containers-squad-status 14:52:56 edge 14:52:56 #link https://etherpad.openstack.org/p/tripleo-edge-squad-status 14:52:57 integration 14:52:57 #link https://etherpad.openstack.org/p/tripleo-integration-squad-status 14:52:58 ui/cli 14:52:58 #link https://etherpad.openstack.org/p/tripleo-ui-cli-squad-status 14:52:59 validations 14:52:59 #link https://etherpad.openstack.org/p/tripleo-validations-squad-status 14:53:00 networking 14:53:01 #link https://etherpad.openstack.org/p/tripleo-networking-squad-status 14:53:01 workflows 14:53:01 #link https://etherpad.openstack.org/p/tripleo-workflows-squad-status 14:53:02 security 14:53:02 #link https://etherpad.openstack.org/p/tripleo-security-squad 14:53:16 any particular highlights that anyone would like to raise? 14:54:50 sounds like no 14:55:13 #topic bugs & blueprints 14:55:13 #link https://launchpad.net/tripleo/+milestone/stein-1 14:55:13 For Stein we currently have 28 blueprints and about 743 open Launchpad bugs. 739 stein-1, 4 stein-2. 102 open Storyboard bugs. 14:55:13 #link https://storyboard.openstack.org/#!/project_group/76 14:55:47 rfolco, post the link for the bluejeans ci here please 14:55:58 please take a look at the open blueprints as there are a few without approvals and such 14:56:07 oops sorry 14:56:11 https://bluejeans.com/5878458097 --> ci community mtg bj 14:56:21 any specific bugs, other than the rdo cloud ones, that people want to point out? 14:57:18 sounds like no on that as well 14:57:19 #topic projects releases or stable backports 14:57:25 EmilienM: stable releases? 14:57:56 mwhahaha: I do once a month now 14:58:01 no updates this week 14:58:04 k 14:58:11 #topic specs 14:58:11 #link https://review.openstack.org/#/q/project:openstack/tripleo-specs+status:open 14:58:38 please take some time to review the open specs, looks like there is only a few for stein 14:58:45 #topic open discussion 14:58:47 anything else? 14:59:42 mwhahaha: What's about Undercloud/Overcloud OS upgrade? 14:59:52 huynq: upgrade to what? 15:00:03 Marius Cornea proposed openstack/tripleo-upgrade master: Add build option to plugin.spec https://review.openstack.org/611005 15:00:19 thrash: therve : is it an ok idea to kick off workflow from within an action? 15:00:33 slagle: Uhhh... I would say no 15:00:35 mwhahaha: e.g from CentOS7 to CentOS8 15:01:16 huynq: some folks are still looking into that )cc: chem, holser_) but i'm not sure we have a solid plan at the moment due to lack of CentOS8 15:01:25 alright we're out of time 15:01:28 #endmeeting