#openstack-meeting log

11:00:12 <oneswig> #startmeeting scientific-sig
11:00:13 <openstack> Meeting started Wed Apr  8 11:00:12 2020 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:00:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:16 <openstack> The meeting name has been set to 'scientific_sig'
11:00:33 <oneswig> #topic greetings
11:00:37 <janders> g'day
11:00:41 <oneswig> hi janders
11:00:57 <oneswig> What's new with you?
11:01:01 <dh3> hi
11:01:08 <janders> interesting cx6-eth breakages for AOB :)
11:01:10 <oneswig> Hi dh3
11:01:22 <oneswig> already looking forward to it...
11:01:49 <verdurin> Hello.
11:02:00 <oneswig> greetings verdurin
11:02:26 <oneswig> Let's get the show on the road
11:02:37 <oneswig> #topic Kolla user group
11:02:52 <oneswig> OK, only new item on this week's agenda
11:03:25 <oneswig> For Kolla users there's a new effort to get users and operators talking with developers
11:03:51 <oneswig> #link Kolla user group thread http://lists.openstack.org/pipermail/openstack-discuss/2020-April/013787.html
11:04:44 <oneswig> If you're using Kolla, hopefully it will be a worthwhile discussion
11:06:32 <verdurin> Yes, hoping to attend.
11:06:45 <oneswig> Is there something similar for TripleO - I know there are feedback sessions at the summits
11:07:07 <janders> I call it tripleno :P
11:07:17 <janders> (can be made work though :)  )
11:08:49 <oneswig> OK, good to know.
11:09:24 <oneswig> I've been to quite a few operator feedback sessions in different forms, hopefully this will be productive.
11:09:33 <oneswig> Move on?
11:09:41 <oneswig> #topic AOB
11:10:00 <janders> ok... i promised some cx6-eth stories
11:10:28 <janders> we used to have an issue getting VPI to work. All IB config *just worked* but eth-ib less so
11:10:44 <janders> a FW version came out some time back that was supposed to fix it, we started testing this a couple weeks back
11:10:59 <janders> catch: it seems for VPI to work, eth needs to be on port1 and ib on port2
11:11:20 <janders> and we're wired the other way round and have limited DC access due to lockdown so it's tricky to swap it over
11:11:25 <verdurin> janders: ???
11:11:29 <janders> in any case - it does malfunction in an interesting way
11:11:35 <janders> eth link comes up no worries
11:11:43 <janders> till... you try to use it with LB or ovs
11:11:58 <janders> when it just drops any traffic with MAC address not matching physical point
11:12:16 <janders> our friends at Mellanox are looking at it but that's where things are at
11:12:28 <janders> ive seen a few issues with VPI but not this one :)
11:12:37 <janders> have you guys seen anything like that?
11:12:56 <oneswig> I have... kind of...
11:13:07 <janders> what card was that on?
11:13:11 <janders> also a cx6?
11:13:15 <janders> or something different?
11:13:21 <oneswig> ConnectX-5, probably different issue.
11:13:32 <oneswig> This is SRIOV over a bonded LAG
11:14:14 <janders> sorry got dropped out
11:14:21 <janders> (not VPIs fault :)  )
11:14:35 <janders> oneswig: what was the problem on cx5?
11:14:49 <oneswig> Aha, well it was related to VF-LAG and OVS
11:15:06 <belmoreira> hi, just joined
11:15:09 <oneswig> I got into a situation where I can receive broadcast traffic but not unicast
11:15:13 <oneswig> Hi belmoreira
11:15:20 <janders> right!
11:15:31 <oneswig> Are you using OVS 2.12?
11:15:34 <janders> was that only specific to VFs, or would it impact traffic across the board?
11:15:45 <janders> (checking)
11:16:00 <oneswig> Saw it first in VFs.  When I installed OVS 2.12 it affected both
11:16:54 <oneswig> Haven't investigated in sufficient detail yet but it might not be related
11:17:41 <janders> 2.9.0 is the version
11:17:49 <oneswig> BTW The RDO build of OVS 2.12 apparently is quite old compared to recent code on that branch
11:19:25 <janders> right!
11:19:32 <janders> this is OSP13/Queens based project
11:19:40 <janders> so it may be worth re-testing with latest ovs
11:19:52 <janders> has upgrading ovs helped in your case?
11:19:54 <oneswig> OVS troubleshooting tools are a dark art all to their own
11:20:15 <janders> yeah we may need to sacrifice a packet or two to the ovs gods...
11:20:20 <oneswig> janders: not yet, 2.11->2.12 caused many problems on first attempt
11:20:39 <oneswig> Need to go back and do it again, with better preparation
11:22:06 <oneswig> belmoreira: was talking with someone recently about external hardware inventories and Ironic.  I was thinking about Arne's work on that.  Did CERN settle on a hardware inventory management system?
11:22:33 <belmoreira> not yet
11:24:18 <oneswig> belmoreira: so what's new at CERN?
11:24:58 <belmoreira> :) related with ironic, we are now moving into conductor groups
11:25:37 <belmoreira> this allows us to split the ironic infrastructure more or less like cells
11:25:56 <oneswig> how many nodes are you managing now?
11:27:35 <belmoreira> ~5100
11:28:03 <oneswig> nice work :-)
11:29:01 <oneswig> How do you size the number of nodes managed by a conductor group?
11:30:29 <belmoreira> We introduced the first conductor group with ~600 nodes
11:30:53 <belmoreira> the metric that we use is the time that the resource tracker takes to run
11:31:35 <janders> clever!
11:32:05 <oneswig> Have you reduced how often it runs?  I think I remember you changed it to run every few hours?
11:33:24 <belmoreira> yes, but with the latest versions it impacts node state updates
11:33:49 <belmoreira> with 600 nodes it takes around ~15 minutes to run
11:34:13 <oneswig> On a related subject, someone in our team mentioned the software-raid deployment has improved flexibility now.
11:34:17 <belmoreira> we are discussing in having conductor groups with 500 nodes
11:36:11 <belmoreira> I think the raid work is already merged upstream. For details Arne is the best person to contact
11:37:17 <oneswig> great, thanks.
11:40:21 <oneswig> On a different subject - dh3: has there been any further development on your work integrating Lustre with OpenStack clients?
11:42:03 <dh3> oneswig: "a bit" - it is fighting for our attention with other projects, and we were hitting some weird Lustre errors which made upgrading the client version more urgent (we are settling on 2.12.4 now) but the users still want it
11:44:06 <dh3> the general approach of building an image with Lustre already in it is working though, we have several groups using it as poc
11:44:16 <dh3> (can't escape posix :)  )
11:44:22 <oneswig> dh3: I saw you guys are now famous - some familiar faces here https://www.youtube.com/watch?v=WAWJxVYH9QM&feature=youtu.be
11:45:47 <dh3> haha, I knew there was something in the works but I hadn't seen that. we are in contract (re)negotiation land now, different kind of fun
11:46:44 <oneswig> good luck with that! :-)
11:47:13 <dh3> thanks :)
11:49:18 <oneswig> dh3: there has been some work recently with Cambridge Uni on improved Ansible automation for Lustre.  Might be worth sharing notes with you on that.
11:50:21 <dh3> oneswig: we'd be interested. We don't let our main site Ansible touch Lustre servers much at the moment, trying to keep them as "black box" - only to install stats scripts and (soon) to control iptables
11:51:47 <oneswig> dh3: ok sounds good.
11:52:01 <oneswig> I didn't have more for today - anyone else?
11:54:36 <oneswig> OK y'all, good to see everyone, thanks for joining :-)
11:54:41 <oneswig> #endmeeting