#openstack-meeting log

21:00:26 <oneswig> #startmeeting scientific-sig
21:00:27 <openstack> Meeting started Tue Sep 18 21:00:26 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:27 <janders_> g'day everyone
21:00:28 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:30 <openstack> The meeting name has been set to 'scientific_sig'
21:00:37 <oneswig> greetings janders_ and all
21:00:50 <oneswig> what's new?
21:01:20 <oneswig> #link agenda for today is https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_September_18th_2018
21:02:08 <oneswig> Tomorrow is upgrade day over here... but this specific time it's Pike->Queens
21:03:17 <oneswig> We've been doing the drill on the staging environment but there's nothing quite like the real thing ...
21:03:34 <janders_> oneswig: what are the main challenges?
21:04:20 <oneswig> In this case, not too many.  One concern is correctly managing resource classes in Ironic
21:04:42 <janders_> right! are you doing BIOS/firmware upgrades as well?
21:05:07 <oneswig> oh no.  That's not in the plan (should it be I wonder?)
21:05:28 <b1air> o/
21:05:44 <oneswig> G'day b1air, which airport are you in today? :-)
21:05:44 <b1air> Do all the changes all at once!!
21:05:56 <janders_> if you were to, would you use something like lifecycle manager, or would you temporarily boot ironic nodes into a "service image" with all the tools?
21:05:58 <b1air> Very near AKL as it happens
21:06:00 <oneswig> Fighting talk from a safe distance, that
21:06:07 <b1air> ;-)
21:06:34 <oneswig> janders_: last time we did this, it was the latter - a heat stack for all compute instances with a service image in it.
21:07:16 <janders_> right! in a KVM-centric world, it's easy - just incorporate all the BIOS/FW management tools in the image. Ironic changes this paradigm so I was wondering how do you go about it. Might be an interesting forum topic.
21:07:26 <martial_> (difficulty joining on the phone)
21:07:30 <oneswig> Have you seen an Ansible playbook for doing firmware upgrades via the dell idrac?
21:07:33 <oneswig> Hello martial_
21:07:40 <oneswig> #chair b1air martial_
21:07:41 <openstack> Current chairs: b1air martial_ oneswig
21:07:45 <oneswig> (remiss of me)
21:07:47 <janders_> do you pxeboot the service image via ironic or outside ironic?
21:08:18 <oneswig> In that case we booted it like a standard compute instance, via Ironic
21:08:24 <b1air> KVM world easy? Pull the other one @janders_ ! :-)
21:08:54 <janders_> no.. I looked at the playbooks managing the settings but not the BIOS/FW versions. If it works (and I'm not worried about the playbooks, I'm worried about the Dell hardware side :) it'd be gold
21:09:14 <janders_> oneswig: does this mean you had to delete all the ironic instances first?
21:09:31 <janders_> b1air: KVM world is easy in this one sense :)
21:09:42 <oneswig> In that case, yes - I guess the lifecycle manager could have avoided that, do you think?
21:10:01 <janders_> oneswig: yes - it will do all of this in the pre-boot environment (if it works..)
21:10:41 <janders_> when I say "if it works" - on our few hundred nodes of HPC it definitely works for 70-95% nodes. Success rates vary. The ones that failed usually just need more attempts.. (thanks, Dell)
21:11:10 <b1air> Power drain?
21:11:30 <janders_> however I am unsure if Mellanox firmware can be done via Lifecycle Controller (we usually do this part from the compute OS)
21:11:33 <oneswig> janders_: is this the playbooks at https://github.com/dell/Dell-EMC-Ansible-Modules-for-iDRAC ?
21:12:10 <b1air> janders_: only if it is a Dell OEM Mellanox part - that's the value add
21:13:28 <janders_> b1air: most of our HCAs are indeed OEM - I need to revisit this (I guess the guys have always done this with mft & flint, cause it works 99/100) - in the ironic world doing everything from LC could simplify things
21:14:27 <janders_> closer to the main topics - from your experience, how big do the forum sessions typically get?
21:14:49 <oneswig> janders_: there has also been talk previously of performing these actions as a manual cleaning step - less obtrusive but without out-of-band dependencies on idrac
21:15:00 <b1air> At Monash we found the LCs to be ok reliability-wise from 13G
21:15:27 <oneswig> janders_: perhaps we should, indeed, look at the agenda..
21:15:37 <oneswig> #topic Forum sessions
21:16:18 <oneswig> Forum sessions I've been in have ranged in size from ~8 people to ~50 (but about 12 holding court)
21:16:29 <janders_> oneswig: this is a neat way to do it in a rolling fashion - however the drawback is having a mix of versions for quite a while as users delete/reprovision the nodes. I'm trying to come up with an option of doing it all in a defined downtime window, without affecting existing ironic instances.
21:16:36 <janders_> b1air: that is great to hear! :)
21:16:55 <janders_> oneswig: that is good - it shouldn't be impossible to get some bandwidth in these sessions! :)
21:17:32 <oneswig> I get the feeling one on Ironic and BIOS firmware management could be interesting!
21:17:46 <oneswig> Facilitating it but also, conversely, preventing it
21:19:30 <priteau> janders_: I think at CERN they have a way of letting the instance owner select their downtime period
21:19:52 <priteau> I am trying to find where I saw it described
21:20:14 <oneswig> Good evening priteau!
21:20:34 <janders_> wow - very cool idea.. I wonder if it's leveraging AZs (which might have different downtime windows) or something else
21:20:35 <priteau> Hi everyone by the way :-)
21:20:54 <priteau> janders_: it may even be per-host
21:21:26 <b1air> Sounds a bit like AWS' reboot/downtime scheduling API
21:22:42 <janders_> thinking about it - if it's just the instance that's supposed to be up and it has no volumes etc attached it can be quite fine grained
21:23:13 <janders_> however if the instance is leveraging any services coming off the control plane, it might be tricky to go below AZ-level downtime
21:23:28 <janders_> or at least that's my quick high level thought without looking into details
21:23:51 <janders_> very interesting topic though! :)
21:24:39 <oneswig> question of procedure - do we add a proposal like this to the Ironic forum etherpad, or mint our own SIG etherpad and add it to the list?
21:25:42 <priteau> I found http://openstack-in-production.blogspot.com/2018/01/keep-calm-and-reboot-patching-recent.html, but it's not how I remember it
21:26:54 <oneswig> Another area I am interested in pursuing is support for the recent features introduced to Ironic for alternative boot methods (boot from volume, boot to ramdisk) - is there scope for getting these working with multi-tenant networking?
21:26:55 <priteau> Maybe there is another procedure for the less critical upgrades
21:30:19 <janders_> oneswig: alternative boot methods would definitely be of interest. Looking at the PTG notes there are some good ideas so it looks like the next step would be to find out if/when these ideas can be implemented
21:31:01 <janders_> something from my side (across all the storage-related components) would be BeeGFS support/integration in OpenStack
21:31:30 <oneswig> Ooh, interesting.
21:31:35 <janders_> would you guys be interested in this, too?
21:31:37 <oneswig> Like, in Manila?
21:31:48 <janders_> yes, that's the most powerful scenario
21:32:05 <oneswig> Absolutely!  We've got playbooks for it, but nothing "integrated"
21:32:12 <oneswig> (but does it need to be?)
21:32:16 <janders_> but running VM instances (for those who still need VMs) and cinder volumes off BeeGFS would be of value as well
21:32:56 <oneswig> That follows quite closely what IBM was up to with SpectrumScale
21:33:00 <janders_> given no kerberos support in BeeGFS for the time being I think it would be very useful to have some smarts there
21:33:22 <oneswig> OK, let's get these down...
21:33:29 <janders_> haha! you found the logic behind my thinking
21:33:38 <oneswig> #link SIG brainstorming ideas https://etherpad.openstack.org/p/BER-stein-forum-scientific-sig
21:33:58 <janders_> I liked what IBM have done with GPFS/Spectrum however I find deploying and maintaining this solution more and more painful as time goes
21:34:13 <janders_> I see the same sentiment on the storage side
21:34:22 <janders_> "it's good, but..."
21:34:41 <janders_> I'll add some points to the etherpad now
21:35:11 <janders_> ok, you already have - thank you! :)
21:36:10 <janders_> another storage related idea
21:36:27 <janders_> would you find it useful to be able to separate storage backends for instance boot drives and ephemeral drives?
21:36:43 <janders_> I like the raw performance of node-local SSD/NVMe
21:37:10 <janders_> however having something more resilient (and possibly shared) for the boot drive is good, too
21:37:34 <janders_> I would happily see support for splitting the two up (I do not think this is possible today, please correct me if I am wrong)
21:37:41 <goldenfri> I was just thinking about that today, so I 2nd that
21:38:17 <janders_> in this case, we could even wipe ephemeral on live migration (this would have to be configurable) so only the boot drive needs to persist
21:38:53 <oneswig> It seems like a good idea to me, certainly worth suggesting
21:38:58 <janders_> ok!
21:38:59 <oneswig> hello goldenfri!
21:39:17 <goldenfri> o/
21:40:07 <priteau> janders_: if the ephemeral storage is mounted while live migrating, wouldn't the guest OS complain if data gets wiped out?
21:42:02 <janders_> good point, there would have to be some smarts around it. I don't have this fully thought through yet, but I think the capability would be useful. Perhaps cloud-* services could help facilitate this?
21:42:10 <oneswig> OK we are linked up to https://wiki.openstack.org/wiki/Forum/Berlin2018#Etherpads_from_Teams_and_Working_Groups
21:42:31 <janders_> but obviously if there's heavy IO hitting ephemeral, some service trying to umount /dev/sdb won't have a lot of luck..
21:42:56 <b1air> +1 to janders_ ephemeral separation feature request
21:43:28 <priteau> janders_: VM-aware live migration?
21:43:31 <b1air> I see it more likely to be used with cold migration
21:44:07 <b1air> Where you have a fleet of long lived instances that you want to move around due to underlying maintenance etc
21:46:16 <janders_> another thing I'm looking at is using trim/discard like features for node cleaning - however bits of this might be already implemented, looking at ironic and pxe_idrac/pxe_ilo bits
21:46:23 <janders_> have any of you used this with success?
21:46:52 <janders_> (I might have asked this question here already, not sure)
21:47:22 <b1air> Yes I recall discussing this before, but don't think anything came of it yet
21:47:24 <oneswig> Did we cover this last week?  I think there's an Ironic config parameter for key rotation
21:47:57 <b1air> With hardware encrypted storage?
21:48:08 <oneswig> We use it, and when I checked up I believe it was as simple as that - with the caveat that some of the drives needed a firmware update (of course!)
21:48:29 <priteau> janders_: you asked last week ;-) http://eavesdrop.openstack.org/meetings/scientific_sig/2018/scientific_sig.2018-09-12-11.00.log.html#l-139
21:48:40 <oneswig> b1air: hardware encryption as I understand it but with an empty secret.
21:49:01 <oneswig> So not really encryption...
21:49:55 <b1air> Cunning - the baddies will never suspect an empty password!
21:50:04 <janders_> oneswig: :) I've discussed this with too many parties and lost track (scientific-sig, RHAT, Dell, ... )
21:50:39 <oneswig> janders_: your comrades here are the source of truth, you can't trust those other guys :-)
21:51:11 <janders_> that's right :) can't trust those sales organisations
21:51:31 <oneswig> There was one other matter to cover today, before I forget
21:51:38 <priteau> keycloack
21:51:43 <oneswig> #topic SIG event space at Berlin
21:52:09 <oneswig> priteau: I think we have that on the agenda for next week
21:52:18 <priteau> Oh, I looked at the wrong week :-)
21:52:47 <oneswig> I know - it's a handy aide memoire for me, probably confusing for anyone else!
21:53:13 <oneswig> Anyway - We have the option of 1 working group session + 1 bof session (ie, what we've had at previous summits).
21:53:43 <oneswig> I think this works well enough, unless anyone prefers to shorten it?
21:53:59 <oneswig> b1air? martial_? Thoughts on that?
21:56:56 <janders_> I have couple more forum ideas - given we're running low on time I will fire these away now
21:57:17 <oneswig> Please do.
21:57:23 <janders_> 1) being able to schedule a bare-metal instance to a specific piece of hardware (I don't think this is supported today) - would this be useful to you?
21:57:43 <janders_> think --availability-zone host:x.y.z equivalent for Ironic
21:57:44 <oneswig> On the SIG events - looks like Wednesday morning is clear for the AI-HPC-GPU track
21:58:22 <oneswig> janders_: I believe that exists, in the form of a three-tuple delimited by colons
21:58:34 <janders_> 2) I don't think "nova rebuild" works with baremetal instances - I think it would be something useful
21:58:43 <oneswig> The form might be nova::<Ironic uuid of the node>
21:59:24 <oneswig> On 2, are you sure? I think I've rebuilt Ironic instances before
21:59:43 <oneswig> Let's follow up on that...
21:59:48 <janders_> in this case, I will retest both and update the etherpad as required
21:59:58 <oneswig> good plan, let us know!
22:00:07 <oneswig> OK, we are out of time
22:00:14 <oneswig> Thanks everyone
22:00:32 <oneswig> keep adding to that etherpad if you get more ideas we should advocate
22:00:56 <oneswig> https://etherpad.openstack.org/p/BER-stein-forum-scientific-sig
22:00:59 <janders_> thanks guys!
22:01:02 <oneswig> #endmeeting