11:03:55 <oneswig> #startmeeting scientific-sig
11:03:56 <openstack> Meeting started Wed Sep 23 11:03:55 2020 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:03:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:03:59 <openstack> The meeting name has been set to 'scientific_sig'
11:04:11 <oneswig> apologies for lateness
11:10:39 <belmoreira> o/
11:11:04 <belmoreira> hi oneswig
11:11:27 <oneswig> Hi belmoreira, how's things?
11:11:46 <oneswig> priteau was mentioning the discussion in the large scale sig this morning
11:12:09 <belmoreira> good, and with you?
11:12:28 <oneswig> I have been negligent in my attention to the Scientific SIG :-(
11:12:52 <oneswig> Otherwise things are going well I'd say
11:12:53 <belmoreira> yes, we had the large scale sig meeting this morning and the discussion for the next goals and PTG
11:13:20 <belmoreira> good to hear that
11:14:56 <oneswig> What are the pain points you are having with scaling?
11:17:34 <belmoreira> well, I think we hit everything possible to hit during the last years... for now until we can physically add more nodes we should be good
11:20:10 <oneswig> Pierre mentioned you were describing limitations with cells because of not helping network scaling - is that new or something you've been fighting from the beginning?
11:20:57 <belmoreira> it's more related with neutron scalability, not nova or cells design
11:21:33 <belmoreira> last year we split the infrastructure in 2 regions. Currently we have 3 regions
11:26:39 <belmoreira> I think it will be interesting to share it
11:27:22 <oneswig> How does storage link between the 3 regions, do you have to share storage?
11:36:08 <janders> hi oneswig belmoreira
11:36:22 <janders> belmoreira do you have multiple cells per region, do I get that right?
11:36:36 <oneswig> Hi janders, good to see you
11:36:56 <janders> oneswig good to see you too. Apologies for joining late - team meeting clash.
11:36:59 <b1airo> g'evening
11:37:08 <janders> hey b1airo!
11:37:23 <b1airo> my late excuse is more about beer...
11:37:36 <janders> b1airo important! :)
11:38:10 <oneswig> Hi b1airo, very important.  Back home?
11:38:15 <oneswig> #chair b1airo
11:38:16 <openstack> Current chairs: b1airo oneswig
11:38:21 <b1airo> oh totally janders , certainly higher priority than meetings anyways :-P
11:38:38 <janders> I look forward to times when we can combine both again
11:38:45 <janders> as it should be
11:38:46 <b1airo> no oneswig , still up north hanging out with the NIWA crew
11:39:14 <belmoreira> janders yes, we have multiple cells
11:39:34 <b1airo> how many now belmoreira ?
11:39:35 <belmoreira> in 3 regions we have more than 70 cells in total
11:39:46 <b1airo> *whistle*
11:40:02 <janders> belmoreira do any of your cells span regions, or is each cell contained within one region?
11:40:11 <belmoreira> each cell has a maximum of 250 nodes
11:40:35 <belmoreira> janders cells are per region
11:40:41 <janders> belmoreira which aspect of scalability do cells help with the most in your experience?
11:40:44 <b1airo> are you still following the same, err, "disposable" cell controllers model? :-)
11:41:24 <belmoreira> each cell has it's own rabbit infrastructure
11:41:58 <b1airo> (i vaguely recall you are running your cell controllers within the prod cloud itself... 🐢)
11:41:59 <belmoreira> and it's a good failure domain, in case of issues things are contained
11:42:56 <belmoreira> b1airo :) yes, all our controller plane runs inside the cloud itself (inception)
11:43:07 <janders> belmoreira nice! :)
11:43:30 <b1airo> agree on that - spanning regions (a user facing construct) across cells (a backend scalability and failure domain concern) seems like a questionable idea
11:43:32 <janders> belmoreira does this architecture pose a challenge in case of a need of a full-system shutdown?
11:44:30 <belmoreira> you mean a shutdown in the data centre :)
11:44:35 <janders> belmoreira I really like it, just wonder what extra measures are needed to prevent losing the "starer motor"
11:44:51 <janders> belmoreira yeah
11:45:13 <belmoreira> yes, sure... if that happens we need to understand what needs to be available first
11:46:13 <b1airo> i guess maybe the api top-level needs to come up first, followed by compute "cell0" (i guess that must be a thing in this architecture?
11:46:14 <belmoreira> but is not a big issue, because instance start doesn't need the control plane
11:46:37 <janders> belmoreira right!
11:47:03 <janders> belmoreira do you have dedicated compute nodes for infra services, so that they are separate from user workloads and easy to identify?
11:47:07 <belmoreira> b1airo yes, if we really need APIs from the beginning, but if a disaster happens APIs availability will be probably the last
11:47:46 <b1airo> ha, good point! so "cell0" is really just select instance startup directly on compute nodes?
11:48:25 <janders> do the infra instances have networking statically configured?
11:48:35 <janders> (cause I suppose DHCP services may not be available yet)
11:49:11 <b1airo> was coming to the networking question too :-)
11:49:11 <janders> or is the inception arch cell-specific, with neutron being independent of this?
11:49:16 <belmoreira> b1airo in a case of a disaster we will probably force instance start per compute node
11:49:31 <belmoreira> and worry to have the DBs up
11:51:31 <belmoreira> janders we use DHCP, but it's a separate infrastructure... yes, it needs to be up
11:51:50 <janders> belmoreira makes sense
11:52:00 <belmoreira> janders users and infra instances share the same infrastructure
11:52:14 <janders> belmoreira no noisy neighbour issues?
11:52:39 <belmoreira> only compute instances have their dedicated cells/regions
11:53:37 <belmoreira> janders yes, sometimes... we usually live migrate noisy neighbours to less busy compute nodes
11:54:16 <martial_> Late (still getting kids ready)
11:54:53 <oneswig> Hi martial_, morning
11:55:13 <janders> belmoreira it's awesome to hear about your architecture and your experiences with it, thanks for sharing!
11:55:40 <oneswig> janders: how's things with you?
11:55:56 <janders> oneswig good, thank you for asking! :)
11:56:08 <belmoreira> janders np
11:56:45 <janders> something I've been looking most recently that might be useful for the SIG is potentially introducing NVMe-aware cleaning to Ironic
11:56:57 <janders> (think trim/discard/... functionality)
11:57:20 <oneswig> janders: I've seen key rotation used for SATA SSDs, does that also apply here?
11:57:44 <oneswig> The discard idea is good though!
11:58:07 <janders> yeah some of the "secure" deletion options leverage manipulating crypto keys
11:58:30 <janders> what's supported really varies but I hope we can find enough common ground
11:59:04 <janders> how are things at your end oneswig? What are you guys up to these days?
11:59:34 <oneswig> janders: too much to describe in our final minute :-(
11:59:50 <oneswig> I think finally we are back to having rather too much fun.
11:59:57 <janders> oneswig true! poor timing on my behalf. Next time!
12:00:16 <oneswig> Until next time.  I promise to come better prepared.
12:00:30 <janders> have a good one all
12:00:36 <oneswig> time to close
12:00:38 <oneswig> #endmeeting