19:01:06 <clarkb> #startmeeting infra
19:01:06 <openstack> Meeting started Tue Mar  9 19:01:06 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:09 <openstack> The meeting name has been set to 'infra'
19:01:21 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-March/000195.html Our Agenda
19:01:29 <clarkb> #topic Announcements
19:02:12 <clarkb> clarkb out March 23rd, could use a volunteer meeting chair or plan to skip
19:02:37 <clarkb> I'll probably just let this resolve itself. If you see a meeting agenda next week show up to the meeting otherwise skip it :)
19:03:07 <clarkb> er sorry its 2 weeks from now
19:03:11 <clarkb> I'm getting too excited :)
19:03:13 <fungi> heh
19:03:20 <clarkb> DST change happens for those of us in North America this weekend. EU and others follow in a few weeks.
19:03:27 <fungi> you're in a hurry for northern hemisphere spring i guess
19:03:31 <ianw> :) i am around so can run it
19:03:38 <clarkb> ianw: thanks!
19:03:57 <clarkb> heads up on the DST changes starting soon for many of us. You'll want to update your calendars if you operate in local time
19:04:13 <clarkb> North american is this weekend then EU in like two weeks and Australia in 3? something like that
19:04:42 <clarkb> #topic Actions from last meeting
19:04:48 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-02-19.01.txt minutes from last meeting
19:04:58 <clarkb> corvus has started the jitsi unfork
19:05:00 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/778308
19:05:21 <clarkb> I don't think we need to reaction that and can track it with the change now. Currently it is failing CI for some reason, and I haven't had a chance to look at it though I should try to look at it today
19:05:34 <clarkb> #topic Priority Efforts
19:05:39 <clarkb> #topic OpenDev
19:05:50 <clarkb> The gerrit account inconsistency work continues.
19:06:21 <clarkb> Sicne last we recorded the status here I have deleted conflicting external ids from about 35 accounts that were previously inactive. These were considered safe changes because the accounts were already disabled.
19:07:31 <clarkb> I have also put another 70 something through the disabling process in preparation for deleting their external ids (something I'd like to do today or tomorrow if we are comfortable with the list). These were accounts that the audit script identified as not having valid openids or ssh usernames or sshkeys or any reviewd changes or pushed changes
19:08:07 <clarkb> essentially they were never used and cannot be used for anything. I set them inactive on friday to cause them to flip an error if they were used over the last few days but I haven't seenanything related to that.
19:08:18 <clarkb> I'll put together the external id cleanup input file and run that when we are ready
19:08:22 <fungi> yeah, i don't see any way those accounts could be used in their current state, so should be safe
19:08:37 <fungi> hard to say they were never logged into, but they can't be logged into now anyway
19:09:11 <fungi> possible at least some are lingering remnants of earlier account merges/cleanups
19:09:22 <fungi> but just vestiges now if so
19:09:22 <clarkb> yup
19:09:52 <clarkb> Sort of related to this we had a user reach out to service-incident about not being able to log in. THis was for an entirely new account though and appears to be the moving openid then email conflict problem.
19:10:43 <clarkb> they removed their email from moderation themselves but I reached out anyway asking them for info on how they got into that state and offered a couple of options for moving forward (they managed to create a second account with a third non conflicting email which woudl work or we can apply these same retirement and external id cleanups to the original account and have them try again)
19:11:29 <clarkb> will wait and see what they say. I cc'd fungi on the email since fungi would've gotten the moderation notice too. But kept it off of public lists so we can atlkabout email addrs and details like that
19:11:52 <fungi> not that service-incident is a public list anyway
19:11:57 <fungi> but also not really on-topic
19:12:15 <clarkb> fungi: right I was thinkign we could discuss it on service-discuss if not for all the personal details
19:12:39 <fungi> yep
19:13:16 <clarkb> any other opendev items to discuss? if not we can move on? (I've sort of stalled out on the profiling of gerrit work in CI, as I'm prioritizing the account work)
19:14:15 <clarkb> #topic Update Config Management
19:14:33 <clarkb> I'm not aware of any items under this heading to talk about, but thought I'd ask before skipping ahead
19:14:59 <fungi> nothing new this week afaik
19:16:21 <clarkb> #topic General Topics
19:16:32 <clarkb> #topic OpenAFS cluster status
19:16:55 <clarkb> ianw: I think this may be all complete now? (aside from making room on dfw01's vicepa?)
19:17:05 <clarkb> we have a third db server and all servers are upgraded to focal now?
19:17:22 <ianw> yeah, i've moved on to the kerberos kdc hosts, which are related
19:17:50 <ianw> i hope to get ansible for that finished very soon; i don't think an in-place upgrade is as important there but probably easiest
19:18:01 <clarkb> good point. Should we drop this topic and track kerberos under the general upgrades heading or would you like to keep this here as a separate item?
19:18:23 <clarkb> also thank you for pushing on this, our openafs cluster should be much happier now and possibly ready for 2.0 whenever that is somethign to consider
19:18:43 <ianw> i think we can remove it
19:18:55 <clarkb> ok
19:19:07 <ianw> you're right i owe looking at the fedora mirror bits, on the todo list
19:19:20 <clarkb> #topic Borg Backups
19:19:42 <clarkb> Last I heard we were going to try manually running the backup for gitea db and see if we could determine why it is sad but only to one target
19:19:46 <clarkb> any new news on that?
19:19:46 <fungi> the errors for gitea01 ceased. not sure if anyone changed anything there?
19:19:59 <ianw> i did not
19:20:20 <clarkb> I did not either
19:20:39 <clarkb> I guess we keep our eyes open for recurrence but can probably drop this topic now too?
19:21:18 <ianw> yep!  i think we're done there
19:21:21 <clarkb> great
19:21:25 <clarkb> thank you for working on this as well
19:21:34 <fungi> i just rechecked the root inbox to be sure, no more gitea01 errors
19:21:47 <clarkb> #topic Puppet replacements and Server upgrades
19:22:01 <fungi> though the ticket about fedora 33 being unable to reach emergency consoles seems to be waiting for an update from us
19:22:25 <clarkb> I've rotated all the zuul-executors at this point. That means zuul-mergers and executors are done. Next on my list was nodepool launchers
19:23:06 <clarkb> I think these are going to be a bit more involved since we need to keep old launcher from interfering with new launcher. One idea I had was to land a change that sets max-servers: 0 on the old host and max-servers: valid-value on the new server and then remove the old server when it stops managing any hosts
19:23:15 <clarkb> corvus wasn't sure if that would be safe (it sounds like maybe it would be)
19:23:37 <clarkb> not sure if we want to find out the hard way or do a more careful disable old server, wait for it to idle, start new server setup
19:23:53 <clarkb> the downside with the careful appraoch is we'll drop our node count by the number of nodes in that provider in the interim
19:24:09 <ianw> if anyone would know, corvus would :)
19:24:26 <ianw> it doesn't seem like turning one to zero would communicate anything to the other
19:24:31 <clarkb> I've got some changes to rebase and clean up anyway related to this so I'll look at it a bit more and see if I can convince myself the first idea is safe
19:24:44 <ianw> it feels like the other one would just see more resources available
19:24:45 <clarkb> ianw: I think the concern is that they may see each other as leaking nodes within the provider
19:25:16 <clarkb> also possibly the max-servers 0 instance my reject node requests for the provider since it has no quota. Not sure how the node request rejections work though and if they would be unique enough to avoid that problem.
19:25:50 <clarkb> if the node requests are handled with host+provider unique info we would be ok. I can check on that
19:26:39 <clarkb> That was all I had on this though
19:26:46 <clarkb> ianw: anything to add re kerberos servers?
19:27:53 <ianw> no, wip, but i think i have a handle on it after a crash course on kerberos configuration files :)
19:28:06 <clarkb> let us know when you want reviews
19:28:21 <clarkb> #topic Deploy new refstack
19:28:35 <clarkb> kopecmartin: ianw: any luck sorting out the api situation (and wsgi etc)
19:28:41 <kopecmartin> this is ready: https://review.opendev.org/c/opendev/system-config/+/776292
19:29:00 <kopecmartin> i wrote a comments as well so that we know why the vhost was edited the way it is
19:29:02 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/776292 refstack api url change
19:29:13 <clarkb> thanks I've got that on my list for review now
19:29:20 <kopecmartin> great, thanks
19:29:25 <kopecmartin> i tested it and it seems ok
19:29:29 <clarkb> I guess we land that then retest the new server and take it from there?
19:29:32 <kopecmartin> so I'd say it can go to production
19:30:16 <clarkb> the comment helps, thank you for that
19:30:26 <clarkb> anything else related to refstack?
19:30:34 <ianw> ok, it seems we don't quite understand what is going on, but i doubt any of us have a lot of effort to put into it if it seems to work
19:30:57 <kopecmartin> yeah, i'm out of time on this unfortunately
19:31:14 <kopecmartin> nothing else from my side
19:31:50 <clarkb> kopecmartin: ok, left a quick note for somethign I noticed on that change
19:31:57 <clarkb> if we update that I expect we can land it
19:32:19 <clarkb> #topic Bridge Disk Space
19:32:46 <clarkb> the major consumer of disk here got tracked down to stevedore (thank you ianw and mordred and frickler) writing out entrypoint cache files
19:33:03 <clarkb> latest stevedore aviods writing those caches when it can detect it is running under ansible
19:33:29 <clarkb> ianw wrote some cahnges to upgrade stevedore as the major problem was ansible related, but also wrote out a disable file to the cache dir to avoid other uses polluting the dir
19:33:39 <clarkb> ianw: it also looks like you cleaned up the stale cache files.
19:33:51 <clarkb> Anything else to bring up on this? I think we can consider a solved problem
19:34:31 <ianw> yep i checked the deployment of stevedore and cleaned up those files, so ++
19:35:19 <clarkb> #topic PTG Prep
19:35:53 <clarkb> The next PTG dates have been annoucned as April 19-23. We have been asked to fill out a survey by March 25 to indicate interest in participating if we wish to participate
19:36:13 <clarkb> I'd be interested in hearing from others if they think this will be valuable or not.
19:36:41 <clarkb> The last PTG was full of distractions and a lot of other stuff going on and it felt less useful to me. But I'm not sure if that was due to circumstances or if this smaller group just doesn't need as much synchronous time
19:37:05 <clarkb> curious to hear what others think. I'm happy to organize time for us if we want to particpate, just let me know
19:38:17 <clarkb> maybe think about it and we can bring it up again in next week's meeting and take it from there
19:38:29 <clarkb> #topic Open Discussion
19:38:38 <clarkb> That was all I had on the agenda, anything else?
19:39:42 <ianw> i haven't got to the new review server setup although the server is started
19:40:02 <ianw> i did wonder if maybe we should run some performance things on it now
19:40:35 <clarkb> we might also want to consider if we need a bigger instance?
19:40:51 <ianw> i know the old server has hot-migrated itself around, but i wonder if maybe the new server is in a new rack or something and might be faster?
19:41:14 <clarkb> but ya some performance sanity checks make sense to me. In addition to cpu checks disk io checking (against the ssd volume?) might be good
19:41:18 <ianw> i'm not sure how it works on the backend.  perhaps there's the "openstack rack" in a corner and that's what we're always on :)
19:41:35 <ianw> i think the next size up was 96gb
19:41:40 <clarkb> ianw: ya i don't know either. That said the notedb migration was much quicker on review-test than it ended up being on prod
19:41:54 <clarkb> it is possible that the newer hosts gain some benefit somewhere based on ^
19:42:08 <clarkb> too many variables involved to say for sure though
19:42:55 <ianw> performance2-90         | 90 GB Performance                 |  92160 |   40 |       900 |    24 | N/A
19:43:28 <clarkb> ianw: it is probably worth considering ^ since we're tight on memory right now and one of my theories is the lack of space for the kernel to cache things may be impacting io
19:43:36 <clarkb> (of course I haven't been able to test that in any reasonable way)
19:44:00 <ianw> we also have an onMetal flavour
19:44:21 <clarkb> I think we had been asked to not onmetal at some point
19:45:37 <ianw> ok, the medium flavor looks about right with 64gb and 24vcpu; but yeah, we may not actually have any quota
19:46:14 <clarkb> we do have a couple of services we have talked about turning off like pbx and mqtt
19:46:48 <clarkb> and if we go more radical dropping elasticsearch would free up massive resources. At this point the trade between better gerrit and no elasticsearch may be worthwhile
19:46:58 <ianw> i'd have to check but i think we'd be very tight to add 30gb ATM
19:47:06 <clarkb> definitely something to consider, I don't really want to hold your work up overthinking it though
19:48:18 <ianw> at this point i need to get back to pulling apart where we've mixed openstack.org/opendev.org in ansible so we have a clear path for adding the new server anyway
19:48:50 <clarkb> ok, fungi frickler corvus ^ maybe you can think about that a bit and we can decide if we should go bigger (and make necessary sacrifices if so)
19:49:31 <frickler> going bigger would mean changing IPs, right? would it be an option to move to vexxhost then?
19:49:51 <ianw> frickler: either way we've been looking at a new host and changing ip's
19:50:14 <clarkb> frickler: that may also be something to consider especially if we want to fine tune sizing
19:50:28 <corvus> o/
19:50:41 <clarkb> my concern with a change like that would be the disk io (we can test it to ensure some confidence in it though). We'd also want to talk to mnaser and see if that is reasonable
19:50:43 <frickler> iirc mnaser has nice fast amd cpus
19:51:02 <clarkb> frickler: yup, but then all the disk is ceph and I'm not sure how that compares to $ssd gerrit cinder volume we have currently
19:51:11 <clarkb> it may be great it may not be, something to check
19:51:19 <frickler> sure
19:51:34 <mnaser> clarkb / frickler: our ceph is all nvme/ssd backed
19:51:41 <mnaser> and we also have local (but unreliable) storage available
19:52:13 <mnaser> depending on your timeline, we're rolling out access to baremetal systems
19:52:23 <mnaser> so that might be an interesting option too
19:52:43 <frickler> depending on your timeline, we might consider waiting for that ;)
19:52:49 <corvus> i like the idea of increasing gerrit size; i also like the idea of moving it to vexx if mnaser is ok;
19:52:51 <clarkb> mnaser: is that something you might be interested in hosting on vexxhost? we're thinking that ab igger server will probably help with some of the performance issues. In particular we allocate a ton of memory to the jvm and that impacts the kernels ability to cache at its level
19:53:11 <clarkb> mnaser: the current server is 60GB ram + 16 vcpu and we'd probably want to bump up both of those axis if possible
19:53:37 <mnaser> hm
19:53:49 <mnaser> so, we're 'recycling' our old compute nodes to make them available as baremetal instances
19:54:01 <ianw> (plus a 256gb attached volume for /home/gerrit2)
19:54:32 <mnaser> so you'd have 2x 240G for OS (RAID-1), 2x 960G disks (for whatever you want to use them, including raid), 384gb memory, but the cpus arent the newest, but..
19:54:34 <mnaser> its not vcpus
19:55:07 <clarkb> part of me likes the simplicity of VMs. If they are on a failing host tehy get migrated somewhere else
19:55:11 <corvus> baremetal sounds good assuming remote disk and floating ips to cope with hardware failure; with our issues changing ips, i wouldn't want to need to do an ip change to address a failure
19:55:14 <clarkb> but there is a performance impact
19:55:35 <clarkb> corvus: thats a better way of describing what I'm saying I think
19:55:39 <mnaser> 40 thread cpu systems, but yeah
19:55:40 <corvus> clarkb: yeah, my preference is still fully virtualized until we hit that performance wall :)
19:56:12 <mnaser> virtual works too, our cpu to mem ratio is 4 so
19:56:18 <mnaser> for every 1 vcpu => 4 gb of memory
19:56:25 <mnaser> 32vcpus => 128gb memory
19:56:42 <clarkb> mnaser: is 96GB and 24 vcpu a possibility?
19:56:56 <clarkb> (I haven't looked at flavors lately)
19:56:59 <mnaser> i think there is a flavor with that size i believe
19:57:07 <mnaser> if not we can makeit happen as long as it fits the ratio
19:57:09 <clarkb> I suspect that sort of bump may be reasonable given the current situation on 60 + 16
19:57:33 <clarkb> we wouldn't really incrase jvm heap allocation from 48gb we'd just let the kernel participate in file caching
19:57:54 <mnaser> also, i'd advise against using a floating ip (so traffic is not hairpinned) but instead attach publicly directly -- you can keep the port and reuse it if you need to
19:58:29 <corvus> one thing to consider if we move gerrit to vexxhost is there will likely be considerable network traffic between it and zuul; probably not a big deal, but right now all of gerrit+zuul is in one data center
19:58:41 <corvus> mnaser: ++
19:58:54 <clarkb> mnaser: that looks like create a port with an ip in neutron (but not floating ip) then when we openstack server create or similar pass the port value in for network info?
19:59:09 <mnaser> admins can create ports with any ips
19:59:11 <mnaser> so can help with that
19:59:15 <clarkb> mnaser: gotcha
19:59:57 <clarkb> we are just about at time. It sounds like mnaser isn't opposed to the idea. In addition to untangling opendev vs openstack maybe the next step here is to decide what an instance in vexxhost should look like and discuss those specifics with mnaser?
20:00:10 <mnaser> +1, also recommend going to mtl for this one
20:00:13 <clarkb> then we can spin that up and do some perf testing ot make sure we aren't missing something important and take it from there
20:00:36 <clarkb> I'll go ahead and end the meeting now so that we can have lunch/dinner/breakfast
20:00:40 <clarkb> #endmeeting