Tuesday, 2021-03-09

*** hamalq has quit IRC		01:24
*** hashar has joined #opendev-meeting		07:09
*** hashar has quit IRC		08:19
*** hashar has joined #opendev-meeting		09:25
*** hashar has quit IRC		11:08
*** hashar has joined #opendev-meeting		13:04
*** hashar has quit IRC		15:28
*** hashar has joined #opendev-meeting		15:57
*** hashar has quit IRC		17:07
*** hamalq has joined #opendev-meeting		18:30
*** hashar has joined #opendev-meeting		18:53
clarkb	anyone else here for the meeting?	19:00
clarkb	we will get started shortly	19:00
ianw	o/	19:00
clarkb	#startmeeting infra	19:01
openstack	Meeting started Tue Mar 9 19:01:06 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
openstack	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
*** openstack changes topic to " (Meeting topic: infra)"		19:01
openstack	The meeting name has been set to 'infra'	19:01
clarkb	#link http://lists.opendev.org/pipermail/service-discuss/2021-March/000195.html Our Agenda	19:01
clarkb	#topic Announcements	19:01
*** openstack changes topic to "Announcements (Meeting topic: infra)"		19:01
clarkb	clarkb out March 23rd, could use a volunteer meeting chair or plan to skip	19:02
clarkb	I'll probably just let this resolve itself. If you see a meeting agenda next week show up to the meeting otherwise skip it :)	19:02
clarkb	er sorry its 2 weeks from now	19:03
clarkb	I'm getting too excited :)	19:03
fungi	heh	19:03
clarkb	DST change happens for those of us in North America this weekend. EU and others follow in a few weeks.	19:03
fungi	you're in a hurry for northern hemisphere spring i guess	19:03
ianw	:) i am around so can run it	19:03
clarkb	ianw: thanks!	19:03
clarkb	heads up on the DST changes starting soon for many of us. You'll want to update your calendars if you operate in local time	19:03
clarkb	North american is this weekend then EU in like two weeks and Australia in 3? something like that	19:04
clarkb	#topic Actions from last meeting	19:04
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)"		19:04
clarkb	#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-02-19.01.txt minutes from last meeting	19:04
clarkb	corvus has started the jitsi unfork	19:04
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/778308	19:05
clarkb	I don't think we need to reaction that and can track it with the change now. Currently it is failing CI for some reason, and I haven't had a chance to look at it though I should try to look at it today	19:05
clarkb	#topic Priority Efforts	19:05
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)"		19:05
clarkb	#topic OpenDev	19:05
*** openstack changes topic to "OpenDev (Meeting topic: infra)"		19:05
clarkb	The gerrit account inconsistency work continues.	19:05
clarkb	Sicne last we recorded the status here I have deleted conflicting external ids from about 35 accounts that were previously inactive. These were considered safe changes because the accounts were already disabled.	19:06
clarkb	I have also put another 70 something through the disabling process in preparation for deleting their external ids (something I'd like to do today or tomorrow if we are comfortable with the list). These were accounts that the audit script identified as not having valid openids or ssh usernames or sshkeys or any reviewd changes or pushed changes	19:07
clarkb	essentially they were never used and cannot be used for anything. I set them inactive on friday to cause them to flip an error if they were used over the last few days but I haven't seenanything related to that.	19:08
clarkb	I'll put together the external id cleanup input file and run that when we are ready	19:08
fungi	yeah, i don't see any way those accounts could be used in their current state, so should be safe	19:08
fungi	hard to say they were never logged into, but they can't be logged into now anyway	19:08
fungi	possible at least some are lingering remnants of earlier account merges/cleanups	19:09
*** hamalq has quit IRC		19:09
fungi	but just vestiges now if so	19:09
clarkb	yup	19:09
*** hamalq has joined #opendev-meeting		19:09
clarkb	Sort of related to this we had a user reach out to service-incident about not being able to log in. THis was for an entirely new account though and appears to be the moving openid then email conflict problem.	19:09
clarkb	they removed their email from moderation themselves but I reached out anyway asking them for info on how they got into that state and offered a couple of options for moving forward (they managed to create a second account with a third non conflicting email which woudl work or we can apply these same retirement and external id cleanups to the original account and have them try again)	19:10
clarkb	will wait and see what they say. I cc'd fungi on the email since fungi would've gotten the moderation notice too. But kept it off of public lists so we can atlkabout email addrs and details like that	19:11
fungi	not that service-incident is a public list anyway	19:11
fungi	but also not really on-topic	19:11
clarkb	fungi: right I was thinkign we could discuss it on service-discuss if not for all the personal details	19:12
fungi	yep	19:12
clarkb	any other opendev items to discuss? if not we can move on? (I've sort of stalled out on the profiling of gerrit work in CI, as I'm prioritizing the account work)	19:13
clarkb	#topic Update Config Management	19:14
*** openstack changes topic to "Update Config Management (Meeting topic: infra)"		19:14
clarkb	I'm not aware of any items under this heading to talk about, but thought I'd ask before skipping ahead	19:14
fungi	nothing new this week afaik	19:14
clarkb	#topic General Topics	19:16
*** openstack changes topic to "General Topics (Meeting topic: infra)"		19:16
clarkb	#topic OpenAFS cluster status	19:16
*** openstack changes topic to "OpenAFS cluster status (Meeting topic: infra)"		19:16
clarkb	ianw: I think this may be all complete now? (aside from making room on dfw01's vicepa?)	19:16
clarkb	we have a third db server and all servers are upgraded to focal now?	19:17
ianw	yeah, i've moved on to the kerberos kdc hosts, which are related	19:17
ianw	i hope to get ansible for that finished very soon; i don't think an in-place upgrade is as important there but probably easiest	19:17
clarkb	good point. Should we drop this topic and track kerberos under the general upgrades heading or would you like to keep this here as a separate item?	19:18
clarkb	also thank you for pushing on this, our openafs cluster should be much happier now and possibly ready for 2.0 whenever that is somethign to consider	19:18
ianw	i think we can remove it	19:18
clarkb	ok	19:18
ianw	you're right i owe looking at the fedora mirror bits, on the todo list	19:19
clarkb	#topic Borg Backups	19:19
*** openstack changes topic to "Borg Backups (Meeting topic: infra)"		19:19
clarkb	Last I heard we were going to try manually running the backup for gitea db and see if we could determine why it is sad but only to one target	19:19
clarkb	any new news on that?	19:19
fungi	the errors for gitea01 ceased. not sure if anyone changed anything there?	19:19
ianw	i did not	19:19
clarkb	I did not either	19:20
clarkb	I guess we keep our eyes open for recurrence but can probably drop this topic now too?	19:20
ianw	yep! i think we're done there	19:21
clarkb	great	19:21
clarkb	thank you for working on this as well	19:21
fungi	i just rechecked the root inbox to be sure, no more gitea01 errors	19:21
clarkb	#topic Puppet replacements and Server upgrades	19:21
*** openstack changes topic to "Puppet replacements and Server upgrades (Meeting topic: infra)"		19:21
fungi	though the ticket about fedora 33 being unable to reach emergency consoles seems to be waiting for an update from us	19:22
clarkb	I've rotated all the zuul-executors at this point. That means zuul-mergers and executors are done. Next on my list was nodepool launchers	19:22
clarkb	I think these are going to be a bit more involved since we need to keep old launcher from interfering with new launcher. One idea I had was to land a change that sets max-servers: 0 on the old host and max-servers: valid-value on the new server and then remove the old server when it stops managing any hosts	19:23
clarkb	corvus wasn't sure if that would be safe (it sounds like maybe it would be)	19:23
clarkb	not sure if we want to find out the hard way or do a more careful disable old server, wait for it to idle, start new server setup	19:23
clarkb	the downside with the careful appraoch is we'll drop our node count by the number of nodes in that provider in the interim	19:23
ianw	if anyone would know, corvus would :)	19:24
ianw	it doesn't seem like turning one to zero would communicate anything to the other	19:24
clarkb	I've got some changes to rebase and clean up anyway related to this so I'll look at it a bit more and see if I can convince myself the first idea is safe	19:24
ianw	it feels like the other one would just see more resources available	19:24
clarkb	ianw: I think the concern is that they may see each other as leaking nodes within the provider	19:24
clarkb	also possibly the max-servers 0 instance my reject node requests for the provider since it has no quota. Not sure how the node request rejections work though and if they would be unique enough to avoid that problem.	19:25
clarkb	if the node requests are handled with host+provider unique info we would be ok. I can check on that	19:25
clarkb	That was all I had on this though	19:26
clarkb	ianw: anything to add re kerberos servers?	19:26
ianw	no, wip, but i think i have a handle on it after a crash course on kerberos configuration files :)	19:27
clarkb	let us know when you want reviews	19:28
clarkb	#topic Deploy new refstack	19:28
*** openstack changes topic to "Deploy new refstack (Meeting topic: infra)"		19:28
clarkb	kopecmartin: ianw: any luck sorting out the api situation (and wsgi etc)	19:28
kopecmartin	this is ready: https://review.opendev.org/c/opendev/system-config/+/776292	19:28
kopecmartin	i wrote a comments as well so that we know why the vhost was edited the way it is	19:29
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/776292 refstack api url change	19:29
clarkb	thanks I've got that on my list for review now	19:29
kopecmartin	great, thanks	19:29
kopecmartin	i tested it and it seems ok	19:29
clarkb	I guess we land that then retest the new server and take it from there?	19:29
kopecmartin	so I'd say it can go to production	19:29
clarkb	the comment helps, thank you for that	19:30
clarkb	anything else related to refstack?	19:30
ianw	ok, it seems we don't quite understand what is going on, but i doubt any of us have a lot of effort to put into it if it seems to work	19:30
kopecmartin	yeah, i'm out of time on this unfortunately	19:30
kopecmartin	nothing else from my side	19:31
clarkb	kopecmartin: ok, left a quick note for somethign I noticed on that change	19:31
clarkb	if we update that I expect we can land it	19:31
clarkb	#topic Bridge Disk Space	19:32
*** openstack changes topic to "Bridge Disk Space (Meeting topic: infra)"		19:32
clarkb	the major consumer of disk here got tracked down to stevedore (thank you ianw and mordred and frickler) writing out entrypoint cache files	19:32
clarkb	latest stevedore aviods writing those caches when it can detect it is running under ansible	19:33
clarkb	ianw wrote some cahnges to upgrade stevedore as the major problem was ansible related, but also wrote out a disable file to the cache dir to avoid other uses polluting the dir	19:33
clarkb	ianw: it also looks like you cleaned up the stale cache files.	19:33
clarkb	Anything else to bring up on this? I think we can consider a solved problem	19:33
ianw	yep i checked the deployment of stevedore and cleaned up those files, so ++	19:34
clarkb	#topic PTG Prep	19:35
*** openstack changes topic to "PTG Prep (Meeting topic: infra)"		19:35
clarkb	The next PTG dates have been annoucned as April 19-23. We have been asked to fill out a survey by March 25 to indicate interest in participating if we wish to participate	19:35
clarkb	I'd be interested in hearing from others if they think this will be valuable or not.	19:36
clarkb	The last PTG was full of distractions and a lot of other stuff going on and it felt less useful to me. But I'm not sure if that was due to circumstances or if this smaller group just doesn't need as much synchronous time	19:36
clarkb	curious to hear what others think. I'm happy to organize time for us if we want to particpate, just let me know	19:37
clarkb	maybe think about it and we can bring it up again in next week's meeting and take it from there	19:38
clarkb	#topic Open Discussion	19:38
*** openstack changes topic to "Open Discussion (Meeting topic: infra)"		19:38
clarkb	That was all I had on the agenda, anything else?	19:38
ianw	i haven't got to the new review server setup although the server is started	19:39
ianw	i did wonder if maybe we should run some performance things on it now	19:40
clarkb	we might also want to consider if we need a bigger instance?	19:40
ianw	i know the old server has hot-migrated itself around, but i wonder if maybe the new server is in a new rack or something and might be faster?	19:40
clarkb	but ya some performance sanity checks make sense to me. In addition to cpu checks disk io checking (against the ssd volume?) might be good	19:41
ianw	i'm not sure how it works on the backend. perhaps there's the "openstack rack" in a corner and that's what we're always on :)	19:41
ianw	i think the next size up was 96gb	19:41
clarkb	ianw: ya i don't know either. That said the notedb migration was much quicker on review-test than it ended up being on prod	19:41
clarkb	it is possible that the newer hosts gain some benefit somewhere based on ^	19:41
clarkb	too many variables involved to say for sure though	19:42
ianw	performance2-90 \| 90 GB Performance \| 92160 \| 40 \| 900 \| 24 \| N/A	19:42
clarkb	ianw: it is probably worth considering ^ since we're tight on memory right now and one of my theories is the lack of space for the kernel to cache things may be impacting io	19:43
clarkb	(of course I haven't been able to test that in any reasonable way)	19:43
ianw	we also have an onMetal flavour	19:44
clarkb	I think we had been asked to not onmetal at some point	19:44
ianw	ok, the medium flavor looks about right with 64gb and 24vcpu; but yeah, we may not actually have any quota	19:45
clarkb	we do have a couple of services we have talked about turning off like pbx and mqtt	19:46
clarkb	and if we go more radical dropping elasticsearch would free up massive resources. At this point the trade between better gerrit and no elasticsearch may be worthwhile	19:46
ianw	i'd have to check but i think we'd be very tight to add 30gb ATM	19:46
clarkb	definitely something to consider, I don't really want to hold your work up overthinking it though	19:47
ianw	at this point i need to get back to pulling apart where we've mixed openstack.org/opendev.org in ansible so we have a clear path for adding the new server anyway	19:48
clarkb	ok, fungi frickler corvus ^ maybe you can think about that a bit and we can decide if we should go bigger (and make necessary sacrifices if so)	19:48
frickler	going bigger would mean changing IPs, right? would it be an option to move to vexxhost then?	19:49
ianw	frickler: either way we've been looking at a new host and changing ip's	19:49
clarkb	frickler: that may also be something to consider especially if we want to fine tune sizing	19:50
corvus	o/	19:50
clarkb	my concern with a change like that would be the disk io (we can test it to ensure some confidence in it though). We'd also want to talk to mnaser and see if that is reasonable	19:50
frickler	iirc mnaser has nice fast amd cpus	19:50
clarkb	frickler: yup, but then all the disk is ceph and I'm not sure how that compares to $ssd gerrit cinder volume we have currently	19:51
clarkb	it may be great it may not be, something to check	19:51
frickler	sure	19:51
mnaser	clarkb / frickler: our ceph is all nvme/ssd backed	19:51
mnaser	and we also have local (but unreliable) storage available	19:51
mnaser	depending on your timeline, we're rolling out access to baremetal systems	19:52
mnaser	so that might be an interesting option too	19:52
frickler	depending on your timeline, we might consider waiting for that ;)	19:52
corvus	i like the idea of increasing gerrit size; i also like the idea of moving it to vexx if mnaser is ok;	19:52
clarkb	mnaser: is that something you might be interested in hosting on vexxhost? we're thinking that ab igger server will probably help with some of the performance issues. In particular we allocate a ton of memory to the jvm and that impacts the kernels ability to cache at its level	19:52
clarkb	mnaser: the current server is 60GB ram + 16 vcpu and we'd probably want to bump up both of those axis if possible	19:53
mnaser	hm	19:53
mnaser	so, we're 'recycling' our old compute nodes to make them available as baremetal instances	19:53
ianw	(plus a 256gb attached volume for /home/gerrit2)	19:54
mnaser	so you'd have 2x 240G for OS (RAID-1), 2x 960G disks (for whatever you want to use them, including raid), 384gb memory, but the cpus arent the newest, but..	19:54
mnaser	its not vcpus	19:54
clarkb	part of me likes the simplicity of VMs. If they are on a failing host tehy get migrated somewhere else	19:55
corvus	baremetal sounds good assuming remote disk and floating ips to cope with hardware failure; with our issues changing ips, i wouldn't want to need to do an ip change to address a failure	19:55
clarkb	but there is a performance impact	19:55
clarkb	corvus: thats a better way of describing what I'm saying I think	19:55
mnaser	40 thread cpu systems, but yeah	19:55
corvus	clarkb: yeah, my preference is still fully virtualized until we hit that performance wall :)	19:55
mnaser	virtual works too, our cpu to mem ratio is 4 so	19:56
mnaser	for every 1 vcpu => 4 gb of memory	19:56
mnaser	32vcpus => 128gb memory	19:56
clarkb	mnaser: is 96GB and 24 vcpu a possibility?	19:56
clarkb	(I haven't looked at flavors lately)	19:56
mnaser	i think there is a flavor with that size i believe	19:56
mnaser	if not we can makeit happen as long as it fits the ratio	19:57
clarkb	I suspect that sort of bump may be reasonable given the current situation on 60 + 16	19:57
clarkb	we wouldn't really incrase jvm heap allocation from 48gb we'd just let the kernel participate in file caching	19:57
mnaser	also, i'd advise against using a floating ip (so traffic is not hairpinned) but instead attach publicly directly -- you can keep the port and reuse it if you need to	19:57
corvus	one thing to consider if we move gerrit to vexxhost is there will likely be considerable network traffic between it and zuul; probably not a big deal, but right now all of gerrit+zuul is in one data center	19:58
corvus	mnaser: ++	19:58
clarkb	mnaser: that looks like create a port with an ip in neutron (but not floating ip) then when we openstack server create or similar pass the port value in for network info?	19:58
mnaser	admins can create ports with any ips	19:59
mnaser	so can help with that	19:59
clarkb	mnaser: gotcha	19:59
clarkb	we are just about at time. It sounds like mnaser isn't opposed to the idea. In addition to untangling opendev vs openstack maybe the next step here is to decide what an instance in vexxhost should look like and discuss those specifics with mnaser?	19:59
mnaser	+1, also recommend going to mtl for this one	20:00
clarkb	then we can spin that up and do some perf testing ot make sure we aren't missing something important and take it from there	20:00
clarkb	I'll go ahead and end the meeting now so that we can have lunch/dinner/breakfast	20:00
clarkb	#endmeeting	20:00
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev"		20:00
openstack	Meeting ended Tue Mar 9 20:00:40 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
openstack	Minutes: http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-09-19.01.html	20:00
openstack	Minutes (text): http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-09-19.01.txt	20:00
openstack	Log: http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-09-19.01.log.html	20:00
clarkb	thanks everyone and feel free to continue conversations in #opendev or on the mailing list	20:00
corvus	mnaser: aww, i was hoping for sjc ;)	20:01
fungi	thanks clarkb!	20:02
clarkb	corvus: the worst thing is when silly ISPs send you halfway across the country to hit local resources due to peering and route costs	20:04
clarkb	corvus: up here its really common to go to at least seattle before returning oregon	20:04
ianw	clarkb: my brother-in-law lives in what could only be termed the middle of nowhere. he's signed up for starlink ... going to be very interested if he ends up with better internet than me in a suburb of a major city	20:15
fungi	my folks did the early signup too. same deal. their only current "broadband" option is slow and often dead at&t adsl	20:16
fungi	though they're in a tight valley, i warned them that it may be a while before there's a satellite which isn't behind a mountain for them	20:17
clarkb	ianw: eventually you should be able to get starlink too? though that may be a long way away	20:27
*** irclogbot_3 has quit IRC		20:31
*** irclogbot_1 has joined #opendev-meeting		20:32
*** sboyron has quit IRC		20:38
fungi	in more ways than one	20:38
*** hashar has quit IRC		22:53

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!