19:01:06 #startmeeting infra 19:01:06 Meeting started Tue Mar 9 19:01:06 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:09 The meeting name has been set to 'infra' 19:01:21 #link http://lists.opendev.org/pipermail/service-discuss/2021-March/000195.html Our Agenda 19:01:29 #topic Announcements 19:02:12 clarkb out March 23rd, could use a volunteer meeting chair or plan to skip 19:02:37 I'll probably just let this resolve itself. If you see a meeting agenda next week show up to the meeting otherwise skip it :) 19:03:07 er sorry its 2 weeks from now 19:03:11 I'm getting too excited :) 19:03:13 heh 19:03:20 DST change happens for those of us in North America this weekend. EU and others follow in a few weeks. 19:03:27 you're in a hurry for northern hemisphere spring i guess 19:03:31 :) i am around so can run it 19:03:38 ianw: thanks! 19:03:57 heads up on the DST changes starting soon for many of us. You'll want to update your calendars if you operate in local time 19:04:13 North american is this weekend then EU in like two weeks and Australia in 3? something like that 19:04:42 #topic Actions from last meeting 19:04:48 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-02-19.01.txt minutes from last meeting 19:04:58 corvus has started the jitsi unfork 19:05:00 #link https://review.opendev.org/c/opendev/system-config/+/778308 19:05:21 I don't think we need to reaction that and can track it with the change now. Currently it is failing CI for some reason, and I haven't had a chance to look at it though I should try to look at it today 19:05:34 #topic Priority Efforts 19:05:39 #topic OpenDev 19:05:50 The gerrit account inconsistency work continues. 19:06:21 Sicne last we recorded the status here I have deleted conflicting external ids from about 35 accounts that were previously inactive. These were considered safe changes because the accounts were already disabled. 19:07:31 I have also put another 70 something through the disabling process in preparation for deleting their external ids (something I'd like to do today or tomorrow if we are comfortable with the list). These were accounts that the audit script identified as not having valid openids or ssh usernames or sshkeys or any reviewd changes or pushed changes 19:08:07 essentially they were never used and cannot be used for anything. I set them inactive on friday to cause them to flip an error if they were used over the last few days but I haven't seenanything related to that. 19:08:18 I'll put together the external id cleanup input file and run that when we are ready 19:08:22 yeah, i don't see any way those accounts could be used in their current state, so should be safe 19:08:37 hard to say they were never logged into, but they can't be logged into now anyway 19:09:11 possible at least some are lingering remnants of earlier account merges/cleanups 19:09:22 but just vestiges now if so 19:09:22 yup 19:09:52 Sort of related to this we had a user reach out to service-incident about not being able to log in. THis was for an entirely new account though and appears to be the moving openid then email conflict problem. 19:10:43 they removed their email from moderation themselves but I reached out anyway asking them for info on how they got into that state and offered a couple of options for moving forward (they managed to create a second account with a third non conflicting email which woudl work or we can apply these same retirement and external id cleanups to the original account and have them try again) 19:11:29 will wait and see what they say. I cc'd fungi on the email since fungi would've gotten the moderation notice too. But kept it off of public lists so we can atlkabout email addrs and details like that 19:11:52 not that service-incident is a public list anyway 19:11:57 but also not really on-topic 19:12:15 fungi: right I was thinkign we could discuss it on service-discuss if not for all the personal details 19:12:39 yep 19:13:16 any other opendev items to discuss? if not we can move on? (I've sort of stalled out on the profiling of gerrit work in CI, as I'm prioritizing the account work) 19:14:15 #topic Update Config Management 19:14:33 I'm not aware of any items under this heading to talk about, but thought I'd ask before skipping ahead 19:14:59 nothing new this week afaik 19:16:21 #topic General Topics 19:16:32 #topic OpenAFS cluster status 19:16:55 ianw: I think this may be all complete now? (aside from making room on dfw01's vicepa?) 19:17:05 we have a third db server and all servers are upgraded to focal now? 19:17:22 yeah, i've moved on to the kerberos kdc hosts, which are related 19:17:50 i hope to get ansible for that finished very soon; i don't think an in-place upgrade is as important there but probably easiest 19:18:01 good point. Should we drop this topic and track kerberos under the general upgrades heading or would you like to keep this here as a separate item? 19:18:23 also thank you for pushing on this, our openafs cluster should be much happier now and possibly ready for 2.0 whenever that is somethign to consider 19:18:43 i think we can remove it 19:18:55 ok 19:19:07 you're right i owe looking at the fedora mirror bits, on the todo list 19:19:20 #topic Borg Backups 19:19:42 Last I heard we were going to try manually running the backup for gitea db and see if we could determine why it is sad but only to one target 19:19:46 any new news on that? 19:19:46 the errors for gitea01 ceased. not sure if anyone changed anything there? 19:19:59 i did not 19:20:20 I did not either 19:20:39 I guess we keep our eyes open for recurrence but can probably drop this topic now too? 19:21:18 yep! i think we're done there 19:21:21 great 19:21:25 thank you for working on this as well 19:21:34 i just rechecked the root inbox to be sure, no more gitea01 errors 19:21:47 #topic Puppet replacements and Server upgrades 19:22:01 though the ticket about fedora 33 being unable to reach emergency consoles seems to be waiting for an update from us 19:22:25 I've rotated all the zuul-executors at this point. That means zuul-mergers and executors are done. Next on my list was nodepool launchers 19:23:06 I think these are going to be a bit more involved since we need to keep old launcher from interfering with new launcher. One idea I had was to land a change that sets max-servers: 0 on the old host and max-servers: valid-value on the new server and then remove the old server when it stops managing any hosts 19:23:15 corvus wasn't sure if that would be safe (it sounds like maybe it would be) 19:23:37 not sure if we want to find out the hard way or do a more careful disable old server, wait for it to idle, start new server setup 19:23:53 the downside with the careful appraoch is we'll drop our node count by the number of nodes in that provider in the interim 19:24:09 if anyone would know, corvus would :) 19:24:26 it doesn't seem like turning one to zero would communicate anything to the other 19:24:31 I've got some changes to rebase and clean up anyway related to this so I'll look at it a bit more and see if I can convince myself the first idea is safe 19:24:44 it feels like the other one would just see more resources available 19:24:45 ianw: I think the concern is that they may see each other as leaking nodes within the provider 19:25:16 also possibly the max-servers 0 instance my reject node requests for the provider since it has no quota. Not sure how the node request rejections work though and if they would be unique enough to avoid that problem. 19:25:50 if the node requests are handled with host+provider unique info we would be ok. I can check on that 19:26:39 That was all I had on this though 19:26:46 ianw: anything to add re kerberos servers? 19:27:53 no, wip, but i think i have a handle on it after a crash course on kerberos configuration files :) 19:28:06 let us know when you want reviews 19:28:21 #topic Deploy new refstack 19:28:35 kopecmartin: ianw: any luck sorting out the api situation (and wsgi etc) 19:28:41 this is ready: https://review.opendev.org/c/opendev/system-config/+/776292 19:29:00 i wrote a comments as well so that we know why the vhost was edited the way it is 19:29:02 #link https://review.opendev.org/c/opendev/system-config/+/776292 refstack api url change 19:29:13 thanks I've got that on my list for review now 19:29:20 great, thanks 19:29:25 i tested it and it seems ok 19:29:29 I guess we land that then retest the new server and take it from there? 19:29:32 so I'd say it can go to production 19:30:16 the comment helps, thank you for that 19:30:26 anything else related to refstack? 19:30:34 ok, it seems we don't quite understand what is going on, but i doubt any of us have a lot of effort to put into it if it seems to work 19:30:57 yeah, i'm out of time on this unfortunately 19:31:14 nothing else from my side 19:31:50 kopecmartin: ok, left a quick note for somethign I noticed on that change 19:31:57 if we update that I expect we can land it 19:32:19 #topic Bridge Disk Space 19:32:46 the major consumer of disk here got tracked down to stevedore (thank you ianw and mordred and frickler) writing out entrypoint cache files 19:33:03 latest stevedore aviods writing those caches when it can detect it is running under ansible 19:33:29 ianw wrote some cahnges to upgrade stevedore as the major problem was ansible related, but also wrote out a disable file to the cache dir to avoid other uses polluting the dir 19:33:39 ianw: it also looks like you cleaned up the stale cache files. 19:33:51 Anything else to bring up on this? I think we can consider a solved problem 19:34:31 yep i checked the deployment of stevedore and cleaned up those files, so ++ 19:35:19 #topic PTG Prep 19:35:53 The next PTG dates have been annoucned as April 19-23. We have been asked to fill out a survey by March 25 to indicate interest in participating if we wish to participate 19:36:13 I'd be interested in hearing from others if they think this will be valuable or not. 19:36:41 The last PTG was full of distractions and a lot of other stuff going on and it felt less useful to me. But I'm not sure if that was due to circumstances or if this smaller group just doesn't need as much synchronous time 19:37:05 curious to hear what others think. I'm happy to organize time for us if we want to particpate, just let me know 19:38:17 maybe think about it and we can bring it up again in next week's meeting and take it from there 19:38:29 #topic Open Discussion 19:38:38 That was all I had on the agenda, anything else? 19:39:42 i haven't got to the new review server setup although the server is started 19:40:02 i did wonder if maybe we should run some performance things on it now 19:40:35 we might also want to consider if we need a bigger instance? 19:40:51 i know the old server has hot-migrated itself around, but i wonder if maybe the new server is in a new rack or something and might be faster? 19:41:14 but ya some performance sanity checks make sense to me. In addition to cpu checks disk io checking (against the ssd volume?) might be good 19:41:18 i'm not sure how it works on the backend. perhaps there's the "openstack rack" in a corner and that's what we're always on :) 19:41:35 i think the next size up was 96gb 19:41:40 ianw: ya i don't know either. That said the notedb migration was much quicker on review-test than it ended up being on prod 19:41:54 it is possible that the newer hosts gain some benefit somewhere based on ^ 19:42:08 too many variables involved to say for sure though 19:42:55 performance2-90 | 90 GB Performance | 92160 | 40 | 900 | 24 | N/A 19:43:28 ianw: it is probably worth considering ^ since we're tight on memory right now and one of my theories is the lack of space for the kernel to cache things may be impacting io 19:43:36 (of course I haven't been able to test that in any reasonable way) 19:44:00 we also have an onMetal flavour 19:44:21 I think we had been asked to not onmetal at some point 19:45:37 ok, the medium flavor looks about right with 64gb and 24vcpu; but yeah, we may not actually have any quota 19:46:14 we do have a couple of services we have talked about turning off like pbx and mqtt 19:46:48 and if we go more radical dropping elasticsearch would free up massive resources. At this point the trade between better gerrit and no elasticsearch may be worthwhile 19:46:58 i'd have to check but i think we'd be very tight to add 30gb ATM 19:47:06 definitely something to consider, I don't really want to hold your work up overthinking it though 19:48:18 at this point i need to get back to pulling apart where we've mixed openstack.org/opendev.org in ansible so we have a clear path for adding the new server anyway 19:48:50 ok, fungi frickler corvus ^ maybe you can think about that a bit and we can decide if we should go bigger (and make necessary sacrifices if so) 19:49:31 going bigger would mean changing IPs, right? would it be an option to move to vexxhost then? 19:49:51 frickler: either way we've been looking at a new host and changing ip's 19:50:14 frickler: that may also be something to consider especially if we want to fine tune sizing 19:50:28 o/ 19:50:41 my concern with a change like that would be the disk io (we can test it to ensure some confidence in it though). We'd also want to talk to mnaser and see if that is reasonable 19:50:43 iirc mnaser has nice fast amd cpus 19:51:02 frickler: yup, but then all the disk is ceph and I'm not sure how that compares to $ssd gerrit cinder volume we have currently 19:51:11 it may be great it may not be, something to check 19:51:19 sure 19:51:34 clarkb / frickler: our ceph is all nvme/ssd backed 19:51:41 and we also have local (but unreliable) storage available 19:52:13 depending on your timeline, we're rolling out access to baremetal systems 19:52:23 so that might be an interesting option too 19:52:43 depending on your timeline, we might consider waiting for that ;) 19:52:49 i like the idea of increasing gerrit size; i also like the idea of moving it to vexx if mnaser is ok; 19:52:51 mnaser: is that something you might be interested in hosting on vexxhost? we're thinking that ab igger server will probably help with some of the performance issues. In particular we allocate a ton of memory to the jvm and that impacts the kernels ability to cache at its level 19:53:11 mnaser: the current server is 60GB ram + 16 vcpu and we'd probably want to bump up both of those axis if possible 19:53:37 hm 19:53:49 so, we're 'recycling' our old compute nodes to make them available as baremetal instances 19:54:01 (plus a 256gb attached volume for /home/gerrit2) 19:54:32 so you'd have 2x 240G for OS (RAID-1), 2x 960G disks (for whatever you want to use them, including raid), 384gb memory, but the cpus arent the newest, but.. 19:54:34 its not vcpus 19:55:07 part of me likes the simplicity of VMs. If they are on a failing host tehy get migrated somewhere else 19:55:11 baremetal sounds good assuming remote disk and floating ips to cope with hardware failure; with our issues changing ips, i wouldn't want to need to do an ip change to address a failure 19:55:14 but there is a performance impact 19:55:35 corvus: thats a better way of describing what I'm saying I think 19:55:39 40 thread cpu systems, but yeah 19:55:40 clarkb: yeah, my preference is still fully virtualized until we hit that performance wall :) 19:56:12 virtual works too, our cpu to mem ratio is 4 so 19:56:18 for every 1 vcpu => 4 gb of memory 19:56:25 32vcpus => 128gb memory 19:56:42 mnaser: is 96GB and 24 vcpu a possibility? 19:56:56 (I haven't looked at flavors lately) 19:56:59 i think there is a flavor with that size i believe 19:57:07 if not we can makeit happen as long as it fits the ratio 19:57:09 I suspect that sort of bump may be reasonable given the current situation on 60 + 16 19:57:33 we wouldn't really incrase jvm heap allocation from 48gb we'd just let the kernel participate in file caching 19:57:54 also, i'd advise against using a floating ip (so traffic is not hairpinned) but instead attach publicly directly -- you can keep the port and reuse it if you need to 19:58:29 one thing to consider if we move gerrit to vexxhost is there will likely be considerable network traffic between it and zuul; probably not a big deal, but right now all of gerrit+zuul is in one data center 19:58:41 mnaser: ++ 19:58:54 mnaser: that looks like create a port with an ip in neutron (but not floating ip) then when we openstack server create or similar pass the port value in for network info? 19:59:09 admins can create ports with any ips 19:59:11 so can help with that 19:59:15 mnaser: gotcha 19:59:57 we are just about at time. It sounds like mnaser isn't opposed to the idea. In addition to untangling opendev vs openstack maybe the next step here is to decide what an instance in vexxhost should look like and discuss those specifics with mnaser? 20:00:10 +1, also recommend going to mtl for this one 20:00:13 then we can spin that up and do some perf testing ot make sure we aren't missing something important and take it from there 20:00:36 I'll go ahead and end the meeting now so that we can have lunch/dinner/breakfast 20:00:40 #endmeeting