Tuesday, 2021-03-09

*** hamalq has quit IRC01:24
*** hashar has joined #opendev-meeting07:09
*** hashar has quit IRC08:19
*** hashar has joined #opendev-meeting09:25
*** hashar has quit IRC11:08
*** hashar has joined #opendev-meeting13:04
*** hashar has quit IRC15:28
*** hashar has joined #opendev-meeting15:57
*** hashar has quit IRC17:07
*** hamalq has joined #opendev-meeting18:30
*** hashar has joined #opendev-meeting18:53
clarkbanyone else here for the meeting?19:00
clarkbwe will get started shortly19:00
ianwo/19:00
clarkb#startmeeting infra19:01
openstackMeeting started Tue Mar  9 19:01:06 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
*** openstack changes topic to " (Meeting topic: infra)"19:01
openstackThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-March/000195.html Our Agenda19:01
clarkb#topic Announcements19:01
*** openstack changes topic to "Announcements (Meeting topic: infra)"19:01
clarkbclarkb out March 23rd, could use a volunteer meeting chair or plan to skip19:02
clarkbI'll probably just let this resolve itself. If you see a meeting agenda next week show up to the meeting otherwise skip it :)19:02
clarkber sorry its 2 weeks from now19:03
clarkbI'm getting too excited :)19:03
fungiheh19:03
clarkbDST change happens for those of us in North America this weekend. EU and others follow in a few weeks.19:03
fungiyou're in a hurry for northern hemisphere spring i guess19:03
ianw:) i am around so can run it19:03
clarkbianw: thanks!19:03
clarkbheads up on the DST changes starting soon for many of us. You'll want to update your calendars if you operate in local time19:03
clarkbNorth american is this weekend then EU in like two weeks and Australia in 3? something like that19:04
clarkb#topic Actions from last meeting19:04
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)"19:04
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-02-19.01.txt minutes from last meeting19:04
clarkbcorvus has started the jitsi unfork19:04
clarkb#link https://review.opendev.org/c/opendev/system-config/+/77830819:05
clarkbI don't think we need to reaction that and can track it with the change now. Currently it is failing CI for some reason, and I haven't had a chance to look at it though I should try to look at it today19:05
clarkb#topic Priority Efforts19:05
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)"19:05
clarkb#topic OpenDev19:05
*** openstack changes topic to "OpenDev (Meeting topic: infra)"19:05
clarkbThe gerrit account inconsistency work continues.19:05
clarkbSicne last we recorded the status here I have deleted conflicting external ids from about 35 accounts that were previously inactive. These were considered safe changes because the accounts were already disabled.19:06
clarkbI have also put another 70 something through the disabling process in preparation for deleting their external ids (something I'd like to do today or tomorrow if we are comfortable with the list). These were accounts that the audit script identified as not having valid openids or ssh usernames or sshkeys or any reviewd changes or pushed changes19:07
clarkbessentially they were never used and cannot be used for anything. I set them inactive on friday to cause them to flip an error if they were used over the last few days but I haven't seenanything related to that.19:08
clarkbI'll put together the external id cleanup input file and run that when we are ready19:08
fungiyeah, i don't see any way those accounts could be used in their current state, so should be safe19:08
fungihard to say they were never logged into, but they can't be logged into now anyway19:08
fungipossible at least some are lingering remnants of earlier account merges/cleanups19:09
*** hamalq has quit IRC19:09
fungibut just vestiges now if so19:09
clarkbyup19:09
*** hamalq has joined #opendev-meeting19:09
clarkbSort of related to this we had a user reach out to service-incident about not being able to log in. THis was for an entirely new account though and appears to be the moving openid then email conflict problem.19:09
clarkbthey removed their email from moderation themselves but I reached out anyway asking them for info on how they got into that state and offered a couple of options for moving forward (they managed to create a second account with a third non conflicting email which woudl work or we can apply these same retirement and external id cleanups to the original account and have them try again)19:10
clarkbwill wait and see what they say. I cc'd fungi on the email since fungi would've gotten the moderation notice too. But kept it off of public lists so we can atlkabout email addrs and details like that19:11
funginot that service-incident is a public list anyway19:11
fungibut also not really on-topic19:11
clarkbfungi: right I was thinkign we could discuss it on service-discuss if not for all the personal details19:12
fungiyep19:12
clarkbany other opendev items to discuss? if not we can move on? (I've sort of stalled out on the profiling of gerrit work in CI, as I'm prioritizing the account work)19:13
clarkb#topic Update Config Management19:14
*** openstack changes topic to "Update Config Management (Meeting topic: infra)"19:14
clarkbI'm not aware of any items under this heading to talk about, but thought I'd ask before skipping ahead19:14
funginothing new this week afaik19:14
clarkb#topic General Topics19:16
*** openstack changes topic to "General Topics (Meeting topic: infra)"19:16
clarkb#topic OpenAFS cluster status19:16
*** openstack changes topic to "OpenAFS cluster status (Meeting topic: infra)"19:16
clarkbianw: I think this may be all complete now? (aside from making room on dfw01's vicepa?)19:16
clarkbwe have a third db server and all servers are upgraded to focal now?19:17
ianwyeah, i've moved on to the kerberos kdc hosts, which are related19:17
ianwi hope to get ansible for that finished very soon; i don't think an in-place upgrade is as important there but probably easiest19:17
clarkbgood point. Should we drop this topic and track kerberos under the general upgrades heading or would you like to keep this here as a separate item?19:18
clarkbalso thank you for pushing on this, our openafs cluster should be much happier now and possibly ready for 2.0 whenever that is somethign to consider19:18
ianwi think we can remove it19:18
clarkbok19:18
ianwyou're right i owe looking at the fedora mirror bits, on the todo list19:19
clarkb#topic Borg Backups19:19
*** openstack changes topic to "Borg Backups (Meeting topic: infra)"19:19
clarkbLast I heard we were going to try manually running the backup for gitea db and see if we could determine why it is sad but only to one target19:19
clarkbany new news on that?19:19
fungithe errors for gitea01 ceased. not sure if anyone changed anything there?19:19
ianwi did not19:19
clarkbI did not either19:20
clarkbI guess we keep our eyes open for recurrence but can probably drop this topic now too?19:20
ianwyep!  i think we're done there19:21
clarkbgreat19:21
clarkbthank you for working on this as well19:21
fungii just rechecked the root inbox to be sure, no more gitea01 errors19:21
clarkb#topic Puppet replacements and Server upgrades19:21
*** openstack changes topic to "Puppet replacements and Server upgrades (Meeting topic: infra)"19:21
fungithough the ticket about fedora 33 being unable to reach emergency consoles seems to be waiting for an update from us19:22
clarkbI've rotated all the zuul-executors at this point. That means zuul-mergers and executors are done. Next on my list was nodepool launchers19:22
clarkbI think these are going to be a bit more involved since we need to keep old launcher from interfering with new launcher. One idea I had was to land a change that sets max-servers: 0 on the old host and max-servers: valid-value on the new server and then remove the old server when it stops managing any hosts19:23
clarkbcorvus wasn't sure if that would be safe (it sounds like maybe it would be)19:23
clarkbnot sure if we want to find out the hard way or do a more careful disable old server, wait for it to idle, start new server setup19:23
clarkbthe downside with the careful appraoch is we'll drop our node count by the number of nodes in that provider in the interim19:23
ianwif anyone would know, corvus would :)19:24
ianwit doesn't seem like turning one to zero would communicate anything to the other19:24
clarkbI've got some changes to rebase and clean up anyway related to this so I'll look at it a bit more and see if I can convince myself the first idea is safe19:24
ianwit feels like the other one would just see more resources available19:24
clarkbianw: I think the concern is that they may see each other as leaking nodes within the provider19:24
clarkbalso possibly the max-servers 0 instance my reject node requests for the provider since it has no quota. Not sure how the node request rejections work though and if they would be unique enough to avoid that problem.19:25
clarkbif the node requests are handled with host+provider unique info we would be ok. I can check on that19:25
clarkbThat was all I had on this though19:26
clarkbianw: anything to add re kerberos servers?19:26
ianwno, wip, but i think i have a handle on it after a crash course on kerberos configuration files :)19:27
clarkblet us know when you want reviews19:28
clarkb#topic Deploy new refstack19:28
*** openstack changes topic to "Deploy new refstack (Meeting topic: infra)"19:28
clarkbkopecmartin: ianw: any luck sorting out the api situation (and wsgi etc)19:28
kopecmartinthis is ready: https://review.opendev.org/c/opendev/system-config/+/77629219:28
kopecmartini wrote a comments as well so that we know why the vhost was edited the way it is19:29
clarkb#link https://review.opendev.org/c/opendev/system-config/+/776292 refstack api url change19:29
clarkbthanks I've got that on my list for review now19:29
kopecmartingreat, thanks19:29
kopecmartini tested it and it seems ok19:29
clarkbI guess we land that then retest the new server and take it from there?19:29
kopecmartinso I'd say it can go to production19:29
clarkbthe comment helps, thank you for that19:30
clarkbanything else related to refstack?19:30
ianwok, it seems we don't quite understand what is going on, but i doubt any of us have a lot of effort to put into it if it seems to work19:30
kopecmartinyeah, i'm out of time on this unfortunately19:30
kopecmartinnothing else from my side19:31
clarkbkopecmartin: ok, left a quick note for somethign I noticed on that change19:31
clarkbif we update that I expect we can land it19:31
clarkb#topic Bridge Disk Space19:32
*** openstack changes topic to "Bridge Disk Space (Meeting topic: infra)"19:32
clarkbthe major consumer of disk here got tracked down to stevedore (thank you ianw and mordred and frickler) writing out entrypoint cache files19:32
clarkblatest stevedore aviods writing those caches when it can detect it is running under ansible19:33
clarkbianw wrote some cahnges to upgrade stevedore as the major problem was ansible related, but also wrote out a disable file to the cache dir to avoid other uses polluting the dir19:33
clarkbianw: it also looks like you cleaned up the stale cache files.19:33
clarkbAnything else to bring up on this? I think we can consider a solved problem19:33
ianwyep i checked the deployment of stevedore and cleaned up those files, so ++19:34
clarkb#topic PTG Prep19:35
*** openstack changes topic to "PTG Prep (Meeting topic: infra)"19:35
clarkbThe next PTG dates have been annoucned as April 19-23. We have been asked to fill out a survey by March 25 to indicate interest in participating if we wish to participate19:35
clarkbI'd be interested in hearing from others if they think this will be valuable or not.19:36
clarkbThe last PTG was full of distractions and a lot of other stuff going on and it felt less useful to me. But I'm not sure if that was due to circumstances or if this smaller group just doesn't need as much synchronous time19:36
clarkbcurious to hear what others think. I'm happy to organize time for us if we want to particpate, just let me know19:37
clarkbmaybe think about it and we can bring it up again in next week's meeting and take it from there19:38
clarkb#topic Open Discussion19:38
*** openstack changes topic to "Open Discussion (Meeting topic: infra)"19:38
clarkbThat was all I had on the agenda, anything else?19:38
ianwi haven't got to the new review server setup although the server is started19:39
ianwi did wonder if maybe we should run some performance things on it now19:40
clarkbwe might also want to consider if we need a bigger instance?19:40
ianwi know the old server has hot-migrated itself around, but i wonder if maybe the new server is in a new rack or something and might be faster?19:40
clarkbbut ya some performance sanity checks make sense to me. In addition to cpu checks disk io checking (against the ssd volume?) might be good19:41
ianwi'm not sure how it works on the backend.  perhaps there's the "openstack rack" in a corner and that's what we're always on :)19:41
ianwi think the next size up was 96gb19:41
clarkbianw: ya i don't know either. That said the notedb migration was much quicker on review-test than it ended up being on prod19:41
clarkbit is possible that the newer hosts gain some benefit somewhere based on ^19:41
clarkbtoo many variables involved to say for sure though19:42
ianwperformance2-90         | 90 GB Performance                 |  92160 |   40 |       900 |    24 | N/A19:42
clarkbianw: it is probably worth considering ^ since we're tight on memory right now and one of my theories is the lack of space for the kernel to cache things may be impacting io19:43
clarkb(of course I haven't been able to test that in any reasonable way)19:43
ianwwe also have an onMetal flavour19:44
clarkbI think we had been asked to not onmetal at some point19:44
ianwok, the medium flavor looks about right with 64gb and 24vcpu; but yeah, we may not actually have any quota19:45
clarkbwe do have a couple of services we have talked about turning off like pbx and mqtt19:46
clarkband if we go more radical dropping elasticsearch would free up massive resources. At this point the trade between better gerrit and no elasticsearch may be worthwhile19:46
ianwi'd have to check but i think we'd be very tight to add 30gb ATM19:46
clarkbdefinitely something to consider, I don't really want to hold your work up overthinking it though19:47
ianwat this point i need to get back to pulling apart where we've mixed openstack.org/opendev.org in ansible so we have a clear path for adding the new server anyway19:48
clarkbok, fungi frickler corvus ^ maybe you can think about that a bit and we can decide if we should go bigger (and make necessary sacrifices if so)19:48
fricklergoing bigger would mean changing IPs, right? would it be an option to move to vexxhost then?19:49
ianwfrickler: either way we've been looking at a new host and changing ip's19:49
clarkbfrickler: that may also be something to consider especially if we want to fine tune sizing19:50
corvuso/19:50
clarkbmy concern with a change like that would be the disk io (we can test it to ensure some confidence in it though). We'd also want to talk to mnaser and see if that is reasonable19:50
frickleriirc mnaser has nice fast amd cpus19:50
clarkbfrickler: yup, but then all the disk is ceph and I'm not sure how that compares to $ssd gerrit cinder volume we have currently19:51
clarkbit may be great it may not be, something to check19:51
fricklersure19:51
mnaserclarkb / frickler: our ceph is all nvme/ssd backed19:51
mnaserand we also have local (but unreliable) storage available19:51
mnaserdepending on your timeline, we're rolling out access to baremetal systems19:52
mnaserso that might be an interesting option too19:52
fricklerdepending on your timeline, we might consider waiting for that ;)19:52
corvusi like the idea of increasing gerrit size; i also like the idea of moving it to vexx if mnaser is ok;19:52
clarkbmnaser: is that something you might be interested in hosting on vexxhost? we're thinking that ab igger server will probably help with some of the performance issues. In particular we allocate a ton of memory to the jvm and that impacts the kernels ability to cache at its level19:52
clarkbmnaser: the current server is 60GB ram + 16 vcpu and we'd probably want to bump up both of those axis if possible19:53
mnaserhm19:53
mnaserso, we're 'recycling' our old compute nodes to make them available as baremetal instances19:53
ianw(plus a 256gb attached volume for /home/gerrit2)19:54
mnaserso you'd have 2x 240G for OS (RAID-1), 2x 960G disks (for whatever you want to use them, including raid), 384gb memory, but the cpus arent the newest, but..19:54
mnaserits not vcpus19:54
clarkbpart of me likes the simplicity of VMs. If they are on a failing host tehy get migrated somewhere else19:55
corvusbaremetal sounds good assuming remote disk and floating ips to cope with hardware failure; with our issues changing ips, i wouldn't want to need to do an ip change to address a failure19:55
clarkbbut there is a performance impact19:55
clarkbcorvus: thats a better way of describing what I'm saying I think19:55
mnaser40 thread cpu systems, but yeah19:55
corvusclarkb: yeah, my preference is still fully virtualized until we hit that performance wall :)19:55
mnaservirtual works too, our cpu to mem ratio is 4 so19:56
mnaserfor every 1 vcpu => 4 gb of memory19:56
mnaser32vcpus => 128gb memory19:56
clarkbmnaser: is 96GB and 24 vcpu a possibility?19:56
clarkb(I haven't looked at flavors lately)19:56
mnaseri think there is a flavor with that size i believe19:56
mnaserif not we can makeit happen as long as it fits the ratio19:57
clarkbI suspect that sort of bump may be reasonable given the current situation on 60 + 1619:57
clarkbwe wouldn't really incrase jvm heap allocation from 48gb we'd just let the kernel participate in file caching19:57
mnaseralso, i'd advise against using a floating ip (so traffic is not hairpinned) but instead attach publicly directly -- you can keep the port and reuse it if you need to19:57
corvusone thing to consider if we move gerrit to vexxhost is there will likely be considerable network traffic between it and zuul; probably not a big deal, but right now all of gerrit+zuul is in one data center19:58
corvusmnaser: ++19:58
clarkbmnaser: that looks like create a port with an ip in neutron (but not floating ip) then when we openstack server create or similar pass the port value in for network info?19:58
mnaseradmins can create ports with any ips19:59
mnaserso can help with that19:59
clarkbmnaser: gotcha19:59
clarkbwe are just about at time. It sounds like mnaser isn't opposed to the idea. In addition to untangling opendev vs openstack maybe the next step here is to decide what an instance in vexxhost should look like and discuss those specifics with mnaser?19:59
mnaser+1, also recommend going to mtl for this one20:00
clarkbthen we can spin that up and do some perf testing ot make sure we aren't missing something important and take it from there20:00
clarkbI'll go ahead and end the meeting now so that we can have lunch/dinner/breakfast20:00
clarkb#endmeeting20:00
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev"20:00
openstackMeeting ended Tue Mar  9 20:00:40 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
openstackMinutes:        http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-09-19.01.html20:00
openstackMinutes (text): http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-09-19.01.txt20:00
openstackLog:            http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-09-19.01.log.html20:00
clarkbthanks everyone and feel free to continue conversations in #opendev or on the mailing list20:00
corvusmnaser: aww, i was hoping for sjc ;)20:01
fungithanks clarkb!20:02
clarkbcorvus: the worst thing is when silly ISPs send you halfway across the country to hit local resources due to peering and route costs20:04
clarkbcorvus: up here its really common to go to at least seattle before returning oregon20:04
ianwclarkb: my brother-in-law lives in what could only be termed the middle of nowhere.  he's signed up for starlink ... going to be very interested if he ends up with better internet than me in a suburb of a major city20:15
fungimy folks did the early signup too. same deal. their only current "broadband" option is slow and often dead at&t adsl20:16
fungithough they're in a tight valley, i warned them that it may be a while before there's a satellite which isn't behind a mountain for them20:17
clarkbianw: eventually you should be able to get starlink too? though that may be a long way away20:27
*** irclogbot_3 has quit IRC20:31
*** irclogbot_1 has joined #opendev-meeting20:32
*** sboyron has quit IRC20:38
fungiin more ways than one20:38
*** hashar has quit IRC22:53

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!