#opendev-meeting log

19:01:11 <clarkb> #startmeeting infra
19:01:13 <openstack> Meeting started Tue Feb 16 19:01:11 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:16 <openstack> The meeting name has been set to 'infra'
19:01:21 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-February/000184.html Our Agenda
19:01:34 <clarkb> Sorry I had meant to send this out yesterday and got it put together but then got distracted by server upgrades
19:03:30 <clarkb> #topic Announcements
19:03:38 <clarkb> There were none listed
19:03:40 <clarkb> #topic Actions from last meeting
19:03:48 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-09-19.01.txt minutes from last meeting
19:03:57 <clarkb> we had two (though only one got properly recorded)
19:04:07 <clarkb> ianw was looking at wiki borg backups (I think this got done?)
19:04:49 <ianw> yes wiki is now manually configured to be backing up to borg
19:05:28 <clarkb> corvus had an action to unfork jitsi meet
19:05:38 <clarkb> the web componenet at least (everything else is already unforked)
19:06:09 <corvus> not done, feel free to re-action
19:06:22 <clarkb> #action corvus unfork jitsi meet web component
19:06:32 <clarkb> #topic Priority Efforts
19:06:37 <clarkb> #topic OpenDev
19:06:38 <fungi> i also saw ianw's request for me to double-check the setup on wiki.o.o, will try to get to that after the meeting wraps up
19:06:43 <clarkb> fungi:thanks
19:07:19 <clarkb> I did further investigation of gerrit inconsistent accounts and wrote up notes on review-test
19:07:31 <clarkb> I won't go through all the status of things because I don't think much has changed since the last meeting
19:07:49 <clarkb> but I could use another set or two of eyeballs to look over what I've written down to see if the choices described there make sense
19:08:00 <clarkb> if they do then the next step is likely to make that staging All-Users repo and start committing changes
19:08:14 <clarkb> we don't need to work through that in the meeting but if you have time to look at it and want me to walk you through it let me know
19:08:58 <clarkb> I was going to call out a couple of Gerrit 3.3 related changes but looks like both have merged at this point. Thank you reviewers
19:09:29 <clarkb> For the gitea OOM problems we've noticed recently I pushed up a haproxy rate limiting framework change
19:09:31 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/774023 Rate limiting framework change for haproxy.
19:09:57 <clarkb> I doubt that is mergeable as is, but if you have a chance to review it and provide thoughts like "never" or "we could probably get away with $RATE" that may be useful for future occurences
19:10:05 <fungi> i'm feeling like if the count of conflicting accounts is really that high, we should consider sorting by which have the most recent (review/owner) activity and prioritize those, then just disable any which are inactive and let people know, rather than manually investigating hundreds of accounts
19:10:05 <clarkb> that said, I am beginning to suspect that these problems may be self induced
19:10:43 <clarkb> fungi: yup, I'm beginning to think that may be the case. We could do a rough quick search for active accounts, manually check and fix those, then do retirement for all others
19:10:44 <fungi> er, inactive in the sense of not used recently
19:10:56 <fungi> not the inactive account flag specifically
19:11:09 <clarkb> I can look at the data from that perspective and write up a set of alternate notes
19:11:15 <fungi> maybe also any which are referenced in groups
19:11:31 <clarkb> judging based on the existing data I expected that may be be something like 50 accounts max that we have to sort out manually and the rest we can just retire
19:11:32 <fungi> but those are likely very few at this point
19:11:43 <clarkb> but would need to do that audit
19:11:55 <fungi> i can try to help with that
19:11:58 <clarkb> thanks
19:12:21 <clarkb> To help investigate further if the gitea ooms may be self inflicted by our aggressive project description updates I've been trying to get some server metrics into our system-config-run jobs
19:12:23 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/775051 Dstat stat gathering in our system-config-run jobs to measure relative performance impacts.
19:12:49 <clarkb> That failed in gitea previously, but I just pushed a rebase to help make gerrit load testing a thing
19:13:06 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/775883 Gerrit load testing attempt
19:13:26 <clarkb> there was a recent email to the gerrit mailing list about gatling-git which can be used to do artificial load testing against a gerrit and that inspired me to write ^
19:13:33 <clarkb> I think that could be very useful for us if I can manage to make it work
19:13:43 <clarkb> in particular I'm interested in seeing differences between 3.2 and 3.3
19:13:54 <ianw> lgtm; unfortunately there didn't seem to be a good way to provide a visualization of that i could find
19:14:05 <ianw> (dstat)
19:14:24 <clarkb> ya I think we can approach this step by step and add bits we find are lacking or would be helpful
19:14:49 <clarkb> anyway this has all been in service of me trying to better profile our services as we've had a coupel of issues around that recently
19:15:00 <clarkb> I think the owrk is promising but still early and may have very rough edges :)
19:15:30 <clarkb> ianw and fungi also updated some links on opendev docs and front page to better point people at our incident list
19:15:43 <clarkb> Are there any other opendev related items to bring up before we move on?
19:17:37 <clarkb> #topic Update Config Management
19:18:14 <clarkb> ianw: the new refstack deployment is happy now? we are just waiting on testing before scheduling a migration?
19:18:51 <ianw> well i guess i have migrated it
19:19:01 <ianw> the data at least
19:19:18 <ianw> yes, not sure what else to do other than click around a bit?
19:19:25 <clarkb> right but refstack.openstack.org is still pointed at the old server (so we'll need to do testing, then schedule a downtime where we can update dns and remigrate the data)
19:19:42 <clarkb> I think kopecmartin had some ideas around testing, probably just point kopecmartin at it to start and see what that turns up
19:19:42 <ianw> has any new data come into it?
19:20:01 <clarkb> new data does occasionally show up, though I don't know if it has in this window
19:20:22 <ianw> you can access the site via https://refstack01.openstack.org/#/
19:21:02 <clarkb> I'll try to catch kopecmartin and point them to ^
19:21:05 <clarkb> and then we can take it from there
19:21:10 <ianw> ++
19:21:25 <clarkb> fungi: ianw: I also saw that ansible was reenabled on some afs nodes
19:21:32 <clarkb> any updates on that to go over?
19:22:40 <fungi> i think it's all caught up, now we can focus on ubuntu upgrades on those
19:22:45 <ianw> that was a small problem i created that fungi fixed :)
19:23:04 <ianw> yep, trying some in-place focal upgrades is now pretty much top of my todo
19:23:05 <fungi> more like a minor oversight in the massive volume of work you completed to get all that done
19:23:25 <clarkb> ++ and thanks for the followup there fungi
19:23:36 <clarkb> Any other config management items to cover?
19:24:23 <ianw> semi related is
19:24:25 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/775546
19:24:31 <ianw> to upgrade grafana, just to keep in sync
19:24:58 <clarkb> looks like an easy review
19:25:48 <clarkb> #topic General Topics
19:26:00 <clarkb> We just went over afs so we can skip to bup and borg backups
19:26:05 <clarkb> #topic Bup and Borg Backups
19:26:13 <clarkb> wiki has been assimilated
19:26:54 <fungi> resistance was substantial, but eventually futile
19:27:02 <clarkb> any other updates? should we consider removing this from topic from our meetings?
19:27:52 <ianw> umm maybe keep it for one more week as i clean it up
19:28:00 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/766630
19:28:09 <ianw> would be good to look at, which removes bup things
19:28:14 <clarkb> ok
19:28:32 <ianw> i left removing the cron jobs just as a manual task, it's easy enough to just delete them
19:29:00 <clarkb> sounds good
19:30:11 <clarkb> #topic Enable Xenial to Bionic/Focal system upgrades
19:30:19 <clarkb> #link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here
19:30:48 <clarkb> please add additional info on todo items there. I add them as I come across them (though have many other distractions too)
19:31:13 <clarkb> I also intend to start looking at zuul, nodepool, and zookeeper os upgrades as soon as the zuul release settles
19:31:34 <clarkb> I'm hopeful we can largely just roll through those by adding new servers, and removing old ones
19:31:41 <clarkb> the zuul scheduler being the exception there
19:32:11 <fungi> if we were already on zuul v5... ;
19:32:13 <fungi> ;)
19:32:27 <clarkb> if others have time to start looking at other services (I know ianw has talking about looking at review, thanks) that would be much appreciated
19:33:37 <clarkb> #topic opendev.org not reachable via IPv6 from some ISPs
19:33:51 <clarkb> frickler put this item on the agenda. frickler are you around to talk about it? If not I'll do my best
19:34:02 <frickler> yeah so I brought this up mainly to add some nagging toward mnaser
19:34:17 <frickler> or maybe find some other contact at vexxhost
19:34:48 <frickler> the issue is that the IPv6 prefix vexxhost is using is not properly registered, so some ISPs (like mine) are not routing it
19:34:56 <clarkb> noonedeadpunk is another contact there
19:35:30 <frickler> oh, great, I can try that
19:35:45 <fungi> it's specifically about how the routes are being announced in bgp, right?
19:36:20 <frickler> the issue is in the route registry, which provider use to filter bgp announcements
19:36:40 <fungi> usually the way we dealt with it in $past_life was to also announce our aggregates from all borders
19:36:44 <frickler> they registered only a /32, but announce multiple /48s instead
19:36:57 <clarkb> I see so its a separate record that routers will check against to ensure they don't accept bad bgp advertisements?
19:37:26 <fungi> so you announce the /32 to all your peers but also the individual /48 prefixes or whatever from the gateways which can route for them best
19:37:56 <frickler> vexxhost only needs to create route objects for the individual /48s matching what they announce via bgp
19:38:07 <fungi> and yes, there is basically a running list maintained by the address registries which says which prefix lengths to expect
19:38:40 <fungi> out of what ranges
19:39:35 <frickler> the prefix opendev.org is in is 2604:e100:3::/48, which is what they announce via their upstreams
19:39:51 <fungi> and operators wishing to optimize their table sizes use that list to implement filters
19:39:52 <frickler> but a route object only exists for 2604:e100::/32
19:40:14 <frickler> no, that's not about table size, it is general bgp sanity
19:40:31 <frickler> except not too many providers care about that
19:40:42 <frickler> but I expect that to change in the future
19:40:50 <fungi> the main sanity they care about is "will the table overrun my allocated memory in some routers"
19:41:18 <fungi> (and it's no fun when your border routers start crashing and rebooting in a loop as soon as they peer, let me tell you)
19:41:29 <frickler> this is more related to the possibitly of route hijacking
19:42:05 <clarkb> frickler: whee does this registry live? arin (those IPs are hosted in the USA iirc)
19:42:09 <fungi> yeah, but that possibility exists with or without tat filter list, and affects v4 as well
19:42:35 <frickler> in that case it would be arin maybe, though the /32 is registered in radb
19:42:46 <clarkb> (mostly just curious, I know we can't update it for them)
19:43:05 <frickler> I don't know all the details for american networks, in europe it would be RIPE
19:43:40 <clarkb> ok, in any case I would see if noonedeadpunk can help
19:43:59 <ianw> (ftp://ftp.radb.net/radb/dbase/level3.db.gz contains a large amount of ascii art of cartoon characters, which is ... interesting)
19:44:33 <clarkb> anything else on this topic?
19:45:06 <frickler> no, fine for me
19:45:18 <clarkb> #topic Open Discussion
19:45:21 <clarkb> Anything else?
19:46:13 <fungi> yeah, the individual lirs make and (generally) publish their allocation policies indicating what size allocations they're making from what ranges
19:46:28 <fungi> they tend to expect you to at least have aggregates announced for those
19:47:09 <fungi> er, s/lirs/rirs/
19:48:29 <clarkb> sounds like that may be it?
19:48:34 <clarkb> I'll give it another couple of minutes
19:49:51 <fungi> you find recommendations like "route-filter 2600::/12 prefix-length-range /19-/32;" in old lists, e.g. https://www.space.net/~gert/RIPE/ipv6-filters.html
19:50:27 <fungi> that's the /12 which covers our address, and the recommendation is to only accept prefixes between /19 and /32 long in it
19:51:08 <clarkb> and sounds like that may be it, thanks everyone.
19:51:11 <fungi> so if a provider is using a filter like that, they'll discard the /48 routes vexxhost is announcing
19:51:14 <clarkb> we can continue the ipv6 discussion in #opendev
19:51:17 <fungi> thanks clarkb!
19:51:20 <clarkb> #endmeeting