19:01:11 #startmeeting infra 19:01:13 Meeting started Tue Feb 16 19:01:11 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:16 The meeting name has been set to 'infra' 19:01:21 #link http://lists.opendev.org/pipermail/service-discuss/2021-February/000184.html Our Agenda 19:01:34 Sorry I had meant to send this out yesterday and got it put together but then got distracted by server upgrades 19:03:30 #topic Announcements 19:03:38 There were none listed 19:03:40 #topic Actions from last meeting 19:03:48 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-09-19.01.txt minutes from last meeting 19:03:57 we had two (though only one got properly recorded) 19:04:07 ianw was looking at wiki borg backups (I think this got done?) 19:04:49 yes wiki is now manually configured to be backing up to borg 19:05:28 corvus had an action to unfork jitsi meet 19:05:38 the web componenet at least (everything else is already unforked) 19:06:09 not done, feel free to re-action 19:06:22 #action corvus unfork jitsi meet web component 19:06:32 #topic Priority Efforts 19:06:37 #topic OpenDev 19:06:38 i also saw ianw's request for me to double-check the setup on wiki.o.o, will try to get to that after the meeting wraps up 19:06:43 fungi:thanks 19:07:19 I did further investigation of gerrit inconsistent accounts and wrote up notes on review-test 19:07:31 I won't go through all the status of things because I don't think much has changed since the last meeting 19:07:49 but I could use another set or two of eyeballs to look over what I've written down to see if the choices described there make sense 19:08:00 if they do then the next step is likely to make that staging All-Users repo and start committing changes 19:08:14 we don't need to work through that in the meeting but if you have time to look at it and want me to walk you through it let me know 19:08:58 I was going to call out a couple of Gerrit 3.3 related changes but looks like both have merged at this point. Thank you reviewers 19:09:29 For the gitea OOM problems we've noticed recently I pushed up a haproxy rate limiting framework change 19:09:31 #link https://review.opendev.org/c/opendev/system-config/+/774023 Rate limiting framework change for haproxy. 19:09:57 I doubt that is mergeable as is, but if you have a chance to review it and provide thoughts like "never" or "we could probably get away with $RATE" that may be useful for future occurences 19:10:05 i'm feeling like if the count of conflicting accounts is really that high, we should consider sorting by which have the most recent (review/owner) activity and prioritize those, then just disable any which are inactive and let people know, rather than manually investigating hundreds of accounts 19:10:05 that said, I am beginning to suspect that these problems may be self induced 19:10:43 fungi: yup, I'm beginning to think that may be the case. We could do a rough quick search for active accounts, manually check and fix those, then do retirement for all others 19:10:44 er, inactive in the sense of not used recently 19:10:56 not the inactive account flag specifically 19:11:09 I can look at the data from that perspective and write up a set of alternate notes 19:11:15 maybe also any which are referenced in groups 19:11:31 judging based on the existing data I expected that may be be something like 50 accounts max that we have to sort out manually and the rest we can just retire 19:11:32 but those are likely very few at this point 19:11:43 but would need to do that audit 19:11:55 i can try to help with that 19:11:58 thanks 19:12:21 To help investigate further if the gitea ooms may be self inflicted by our aggressive project description updates I've been trying to get some server metrics into our system-config-run jobs 19:12:23 #link https://review.opendev.org/c/opendev/system-config/+/775051 Dstat stat gathering in our system-config-run jobs to measure relative performance impacts. 19:12:49 That failed in gitea previously, but I just pushed a rebase to help make gerrit load testing a thing 19:13:06 #link https://review.opendev.org/c/opendev/system-config/+/775883 Gerrit load testing attempt 19:13:26 there was a recent email to the gerrit mailing list about gatling-git which can be used to do artificial load testing against a gerrit and that inspired me to write ^ 19:13:33 I think that could be very useful for us if I can manage to make it work 19:13:43 in particular I'm interested in seeing differences between 3.2 and 3.3 19:13:54 lgtm; unfortunately there didn't seem to be a good way to provide a visualization of that i could find 19:14:05 (dstat) 19:14:24 ya I think we can approach this step by step and add bits we find are lacking or would be helpful 19:14:49 anyway this has all been in service of me trying to better profile our services as we've had a coupel of issues around that recently 19:15:00 I think the owrk is promising but still early and may have very rough edges :) 19:15:30 ianw and fungi also updated some links on opendev docs and front page to better point people at our incident list 19:15:43 Are there any other opendev related items to bring up before we move on? 19:17:37 #topic Update Config Management 19:18:14 ianw: the new refstack deployment is happy now? we are just waiting on testing before scheduling a migration? 19:18:51 well i guess i have migrated it 19:19:01 the data at least 19:19:18 yes, not sure what else to do other than click around a bit? 19:19:25 right but refstack.openstack.org is still pointed at the old server (so we'll need to do testing, then schedule a downtime where we can update dns and remigrate the data) 19:19:42 I think kopecmartin had some ideas around testing, probably just point kopecmartin at it to start and see what that turns up 19:19:42 has any new data come into it? 19:20:01 new data does occasionally show up, though I don't know if it has in this window 19:20:22 you can access the site via https://refstack01.openstack.org/#/ 19:21:02 I'll try to catch kopecmartin and point them to ^ 19:21:05 and then we can take it from there 19:21:10 ++ 19:21:25 fungi: ianw: I also saw that ansible was reenabled on some afs nodes 19:21:32 any updates on that to go over? 19:22:40 i think it's all caught up, now we can focus on ubuntu upgrades on those 19:22:45 that was a small problem i created that fungi fixed :) 19:23:04 yep, trying some in-place focal upgrades is now pretty much top of my todo 19:23:05 more like a minor oversight in the massive volume of work you completed to get all that done 19:23:25 ++ and thanks for the followup there fungi 19:23:36 Any other config management items to cover? 19:24:23 semi related is 19:24:25 #link https://review.opendev.org/c/opendev/system-config/+/775546 19:24:31 to upgrade grafana, just to keep in sync 19:24:58 looks like an easy review 19:25:48 #topic General Topics 19:26:00 We just went over afs so we can skip to bup and borg backups 19:26:05 #topic Bup and Borg Backups 19:26:13 wiki has been assimilated 19:26:54 resistance was substantial, but eventually futile 19:27:02 any other updates? should we consider removing this from topic from our meetings? 19:27:52 umm maybe keep it for one more week as i clean it up 19:28:00 #link https://review.opendev.org/c/opendev/system-config/+/766630 19:28:09 would be good to look at, which removes bup things 19:28:14 ok 19:28:32 i left removing the cron jobs just as a manual task, it's easy enough to just delete them 19:29:00 sounds good 19:30:11 #topic Enable Xenial to Bionic/Focal system upgrades 19:30:19 #link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here 19:30:48 please add additional info on todo items there. I add them as I come across them (though have many other distractions too) 19:31:13 I also intend to start looking at zuul, nodepool, and zookeeper os upgrades as soon as the zuul release settles 19:31:34 I'm hopeful we can largely just roll through those by adding new servers, and removing old ones 19:31:41 the zuul scheduler being the exception there 19:32:11 if we were already on zuul v5... ; 19:32:13 ;) 19:32:27 if others have time to start looking at other services (I know ianw has talking about looking at review, thanks) that would be much appreciated 19:33:37 #topic opendev.org not reachable via IPv6 from some ISPs 19:33:51 frickler put this item on the agenda. frickler are you around to talk about it? If not I'll do my best 19:34:02 yeah so I brought this up mainly to add some nagging toward mnaser 19:34:17 or maybe find some other contact at vexxhost 19:34:48 the issue is that the IPv6 prefix vexxhost is using is not properly registered, so some ISPs (like mine) are not routing it 19:34:56 noonedeadpunk is another contact there 19:35:30 oh, great, I can try that 19:35:45 it's specifically about how the routes are being announced in bgp, right? 19:36:20 the issue is in the route registry, which provider use to filter bgp announcements 19:36:40 usually the way we dealt with it in $past_life was to also announce our aggregates from all borders 19:36:44 they registered only a /32, but announce multiple /48s instead 19:36:57 I see so its a separate record that routers will check against to ensure they don't accept bad bgp advertisements? 19:37:26 so you announce the /32 to all your peers but also the individual /48 prefixes or whatever from the gateways which can route for them best 19:37:56 vexxhost only needs to create route objects for the individual /48s matching what they announce via bgp 19:38:07 and yes, there is basically a running list maintained by the address registries which says which prefix lengths to expect 19:38:40 out of what ranges 19:39:35 the prefix opendev.org is in is 2604:e100:3::/48, which is what they announce via their upstreams 19:39:51 and operators wishing to optimize their table sizes use that list to implement filters 19:39:52 but a route object only exists for 2604:e100::/32 19:40:14 no, that's not about table size, it is general bgp sanity 19:40:31 except not too many providers care about that 19:40:42 but I expect that to change in the future 19:40:50 the main sanity they care about is "will the table overrun my allocated memory in some routers" 19:41:18 (and it's no fun when your border routers start crashing and rebooting in a loop as soon as they peer, let me tell you) 19:41:29 this is more related to the possibitly of route hijacking 19:42:05 frickler: whee does this registry live? arin (those IPs are hosted in the USA iirc) 19:42:09 yeah, but that possibility exists with or without tat filter list, and affects v4 as well 19:42:35 in that case it would be arin maybe, though the /32 is registered in radb 19:42:46 (mostly just curious, I know we can't update it for them) 19:43:05 I don't know all the details for american networks, in europe it would be RIPE 19:43:40 ok, in any case I would see if noonedeadpunk can help 19:43:59 (ftp://ftp.radb.net/radb/dbase/level3.db.gz contains a large amount of ascii art of cartoon characters, which is ... interesting) 19:44:33 anything else on this topic? 19:45:06 no, fine for me 19:45:18 #topic Open Discussion 19:45:21 Anything else? 19:46:13 yeah, the individual lirs make and (generally) publish their allocation policies indicating what size allocations they're making from what ranges 19:46:28 they tend to expect you to at least have aggregates announced for those 19:47:09 er, s/lirs/rirs/ 19:48:29 sounds like that may be it? 19:48:34 I'll give it another couple of minutes 19:49:51 you find recommendations like "route-filter 2600::/12 prefix-length-range /19-/32;" in old lists, e.g. https://www.space.net/~gert/RIPE/ipv6-filters.html 19:50:27 that's the /12 which covers our address, and the recommendation is to only accept prefixes between /19 and /32 long in it 19:51:08 and sounds like that may be it, thanks everyone. 19:51:11 so if a provider is using a filter like that, they'll discard the /48 routes vexxhost is announcing 19:51:14 we can continue the ipv6 discussion in #opendev 19:51:17 thanks clarkb! 19:51:20 #endmeeting