#opendev-meeting log

19:01:28 <clarkb> #startmeeting infra
19:01:29 <openstack> Meeting started Tue Sep  8 19:01:28 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:30 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:32 <openstack> The meeting name has been set to 'infra'
19:01:42 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-September/000082.html Our Agenda
19:01:54 <clarkb> #topic Announcements
19:01:55 <ianw> o/
19:02:47 <clarkb> I didn't have any formal announcements. But Yesterday and Today Oregon decided to catch on fire so I'm semi distracted by that. We should be ok though a neraby field decided it wanted to be a fire instead
19:02:59 <clarkb> anyone else have anything to announce?
19:03:15 <clarkb> (oh also power outages have been a problem so I may drop out due to that too though haven't lost power yet)
19:03:23 <fungi> nothing which tops that, no ;)
19:03:46 <fungi> :/
19:03:53 <clarkb> really I expect the worst bit will be the smoke when the winds shift again. So I should just be happy right now :)
19:04:14 <clarkb> #topic Actions from last meeting
19:04:23 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-09-01-19.01.txt minutes from last meeting
19:04:33 <clarkb> There were no actions from lsat meeting. Lets just dive into this one then
19:04:39 <clarkb> #topic Priority Efforts
19:04:49 <clarkb> #topic Update Config Management
19:05:04 <clarkb> I've booted a new nb03.opendev.org to run nodepool-builder with docker for arm64 image builds
19:05:20 <clarkb> That has been enrolled into our inventory but has a problem installing things because there aren't wheels for arm64 :)
19:05:29 <clarkb> #link https://review.opendev.org/750472 Add build deps for docker-compose on nb03
19:05:39 <clarkb> that should fix it and once thats done everything should be handled by docker so should work
19:06:12 <clarkb> one thing that came up as part of this is that we don't seem to have ansible using sshfp records yet? or maybe we do and the issue I had was specific to having a stale known_hosts entry for a reused IP?
19:06:23 <clarkb> ianw: fungi ^ any updates on that?
19:06:49 <fungi> we have usable sshfp records for at least some hosts
19:06:54 <ianw> umm, i think that the stale known_hosts overrides the sshfp
19:07:14 <fungi> yes, if there is an existing known_hosts entry that will be used instead
19:07:22 <clarkb> gotcha, that was likely the issue here then
19:07:27 <ianw> it might be a bit of a corner case with linaro
19:07:29 <clarkb> do we expect sshfp to work otherwise?
19:07:36 <ianw> where we have a) few ip's and b) have rebuilt the mirror a lot
19:08:01 <fungi> though i also don't think bridge.o.o is configured to do VerifyHostKeyDNS=yes is it?
19:08:14 <clarkb> https://review.opendev.org/#/c/744821/ <- reviewing and landing that would be good if we expect sshfp to work now
19:08:15 <ianw> my understanding is yes, since it is using unbound and the dns records are trusted
19:08:32 <fungi> i thought VerifyHostKeyDNS=ask was the default
19:09:06 <fungi> and i couldn't find anywhere we'd overridden it
19:10:17 <fungi> ahh, ssh_config manpage on bridge.o.o claims VerifyHostKeyDNS=no is the default actually
19:10:42 <clarkb> ok we don't have to solve this in the meeting but wanted to call it out as a question that came up
19:11:02 <fungi> yeah, i'm not certain we've actually started using sshfp records for ansible runs from bridge yet
19:11:35 <clarkb> Are there other config management update to call out?
19:12:20 <fungi> also worth noting, glibc 2.31 breaks dnssec (there are nondefault workarounds), so we need to be mindful of that when we eventually upgrade bridge.o.o, or for our own systems
19:12:44 <clarkb> fungi: is 2.31 or newer in focal?
19:12:49 <fungi> as that will also prevent openssh from relying on sshfp records
19:13:22 <fungi> yeah, focal
19:13:46 <fungi> 2.31-0ubuntu9
19:14:57 <clarkb> sounds like that may be it for config management and sshfp
19:15:00 <clarkb> #topic OpenDev
19:15:02 <ianw> we could also move back to the patch that just puts the fingerprints into known_hosts
19:15:24 <ianw> as sshfp seems like it is a nice idea, but ... perhaps more trouble that it's worth tbh
19:15:38 <clarkb> ianw: something to consider for sure
19:15:40 <clarkb> #link https://review.opendev.org/#/c/748263/ Update opendev.org front page
19:15:47 <clarkb> Thank you ianw for reviewing this one
19:16:13 <clarkb> Looks like we've got a couple +2s now. corvus do you want to review it before we approve it?
19:16:48 <clarkb> I should rereview it, but in trying to follow the comments its all made sense to me so far s o Idoubt I'll have major concerns
19:17:00 <fungi> i've left some comments there for things i'm happy to address in a follow-up patch
19:17:50 <fungi> so as not to drag this one out unnecessarily
19:18:11 <fungi> it's already a significant improvement over what's on the site now, in my opinion
19:18:14 <clarkb> frickler: ^ you may be interested as well
19:18:37 <clarkb> maybe fungi can approve it first thing tomorrow if there are no further objects between now and then?
19:18:46 <clarkb> because ya I agree a big improvement
19:19:05 <fungi> sure, i'll push up my suggestions as a second change when doing so
19:19:37 <clarkb> On the gerrit upgrade testing side of things I've not had time to push on that since my least email to luca. I'm hoping that I'll hvae time this week for more testing
19:20:03 <clarkb> Any other opendev topics others would like to call out before we move on?
19:20:32 <corvus> clarkb: i will +3 front page
19:20:43 <clarkb> corvus: k
19:20:51 <fungi> i finished the critical volume replacements in rax-dfw last week
19:21:05 <fungi> and have been poking at replacing the less critical ones in the background as time allows
19:21:21 <clarkb> fungi: other than the sometimes old volumes don't delete problem were there issues?
19:21:54 <fungi> ahh, yeah, looks like wiki.o.o will need special attention. i expect it's because it's booted from a snapshot of a legacy flavor instance, but i can't attach a new volume to it
19:22:23 <fungi> may need to rsync its content over to another instance booted from a modern flavor
19:22:34 <clarkb> "fun"
19:22:54 <fungi> the api accepts the volume add, but then the volume immediately returns to available and the instance never sees it
19:23:20 <fungi> oh, and also i discovered that something about osc is causing it not to be able to refer to volumes by name
19:23:48 <fungi> and it gives an empty name column in the volume list output too
19:24:13 <fungi> i've resorted to using cinderclient for now to get a volume listing with names included
19:24:30 <fungi> i suspect it's something to do with using cinder v1 api, or maybe a rackspace-specific problem
19:24:43 <fungi> just something worth keeping in mind if anybody needs something similar
19:24:59 <fungi> i haven't really had time to take it up with the sdk/cli folks yet
19:25:13 <clarkb> Thank you for taking care of that
19:25:48 <fungi> no problem
19:26:16 <clarkb> #topic General Topics
19:26:30 <clarkb> #topic Vexxhost Mirror IPv6 Problems
19:27:08 <clarkb> With this issue it seems we get rogue router advertisements which add bogus IPs to our instance. When that happens we basically break IPv6 routing on the host
19:27:20 <clarkb> This is likely a neutron bug but needs more cloud side involvement to debug
19:27:59 <fungi> note we've seen it (at least) once in limestone too. based on the prefixes getting added we suspect it's coming from a job node in another tenant
19:28:02 <clarkb> frickler has brought up that we should try and mitigate this better. Perhaps via assigning the IP details statically. I looked at this and it should be possible with the new netplan tooling but its a new thing we'll need to figure out
19:28:20 <clarkb> I wrote up an etherpad that I can't find anymore with a potential example config
19:28:35 <clarkb> another thought I has was maybe wecan filter RAs by origin mac ?
19:28:42 <clarkb> is that something iptables can be convinced to do ?
19:29:11 <fungi> i'm not absolutely sure iptables can block that
19:29:28 <fungi> if it's handled like arp, the kernel may be listening to a bpf on the interface
19:29:41 <fungi> so will see and act on it before it ever reaches iptables
19:29:59 <fungi> (dhcp has similar challenges in that regard)
19:29:59 <clarkb> my concern with the netplan idea is if we get it wrong we may have to build a new server. At least with iptables we can tes tthe rule and if we get it wrong reboot
19:30:29 <ianw> clarkb: you could always set a console root password for a bit?
19:30:52 <clarkb> ianw: does remote console access work with vexxhost (I'm not sure but if it does that would be a reaosnable compromise)
19:31:23 <ianw> oh, i'm assuming it would, yeah
19:31:37 <clarkb> Also totally open to other ideas here :)
19:32:18 <ianw> it seems like this is something you have to stop, like a rogue dhcp server
19:32:41 <fungi> statically configuring ipv6 and configuring the kernel not to do autoconf is probably the safest workaround
19:32:41 <clarkb> ya, its basically the same issue just with different IP protocols
19:33:05 <clarkb> I'll try harder to dig out the netplan etherpad after the meeting
19:33:10 <ianw> yeah, so i'm wondering what best practice others use is ... ?
19:33:35 <ianw> oh, it's ipv6
19:33:39 <ianw> of course there's a rfc
19:33:42 <ianw> https://tools.ietf.org/html/rfc6104
19:33:51 <fungi> ianw: generally it's to rely on autoconf and hope there's no bug in neutron leaking them between tenants
19:34:11 <clarkb> manual configuration is the first item on that rfc
19:34:17 <ianw> just 15 pages of options
19:34:18 <clarkb> so maybe we start there as frickler suggests
19:34:42 <clarkb> but if any of the other options there look preferable to you I'm happy to try others instead :)
19:35:44 <ianw> is it neutron leaking ra's ... or devstack doing something to the underlying nic maybe?
19:36:11 <clarkb> ianw: we believe it is neutron running in test jobs on the other tenant (we split mirror and test nodes into different tenants)
19:36:27 <fungi> devstack in a vm altering the host's nic would be... even more troubling
19:36:28 <clarkb> and neutron in the base cloud (vexxhost) is expected to block those RAs
19:36:40 <clarkb> per the bug we filed when limestone had this issue
19:36:42 <fungi> in which case it would point to a likely bug in qemu i guess
19:36:43 <ianw> that seems like a DOS attack :/
19:37:02 <clarkb> ianw: yes I originally filed it as a security bug a year ago or whatever it was
19:37:21 <clarkb> but it largely got ignored as cannot reproduce and then disclosed (so now we can talk about it freely)
19:37:23 <fungi> ianw: yep. neutron has protections which are supposed to prevent exactly this, but sometimes those aren't effective apparently
19:37:40 <clarkb> its possible that because we open up our security groups we're the only ones that notice
19:37:47 <clarkb> (we could try using security groups to block them too maybe?)
19:38:02 <fungi> however we haven't worked out the sequence to reliably recreate the problem, only observed it cropping up with some frequency, so it's hard to pin down the exact circumstances which lead to it
19:38:16 <fungi> the open bug on neutron is still basically a dead end without a reproducer
19:38:22 <clarkb> yup also we don't run the clouds so we don't really see the underlying network behavior
19:39:48 <clarkb> anyway we don't have to solve this here, let's just not forget to work around it this time :) I can help with this once nb03 is in a good spot
19:40:05 <clarkb> #topic Bup and Borg Backups
19:40:27 <clarkb> ianw anything new on this? and if not should we drop it from the agenda until we start enrolling servers with borg?
19:40:37 <ianw> sorry i've just had my head in rhel and efi stuff
19:40:45 <clarkb> (I've kept it on because I think backups are important but bup seems to be working well enough for now so borg isn't urgent)
19:40:49 <ianw> it is right at the top of my todo list though
19:41:45 <ianw> we can keep it for now, and i'll try to get at least an initial host done asap
19:41:50 <clarkb> ok and thank you
19:41:55 <clarkb> #topic PTG Planning
19:42:08 <clarkb> #topic https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning starts here
19:42:22 <clarkb> er
19:42:24 <clarkb> #undo
19:42:25 <openstack> Removing item from minutes: #topic https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning starts here
19:42:35 <clarkb> #link https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning starts here
19:42:48 <clarkb> October is fast approaching and I really do intend to add some content to that etherpad
19:43:03 <clarkb> as always others should feel free to add their own content
19:43:31 <clarkb> #topic Docker Hub Rate Limits
19:43:56 <clarkb> This wasn't on the agenda I sent out this morning as it occurred to me that it may be owrth talking about after looking at emails in openstack-discuss
19:44:51 <clarkb> Long story short docker hub is changing/has changed how they apply rate limits to image pulls. In the past limits were applied to layer blobs which we do cache in our mirrors. Now limits are applied to manifest fetches not blob layers. We don't cache manifest layers because getting those requires auth (even as an anonymous user you get an auth token)
19:45:11 <clarkb> This is unfortunate because it means our caching strategy is no longer effective for docker hub
19:45:33 <clarkb> On the plus side projects like zuul and nodepool and system-config havne't appeared to be affected yet. But othres like tripleo have
19:45:54 <clarkb> docker has promised they'll write a blog post on suggestions for CI operators which I haven't seen being published yet /me waits patiently
19:46:26 <clarkb> If our users struggle with this in the meantime I think their best bet may be to stop using our mirrors because then they will make anonymous requests from IPs that will generally be unique enoughto avoid issues
19:47:01 <clarkb> Other ideas I've seen include building images rather than fetching them (tripleo is doing this) as well as using other registries like quay
19:47:27 <fungi> there are certainly multiple solutions available to us, but i've been trying to remind users that dockerhub has promised to publish guidance and we should wait for that
19:47:58 <fungi> at least before we invest effort in building an alternative solution
19:48:16 <clarkb> ++ I mostly want people to be aware there is an issue and workarounds from the source should be published at some point
19:48:33 <clarkb> and there are "easy" workarounds that can be used between now and then like not using our mirrors
19:48:35 <fungi> (such as running ourt own proxy registry, or switching to a different web proxy which might be more flexible than apache mod_proxy)
19:49:11 <fungi> there was also some repeated confusion i've tried by best to correct around zuul-registry and its presumed use in proxying docker images for jobs
19:50:14 <clarkb> oh ya a couple people were confused by that
19:50:24 <clarkb> not realizing its a temporary staging ground not a canonical source/cache
19:50:26 <ianw> didn't github also announce a competing registry too?
19:50:37 <clarkb> ianw: yes
19:50:43 <clarkb> and google has one
19:50:51 <fungi> yes, but who knows if it will have similar (or worse) rate limits. we've been bitted by github rate limits pretty often as it is
19:51:13 <fungi> man, my typing is atrocious today
19:51:26 <ianw> yeah, just thinking that's sure to become something to mix in as well
19:53:02 <clarkb> #topic Open Discussion
19:53:08 <clarkb> Anything else to bring up in our last 7 minutes?
19:53:46 <fungi> oh, yeah
19:53:51 <fungi> pynotedb
19:54:13 <fungi> a few years ago, zara started work on a python library to interface with gerrit notedb databases
19:54:18 <fungi> but didn't get all that far with it
19:54:42 <fungi> we have the package name on pypi and a repo in our )opendev's) namespace on opendev but that's mostly just a cookie-cutter commit
19:54:54 <hashar> :-\
19:55:23 <fungi> more recently softwarefactory needed something to be able to interface with notedb from python and started writing a module for that
19:55:39 <fungi> they (ironically) picked the same name without checking whether it was taken
19:55:56 <fungi> now they're asking if we can hand over the pypi project so they can publish their library under that name
19:56:00 <clarkb> for the name in pypi was anything released to it?
19:56:24 <clarkb> if yes, then we may want to quickly double check nothing is using it (I think pypi exposes that somehow) but if not I have no objections to that idea
19:56:24 <fungi> a couple of dev releases several years ago, looks like
19:57:19 <fungi> also SotK has confirmed that the original authors are okay with lettnig it go
19:57:40 <fungi> and probably just using tristanC's thing instead once they're ready
19:58:25 <clarkb> works for me
19:58:32 <clarkb> particularly if the original authors are happy with the plan
19:58:49 <diablo_rojo> Seems reasonable
19:59:24 <fungi> ahh, looks like the "releases" for it on pypi have no files anyway
19:59:58 <fungi> evidenced from the lack of "download files" at https://pypi.org/project/pynotedb/
20:00:13 <hashar> there is no tag in the repo apparently
20:00:15 <fungi> so the two dev releases on there are just empty, no packages
20:00:17 <clarkb> that makes things easy
20:00:24 <diablo_rojo> Nice
20:00:26 <clarkb> and we are at time
20:00:29 <clarkb> Thank you everyone!
20:00:31 <fungi> thanks clarkb!
20:00:31 <clarkb> #endmeeting