#opendev-meeting log

19:01:04 <clarkb> #startmeeting infra
19:01:05 <openstack> Meeting started Tue Jul 28 19:01:04 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:08 <openstack> The meeting name has been set to 'infra'
19:01:15 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-July/000059.html Our Agenda
19:01:44 <clarkb> #topic Announcements
19:01:49 <clarkb> I have no announcements
19:01:58 <clarkb> #topic Actions from last meeting
19:02:02 <fungi> we now have dates for the (no surprise) virtual open infrastructure summit
19:02:11 <clarkb> #undo
19:02:12 <openstack> Removing item from minutes: #topic Actions from last meeting
19:02:29 <fungi> and there's a survey which has gone out to pick dates for the ptg
19:02:57 <clarkb> #link https://www.openstack.org/summit Info on now virtual open infrastructure summit
19:03:21 <mordred> o/
19:03:45 <clarkb> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/016098.html PTG date selection survey
19:03:54 <fungi> ptg date options seem to span from a couple weeks before the summit to several weeks after
19:03:54 <clarkb> fungi: thank you for the reminders
19:04:10 <fungi> my personal preference is to not interfere with hallowe'en ;)
19:04:44 <mordred> fungi: I dunno - interfering with halloween is a long-standing summit tradition :)
19:04:55 <fungi> and an unfortunate one in my opinion
19:05:36 <fungi> anyway, you can proceed with the meeting, sorry for derailment
19:05:41 <clarkb> #topic Actions from last meeting
19:05:46 <mordred> I don't disagree with you
19:05:48 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-21-19.01.txt minutes from last meeting
19:05:56 <clarkb> There were no actions
19:06:03 <clarkb> #topic Specs approval
19:06:11 <clarkb> #link https://review.opendev.org/#/c/731838/ Authentication broker service
19:06:28 <clarkb> This just got a new patchset. Probably still not ready for approval but definitely worthy of review
19:06:43 <fungi> yeah, i think we at least have direction for a poc
19:06:45 <clarkb> fungi: anything you want to add to that re new patchset or $other
19:07:12 <fungi> new patcheset is primarily thanks to your inspiration, more comments welcome of course
19:08:00 <fungi> basically noting that we're agreed on keycloak, not much new otherwise
19:09:42 <fungi> also, you can move on, nothing else from me
19:09:44 <clarkb> #topic Priority Efforts
19:09:51 <clarkb> #topic Update Config Management
19:10:21 <clarkb> A note that we went about 5 days without any of our infra-prod jobs running due to the zuul security bug fix and our playbook tripping over that. Since that has been fixed things seem happy
19:10:32 <clarkb> If you find things are lagging or otherwise not up to date this could be why
19:11:24 <clarkb> Semi related to that are the changes to run zuul executors under containers. This gets all of our zuul and nodepool services except nb03's builder on containers
19:11:28 <fungi> workarounds were mostly trivialish
19:12:12 <clarkb> two things to make note of there. nb03 needs an arm64 image which has hit a speed bump due to the slowness of building python wheels under buildx. Zuul executors need to start docker after afs (a fix for this has landed) to ensure bind mounts work properly for afs
19:12:52 <clarkb> for nb03 I think it is using the plaintext zk connection still. Is that correct corvus ?
19:13:13 <corvus> clarkb: unclear, i'll check
19:13:26 <corvus> yes
19:13:40 <mordred> clarkb: have we checked to see if building the wheels on native arm64 for the wheels in question is also slow?
19:14:12 <clarkb> mordred: thats a thing that ianw has been investigating with upstream python cryptography and I think it is significantly faster
19:14:19 <fungi> mordred: i think it's more that far fewer of our dependencies publish arm manylinux wheels so we have to build more of them than for x86
19:14:21 <mordred> nod
19:14:59 <clarkb> fungi: it is that, but also we had an hour time out on the job and they would hit that compiling any one of cryptography, pynacl and bcrypt
19:15:06 <clarkb> fungi: so the compiles there are also really really slow
19:15:13 <fungi> and yeah, building on native arm64 is apparently far faster than qemu emulated arm64 on amd64
19:15:26 <corvus> solutions include: 1) help upstreams build arm wheels; 2) build nodepool images on a native arm builder; 3) add in a new layer to the nodepool image stack that builds and installs dependencies so that we can rebuild this layer separately and less often.
19:15:55 <fungi> i like the idea of helping upstreams, and there's a topic on the agenda for that
19:16:19 <corvus> ianw started on #1;  #2 and #3 are not in play yet, just for discussion
19:16:30 <corvus> i'm still not keen on #2
19:16:38 <mordred> me either
19:16:46 <fungi> philosophically, #1 is preferable
19:16:50 <corvus> (because i don't think it's okay not to be able to merge nodepool changes if we lose that cloud)
19:16:50 <mordred> I think failing 1 we should try 3 before going to 2
19:17:04 <ianw> #3 doesn't seem like it precludes #1 either
19:17:22 <corvus> yeah, we could go ahead and start on #3 if we think it's not a terrible idea
19:17:35 <fungi> yeah, #3 is doable independently and will likely help
19:17:38 <mordred> yah. we could also do them not as another image layer, but just as a local version of #1 that publishes to a zuul-specific wheel mirror similar to the openstack specific wheel mirror
19:17:53 <corvus> basically, i'm imagining an image that we build only on requirements.txt changes and nightly
19:17:55 <fungi> but if #1 is practical, then it will probably not be a huge gain
19:18:14 <corvus> fungi: it's also a fallback for the next requirement we have that doesn't have a wheel
19:18:18 <fungi> er, if #1 is practical then #3 will probably not be a huge gain
19:18:25 <corvus> it sort of scales up and down as needed :)
19:18:30 <mordred> so we treat "make sure we have wheels of X, Y and Z" as a task separate from "build an image containing nodepool and X, Y and Z"
19:18:35 <fungi> yeah, i agree it's a useful backstop either way
19:18:41 <mordred> corvus: ++
19:18:57 <corvus> mordred, ianw: do you think we can generalize the make-a-wheel-mirror thing like that?
19:19:12 <ianw> corvus: yes, it's just a script :)
19:19:13 <corvus> is that a service we'd like opendev to provide?
19:19:35 <mordred> might not be a terrible general service to offer
19:19:40 <fungi> if the new layer for #3 is part of the same job though, then it can still easily time out whenever there's something new to buid
19:19:41 <ianw> the only thing is that the arm64 wheel build already runs quite long, such that we've restricted it to the lastest 2 branches
19:19:42 <fungi> build
19:19:48 <corvus> ianw: but also a vhost?  or would we just have the mirror host be a superset of all the wheels for all the projects)?
19:20:00 <corvus> fungi: same job as what job?
19:20:12 <fungi> same as the job building the other layers
19:20:21 <fungi> the job currently timing out
19:20:43 <corvus> fungi: yes there would be no point in that :)  my suggestion is a separately tagged image layer built by a separate job
19:20:49 <fungi> do we have a separate job per layer? if so then less of a problem i guess
19:20:56 <clarkb> corvus: I think superset
19:21:01 <corvus> fungi: if the word 'layer' is tripping you up, just ignore it and call it an 'image' :)
19:21:08 <clarkb> corvus: becuase requirements and constraints or lock files should control what projects actually use
19:21:21 <clarkb> if they rely on packages being available (or not) in our cache to select versions that is a bug imo
19:21:21 <ianw> (on the wheel job) but this is something that is open to more parallelism.  we already run it under "parallel" and I think using a couple of nodes could really speed it up by letting the longest running jobs sit on one node, while the smaller things zoom by
19:21:36 <fungi> i do like the superset idea, if we have a good way for projects to chuck in the package+version they want a wheel of
19:21:58 <ianw> fungi: i was probably thinking we run requirements.txt, just without a cap
19:22:00 <corvus> yeah, a wheel mirror sounds way easier than a new layer
19:22:41 <corvus> so let's call that #4, and execute #1 and #4, and leave ideas #2 and #3 on the shelf?
19:22:52 <ianw> i can take an action item to look at the wheel mirror if we like
19:22:53 <fungi> there is a slight gain from not aprallelizing, in that if projects with common dependencies are split between concurrent jobs then that dependency will be built twice, but ni practice that's probably not a huge concern
19:23:20 <fungi> s/aprallelizing/parallelizing/
19:23:37 <fungi> my hands are typing at incompatible speeds today
19:24:31 <mordred> fungi: to one another?
19:24:44 <fungi> at least, yes
19:25:03 <ianw> one other thing about wheels is that some of the caveats in https://review.opendev.org/#/c/703916/ still applies ... we append wheels and never delete currently
19:25:38 <clarkb> ianw: thankfully these packages tend to be fairly small (unlike say the cuda pypi packages)
19:25:41 <ianw> basically the cache grows unbounded, which is one thing with capped requirements but if we're chasing upstream might need thinking about
19:25:57 <fungi> true, if we shard and build in parallel then deleting becomes much harder too
19:27:12 <fungi> though if we're worried about space on our current mirror, i'd say the first course of action is to delete any wheels which are also present on pypi. we used to copy them (unnecessarily) into the mirror
19:27:37 <fungi> we no longer do that, but since we only append...
19:28:22 <clarkb> also due to using afs we could clear the contents, rebuild, then publish the result periodically if we want to prune right?
19:28:46 <clarkb> the RO volume will serve the old contents until we switch
19:28:46 <clarkb> s/switch/vos release/
19:28:46 <fungi> yeah, pruning as a separate step could be atomic, at least
19:28:50 <ianw> this is true, but i think that all the jobs would timeout trying to refresh the cache
19:28:57 <fungi> though we'd need to block additions while pruning
19:29:07 <ianw> so it would be very manual
19:29:42 <clarkb> gotcha
19:30:08 <clarkb> ok we've got time in a bit to talk further about #1, anything else to bring up on #4 before we move on?
19:30:41 <ianw> do you want to give me an action item on it?  or is anyone else super keen?
19:31:07 <clarkb> if you're interested I would say go for it
19:31:38 <corvus> ianw: if yo can get started and tag me for assistance/reviews that'd be great
19:31:53 <ianw> ++ will do
19:31:55 <clarkb> #action ianw Work on incorporating non OpenStack requirements into our python wheel caches. corvus willing to assist
19:32:33 <clarkb> #topic OpenDev
19:32:51 <clarkb> I announced deprecation for gerrit /p/ mirrors
19:32:53 <clarkb> #link http://lists.opendev.org/pipermail/service-announce/2020-July/000007.html Gerrit /p/ mirror deprecation.
19:33:10 <clarkb> #link https://review.opendev.org/#/c/743324/ Implement Gerrit /p/ mirror "removal"
19:33:18 <clarkb> and pushed that change to make it happen. I said I would do this friday
19:33:31 <clarkb> if we think the plan above is a bad one let me know and we can modify as necessary
19:33:59 <fungi> just to confirm (i also left a review comment), is the idea to remove that line from the config when we upgrade?
19:34:09 <clarkb> The reasons for it are that gerrit needs that url for something else on newer versions which we plan to upgrade to and I've slowly been trying to update things to manage git branches and not having another mirror (that is going away anyway) makes that simpler
19:34:11 <clarkb> fungi: yes
19:34:13 <fungi> or will it not break polygerrit dashboards?
19:34:18 <fungi> ahh, okay, thanks
19:34:35 <clarkb> there will also need to be followon cleanup to stop syncing to the local git repos and all that
19:34:45 <clarkb> but I was going to wait on that to be sure we don't revert for some reason
19:35:02 <mordred> ++
19:35:54 <clarkb> That was what I had for opendev subjects.
19:36:20 <clarkb> #topic General topics
19:36:28 <clarkb> #topic Bup and Borg backups
19:36:46 <clarkb> #link https://review.opendev.org/741366 Adds borg backup support and needs review
19:37:00 <ianw> we'd also discussed the old backups and i checked on them
19:37:04 <clarkb> I've reviewed ^ and think it makes sense. Allows us to have bup and borg side by side too
19:37:23 <clarkb> ianw: they still happy even after the index cleanup?
19:37:24 <ianw> review seems fine; i did a full extraction and here's some important bits compared : http://paste.openstack.org/show/796367/
19:37:48 <ianw> zuul also seems ok, but i did notice we've missed adding that to the "new" server
19:37:51 <ianw> https://review.opendev.org/#/c/743445/
19:38:02 <ianw> #link https://review.opendev.org/#/c/743445/
19:38:52 <ianw> if i can get reviews on the borg backup roles, i'll start a new server and we can try it with something
19:39:25 <clarkb> great, thank you for putting that together
19:39:27 <ianw> (as a tangent on starting a new server)
19:39:30 <ianw> #link https://review.opendev.org/743461
19:39:32 <ianw> and
19:39:40 <ianw> #link https://review.opendev.org/743470
19:39:52 <ianw> adds sshfp records for our hosts, and makes the launch script print them too :)
19:40:25 <clarkb> #topic GitHub 3rd Party CI
19:40:33 <clarkb> as promised earlier we can talk about this a bit more
19:41:13 <clarkb> Basically python cryptography doesn't have arm64 wheels. ianw filed an issue with them to start a conversation on whether or not we could help with that
19:41:15 <clarkb> #link https://github.com/pyca/cryptography/issues/5339
19:41:49 <clarkb> we also talked about it a bit ourselves and it is something we'd like to do if we can make all the pieces fit together such that cryptogrpahy is happy and all that
19:42:08 <clarkb> To do this we'll need to spin up a tenant for them
19:42:10 <clarkb> #link https://review.opendev.org/#/q/topic:opendev-3pci
19:42:20 <fungi> though in short, the main reason they don't have arm64 manylinux1 wheels is that travis takes too long to test/build them because it uses emulation?
19:42:42 <clarkb> yes I believe travis is doing something very similar to our buildx nodepool image builds for docker
19:42:43 <ianw> i'm not sure on the emulation part; i'm pretty sure they run on AWS
19:42:56 <ianw> but the slow part, yes
19:43:13 <corvus> i was thinking the tenant would be a 'cryptography' tenant
19:43:46 <corvus> the main issue prompting giving them their own tenant is that they won't be coordinating with anyone else
19:43:48 <fungi> yeah, or at most a pyca tenant
19:43:56 <corvus> that seems reasonable too
19:44:10 <ianw> yeah, that was something to discuss, i made the tenant deliberately a bit more generic thinking that it may be a home for a collection of things we might need/be interested in
19:44:15 <corvus> (so a generic 3pci tenant seems weird since we'd never want to add anyone else there)
19:44:25 <fungi> i concur
19:44:35 <fungi> granularity is good here
19:44:47 <ianw> already i've had mentions from people that lxml would like something similar
19:44:48 <fungi> tenants would, ideally, mirror the authorities for them
19:45:22 <clarkb> its possbile other third party ci groups would be happy with a more generic pool but cryptography in particular seemed to want to be very much their own thing
19:45:32 <clarkb> so at least to start I agree with fungi and corvus
19:45:56 <clarkb> then evaluate if other python deps do similar
19:46:21 <ianw> ok, i will rework it to a pyca tenant, that sounds like a good level
19:46:36 <corvus> ++
19:47:10 <clarkb> Anything else to add?
19:47:39 * mordred thinks this is neat
19:47:49 <ianw> not really, if we get that up, we can start to report on pull requests
19:47:52 <fungi> i, too, think this is neat
19:48:01 <clarkb> #topic Gerrit project CI rework from Google
19:48:10 <clarkb> corvus: ^ want to fill us in
19:48:50 <corvus> the google folks recently decided to abandon work on the checks plugin for gerrit
19:48:54 <corvus> #link https://docs.google.com/document/d/1v2ETifhRXpuYlahtnfIK-1KL3zStERvbnEa8sS1pTwk/edit#
19:49:14 <corvus> i think they mostly have an idea of where they want to go
19:49:44 <fungi> would it still be a plugin, or something more directly integrated?
19:49:54 <corvus> but they did ask a few folks to help them gather requirements, so i hopped on a video conference with ben and patrick and told them about us
19:50:09 <corvus> fungi: possibly both and multiple plugins
19:50:14 <fungi> ahh
19:50:19 <clarkb> corvus: this is driven by googles needs right? not necessarily a change of need for upstream itself?
19:51:00 <corvus> yes, theoretically the checks plugin could live on
19:51:13 <corvus> but google is driving most of the work in this area
19:51:40 <corvus> the biggest driving force for the change is that internal google ci systems didn't want to adapt to the new checks api
19:52:13 <mordred> corvus: did you suggest the internal google ci systems just migrate to zuul which already supports the new api? ;)
19:52:22 * mordred hides
19:52:23 <corvus> anyway, i told them about how we do third-party ci, stream-events, firewalls, really big firewalls, polling, etc
19:52:38 <corvus> mordred: yes actually, but that would be a change too
19:52:46 <mordred> good point
19:53:00 <corvus> i showed them hideci
19:53:03 <corvus> (they said we should upgrade)
19:53:21 <clarkb> have they documented how to upgrade yet ;)
19:53:29 <corvus> clarkb: yes?
19:53:44 <clarkb> corvus: I just remember all the undocumented info you and mordred brought back from gerrit user summit last year
19:53:45 <fungi> step 1: dump all your existing data ;)
19:54:00 <corvus> this is starting to derail, but i'm pretty sure since we're the last folks in the world running on 2.13 and everyone else has upgraded, that's pretty solid
19:54:06 <mordred> nah - upgrade should be fairly straightforward
19:54:15 <mordred> we just need to practice run it
19:54:21 <corvus> yep
19:54:35 <clarkb> k
19:54:37 <fungi> yeah, even my sso spec update assumes we're running newer gerrit
19:55:03 <clarkb> I mean I'm all onboard with upgrading its just still not entirely clear to me that we have a process (but I don't mean to derail from the CI discussion)
19:55:06 <corvus> let me be perfectly clear: the fact that we're running 2.13 and have not upgraded is entirely our doing and has nothing to do with the willingness of the gerrit community to help us do it.
19:55:07 <mordred> corvus: how did our use case resonate?
19:55:13 <fungi> i didn't list it as a dependency because i figure we're very, very close
19:55:32 <mordred> cases
19:55:58 <clarkb> also do they intend on disabling features like stream events?
19:56:13 <corvus> if we have any problems upgrading, they really really want us to let them help us
19:56:28 <corvus> anyway, sorry, i'll try to get back on track.  where were we?
19:56:39 <mordred> corvus: you told them about how we do things
19:56:55 <mordred> and showed them hideci
19:57:06 <corvus> and i emphasized the ui features of hideci, and what we've learned from that
19:57:13 <corvus> about showing multiple ci systems, re-runs, etc
19:57:31 <fungi> the idea that a change might have dozens of ci systems reporting on it
19:57:36 <corvus> what summary information is important, why you would want to show successful runs, etc.
19:57:54 <corvus> some of this was new to them, so i think it was a useful exercise.
19:58:19 <corvus> i can't say for sure what will come out of it.  i don't think google is in a position to commit to implementing things specifically to make our lives easier
19:58:46 <corvus> but i did stress that to the extent that they can make use cases like these simple to deal with, it makes gerrit a better product
19:59:12 <clarkb> assuming that the checks plugin dies and the google thing doesn't work were basically in a similar spot to where we are today using stream events and commits (though they can be labeled as robot comments) right?
19:59:23 <clarkb> s/commits/comments/
19:59:28 <corvus> i think they at least understand what we're doing, why, how, and what works and doesn't work, so hopefully that will inform their work and they'll end up with something we can use without too much difficulty
20:00:00 <clarkb> I guess my biggest concern is that we don't regress where even the simple thing we do today is no longer functional (hence my question about stream events)
20:00:07 <clarkb> something richer would be great, but we can probably make due if that doesn't pan out
20:00:09 <corvus> clarkb: it's the reporting/ui thing that i think is the biggest issue
20:00:32 <corvus> remind me what's our story for replacing hideci when we upgrade?
20:00:45 <mordred> we don't have 100% of one yet
20:00:48 <mordred> but ...
20:00:51 <clarkb> I think we had assumed checks was going to handle it for first party ci
20:00:53 <mordred> our friends at wikimedia have a thing very similar to hideci
20:01:01 <fungi> checks api having subchecks integration in the polygerrit ui
20:01:03 <mordred> that's implemented as a proper polygerrit plugin
20:01:20 <mordred> no - checks api has never been the plan for the upgrade, sorry, that might not have been clear
20:01:37 <corvus> okay, so if that new plugin works, then yeah, we're probably good for a while
20:02:08 <corvus> for me, the goal is still getting all this data out of the comment stream
20:02:56 <clarkb> corvus: ya that would be nice particularly to help reviewers not be distracted by all the CI activity
20:03:12 <clarkb> also we are at time now. I don't mind going for a few more minutes but we should probably wrap up
20:03:21 <clarkb> though feel free to continue discussion in #opendev
20:03:33 <corvus> anyway, that's the gist -- i think they're interested in talking to us more as their plans solidify
20:03:54 <corvus> i'm sure we can invite others to the next meeting if folks want
20:04:25 <corvus> [eot]
20:05:11 <clarkb> cool thanks for your time everyone. As mentioned feel free to continue the discussion in #opendev
20:05:15 <clarkb> #endmeeting