19:01:04 #startmeeting infra 19:01:05 Meeting started Tue Jul 28 19:01:04 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:08 The meeting name has been set to 'infra' 19:01:15 #link http://lists.opendev.org/pipermail/service-discuss/2020-July/000059.html Our Agenda 19:01:44 #topic Announcements 19:01:49 I have no announcements 19:01:58 #topic Actions from last meeting 19:02:02 we now have dates for the (no surprise) virtual open infrastructure summit 19:02:11 #undo 19:02:12 Removing item from minutes: #topic Actions from last meeting 19:02:29 and there's a survey which has gone out to pick dates for the ptg 19:02:57 #link https://www.openstack.org/summit Info on now virtual open infrastructure summit 19:03:21 o/ 19:03:45 #link http://lists.openstack.org/pipermail/openstack-discuss/2020-July/016098.html PTG date selection survey 19:03:54 ptg date options seem to span from a couple weeks before the summit to several weeks after 19:03:54 fungi: thank you for the reminders 19:04:10 my personal preference is to not interfere with hallowe'en ;) 19:04:44 fungi: I dunno - interfering with halloween is a long-standing summit tradition :) 19:04:55 and an unfortunate one in my opinion 19:05:36 anyway, you can proceed with the meeting, sorry for derailment 19:05:41 #topic Actions from last meeting 19:05:46 I don't disagree with you 19:05:48 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-21-19.01.txt minutes from last meeting 19:05:56 There were no actions 19:06:03 #topic Specs approval 19:06:11 #link https://review.opendev.org/#/c/731838/ Authentication broker service 19:06:28 This just got a new patchset. Probably still not ready for approval but definitely worthy of review 19:06:43 yeah, i think we at least have direction for a poc 19:06:45 fungi: anything you want to add to that re new patchset or $other 19:07:12 new patcheset is primarily thanks to your inspiration, more comments welcome of course 19:08:00 basically noting that we're agreed on keycloak, not much new otherwise 19:09:42 also, you can move on, nothing else from me 19:09:44 #topic Priority Efforts 19:09:51 #topic Update Config Management 19:10:21 A note that we went about 5 days without any of our infra-prod jobs running due to the zuul security bug fix and our playbook tripping over that. Since that has been fixed things seem happy 19:10:32 If you find things are lagging or otherwise not up to date this could be why 19:11:24 Semi related to that are the changes to run zuul executors under containers. This gets all of our zuul and nodepool services except nb03's builder on containers 19:11:28 workarounds were mostly trivialish 19:12:12 two things to make note of there. nb03 needs an arm64 image which has hit a speed bump due to the slowness of building python wheels under buildx. Zuul executors need to start docker after afs (a fix for this has landed) to ensure bind mounts work properly for afs 19:12:52 for nb03 I think it is using the plaintext zk connection still. Is that correct corvus ? 19:13:13 clarkb: unclear, i'll check 19:13:26 yes 19:13:40 clarkb: have we checked to see if building the wheels on native arm64 for the wheels in question is also slow? 19:14:12 mordred: thats a thing that ianw has been investigating with upstream python cryptography and I think it is significantly faster 19:14:19 mordred: i think it's more that far fewer of our dependencies publish arm manylinux wheels so we have to build more of them than for x86 19:14:21 nod 19:14:59 fungi: it is that, but also we had an hour time out on the job and they would hit that compiling any one of cryptography, pynacl and bcrypt 19:15:06 fungi: so the compiles there are also really really slow 19:15:13 and yeah, building on native arm64 is apparently far faster than qemu emulated arm64 on amd64 19:15:26 solutions include: 1) help upstreams build arm wheels; 2) build nodepool images on a native arm builder; 3) add in a new layer to the nodepool image stack that builds and installs dependencies so that we can rebuild this layer separately and less often. 19:15:55 i like the idea of helping upstreams, and there's a topic on the agenda for that 19:16:19 ianw started on #1; #2 and #3 are not in play yet, just for discussion 19:16:30 i'm still not keen on #2 19:16:38 me either 19:16:46 philosophically, #1 is preferable 19:16:50 (because i don't think it's okay not to be able to merge nodepool changes if we lose that cloud) 19:16:50 I think failing 1 we should try 3 before going to 2 19:17:04 #3 doesn't seem like it precludes #1 either 19:17:22 yeah, we could go ahead and start on #3 if we think it's not a terrible idea 19:17:35 yeah, #3 is doable independently and will likely help 19:17:38 yah. we could also do them not as another image layer, but just as a local version of #1 that publishes to a zuul-specific wheel mirror similar to the openstack specific wheel mirror 19:17:53 basically, i'm imagining an image that we build only on requirements.txt changes and nightly 19:17:55 but if #1 is practical, then it will probably not be a huge gain 19:18:14 fungi: it's also a fallback for the next requirement we have that doesn't have a wheel 19:18:18 er, if #1 is practical then #3 will probably not be a huge gain 19:18:25 it sort of scales up and down as needed :) 19:18:30 so we treat "make sure we have wheels of X, Y and Z" as a task separate from "build an image containing nodepool and X, Y and Z" 19:18:35 yeah, i agree it's a useful backstop either way 19:18:41 corvus: ++ 19:18:57 mordred, ianw: do you think we can generalize the make-a-wheel-mirror thing like that? 19:19:12 corvus: yes, it's just a script :) 19:19:13 is that a service we'd like opendev to provide? 19:19:35 might not be a terrible general service to offer 19:19:40 if the new layer for #3 is part of the same job though, then it can still easily time out whenever there's something new to buid 19:19:41 the only thing is that the arm64 wheel build already runs quite long, such that we've restricted it to the lastest 2 branches 19:19:42 build 19:19:48 ianw: but also a vhost? or would we just have the mirror host be a superset of all the wheels for all the projects)? 19:20:00 fungi: same job as what job? 19:20:12 same as the job building the other layers 19:20:21 the job currently timing out 19:20:43 fungi: yes there would be no point in that :) my suggestion is a separately tagged image layer built by a separate job 19:20:49 do we have a separate job per layer? if so then less of a problem i guess 19:20:56 corvus: I think superset 19:21:01 fungi: if the word 'layer' is tripping you up, just ignore it and call it an 'image' :) 19:21:08 corvus: becuase requirements and constraints or lock files should control what projects actually use 19:21:21 if they rely on packages being available (or not) in our cache to select versions that is a bug imo 19:21:21 (on the wheel job) but this is something that is open to more parallelism. we already run it under "parallel" and I think using a couple of nodes could really speed it up by letting the longest running jobs sit on one node, while the smaller things zoom by 19:21:36 i do like the superset idea, if we have a good way for projects to chuck in the package+version they want a wheel of 19:21:58 fungi: i was probably thinking we run requirements.txt, just without a cap 19:22:00 yeah, a wheel mirror sounds way easier than a new layer 19:22:41 so let's call that #4, and execute #1 and #4, and leave ideas #2 and #3 on the shelf? 19:22:52 i can take an action item to look at the wheel mirror if we like 19:22:53 there is a slight gain from not aprallelizing, in that if projects with common dependencies are split between concurrent jobs then that dependency will be built twice, but ni practice that's probably not a huge concern 19:23:20 s/aprallelizing/parallelizing/ 19:23:37 my hands are typing at incompatible speeds today 19:24:31 fungi: to one another? 19:24:44 at least, yes 19:25:03 one other thing about wheels is that some of the caveats in https://review.opendev.org/#/c/703916/ still applies ... we append wheels and never delete currently 19:25:38 ianw: thankfully these packages tend to be fairly small (unlike say the cuda pypi packages) 19:25:41 basically the cache grows unbounded, which is one thing with capped requirements but if we're chasing upstream might need thinking about 19:25:57 true, if we shard and build in parallel then deleting becomes much harder too 19:27:12 though if we're worried about space on our current mirror, i'd say the first course of action is to delete any wheels which are also present on pypi. we used to copy them (unnecessarily) into the mirror 19:27:37 we no longer do that, but since we only append... 19:28:22 also due to using afs we could clear the contents, rebuild, then publish the result periodically if we want to prune right? 19:28:46 the RO volume will serve the old contents until we switch 19:28:46 s/switch/vos release/ 19:28:46 yeah, pruning as a separate step could be atomic, at least 19:28:50 this is true, but i think that all the jobs would timeout trying to refresh the cache 19:28:57 though we'd need to block additions while pruning 19:29:07 so it would be very manual 19:29:42 gotcha 19:30:08 ok we've got time in a bit to talk further about #1, anything else to bring up on #4 before we move on? 19:30:41 do you want to give me an action item on it? or is anyone else super keen? 19:31:07 if you're interested I would say go for it 19:31:38 ianw: if yo can get started and tag me for assistance/reviews that'd be great 19:31:53 ++ will do 19:31:55 #action ianw Work on incorporating non OpenStack requirements into our python wheel caches. corvus willing to assist 19:32:33 #topic OpenDev 19:32:51 I announced deprecation for gerrit /p/ mirrors 19:32:53 #link http://lists.opendev.org/pipermail/service-announce/2020-July/000007.html Gerrit /p/ mirror deprecation. 19:33:10 #link https://review.opendev.org/#/c/743324/ Implement Gerrit /p/ mirror "removal" 19:33:18 and pushed that change to make it happen. I said I would do this friday 19:33:31 if we think the plan above is a bad one let me know and we can modify as necessary 19:33:59 just to confirm (i also left a review comment), is the idea to remove that line from the config when we upgrade? 19:34:09 The reasons for it are that gerrit needs that url for something else on newer versions which we plan to upgrade to and I've slowly been trying to update things to manage git branches and not having another mirror (that is going away anyway) makes that simpler 19:34:11 fungi: yes 19:34:13 or will it not break polygerrit dashboards? 19:34:18 ahh, okay, thanks 19:34:35 there will also need to be followon cleanup to stop syncing to the local git repos and all that 19:34:45 but I was going to wait on that to be sure we don't revert for some reason 19:35:02 ++ 19:35:54 That was what I had for opendev subjects. 19:36:20 #topic General topics 19:36:28 #topic Bup and Borg backups 19:36:46 #link https://review.opendev.org/741366 Adds borg backup support and needs review 19:37:00 we'd also discussed the old backups and i checked on them 19:37:04 I've reviewed ^ and think it makes sense. Allows us to have bup and borg side by side too 19:37:23 ianw: they still happy even after the index cleanup? 19:37:24 review seems fine; i did a full extraction and here's some important bits compared : http://paste.openstack.org/show/796367/ 19:37:48 zuul also seems ok, but i did notice we've missed adding that to the "new" server 19:37:51 https://review.opendev.org/#/c/743445/ 19:38:02 #link https://review.opendev.org/#/c/743445/ 19:38:52 if i can get reviews on the borg backup roles, i'll start a new server and we can try it with something 19:39:25 great, thank you for putting that together 19:39:27 (as a tangent on starting a new server) 19:39:30 #link https://review.opendev.org/743461 19:39:32 and 19:39:40 #link https://review.opendev.org/743470 19:39:52 adds sshfp records for our hosts, and makes the launch script print them too :) 19:40:25 #topic GitHub 3rd Party CI 19:40:33 as promised earlier we can talk about this a bit more 19:41:13 Basically python cryptography doesn't have arm64 wheels. ianw filed an issue with them to start a conversation on whether or not we could help with that 19:41:15 #link https://github.com/pyca/cryptography/issues/5339 19:41:49 we also talked about it a bit ourselves and it is something we'd like to do if we can make all the pieces fit together such that cryptogrpahy is happy and all that 19:42:08 To do this we'll need to spin up a tenant for them 19:42:10 #link https://review.opendev.org/#/q/topic:opendev-3pci 19:42:20 though in short, the main reason they don't have arm64 manylinux1 wheels is that travis takes too long to test/build them because it uses emulation? 19:42:42 yes I believe travis is doing something very similar to our buildx nodepool image builds for docker 19:42:43 i'm not sure on the emulation part; i'm pretty sure they run on AWS 19:42:56 but the slow part, yes 19:43:13 i was thinking the tenant would be a 'cryptography' tenant 19:43:46 the main issue prompting giving them their own tenant is that they won't be coordinating with anyone else 19:43:48 yeah, or at most a pyca tenant 19:43:56 that seems reasonable too 19:44:10 yeah, that was something to discuss, i made the tenant deliberately a bit more generic thinking that it may be a home for a collection of things we might need/be interested in 19:44:15 (so a generic 3pci tenant seems weird since we'd never want to add anyone else there) 19:44:25 i concur 19:44:35 granularity is good here 19:44:47 already i've had mentions from people that lxml would like something similar 19:44:48 tenants would, ideally, mirror the authorities for them 19:45:22 its possbile other third party ci groups would be happy with a more generic pool but cryptography in particular seemed to want to be very much their own thing 19:45:32 so at least to start I agree with fungi and corvus 19:45:56 then evaluate if other python deps do similar 19:46:21 ok, i will rework it to a pyca tenant, that sounds like a good level 19:46:36 ++ 19:47:10 Anything else to add? 19:47:39 * mordred thinks this is neat 19:47:49 not really, if we get that up, we can start to report on pull requests 19:47:52 i, too, think this is neat 19:48:01 #topic Gerrit project CI rework from Google 19:48:10 corvus: ^ want to fill us in 19:48:50 the google folks recently decided to abandon work on the checks plugin for gerrit 19:48:54 #link https://docs.google.com/document/d/1v2ETifhRXpuYlahtnfIK-1KL3zStERvbnEa8sS1pTwk/edit# 19:49:14 i think they mostly have an idea of where they want to go 19:49:44 would it still be a plugin, or something more directly integrated? 19:49:54 but they did ask a few folks to help them gather requirements, so i hopped on a video conference with ben and patrick and told them about us 19:50:09 fungi: possibly both and multiple plugins 19:50:14 ahh 19:50:19 corvus: this is driven by googles needs right? not necessarily a change of need for upstream itself? 19:51:00 yes, theoretically the checks plugin could live on 19:51:13 but google is driving most of the work in this area 19:51:40 the biggest driving force for the change is that internal google ci systems didn't want to adapt to the new checks api 19:52:13 corvus: did you suggest the internal google ci systems just migrate to zuul which already supports the new api? ;) 19:52:22 * mordred hides 19:52:23 anyway, i told them about how we do third-party ci, stream-events, firewalls, really big firewalls, polling, etc 19:52:38 mordred: yes actually, but that would be a change too 19:52:46 good point 19:53:00 i showed them hideci 19:53:03 (they said we should upgrade) 19:53:21 have they documented how to upgrade yet ;) 19:53:29 clarkb: yes? 19:53:44 corvus: I just remember all the undocumented info you and mordred brought back from gerrit user summit last year 19:53:45 step 1: dump all your existing data ;) 19:54:00 this is starting to derail, but i'm pretty sure since we're the last folks in the world running on 2.13 and everyone else has upgraded, that's pretty solid 19:54:06 nah - upgrade should be fairly straightforward 19:54:15 we just need to practice run it 19:54:21 yep 19:54:35 k 19:54:37 yeah, even my sso spec update assumes we're running newer gerrit 19:55:03 I mean I'm all onboard with upgrading its just still not entirely clear to me that we have a process (but I don't mean to derail from the CI discussion) 19:55:06 let me be perfectly clear: the fact that we're running 2.13 and have not upgraded is entirely our doing and has nothing to do with the willingness of the gerrit community to help us do it. 19:55:07 corvus: how did our use case resonate? 19:55:13 i didn't list it as a dependency because i figure we're very, very close 19:55:32 cases 19:55:58 also do they intend on disabling features like stream events? 19:56:13 if we have any problems upgrading, they really really want us to let them help us 19:56:28 anyway, sorry, i'll try to get back on track. where were we? 19:56:39 corvus: you told them about how we do things 19:56:55 and showed them hideci 19:57:06 and i emphasized the ui features of hideci, and what we've learned from that 19:57:13 about showing multiple ci systems, re-runs, etc 19:57:31 the idea that a change might have dozens of ci systems reporting on it 19:57:36 what summary information is important, why you would want to show successful runs, etc. 19:57:54 some of this was new to them, so i think it was a useful exercise. 19:58:19 i can't say for sure what will come out of it. i don't think google is in a position to commit to implementing things specifically to make our lives easier 19:58:46 but i did stress that to the extent that they can make use cases like these simple to deal with, it makes gerrit a better product 19:59:12 assuming that the checks plugin dies and the google thing doesn't work were basically in a similar spot to where we are today using stream events and commits (though they can be labeled as robot comments) right? 19:59:23 s/commits/comments/ 19:59:28 i think they at least understand what we're doing, why, how, and what works and doesn't work, so hopefully that will inform their work and they'll end up with something we can use without too much difficulty 20:00:00 I guess my biggest concern is that we don't regress where even the simple thing we do today is no longer functional (hence my question about stream events) 20:00:07 something richer would be great, but we can probably make due if that doesn't pan out 20:00:09 clarkb: it's the reporting/ui thing that i think is the biggest issue 20:00:32 remind me what's our story for replacing hideci when we upgrade? 20:00:45 we don't have 100% of one yet 20:00:48 but ... 20:00:51 I think we had assumed checks was going to handle it for first party ci 20:00:53 our friends at wikimedia have a thing very similar to hideci 20:01:01 checks api having subchecks integration in the polygerrit ui 20:01:03 that's implemented as a proper polygerrit plugin 20:01:20 no - checks api has never been the plan for the upgrade, sorry, that might not have been clear 20:01:37 okay, so if that new plugin works, then yeah, we're probably good for a while 20:02:08 for me, the goal is still getting all this data out of the comment stream 20:02:56 corvus: ya that would be nice particularly to help reviewers not be distracted by all the CI activity 20:03:12 also we are at time now. I don't mind going for a few more minutes but we should probably wrap up 20:03:21 though feel free to continue discussion in #opendev 20:03:33 anyway, that's the gist -- i think they're interested in talking to us more as their plans solidify 20:03:54 i'm sure we can invite others to the next meeting if folks want 20:04:25 [eot] 20:05:11 cool thanks for your time everyone. As mentioned feel free to continue the discussion in #opendev 20:05:15 #endmeeting