#opendev-meeting log

19:01:01 <clarkb> #startmeeting infra
19:01:01 <opendevmeet> Meeting started Tue Dec  7 19:01:01 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:01 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:01 <opendevmeet> The meeting name has been set to 'infra'
19:01:05 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-December/000305.html Our Agenda
19:01:27 <clarkb> #topic Announcements
19:02:17 <clarkb> This didn't make it onto the agenda because it didn't occur to me until this morning. We are fast approaching a holiday period for many of us. I'll be unable to make a meeting on the 21st and likely unable to make a meetingon January 4
19:02:40 <fungi> i'm okay with skipping those
19:03:15 <clarkb> ya I think we can go ahead and cancel the 21st and 4th. And I'll try hard to do a check in on the 28th though I expect things will get pretty quiet all around
19:03:46 <clarkb> everyone should enjoy the holidays and their assocaited time off. I'm going to attempt to do this myself :)
19:04:21 <ianw> ++ won't be regularly around then either
19:05:01 <clarkb> #topic Actions from last meeting
19:05:06 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-11-30-19.01.txt minutes from last meeting
19:05:13 <clarkb> There weren't any actions recorded
19:05:18 <clarkb> #topic Topics
19:05:26 <clarkb> #topic Improving CD Throughput
19:05:41 <clarkb> We made some progress here and also took a step or two back but we learned some stuff
19:06:10 <clarkb> When we switched in the "setup source for system-config on bridge at the start of each buildset" change we missed a few important things that we have reverted that change over
19:06:42 <clarkb> We need to make sure that we are using nodeless jobs, that we update system-config on bridge and not on a normal zuul node, we need to honor DISABLE-ANSIBLE and we need to be sure every buildset has this job run first
19:06:55 <clarkb> The good news is that since we learned all of that we are able to regroup and make a new plan.
19:07:02 <clarkb> link http://lists.opendev.org/pipermail/service-discuss/2021-December/000306.html has a good rundown of the next steps
19:07:04 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-December/000306.html has a good rundown of the next steps
19:07:36 <ianw> on the DISABLE-ANSIBLE ... i was thinking about that
19:07:38 <clarkb> infra-root ^ that email describes a refactor of various things to make it harder to make those previous mistakes again
19:08:06 <ianw> i feel like it makes sense to check that in the base job that sets up the system-config checkout for the buildset ... that will hold all prod jobs
19:08:29 <ianw> but maybe not so much in the prod jobs themselves.  that way, if a buildset starts, it completes, but a new one won't
19:09:06 <clarkb> ya I'm somewhat on the fence over that. To me if I disable ansible that means no more ansible even if the buildset is still running
19:09:20 <clarkb> But I can see an argument for allowing an in progress buildset to complete for consistency
19:09:22 <ianw> in parallel operation, it seems unclear if you dropped it in the middle of a deploy buildset what it would catch
19:09:41 <clarkb> What we can do if we really really need to stop the production line is move the authorized keys file aside
19:10:15 <clarkb> and usually we use that toggle when doing stuff like project renames, not as an emergency off switch (ssh authorized_keys seems better for that)
19:10:16 <corvus> or dequeue the job
19:10:27 <clarkb> ya I guess that too
19:10:37 <ianw> with zuul authenticated ui, that would be practical
19:10:50 <clarkb> ianw: we should update the documentation to make that behavior change clear though
19:10:58 <ianw> (currently, pretty sure the jobs would be done before i'd pulled up a login window and figured out :)
19:11:30 <ianw> sure, i can post a doc update and we can discuss there
19:11:52 <clarkb> sounds good, thanks
19:12:07 <fungi> we talked about having the base job abort if the disable-ansible file is present, did i push that change (or has someone)? i can't recall now
19:12:32 <clarkb> fungi: I don't recall seeing a change for that. You did split it out into a separate role if we wanted to consume it in multiple jobs
19:13:25 <fungi> oh, right
19:13:28 <ianw> fungi: abort as in abort, or do the pause thing it does now?
19:13:39 <fungi> i think the pause thing it does now, sorry
19:13:41 <clarkb> infra-root ^ if you can review the changes outlined in that email that would be great. I'm planning on digging in this afternoon myself. I think we're really close to being able to start updating semaphores and getting parallel runs which is exciting
19:14:36 <ianw> fungi: that would be the status quo I believe, as that is checked in the setup-src job
19:15:00 <ianw> currently every prod job runs that; after the changes, only the bootstrap-bridge (that all other jobs depend on) would run it
19:15:34 <clarkb> I think we have to avoid soft dependencies to make that work, but I was alread asking for that.
19:15:42 <fungi> yeah, and the problem we ran into was that subsequent jobs didn't check it so proceeded normally when setup-src got skipped
19:16:04 <clarkb> I suspect this beacuse in the current system if you set the disable-ansible file they all run serially failing and retrying in a loop until they have failed 3 times in a row? Or maybe that is only when you pull the ssh keys out
19:16:09 <fungi> as belt-and-braces safety we could check it in the job they all inherit from
19:16:12 <fungi> or in base
19:16:27 <clarkb> fungi: ya I think that is an artifact of the soft dependency
19:16:28 <ianw> right, yes the base job (after proposed changes) is a hard dependency that should always run (no file matchers)
19:16:35 <clarkb> if we make it a hard dependency they they shouldn't proceed
19:17:29 <corvus> if you don't want child jobs to run, you can filter them out of the list
19:17:45 <clarkb> corvus: isn't that what a hard dependency failing to succeed will already do?
19:17:51 <corvus> https://zuul-ci.org/docs/zuul/reference/jobs.html#skipping-dependent-jobs
19:18:19 <ianw> yeah, it was supposed to be a hard dependency in this case -- it has to run to checkout the system-config source for the buildset
19:18:20 <corvus> yeah, but i think you could do "child_jobs: []" to cause 0 child jobs to run regardless of hard/soft
19:18:33 <corvus> so if you want to do it with soft, that could be a way
19:18:40 <corvus> but if it needs to be hard for other reasons, then meh.  :)
19:18:44 <clarkb> gotcha, ya in this case I think we need a hard dependency either way
19:18:45 <ianw> it was a bug to not run it, not the intention
19:19:31 <corvus> ack
19:19:45 <clarkb> Let's continue on as we have a few other subjects to cover
19:19:50 <clarkb> #topic User management on our systems
19:20:04 <clarkb> Yesterday we managed to update the matrix-gerritbot image to run under the gerritbot user
19:20:32 <clarkb> I think what we learned from this exercise is that even simple "read only" appaering images can be complicated and that setting users to run a container under is going to be an image by image exercise
19:20:42 <clarkb> that said I still think there is value in this and we should try to pick them off as we can
19:21:18 <clarkb> But beware that we need to be careful about permissions within the image and bind mounts as well as expectation of the running processes. Turns out openssh fails if it is running as a user without an extry in /etc/passwd
19:21:56 <clarkb> At this point I don't think there is anything else to review or cover other than to say, if you've got free time you might look into updating one of our containers :)
19:22:08 <ianw> do we have a list?
19:22:11 <clarkb> IRC bots in particular seem like good targets since they all run on a shared host
19:22:23 <clarkb> ianw: I haven't made a comprehensive one yet as I was mostly going to focus on eavesdrop to start
19:22:41 <clarkb> low impact from our perspective to restart them and debug as wego, but also relatively high ROI since they share a host
19:22:52 <clarkb> most other systems are all dedicated hosts so less returns
19:22:59 <ianw> ok, np.  i know i had issues when haproxy switched *to* having a separate user with the LB setup
19:24:33 <clarkb> ianw: looking at my notes hound, lodgeit, refstack, grafana are others. But this isn't a comprehensive list I don't think
19:24:47 <clarkb> but ya I was focusing on ircbots to start since all of ^ are on dedicated hosts
19:25:37 <clarkb> Anyway as mentioned I/we hae learned a bit doing this for the gerritbots and there are more irc/matrix bots to address. Also the services above. If you've got time feel free to pick them off. Our testing helps with ensuring it is happy too
19:25:49 <clarkb> #topic Zuul Gearman Going Away
19:26:13 <clarkb> Zuul's gearman tooling is very close to being deleted. This means we can no longer use the zuul gearman commands to enqueue/dequeue etc
19:26:27 <clarkb> Instead we'll need to use Zuul client to talk to the REST API for this whcih requires a JWT
19:26:54 <clarkb> corvus has changes up to set up a local JWT for administrative tasks on our zuul installation. We should also update our docs and our queue saving scripts to match when that is ready
19:27:08 <corvus> i think they just merged (thanks fungi )
19:27:31 <corvus> with those in place, i'll generate a jwt and set up zuul-client
19:27:32 <fungi> yeah, so we should in theory still be able to run them from a shell on the server without looking up credentials
19:27:55 <corvus> note, zuul-client != zuul.  they are very similar, but only zuul-client has the ability to read a jwt from a config file.
19:28:33 <corvus> we will probably remove the admin commands from zuul eventually too since they are redundant and not as useful as zuul-client's implementation
19:28:53 <corvus> so anyway, that'll be "zuul-client enqueue" etc in the future
19:29:06 <fungi> thanks!
19:29:07 <clarkb> yup mostly calling this out so people are aware and that we don't forget to update docs and our queue saving script
19:29:30 <clarkb> #topic keycloak.opendev.org
19:30:04 <clarkb> On the subject of authentication we now have a keycloak server to experiment with
19:30:46 <clarkb> The main thing I wanted to clarify on this is currently the server is in a pre production state right? we shouldn't be relying on this for anything production liek and instead use it to figure out how to make keycloak work according to our auth spec that fungi wrote
19:31:05 <clarkb> for example we can integrate keycloak with zuul's new auth stuff but we aren't doing that yet while we learn about keycloak?
19:31:29 <clarkb> or maybe if we do that it will be in a limited capcity and functionality could come and go. We'll continue to rely on local auth for admin stuff
19:31:39 <fungi> i'm willing to be flexible there
19:31:50 <corvus> it is in pre-prod.  expect data to disappear at any time.
19:32:08 <corvus> i would like to go ahead and create a realm for use with zuul... i think maybe something simple where a few of us make some accounts manually or something
19:32:15 <fungi> one thing we already learned is it's apparently still easy to accidentally create multiple accounts when you use different ids if you don't link them in advance
19:32:34 <corvus> yeah.  that can be resolved, but only if we allow password authentication.
19:33:10 <corvus> (like, you can fix that in a self-service way, but only if password auth is available too)
19:33:12 <clarkb> Seems like we should avoid that if we can to make sure people understand we aren't intending to be an actual auth identity
19:33:41 <fungi> which we had previously wanted to avoid so we could not be in the business of having a database of passwords as a high-profile target, nor deal with frequent password reset requests. it's something we'll need to weigh as the poc moves along
19:34:05 <corvus> i think that's worth revisiting.  here's a thought experiment:
19:34:27 <clarkb> ya and I bet it is impossible to disable the password auth for external identity usages because that identity should be the same for any method used to authenticate via keyloak
19:34:35 <corvus> how different is a database of passwords from a database of mappings from a threat POV.
19:34:49 <corvus> sorry, was meant to be a question
19:35:04 <fungi> if users avoid reusing passwords, not terribly different. but users often reuse passwords
19:35:08 <clarkb> if there as a way to run it where password auth only let you run keycloak account tasks and not log in elsewhere I think that would be fine
19:35:29 <clarkb> But I strongly suspect that isn't hwo things are designed
19:35:35 <corvus> anyway -- not something we need to answer now, but i do think it's worth revisiting that with updated knowledge
19:35:51 <corvus> clarkb: i couldn't say whether that's possible or not
19:36:01 <fungi> also if we add passwords, we probably need to add integrated 2fa
19:36:25 <fungi> which could become its own support burden
19:36:30 <ianw> iiuc, a holdup for gerrit conversion was that keycloak doesn't allow adding launchpad/openid right?  but there was a theory that it wouldn't be too hard to add?  is that accurate?
19:36:30 <clarkb> ya all stuff to explore. Maybe figure out if 2fa is viable and if we can require it for example to mitigate the concerns with passwords
19:36:35 <corvus> clarkb:  there's a lot of workflow-by-form stuff, so maybe something can be created for that.  but it's certainly not a "checkbox" :)
19:36:47 <corvus> ianw: yes
19:36:55 <clarkb> ianw: yes there is a php saml tool thing that can translate to other backends and keycloak speaks the saml to that php tool in theory
19:36:57 <corvus> 2fa is available and is a "checkbox" :)
19:37:07 <fungi> ianw: yes, there's a proposal in the spec to create a sort of bridge from keycloak to openid via phpsimplesaml
19:37:32 <fungi> corvus: i expect turning on 2fa is not hard, but helping users reset it every time they lock themselves out might be
19:37:36 <clarkb> writing that bit would be a good next step for someone interested in experimenting with keycloak more
19:37:38 <ianw> cool, well this seems like a great step in having an environment we can test that too.  i'd be interested in working on that in the future
19:37:57 <clarkb> ianw: ++ having the actual service up gives us something to look at that is more than theoretical
19:38:18 <clarkb> I'll also need to finish the gerrit user cleanups so that we can update the external ids database in a straightforward manner
19:38:26 <corvus> yeah... and i won't be able to drive this, so having other folks step in and pick it up would be great
19:39:00 <clarkb> Alright tldr is work to be done, feel free to experiment, but this isn't for production use yet
19:39:03 <clarkb> anything else?
19:40:03 <clarkb> #topic Adding a lists.openinfra.dev mailman site
19:40:25 <fungi> i'm still trying to fix things to make our current mailman orchestration go
19:40:28 <clarkb> fungi and I ran into some trouble with newlist when doing this that we thought we had corrected. Long story short newlist is still looking for input to confirm emailing people
19:40:56 <clarkb> seems that redirecting /dev/null into newlist corrects this, but it also exposes that our testing is different than prod
19:41:12 <fungi> i was able to reproduce it with a dnm change, and determined that redirecting stdin from /dev/null in a shell task properly solves it
19:41:12 <clarkb> fungi: the plan is to update our system-config-run jobs to all block port 25 outbound then we can tell newlist to send email right?
19:41:22 <clarkb> oh sorry I'll let fungi fill us in :)
19:41:23 <fungi> setting stdin to a null string in a cmd task does not have the same effect
19:41:33 <fungi> (which is waht we had merged previously)
19:42:12 <fungi> and yeah, i have changes up to collect exim logs so we can see what's trying to send e-mail through the mta in tests, as well as blocking 25/tcp egress to prevent our deploy jobs from accidentally sending e-mail
19:42:37 <fungi> and then i'm dropping the test-time addition of the -q option for the newlist command
19:42:55 <clarkb> fungi: are any of those ready for review yet?
19:43:19 <fungi> probably, though i have a pending update to one of them once i get test results back from the latest revision to the iptables change
19:43:34 <fungi> and haven't rebased the dropping of -q onto that stack yet
19:43:40 <clarkb> gotcha, feel free to ping me when you want reviews and I'll happily take a look
19:43:57 <fungi> should hopefully have it up right after the meeting, and then once those merge we can add new mailing lists again more easily
19:44:06 <clarkb> thanks!
19:44:49 <fungi> topic:mailman-lists
19:44:55 <fungi> in case anyone's looking for them
19:45:12 <clarkb> #topic Gerrit User Summit
19:45:31 <clarkb> Gerrit User Summit happened last week. I found it useful to catch up on some of the gerrit upstream activities
19:45:54 <clarkb> I took notes and they are in my homedir on review02. But I'll try to summarize some of the interesting bits really quickly here
19:46:03 <clarkb> Gerrit 3.2 is EOL. Thank you ianw for helping get us to 3.3
19:46:33 <clarkb> The new Checks UI work relies on a plugin in the Gerrit server that queries CI systems for results/status and then renders them in a consistent way regardless of the CI system
19:46:57 <clarkb> this means that we could probably replace the Zuul summary plugin with a Checks UI plugin using this new system. But I think that is 3.4 and beyond. Not a 3.3 thing
19:47:37 <corvus> and that's a java plugin?  or is it a pg plugin?
19:47:54 <clarkb> I think that is a java plugin because you have to interact with gerrit internal state
19:48:16 <clarkb> the plugin acts as a data retrieval and conformance system between your CI system and the checks UI
19:48:29 <clarkb> and I think that requires you make writes somewhere which I suspect the js stuff can't do
19:48:39 <clarkb> however, that wasn't entirely clear to me so I could be wrong
19:49:13 <clarkb> Gerrit is working towards deleting prolog for complex acl rule applications. Instead they are replacing it with "Composable Submit Requirements" which use a simple query language based on Gerrit's existing query language
19:49:31 <clarkb> you essentially write rules that say "if this gerrit query returns a result then this rule applies to this change"
19:49:42 <clarkb> and the rules can say this is required for submitting etc
19:50:22 <clarkb> I don't expect we'll migrate to this quickly for anything though random users may use it for various additional checks. However, we should be careful to ensure we don't accidentally reduce our requirements for submitting via zuul
19:51:08 <clarkb> There is a ChronicleMap libmodule plugin for persistent caches. This apparently improves performance quite a bit since you don't lose cache data when restarting gerrit. Some people suggested it be incorporated directly into Gerrit rather than a plugin
19:51:34 <clarkb> Our performance is pretty good these days and we don't restart Gerrit often but may be worth looking into as some users (those talking to nova stuff) have indicated slowness after restarts
19:52:02 <clarkb> And finally the Gerrit meetings are open to the entire community. YOu can also put stuff on their agenda if you have something specific you want to discuss
19:52:12 <clarkb> This is something I wasn't clear about since I think they title them the EC meeting or similar
19:52:25 <clarkb> I'll probably try to start attending these once I figure out when they happen
19:53:30 <clarkb> So ya feel free to ask me any questions if you have them though I'm still a gerrit community noob
19:53:46 <clarkb> Overall I think the event went well and I learned some stuff about what to look for for the future
19:54:38 <clarkb> #topic Nodepool Image cleanups
19:54:42 <ianw> thanks for attending/the summary!
19:55:33 <clarkb> We've got a number of images that are either under maintained or going EOL
19:55:36 <clarkb> #link http://lists.opendev.org/pipermail/service-announce/2021-December/000029.html
19:55:47 <clarkb> I send email outlining a rough plan for cleanups to service-announce
19:56:07 <clarkb> This should reduce a lot of pressure on our AFS volumes too
19:57:16 <clarkb> If we do get responses for opensuse and gentoo help it would be good to maybe also try and run periodic jobs on those platforms somewhere
19:57:21 <clarkb> to serve as a signal when tehy break?
19:57:34 <clarkb> An idea I had that mgiht make maintenance a bit more repsonsive in the future
19:57:38 <clarkb> #topic Open Discussion
19:57:41 <ianw> we do have a periodic run of zuul-jobs that tries everything
19:57:51 <clarkb> ianw: ah ok so those volunteers could watch that
19:57:53 <ianw> but if a job fails in the woods with nobody listening ... :)
19:58:03 <ianw> speaking of
19:58:04 <clarkb> We are almost out of time, anything else?
19:58:18 <ianw> #link https://review.opendev.org/c/zuul/zuul-jobs/+/818702
19:58:30 <jentoio> I'd like to help/volunteer
19:58:32 <ianw> that will enable f35; which seems to have gone smoothly
19:59:03 <clarkb> jentoio: cool, can you respond to that email so that we can help keep track of it and not miss that there is interest?
19:59:23 <jentoio> sure, I was hoping we can meet for coffee to discuss
19:59:37 <jentoio> since we live near each other - unless you moved ;)
19:59:46 <ianw> i can also lookup the details of the zuul-jobs runs and post
20:00:02 <clarkb> I'm still in the same part of the world though I don't get out much these days
20:00:19 <jentoio> but I'll respond to email as well.
20:00:33 <clarkb> ya I think that helps since others are involved as well and around the world
20:00:43 <clarkb> but I'm happy to discuss further as well. And we are at time
20:00:46 <clarkb> thank you everyone!
20:00:49 <clarkb> #endmeeting