Tuesday, 2021-12-07

clarkbOur meeting will start in a few minutes.18:58
ianwo/19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Dec  7 19:01:01 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-December/000305.html Our Agenda19:01
clarkb#topic Announcements19:01
clarkbThis didn't make it onto the agenda because it didn't occur to me until this morning. We are fast approaching a holiday period for many of us. I'll be unable to make a meeting on the 21st and likely unable to make a meetingon January 419:02
fungii'm okay with skipping those19:02
clarkbya I think we can go ahead and cancel the 21st and 4th. And I'll try hard to do a check in on the 28th though I expect things will get pretty quiet all around19:03
clarkbeveryone should enjoy the holidays and their assocaited time off. I'm going to attempt to do this myself :)19:03
ianw++ won't be regularly around then either19:04
clarkb#topic Actions from last meeting19:05
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-11-30-19.01.txt minutes from last meeting19:05
clarkbThere weren't any actions recorded19:05
clarkb#topic Topics19:05
clarkb#topic Improving CD Throughput19:05
clarkbWe made some progress here and also took a step or two back but we learned some stuff19:05
clarkbWhen we switched in the "setup source for system-config on bridge at the start of each buildset" change we missed a few important things that we have reverted that change over19:06
clarkbWe need to make sure that we are using nodeless jobs, that we update system-config on bridge and not on a normal zuul node, we need to honor DISABLE-ANSIBLE and we need to be sure every buildset has this job run first19:06
clarkbThe good news is that since we learned all of that we are able to regroup and make a new plan.19:06
clarkblink http://lists.opendev.org/pipermail/service-discuss/2021-December/000306.html has a good rundown of the next steps19:07
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-December/000306.html has a good rundown of the next steps19:07
ianwon the DISABLE-ANSIBLE ... i was thinking about that19:07
clarkbinfra-root ^ that email describes a refactor of various things to make it harder to make those previous mistakes again19:07
ianwi feel like it makes sense to check that in the base job that sets up the system-config checkout for the buildset ... that will hold all prod jobs19:08
ianwbut maybe not so much in the prod jobs themselves.  that way, if a buildset starts, it completes, but a new one won't19:08
clarkbya I'm somewhat on the fence over that. To me if I disable ansible that means no more ansible even if the buildset is still running19:09
clarkbBut I can see an argument for allowing an in progress buildset to complete for consistency19:09
ianwin parallel operation, it seems unclear if you dropped it in the middle of a deploy buildset what it would catch19:09
clarkbWhat we can do if we really really need to stop the production line is move the authorized keys file aside19:09
clarkband usually we use that toggle when doing stuff like project renames, not as an emergency off switch (ssh authorized_keys seems better for that)19:10
corvusor dequeue the job19:10
clarkbya I guess that too19:10
ianwwith zuul authenticated ui, that would be practical19:10
clarkbianw: we should update the documentation to make that behavior change clear though19:10
ianw(currently, pretty sure the jobs would be done before i'd pulled up a login window and figured out :)19:10
ianwsure, i can post a doc update and we can discuss there19:11
clarkbsounds good, thanks19:11
fungiwe talked about having the base job abort if the disable-ansible file is present, did i push that change (or has someone)? i can't recall now19:12
clarkbfungi: I don't recall seeing a change for that. You did split it out into a separate role if we wanted to consume it in multiple jobs19:12
fungioh, right19:13
ianwfungi: abort as in abort, or do the pause thing it does now?19:13
fungii think the pause thing it does now, sorry19:13
clarkbinfra-root ^ if you can review the changes outlined in that email that would be great. I'm planning on digging in this afternoon myself. I think we're really close to being able to start updating semaphores and getting parallel runs which is exciting19:13
ianwfungi: that would be the status quo I believe, as that is checked in the setup-src job19:14
ianwcurrently every prod job runs that; after the changes, only the bootstrap-bridge (that all other jobs depend on) would run it19:15
clarkbI think we have to avoid soft dependencies to make that work, but I was alread asking for that.19:15
fungiyeah, and the problem we ran into was that subsequent jobs didn't check it so proceeded normally when setup-src got skipped19:15
clarkbI suspect this beacuse in the current system if you set the disable-ansible file they all run serially failing and retrying in a loop until they have failed 3 times in a row? Or maybe that is only when you pull the ssh keys out19:16
fungias belt-and-braces safety we could check it in the job they all inherit from19:16
fungior in base19:16
clarkbfungi: ya I think that is an artifact of the soft dependency19:16
ianwright, yes the base job (after proposed changes) is a hard dependency that should always run (no file matchers)19:16
clarkbif we make it a hard dependency they they shouldn't proceed19:16
corvusif you don't want child jobs to run, you can filter them out of the list19:17
clarkbcorvus: isn't that what a hard dependency failing to succeed will already do?19:17
corvushttps://zuul-ci.org/docs/zuul/reference/jobs.html#skipping-dependent-jobs19:17
ianwyeah, it was supposed to be a hard dependency in this case -- it has to run to checkout the system-config source for the buildset19:18
corvusyeah, but i think you could do "child_jobs: []" to cause 0 child jobs to run regardless of hard/soft19:18
corvusso if you want to do it with soft, that could be a way19:18
corvusbut if it needs to be hard for other reasons, then meh.  :)19:18
clarkbgotcha, ya in this case I think we need a hard dependency either way19:18
ianwit was a bug to not run it, not the intention19:18
corvusack19:19
clarkbLet's continue on as we have a few other subjects to cover19:19
clarkb#topic User management on our systems19:19
clarkbYesterday we managed to update the matrix-gerritbot image to run under the gerritbot user19:20
clarkbI think what we learned from this exercise is that even simple "read only" appaering images can be complicated and that setting users to run a container under is going to be an image by image exercise19:20
clarkbthat said I still think there is value in this and we should try to pick them off as we can19:20
clarkbBut beware that we need to be careful about permissions within the image and bind mounts as well as expectation of the running processes. Turns out openssh fails if it is running as a user without an extry in /etc/passwd19:21
clarkbAt this point I don't think there is anything else to review or cover other than to say, if you've got free time you might look into updating one of our containers :)19:21
ianwdo we have a list?19:22
clarkbIRC bots in particular seem like good targets since they all run on a shared host19:22
clarkbianw: I haven't made a comprehensive one yet as I was mostly going to focus on eavesdrop to start19:22
clarkblow impact from our perspective to restart them and debug as wego, but also relatively high ROI since they share a host19:22
clarkbmost other systems are all dedicated hosts so less returns19:22
ianwok, np.  i know i had issues when haproxy switched *to* having a separate user with the LB setup19:22
clarkbianw: looking at my notes hound, lodgeit, refstack, grafana are others. But this isn't a comprehensive list I don't think19:24
clarkbbut ya I was focusing on ircbots to start since all of ^ are on dedicated hosts19:24
clarkbAnyway as mentioned I/we hae learned a bit doing this for the gerritbots and there are more irc/matrix bots to address. Also the services above. If you've got time feel free to pick them off. Our testing helps with ensuring it is happy too19:25
clarkb#topic Zuul Gearman Going Away19:25
clarkbZuul's gearman tooling is very close to being deleted. This means we can no longer use the zuul gearman commands to enqueue/dequeue etc19:26
clarkbInstead we'll need to use Zuul client to talk to the REST API for this whcih requires a JWT19:26
clarkbcorvus has changes up to set up a local JWT for administrative tasks on our zuul installation. We should also update our docs and our queue saving scripts to match when that is ready19:26
corvusi think they just merged (thanks fungi )19:27
corvuswith those in place, i'll generate a jwt and set up zuul-client19:27
fungiyeah, so we should in theory still be able to run them from a shell on the server without looking up credentials19:27
corvusnote, zuul-client != zuul.  they are very similar, but only zuul-client has the ability to read a jwt from a config file.19:27
corvuswe will probably remove the admin commands from zuul eventually too since they are redundant and not as useful as zuul-client's implementation19:28
corvusso anyway, that'll be "zuul-client enqueue" etc in the future19:28
fungithanks!19:29
clarkbyup mostly calling this out so people are aware and that we don't forget to update docs and our queue saving script19:29
clarkb#topic keycloak.opendev.org19:29
clarkbOn the subject of authentication we now have a keycloak server to experiment with19:30
clarkbThe main thing I wanted to clarify on this is currently the server is in a pre production state right? we shouldn't be relying on this for anything production liek and instead use it to figure out how to make keycloak work according to our auth spec that fungi wrote19:30
clarkbfor example we can integrate keycloak with zuul's new auth stuff but we aren't doing that yet while we learn about keycloak?19:31
clarkbor maybe if we do that it will be in a limited capcity and functionality could come and go. We'll continue to rely on local auth for admin stuff19:31
fungii'm willing to be flexible there19:31
corvusit is in pre-prod.  expect data to disappear at any time.19:31
corvusi would like to go ahead and create a realm for use with zuul... i think maybe something simple where a few of us make some accounts manually or something19:32
fungione thing we already learned is it's apparently still easy to accidentally create multiple accounts when you use different ids if you don't link them in advance19:32
corvusyeah.  that can be resolved, but only if we allow password authentication.19:32
corvus(like, you can fix that in a self-service way, but only if password auth is available too)19:33
clarkbSeems like we should avoid that if we can to make sure people understand we aren't intending to be an actual auth identity19:33
fungiwhich we had previously wanted to avoid so we could not be in the business of having a database of passwords as a high-profile target, nor deal with frequent password reset requests. it's something we'll need to weigh as the poc moves along19:33
corvusi think that's worth revisiting.  here's a thought experiment:19:34
clarkbya and I bet it is impossible to disable the password auth for external identity usages because that identity should be the same for any method used to authenticate via keyloak19:34
corvushow different is a database of passwords from a database of mappings from a threat POV.19:34
corvussorry, was meant to be a question19:34
fungiif users avoid reusing passwords, not terribly different. but users often reuse passwords19:35
clarkbif there as a way to run it where password auth only let you run keycloak account tasks and not log in elsewhere I think that would be fine19:35
clarkbBut I strongly suspect that isn't hwo things are designed19:35
corvusanyway -- not something we need to answer now, but i do think it's worth revisiting that with updated knowledge19:35
corvusclarkb: i couldn't say whether that's possible or not19:35
fungialso if we add passwords, we probably need to add integrated 2fa19:36
fungiwhich could become its own support burden19:36
ianwiiuc, a holdup for gerrit conversion was that keycloak doesn't allow adding launchpad/openid right?  but there was a theory that it wouldn't be too hard to add?  is that accurate?19:36
clarkbya all stuff to explore. Maybe figure out if 2fa is viable and if we can require it for example to mitigate the concerns with passwords19:36
corvusclarkb:  there's a lot of workflow-by-form stuff, so maybe something can be created for that.  but it's certainly not a "checkbox" :)19:36
corvusianw: yes19:36
clarkbianw: yes there is a php saml tool thing that can translate to other backends and keycloak speaks the saml to that php tool in theory19:36
corvus2fa is available and is a "checkbox" :)19:36
fungiianw: yes, there's a proposal in the spec to create a sort of bridge from keycloak to openid via phpsimplesaml19:37
fungicorvus: i expect turning on 2fa is not hard, but helping users reset it every time they lock themselves out might be19:37
clarkbwriting that bit would be a good next step for someone interested in experimenting with keycloak more19:37
ianwcool, well this seems like a great step in having an environment we can test that too.  i'd be interested in working on that in the future19:37
clarkbianw: ++ having the actual service up gives us something to look at that is more than theoretical19:37
clarkbI'll also need to finish the gerrit user cleanups so that we can update the external ids database in a straightforward manner19:38
corvusyeah... and i won't be able to drive this, so having other folks step in and pick it up would be great19:38
clarkbAlright tldr is work to be done, feel free to experiment, but this isn't for production use yet19:39
clarkbanything else?19:39
clarkb#topic Adding a lists.openinfra.dev mailman site19:40
fungii'm still trying to fix things to make our current mailman orchestration go19:40
clarkbfungi and I ran into some trouble with newlist when doing this that we thought we had corrected. Long story short newlist is still looking for input to confirm emailing people19:40
clarkbseems that redirecting /dev/null into newlist corrects this, but it also exposes that our testing is different than prod19:40
fungii was able to reproduce it with a dnm change, and determined that redirecting stdin from /dev/null in a shell task properly solves it19:41
clarkbfungi: the plan is to update our system-config-run jobs to all block port 25 outbound then we can tell newlist to send email right?19:41
clarkboh sorry I'll let fungi fill us in :)19:41
fungisetting stdin to a null string in a cmd task does not have the same effect19:41
fungi(which is waht we had merged previously)19:41
fungiand yeah, i have changes up to collect exim logs so we can see what's trying to send e-mail through the mta in tests, as well as blocking 25/tcp egress to prevent our deploy jobs from accidentally sending e-mail19:42
fungiand then i'm dropping the test-time addition of the -q option for the newlist command19:42
clarkbfungi: are any of those ready for review yet?19:42
fungiprobably, though i have a pending update to one of them once i get test results back from the latest revision to the iptables change19:43
fungiand haven't rebased the dropping of -q onto that stack yet19:43
clarkbgotcha, feel free to ping me when you want reviews and I'll happily take a look19:43
fungishould hopefully have it up right after the meeting, and then once those merge we can add new mailing lists again more easily19:43
clarkbthanks!19:44
fungitopic:mailman-lists19:44
fungiin case anyone's looking for them19:44
clarkb#topic Gerrit User Summit19:45
clarkbGerrit User Summit happened last week. I found it useful to catch up on some of the gerrit upstream activities19:45
clarkbI took notes and they are in my homedir on review02. But I'll try to summarize some of the interesting bits really quickly here19:45
clarkbGerrit 3.2 is EOL. Thank you ianw for helping get us to 3.319:46
clarkbThe new Checks UI work relies on a plugin in the Gerrit server that queries CI systems for results/status and then renders them in a consistent way regardless of the CI system19:46
clarkbthis means that we could probably replace the Zuul summary plugin with a Checks UI plugin using this new system. But I think that is 3.4 and beyond. Not a 3.3 thing19:46
corvusand that's a java plugin?  or is it a pg plugin?19:47
clarkbI think that is a java plugin because you have to interact with gerrit internal state19:47
clarkbthe plugin acts as a data retrieval and conformance system between your CI system and the checks UI19:48
clarkband I think that requires you make writes somewhere which I suspect the js stuff can't do19:48
clarkbhowever, that wasn't entirely clear to me so I could be wrong19:48
clarkbGerrit is working towards deleting prolog for complex acl rule applications. Instead they are replacing it with "Composable Submit Requirements" which use a simple query language based on Gerrit's existing query language19:49
clarkbyou essentially write rules that say "if this gerrit query returns a result then this rule applies to this change"19:49
clarkband the rules can say this is required for submitting etc19:49
clarkbI don't expect we'll migrate to this quickly for anything though random users may use it for various additional checks. However, we should be careful to ensure we don't accidentally reduce our requirements for submitting via zuul19:50
clarkbThere is a ChronicleMap libmodule plugin for persistent caches. This apparently improves performance quite a bit since you don't lose cache data when restarting gerrit. Some people suggested it be incorporated directly into Gerrit rather than a plugin19:51
clarkbOur performance is pretty good these days and we don't restart Gerrit often but may be worth looking into as some users (those talking to nova stuff) have indicated slowness after restarts19:51
clarkbAnd finally the Gerrit meetings are open to the entire community. YOu can also put stuff on their agenda if you have something specific you want to discuss19:52
clarkbThis is something I wasn't clear about since I think they title them the EC meeting or similar19:52
clarkbI'll probably try to start attending these once I figure out when they happen19:52
clarkbSo ya feel free to ask me any questions if you have them though I'm still a gerrit community noob19:53
clarkbOverall I think the event went well and I learned some stuff about what to look for for the future19:53
clarkb#topic Nodepool Image cleanups19:54
ianwthanks for attending/the summary!19:54
clarkbWe've got a number of images that are either under maintained or going EOL19:55
clarkb#link http://lists.opendev.org/pipermail/service-announce/2021-December/000029.html19:55
clarkbI send email outlining a rough plan for cleanups to service-announce19:55
clarkbThis should reduce a lot of pressure on our AFS volumes too19:56
clarkbIf we do get responses for opensuse and gentoo help it would be good to maybe also try and run periodic jobs on those platforms somewhere19:57
clarkbto serve as a signal when tehy break?19:57
clarkbAn idea I had that mgiht make maintenance a bit more repsonsive in the future19:57
clarkb#topic Open Discussion19:57
ianwwe do have a periodic run of zuul-jobs that tries everything19:57
clarkbianw: ah ok so those volunteers could watch that19:57
ianwbut if a job fails in the woods with nobody listening ... :)19:57
ianwspeaking of19:58
clarkbWe are almost out of time, anything else?19:58
ianw#link https://review.opendev.org/c/zuul/zuul-jobs/+/81870219:58
jentoioI'd like to help/volunteer19:58
ianwthat will enable f35; which seems to have gone smoothly19:58
clarkbjentoio: cool, can you respond to that email so that we can help keep track of it and not miss that there is interest?19:59
jentoiosure, I was hoping we can meet for coffee to discuss19:59
jentoiosince we live near each other - unless you moved ;)19:59
ianwi can also lookup the details of the zuul-jobs runs and post19:59
clarkbI'm still in the same part of the world though I don't get out much these days20:00
jentoiobut I'll respond to email as well.20:00
clarkbya I think that helps since others are involved as well and around the world20:00
clarkbbut I'm happy to discuss further as well. And we are at time20:00
clarkbthank you everyone!20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Dec  7 20:00:49 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-07-19.01.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-07-19.01.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-07-19.01.log.html20:00
fungithanks clarkb!20:01
clarkbAs always feel free to continue discussion on the mailing list (service-discuss@lists.opendev.org) or in #opendev20:01
clarkbAnd we'll see you here next week before taking a break for the holidays20:01

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!