#opendev-meeting log

19:01:33 <clarkb> #startmeeting infra
19:01:33 <opendevmeet> Meeting started Tue Aug 17 19:01:33 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:33 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:33 <opendevmeet> The meeting name has been set to 'infra'
19:01:39 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-August/000277.html Our Agenda
19:02:04 <yoctozepto> o/
19:02:21 <clarkb> #topic Announcements
19:02:27 <clarkb> I didn't have any announcements
19:02:43 <clarkb> #topic Actions from last meeting
19:02:51 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-08-10-19.01.txt minutes from last meeting
19:02:57 <clarkb> There were no actions recorded
19:03:03 <clarkb> #topic Specs
19:03:08 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement
19:03:31 <clarkb> tristanC reviewed this spec (thank you!) Would be good if I can get some feedback from infra-root as well
19:04:21 <clarkb> This isn't really urgnet at the moment, but I'd like feedback while its still somewhat fresh for me if possible
19:04:31 <clarkb> #topic Topics
19:04:46 * fungi grins
19:04:53 <clarkb> Lets jump right in. I did quite a bit of pruning on the topics list as we seemed to have reached the end point of a number of in flight items
19:05:05 <clarkb> thank you to everyone who had been working on those items. it is much appreciated.
19:05:17 <clarkb> If something got removed that should have remained please let me know
19:05:29 <clarkb> #topic Service Coordinator Election
19:05:49 <clarkb> The only nomination I saw was the one I submitted for myself. By default I think this means I'm it again.
19:06:00 <yoctozepto> \o/ congrats clarkb
19:06:22 <fungi> and/or condolences
19:06:26 <clarkb> ha
19:06:28 <fungi> but either way, thanks!
19:06:47 <yoctozepto> yes, thanks for being awesome 8-)
19:06:48 <clarkb> We'll have another election in about 6 months (the details were in that email thread)
19:07:06 <clarkb> until then I'll run this meeting and coordinate with the projects and our resource donators
19:07:23 <clarkb> #topic Matrix eavesdrop and gerritbot bots
19:07:39 <clarkb> The bots live and the latest version of the gerritbot seems to be more reliable
19:07:47 <fungi> and sounds like zuul is taking the plunge this weekend
19:07:58 <clarkb> corvus has proposed that August 21 will be the Zuul switch day
19:08:01 <clarkb> yup
19:08:08 <corvus> no objections so far, so i'm assuming so
19:08:21 <yoctozepto> what's the current plan to support the current bots on irc?
19:08:25 <corvus> i'll send out a formal announcement... wednesday?
19:08:29 <clarkb> At this point I'm not sure there is much else for us to do on the opendev side other than helping get a new room created if necessary and joining our matrix clients to it
19:08:32 <clarkb> corvus: wfm
19:08:42 <corvus> and i can take care of making the room too
19:08:47 <mordred> \o/
19:09:02 <clarkb> yoctozepto: at this point we'll keep running them though I'd like to start testing matrix gerritbot as a replacement for irc gerritbot at some point
19:09:24 <yoctozepto> clarkb: ack, thanks; I know it's less featureful so to speak
19:09:27 <clarkb> yoctozepto: we'll need a channel to do that in where people won't mind the noise and possible double posting. But if that goes well I think we can drop the irc gerritbot in favor of  the matrix bot
19:09:29 <yoctozepto> but perhaps it has better design
19:09:47 <clarkb> yoctozepto: eavesdrop is trickier becuse it doesn't support meetings on the matrix side
19:10:00 <clarkb> we'll need a separate matrix meeting bot before we can swap out the limnoria setup I think
19:10:13 <yoctozepto> eh, always uphill
19:10:24 <corvus> or add meetings to the existing matrix eavesdrop
19:10:33 <fungi> also not sure if the reliability of the oftc matrix bridge plays into it (missing the occasional gerrit event in an irc channel is probably fine though?)
19:10:46 <clarkb> corvus: ya or that
19:11:02 <clarkb> fungi: yup, another reason starting with gerritbot is a good idea
19:11:08 <clarkb> that would be more problematic for logging and meetings though
19:11:18 <corvus> for the actual logging part of it, it's nice since it never misses anything that matrix sees
19:11:51 <corvus> and even though occasionally the oftc irc bridge misses something, on the whole, i'd say it <= what one misses just in normal netsplits
19:12:00 <yoctozepto> mayhaps the openstack community migrates as well at some point
19:12:05 <corvus> (it probably is netsplits that cause that)
19:12:12 <clarkb> corvus: good point re netsplits
19:12:12 <yoctozepto> there was some pressure from ironic I think
19:12:25 <yoctozepto> we'll see ;-)
19:12:50 <corvus> yoctozepto: it's my personal hope -- everything weve done for zuul is designed to be "forward-compatible" for any other projects wanting to either have a robust bridged presence or switch
19:12:52 <clarkb> yoctozepto: at this point I think it would be good to see how it goes for zuul since it is a much smaller community that knows how to pivot if necessary :) we are learning and having the ability to be flexible with zuul is nice. Not that I expect issues as this point
19:12:57 <fungi> yoctozepto: one up-side to this model is that ironic can decide to switch even if not all of openstack does
19:13:16 <clarkb> fungi: I'm not sure that is entirely true the way zuul is doing it
19:13:26 <yoctozepto> you three just said all my thoughts ;p
19:13:27 <clarkb> assuming you mean they could leverage the bridge
19:13:40 <clarkb> once zuul moves the irc bridge is not something that will function for it aiui
19:13:53 <clarkb> it will be all matrix all the time
19:14:09 <fungi> i mean ironic could decide to stop using their oftc channel and have a matrix channel and still take advantage of the same bots (once the lack of meetings is solved, assuming they have meetings)
19:14:20 <clarkb> gotcha
19:14:23 <corvus> that's a choice we're making, but not the only possible choice :)
19:14:54 <fungi> the openstack community as a whole could decide that it's okay if some teams use matrix instead of irc, they already have some teams using wechat instead of irc
19:15:22 <clarkb> Definitely get your matrix clients set up if you would like to interact with zuul more synchronously in the future. And ya I don't see any issues with zuul moving forward at this point from the bot perspective
19:15:30 <mordred> yoctozepto: I don't know if the ironic team has seen the zuul docs - but zuul updated all of their contributor docs to point people to matrix as the way to get to the channel even with it currently still on irc
19:15:35 <yoctozepto> fungi: who's using wechat?
19:16:16 <yoctozepto> mordred: well, I see more and more people coming from the irc bridge
19:16:19 <yoctozepto> on all the channels
19:16:28 <yoctozepto> so I think they did see these tips
19:17:29 <clarkb> Is there any other concerns or issues to call out on the subject of matrix before zuul moves? If not we should move on. Openstack and Ironic can definitely look to see how it goes with zuul to help evaluate things for their uses
19:17:30 <fungi> yoctozepto: i forget which ones, but one or more of the teams consisting predominately of contributors from mainland china coordinate over wechat, and i think have irc channels which sit unused
19:17:48 <yoctozepto> fungi: ack
19:17:59 <yoctozepto> clarkb: go on; I'm happy to see zuul move :-)
19:18:21 <clarkb> #topic Backup reliability
19:18:42 <clarkb> We had a number of servers fail to backup on Friday. But manually rerunning the backups on one of them succeeded a few hours later
19:18:54 <clarkb> what seemed to happen was they all started their backups at the same time and then timed out 2 hours later sending email about it
19:19:01 <clarkb> Since then they havne't reported failures
19:19:23 <clarkb> Calling this out as it was backups to the other backup server not the one gitea01 struggles with. Thought gitea01 is also still strugglging
19:19:53 <clarkb> ianw: I think the backups use a random hour but not minute? or is the minute in the cron randomized too?
19:20:06 <clarkb> I'm wondering if we need better randomization for the minute portion as it seems quite a few have the same time set up
19:20:35 <ianw> umm, i would have thought it was random minutes
19:20:54 <ianw> iirc we divide 24 by the number of backup servers
19:21:13 <clarkb> hrm I guess it is just chance on the way it hashes then
19:21:43 <clarkb> Things have been happy since. I'm fairly certain this was an internet blip and we don't need to focus too much on it unless it becomes persistent
19:21:43 <ianw> https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/borg-backup/tasks/main.yaml#L62
19:21:51 <ianw> the minute should be random
19:22:00 <clarkb> huh so it is chance then
19:22:03 <clarkb> lucky us :)
19:22:26 <ianw> actually -> '{{ 59|random(seed=item) }}'
19:22:53 <clarkb> ya so it doesn't change the cron each ansible run
19:22:54 <ianw> item is the backup server i guess.  the seed should probably be inventory hostname of the host we're setting up
19:23:01 <clarkb> oh yes ++
19:23:23 <clarkb> seed=ansible_fqdn or similar?
19:23:35 <clarkb> anyway we don't need to fix it now. I wanted to call it out and it does seem that maybe there is a small bug here :)
19:24:43 <clarkb> #topic Mailman server upgrades
19:24:59 <clarkb> Yesterday with the help of fungi I got a test copy of lists.kc.io booted and upgraded to focal
19:25:06 <clarkb> The mailman web stuff seems to work
19:25:29 <clarkb> fungi did you manage to try a newlist yet to see if it successfully sends email to your server (as I expect mine would reject it)
19:25:37 <fungi> i still need to spend a few minutes creating a test ml on it and trying getting messages in and out
19:25:56 <clarkb> thanks
19:26:30 <clarkb> If people can think of other test actions that would be great, but so far things look ok. The biggest hurdle was realizing the snapshot wouldn't boot because its fstab had an entry for swap that didn't exist on the new servers epehemeral disk
19:26:54 <clarkb> Once we're happy with the results I'll reach out to the kata project and schedule an upgrade for them.
19:27:02 <clarkb> Then when we're happy with that we can do the same for lists.o.o
19:27:49 <clarkb> fungi: one thing I meant to check which i haven't yet is how the python2 config is selected for mailman
19:27:51 <fungi> also remembering that you need to fake the hostname resolution in your /etc/hosts to test the webui
19:27:57 <fungi> because apache cares
19:28:02 <clarkb> want to make sure that in our overrides for mailman vhosts we don't break that
19:28:27 <clarkb> I expect it is fine since I think we still include the Defaults.py first but want to be sure
19:28:55 <clarkb> #topic Improving OpenDev CD throughput
19:29:16 <fungi> i replaced our cds with dvds, now they're much faster
19:29:35 <clarkb> This came up in discussion last week around how some change in zuul behavior has made applying deploy buildsets for merged changes much slower now due to the hourly deploy pipeline buildsets
19:29:54 <clarkb> The hourly deploys were taking almost an hour for each buildset then immediately reenqueing another buildset
19:30:25 <clarkb> this meant you'd land a change then it would wait up to an hour before its turn then as soon as it was done the hourly buildset would take teh locks and continue. Landing a second change would have to wait even longer
19:31:03 <clarkb> There are a couple of fundamental issues here: Our ansible jobs take a lot of time, our ansible jobs don't run in parallel, and we run/ran expensive jobs hourly
19:31:34 <clarkb> A quick and easy improvement was to stop running the cloud launcher job hourly since we rarely need its updates anyway. That cut about 20 minutes off of the hourly buildset runtime
19:31:57 <clarkb> I think we can make similar improvements dropping zuul-preview and remote-puppet-else from the hourly job as well (though the impact won't be as dramatic)
19:32:10 <clarkb> We can run those daily instead and not be majorly impacted
19:32:29 <fungi> pabelanger has also proposed a related change for zuul to start prioritizing semaphore locks along with the pipeline priorities, which implies some potential ordering issues between our timer and change-merged triggered deploy pipelines
19:32:50 <clarkb> A more comprehensive fix is to figure out our dependencies between jobs and express them in the zuul config properly. Then we can run the jobs in parallel.
19:33:12 <clarkb> One concern I've got running jobs in parallel is the system-config updates on bridge, but I think we can move that into a central locking job that pauses allowing its hcildren to run
19:33:49 <fungi> with the semaphore priority change, we could end up with incomplete periodic deploys rolling back configuration when their buildsets resume after regaining control of the lock, unless we make sure every build updates system-config before using it
19:34:16 <clarkb> yup and the hourly deploys do currently do that so we should be ok for now
19:34:29 <clarkb> hourly jobs alawys pull latest system-config
19:34:51 <clarkb> the issue will be if we then flip back to a specific change in deploy
19:34:55 <fungi> ahh, okay for some reason i thought we had concluded they were using the state of system-config from the time they were enqueued instead
19:35:00 <clarkb> fungi: only in deploy
19:35:06 <fungi> got it
19:35:07 <corvus> clarkb: they pull it or push it?
19:35:19 <corvus> the jobs behave differently in different pipelines?
19:35:57 <clarkb> corvus: the pull when in the hourly and daily deploy pipelines and yes
19:36:10 <corvus> the job says "i'm in a periodic pipeline, i will manually run 'git fetch...'"?
19:36:22 <clarkb> https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-production-playbook.yaml#L27-L32
19:36:26 <clarkb> corvus: yes
19:36:41 <corvus> okay, neat, i didn't remember that
19:37:00 <corvus> so those jobs are safe from "interruptions"
19:37:05 <clarkb> I suspect so
19:37:27 <fungi> yes, i agree
19:37:41 <fungi> i'm now less worried about that behavior changing in the scheduler
19:37:45 <clarkb> the one place we might have trouble is if land a number of changes tht all queue up in deploy and the hourly buildsets run the results of the end of that stack, then we run the first in the normal deploy which is older
19:38:10 <clarkb> then we flip back to hourly and run the latset, then flip back to an older change until we eventually become consistent by running the last change in the normal deploy
19:38:34 <clarkb> This is why I mentioend yesterday that we might just want to always deploy from master instead
19:38:35 <fungi> if deploy has a higher priority than hourly, that shouldn't occur though, right?
19:38:45 <fungi> deploy will be exhausted before hourly gets to run more builds
19:38:51 <clarkb> fungi: yes I think that is another way to address it
19:38:57 <fungi> so there should never be something "older" run later
19:39:16 <clarkb> fungi: currently I'm not sure we do that, it is worth checking on if someone can do that
19:39:23 <fungi> i'll look
19:39:49 <clarkb> also we don't want to switch until we address pabelanger's issue I suspect. But we can coordinate that
19:40:02 <fungi> it's already how we want
19:40:15 <fungi> precedence high on deploy, low on hourly
19:40:50 <fungi> as long as we keep it that way, should be race-free
19:40:52 <clarkb> Anyway, to do parallelized jobs I think we want an anchor job that grabs a global lock and sets up system-config for the entire buildset. Then individual jobs will grab a secondare semaphore that limist the concurency we subject bridge to for that buildset
19:41:15 <fungi> though if we want to do the parallel thing, yeah we need two layers of semaphores
19:41:15 <corvus> clarkb: i don't think having deploy run master is necessarily best; if there's a sequence of changes that should happen in deploy, that may not work as expected?
19:41:35 <clarkb> In order to do that we need to understand the dependencies between all our jobs and then decide if some jobs should be combined or if they should be separate with explicit dependencies in the config.
19:41:43 <corvus> clarkb: maybe it's okay -- but it's worth thinking about whether that subverts expectations
19:41:52 <clarkb> corvus: yup that is the downside. If we want to say remvoe a cron by ensuring it is abset then remove the cron definition always doing master breaks that
19:42:06 <clarkb> corvus: if the priority settings work then that is probably the best option
19:42:42 <clarkb> For the first step which is understanding the relationship between jobs I think that would be good to have in some sort of human readable format (graph images?I dunno)
19:42:54 <clarkb> does anyone have a good idea for a way to make that collaborative?
19:43:05 <fungi> and corvus pointed out that having a buildset semaphore with count 1 would allow us to safely do a build semaphore with a higher count
19:43:10 <clarkb> Then we could potentailly each grab some subset of jobs and start mapping them out teogether
19:43:42 <clarkb> miro is a web tool I've used in other contexts, we could maybe try that here if people are interested?
19:43:42 <fungi> i would say just do an etherpad, we can make a yaml-ish dag in there"
19:43:59 <fungi> if we want a flowchart instead, then maybe not etherpad
19:44:02 <clarkb> fungi: my concern with that is we seem to already fail with the yamlish dag because we haven't used it in the zuul config
19:44:18 <clarkb> (a small number of jobs do have dependencies mapped but not all)
19:44:28 <fungi> upside to a dag in actual yaml would be we can just copy it into the conifig ;)
19:44:46 <clarkb> fair point. I guess we can start there and switch if it isn't good enough
19:45:16 <clarkb> I'm hoping to be able to start mapping some of this out this week. Feel free to get started first if you are interested
19:45:24 <fungi> i'm not opposed to trying some whiteboarding tool instead/in addition though, if you think it would help
19:45:40 <clarkb> fungi: I'm fine either way wasn't really happy with any of the options really.
19:45:54 <clarkb> miro has its problems (makes your browser die)
19:46:22 <fungi> we could start by writing out explanations of the relationships, and then trying to reflect those in structured data, i guess
19:46:41 <clarkb> Then once we've got things mapped out we can express those relationships in the zuul config while keeping everything serialized the way it is today
19:47:01 <clarkb> And once we're happy with the relationships in the zuul config we can swap over to the anchor jobs and the various locks
19:47:13 <clarkb> I don't expect this will be a quick change but it should make our deployments much happier
19:47:16 <corvus> i like 'anchor job' as a term
19:47:50 * fungi weighs anchor
19:47:55 <clarkb> Definitely let me know if you start looking at this. Otherwise I'll let others know when I've got something written down somewhere that we can look at :)
19:48:16 <clarkb> Finally I think it is worht mentioning some of the "crazier" ideas that have come up:
19:48:19 <ianw> when in doubt, write a DSL
19:48:23 <clarkb> we could try mitogen
19:48:34 <clarkb> we could run the hourly buildsets less often like every other hour or every third hour
19:48:56 <clarkb> I'm not a huge fan of reducing the hourly frequency since the things we want in there do want frequent updates (like zuul and nodepool)
19:49:06 <clarkb> But the mitogen idea (thanks pabelanger) is an interseting one
19:49:20 <clarkb> mitogen claims to significantly reduce the amount of python forking that ansible does
19:49:26 <clarkb> and we know that is a major cost in some of our playbooks
19:49:40 <clarkb> It is BSD licensed and installable from pypi
19:50:08 <clarkb> The risk would be that we need to rework our playbooks significantly to make it work (and maybe even downgrade ansible? it isn't clear to me if they work with newer ansible)
19:50:13 <corvus> [fwiw, i don't actually think it's the forking, it's the serialized task distribution algorithm]
19:50:40 <fungi> this? https://pypi.org/project/mitogen/
19:50:55 <clarkb> fungi: yes, it is a generic tool but installing that comes with an ansible plugin taht you configure ansibel to use
19:51:15 <clarkb> corvus: ya I think they are related becuse ansible doesn't pre fork the worker processes anymore and instead forks a new process for each task that it grabs off the serialized queue?
19:51:29 <fungi> yeah, started digging in the git repo linked from there, seems like that's the one
19:51:32 <corvus> clarkb: yeah
19:51:49 <corvus> clarkb: it thinks it has an executor worker distribution model, but it doesn't
19:52:19 <clarkb> Anyway mitogen is a large subject. Paul offered to push a change and see if it would work in testing
19:52:35 <fungi> just interesting to see that mitogen hasn't released a new version in almost two years, but it does have git commits as recent as 5 months ago
19:52:38 <clarkb> I think it is worth entertaining if it shows an improvement in performance without requiring major refactoring
19:52:56 <clarkb> I don't expect us to say yay or nay here but wanted to call it out
19:53:10 <clarkb> Definitely the most important thing is improving the way we run the jobs in zuul in the first place
19:53:24 <fungi> oh, they uploaded some prereleases to pypi in january, okay
19:53:25 <corvus> and the playbooks themselves
19:53:26 <clarkb> Anything else on this subject or shoudl I open up the floor?
19:54:07 <clarkb> #topic Open Discussion
19:54:09 * fungi checks that he's not standing on a trapdoor
19:54:34 <clarkb> Is there anything else?
19:55:02 <fungi> noticed today, possibly related to a missed change-merged event for a zuul config update, we're getting a dozen or two of these exceptions a day:
19:55:07 <fungi> #link http://paste.openstack.org/show/808156/
19:55:47 <fungi> it's possible this means we're missing a small but not insignificant number of gerrit events in zuul
19:55:56 <clarkb> After some lunch and a bike ride I can help try and udnerstand the sequence of events there if that would be useful
19:56:06 <fungi> something to keep in mind if you end up needing to run down any similar weirdness
19:56:12 <clarkb> I think we watn to see what sort of actions would be aborted when we get those exceptions
19:56:28 <clarkb> and determine if we can retry those requests or if we can avoid aborting important actions by running them earlier etc
19:56:42 <clarkb> but that discussion probably belongs in #zuul and I haven't pulled the code upyet :0
19:56:44 <clarkb> er :)
19:57:05 <corvus> is opendev's gerrit or network having issues?
19:57:17 <clarkb> corvus: we haven't seen any other reports
19:57:22 <corvus> maybe we're seeing this now because zuul<-internet->gerrit ?
19:57:25 <clarkb> it is a possibility
19:57:41 <corvus> fungi said 12/day
19:57:54 <fungi> 17 yesterday, 16 so far today when i looked
19:58:01 <clarkb> er I mean no reports by users or other issues. These exceptions are the only indication so far of an internet problem
19:58:07 <clarkb> that said it could definitely be an internet problem
19:58:27 <fungi> zuul is certainly hammering gerrit way harder than the typical user, so more likely to notice
19:59:00 <corvus> yeah, given that this is a TCP-level error, and we haven't changed that handling in zuul in ages, but we have recently moved gerrit farther from zuul in network topology, it seems there's a circumstancial fit
19:59:31 <corvus> if that is the case, then "retry harder" may be the best/only solution
19:59:39 <fungi> also not related, but paste reminded me just now, a user let us know last week that we stopped supporting the pastebinit tool when we switched lodgeit servers. apparently pastebinit hardcodes hostnames and our redirect from paste.openstack.org (which it knows about) to paste.opendev.org is breaking it. we're investigating options to be able to maybe exclude the pastebinit user agent
19:59:41 <fungi> from the redirect, in case interested is following the meeting
20:00:04 <clarkb> fungi: ianw: it wasn't clear to me if someone was volunteering to update the chagne for that?
20:00:12 <clarkb> do we need someone to volunteer?
20:00:18 <fungi> i can update the change, i just need to find time
20:00:22 <clarkb> ok
20:00:32 <fungi> i've got a test server held from the earlier investigations
20:00:34 <clarkb> I was going to suggest we could land the redirect removal now if we want then followup with the UA check
20:00:47 <clarkb> but if we have someone working on the UA check already then no need
20:00:59 <clarkb> And we are at time.
20:01:08 <fungi> if "i've thought about what it looks like" counts as working on it ;)
20:01:20 <clarkb> Feel free to continue discussion in #opendev or on the mailing list, but we should call it here and let peopel get to lunch/breakfast/dinner etc :)
20:01:23 <clarkb> (I'm hungry)
20:01:26 <clarkb> thanks everyone!
20:01:28 <fungi> thanks clarkb!
20:01:29 <clarkb> #endmeeting