19:01:33 #startmeeting infra 19:01:33 Meeting started Tue Aug 17 19:01:33 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:33 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:33 The meeting name has been set to 'infra' 19:01:39 #link http://lists.opendev.org/pipermail/service-discuss/2021-August/000277.html Our Agenda 19:02:04 o/ 19:02:21 #topic Announcements 19:02:27 I didn't have any announcements 19:02:43 #topic Actions from last meeting 19:02:51 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-08-10-19.01.txt minutes from last meeting 19:02:57 There were no actions recorded 19:03:03 #topic Specs 19:03:08 #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement 19:03:31 tristanC reviewed this spec (thank you!) Would be good if I can get some feedback from infra-root as well 19:04:21 This isn't really urgnet at the moment, but I'd like feedback while its still somewhat fresh for me if possible 19:04:31 #topic Topics 19:04:46 * fungi grins 19:04:53 Lets jump right in. I did quite a bit of pruning on the topics list as we seemed to have reached the end point of a number of in flight items 19:05:05 thank you to everyone who had been working on those items. it is much appreciated. 19:05:17 If something got removed that should have remained please let me know 19:05:29 #topic Service Coordinator Election 19:05:49 The only nomination I saw was the one I submitted for myself. By default I think this means I'm it again. 19:06:00 \o/ congrats clarkb 19:06:22 and/or condolences 19:06:26 ha 19:06:28 but either way, thanks! 19:06:47 yes, thanks for being awesome 8-) 19:06:48 We'll have another election in about 6 months (the details were in that email thread) 19:07:06 until then I'll run this meeting and coordinate with the projects and our resource donators 19:07:23 #topic Matrix eavesdrop and gerritbot bots 19:07:39 The bots live and the latest version of the gerritbot seems to be more reliable 19:07:47 and sounds like zuul is taking the plunge this weekend 19:07:58 corvus has proposed that August 21 will be the Zuul switch day 19:08:01 yup 19:08:08 no objections so far, so i'm assuming so 19:08:21 what's the current plan to support the current bots on irc? 19:08:25 i'll send out a formal announcement... wednesday? 19:08:29 At this point I'm not sure there is much else for us to do on the opendev side other than helping get a new room created if necessary and joining our matrix clients to it 19:08:32 corvus: wfm 19:08:42 and i can take care of making the room too 19:08:47 \o/ 19:09:02 yoctozepto: at this point we'll keep running them though I'd like to start testing matrix gerritbot as a replacement for irc gerritbot at some point 19:09:24 clarkb: ack, thanks; I know it's less featureful so to speak 19:09:27 yoctozepto: we'll need a channel to do that in where people won't mind the noise and possible double posting. But if that goes well I think we can drop the irc gerritbot in favor of the matrix bot 19:09:29 but perhaps it has better design 19:09:47 yoctozepto: eavesdrop is trickier becuse it doesn't support meetings on the matrix side 19:10:00 we'll need a separate matrix meeting bot before we can swap out the limnoria setup I think 19:10:13 eh, always uphill 19:10:24 or add meetings to the existing matrix eavesdrop 19:10:33 also not sure if the reliability of the oftc matrix bridge plays into it (missing the occasional gerrit event in an irc channel is probably fine though?) 19:10:46 corvus: ya or that 19:11:02 fungi: yup, another reason starting with gerritbot is a good idea 19:11:08 that would be more problematic for logging and meetings though 19:11:18 for the actual logging part of it, it's nice since it never misses anything that matrix sees 19:11:51 and even though occasionally the oftc irc bridge misses something, on the whole, i'd say it <= what one misses just in normal netsplits 19:12:00 mayhaps the openstack community migrates as well at some point 19:12:05 (it probably is netsplits that cause that) 19:12:12 corvus: good point re netsplits 19:12:12 there was some pressure from ironic I think 19:12:25 we'll see ;-) 19:12:50 yoctozepto: it's my personal hope -- everything weve done for zuul is designed to be "forward-compatible" for any other projects wanting to either have a robust bridged presence or switch 19:12:52 yoctozepto: at this point I think it would be good to see how it goes for zuul since it is a much smaller community that knows how to pivot if necessary :) we are learning and having the ability to be flexible with zuul is nice. Not that I expect issues as this point 19:12:57 yoctozepto: one up-side to this model is that ironic can decide to switch even if not all of openstack does 19:13:16 fungi: I'm not sure that is entirely true the way zuul is doing it 19:13:26 you three just said all my thoughts ;p 19:13:27 assuming you mean they could leverage the bridge 19:13:40 once zuul moves the irc bridge is not something that will function for it aiui 19:13:53 it will be all matrix all the time 19:14:09 i mean ironic could decide to stop using their oftc channel and have a matrix channel and still take advantage of the same bots (once the lack of meetings is solved, assuming they have meetings) 19:14:20 gotcha 19:14:23 that's a choice we're making, but not the only possible choice :) 19:14:54 the openstack community as a whole could decide that it's okay if some teams use matrix instead of irc, they already have some teams using wechat instead of irc 19:15:22 Definitely get your matrix clients set up if you would like to interact with zuul more synchronously in the future. And ya I don't see any issues with zuul moving forward at this point from the bot perspective 19:15:30 yoctozepto: I don't know if the ironic team has seen the zuul docs - but zuul updated all of their contributor docs to point people to matrix as the way to get to the channel even with it currently still on irc 19:15:35 fungi: who's using wechat? 19:16:16 mordred: well, I see more and more people coming from the irc bridge 19:16:19 on all the channels 19:16:28 so I think they did see these tips 19:17:29 Is there any other concerns or issues to call out on the subject of matrix before zuul moves? If not we should move on. Openstack and Ironic can definitely look to see how it goes with zuul to help evaluate things for their uses 19:17:30 yoctozepto: i forget which ones, but one or more of the teams consisting predominately of contributors from mainland china coordinate over wechat, and i think have irc channels which sit unused 19:17:48 fungi: ack 19:17:59 clarkb: go on; I'm happy to see zuul move :-) 19:18:21 #topic Backup reliability 19:18:42 We had a number of servers fail to backup on Friday. But manually rerunning the backups on one of them succeeded a few hours later 19:18:54 what seemed to happen was they all started their backups at the same time and then timed out 2 hours later sending email about it 19:19:01 Since then they havne't reported failures 19:19:23 Calling this out as it was backups to the other backup server not the one gitea01 struggles with. Thought gitea01 is also still strugglging 19:19:53 ianw: I think the backups use a random hour but not minute? or is the minute in the cron randomized too? 19:20:06 I'm wondering if we need better randomization for the minute portion as it seems quite a few have the same time set up 19:20:35 umm, i would have thought it was random minutes 19:20:54 iirc we divide 24 by the number of backup servers 19:21:13 hrm I guess it is just chance on the way it hashes then 19:21:43 Things have been happy since. I'm fairly certain this was an internet blip and we don't need to focus too much on it unless it becomes persistent 19:21:43 https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/borg-backup/tasks/main.yaml#L62 19:21:51 the minute should be random 19:22:00 huh so it is chance then 19:22:03 lucky us :) 19:22:26 actually -> '{{ 59|random(seed=item) }}' 19:22:53 ya so it doesn't change the cron each ansible run 19:22:54 item is the backup server i guess. the seed should probably be inventory hostname of the host we're setting up 19:23:01 oh yes ++ 19:23:23 seed=ansible_fqdn or similar? 19:23:35 anyway we don't need to fix it now. I wanted to call it out and it does seem that maybe there is a small bug here :) 19:24:43 #topic Mailman server upgrades 19:24:59 Yesterday with the help of fungi I got a test copy of lists.kc.io booted and upgraded to focal 19:25:06 The mailman web stuff seems to work 19:25:29 fungi did you manage to try a newlist yet to see if it successfully sends email to your server (as I expect mine would reject it) 19:25:37 i still need to spend a few minutes creating a test ml on it and trying getting messages in and out 19:25:56 thanks 19:26:30 If people can think of other test actions that would be great, but so far things look ok. The biggest hurdle was realizing the snapshot wouldn't boot because its fstab had an entry for swap that didn't exist on the new servers epehemeral disk 19:26:54 Once we're happy with the results I'll reach out to the kata project and schedule an upgrade for them. 19:27:02 Then when we're happy with that we can do the same for lists.o.o 19:27:49 fungi: one thing I meant to check which i haven't yet is how the python2 config is selected for mailman 19:27:51 also remembering that you need to fake the hostname resolution in your /etc/hosts to test the webui 19:27:57 because apache cares 19:28:02 want to make sure that in our overrides for mailman vhosts we don't break that 19:28:27 I expect it is fine since I think we still include the Defaults.py first but want to be sure 19:28:55 #topic Improving OpenDev CD throughput 19:29:16 i replaced our cds with dvds, now they're much faster 19:29:35 This came up in discussion last week around how some change in zuul behavior has made applying deploy buildsets for merged changes much slower now due to the hourly deploy pipeline buildsets 19:29:54 The hourly deploys were taking almost an hour for each buildset then immediately reenqueing another buildset 19:30:25 this meant you'd land a change then it would wait up to an hour before its turn then as soon as it was done the hourly buildset would take teh locks and continue. Landing a second change would have to wait even longer 19:31:03 There are a couple of fundamental issues here: Our ansible jobs take a lot of time, our ansible jobs don't run in parallel, and we run/ran expensive jobs hourly 19:31:34 A quick and easy improvement was to stop running the cloud launcher job hourly since we rarely need its updates anyway. That cut about 20 minutes off of the hourly buildset runtime 19:31:57 I think we can make similar improvements dropping zuul-preview and remote-puppet-else from the hourly job as well (though the impact won't be as dramatic) 19:32:10 We can run those daily instead and not be majorly impacted 19:32:29 pabelanger has also proposed a related change for zuul to start prioritizing semaphore locks along with the pipeline priorities, which implies some potential ordering issues between our timer and change-merged triggered deploy pipelines 19:32:50 A more comprehensive fix is to figure out our dependencies between jobs and express them in the zuul config properly. Then we can run the jobs in parallel. 19:33:12 One concern I've got running jobs in parallel is the system-config updates on bridge, but I think we can move that into a central locking job that pauses allowing its hcildren to run 19:33:49 with the semaphore priority change, we could end up with incomplete periodic deploys rolling back configuration when their buildsets resume after regaining control of the lock, unless we make sure every build updates system-config before using it 19:34:16 yup and the hourly deploys do currently do that so we should be ok for now 19:34:29 hourly jobs alawys pull latest system-config 19:34:51 the issue will be if we then flip back to a specific change in deploy 19:34:55 ahh, okay for some reason i thought we had concluded they were using the state of system-config from the time they were enqueued instead 19:35:00 fungi: only in deploy 19:35:06 got it 19:35:07 clarkb: they pull it or push it? 19:35:19 the jobs behave differently in different pipelines? 19:35:57 corvus: the pull when in the hourly and daily deploy pipelines and yes 19:36:10 the job says "i'm in a periodic pipeline, i will manually run 'git fetch...'"? 19:36:22 https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-production-playbook.yaml#L27-L32 19:36:26 corvus: yes 19:36:41 okay, neat, i didn't remember that 19:37:00 so those jobs are safe from "interruptions" 19:37:05 I suspect so 19:37:27 yes, i agree 19:37:41 i'm now less worried about that behavior changing in the scheduler 19:37:45 the one place we might have trouble is if land a number of changes tht all queue up in deploy and the hourly buildsets run the results of the end of that stack, then we run the first in the normal deploy which is older 19:38:10 then we flip back to hourly and run the latset, then flip back to an older change until we eventually become consistent by running the last change in the normal deploy 19:38:34 This is why I mentioend yesterday that we might just want to always deploy from master instead 19:38:35 if deploy has a higher priority than hourly, that shouldn't occur though, right? 19:38:45 deploy will be exhausted before hourly gets to run more builds 19:38:51 fungi: yes I think that is another way to address it 19:38:57 so there should never be something "older" run later 19:39:16 fungi: currently I'm not sure we do that, it is worth checking on if someone can do that 19:39:23 i'll look 19:39:49 also we don't want to switch until we address pabelanger's issue I suspect. But we can coordinate that 19:40:02 it's already how we want 19:40:15 precedence high on deploy, low on hourly 19:40:50 as long as we keep it that way, should be race-free 19:40:52 Anyway, to do parallelized jobs I think we want an anchor job that grabs a global lock and sets up system-config for the entire buildset. Then individual jobs will grab a secondare semaphore that limist the concurency we subject bridge to for that buildset 19:41:15 though if we want to do the parallel thing, yeah we need two layers of semaphores 19:41:15 clarkb: i don't think having deploy run master is necessarily best; if there's a sequence of changes that should happen in deploy, that may not work as expected? 19:41:35 In order to do that we need to understand the dependencies between all our jobs and then decide if some jobs should be combined or if they should be separate with explicit dependencies in the config. 19:41:43 clarkb: maybe it's okay -- but it's worth thinking about whether that subverts expectations 19:41:52 corvus: yup that is the downside. If we want to say remvoe a cron by ensuring it is abset then remove the cron definition always doing master breaks that 19:42:06 corvus: if the priority settings work then that is probably the best option 19:42:42 For the first step which is understanding the relationship between jobs I think that would be good to have in some sort of human readable format (graph images?I dunno) 19:42:54 does anyone have a good idea for a way to make that collaborative? 19:43:05 and corvus pointed out that having a buildset semaphore with count 1 would allow us to safely do a build semaphore with a higher count 19:43:10 Then we could potentailly each grab some subset of jobs and start mapping them out teogether 19:43:42 miro is a web tool I've used in other contexts, we could maybe try that here if people are interested? 19:43:42 i would say just do an etherpad, we can make a yaml-ish dag in there" 19:43:59 if we want a flowchart instead, then maybe not etherpad 19:44:02 fungi: my concern with that is we seem to already fail with the yamlish dag because we haven't used it in the zuul config 19:44:18 (a small number of jobs do have dependencies mapped but not all) 19:44:28 upside to a dag in actual yaml would be we can just copy it into the conifig ;) 19:44:46 fair point. I guess we can start there and switch if it isn't good enough 19:45:16 I'm hoping to be able to start mapping some of this out this week. Feel free to get started first if you are interested 19:45:24 i'm not opposed to trying some whiteboarding tool instead/in addition though, if you think it would help 19:45:40 fungi: I'm fine either way wasn't really happy with any of the options really. 19:45:54 miro has its problems (makes your browser die) 19:46:22 we could start by writing out explanations of the relationships, and then trying to reflect those in structured data, i guess 19:46:41 Then once we've got things mapped out we can express those relationships in the zuul config while keeping everything serialized the way it is today 19:47:01 And once we're happy with the relationships in the zuul config we can swap over to the anchor jobs and the various locks 19:47:13 I don't expect this will be a quick change but it should make our deployments much happier 19:47:16 i like 'anchor job' as a term 19:47:50 * fungi weighs anchor 19:47:55 Definitely let me know if you start looking at this. Otherwise I'll let others know when I've got something written down somewhere that we can look at :) 19:48:16 Finally I think it is worht mentioning some of the "crazier" ideas that have come up: 19:48:19 when in doubt, write a DSL 19:48:23 we could try mitogen 19:48:34 we could run the hourly buildsets less often like every other hour or every third hour 19:48:56 I'm not a huge fan of reducing the hourly frequency since the things we want in there do want frequent updates (like zuul and nodepool) 19:49:06 But the mitogen idea (thanks pabelanger) is an interseting one 19:49:20 mitogen claims to significantly reduce the amount of python forking that ansible does 19:49:26 and we know that is a major cost in some of our playbooks 19:49:40 It is BSD licensed and installable from pypi 19:50:08 The risk would be that we need to rework our playbooks significantly to make it work (and maybe even downgrade ansible? it isn't clear to me if they work with newer ansible) 19:50:13 [fwiw, i don't actually think it's the forking, it's the serialized task distribution algorithm] 19:50:40 this? https://pypi.org/project/mitogen/ 19:50:55 fungi: yes, it is a generic tool but installing that comes with an ansible plugin taht you configure ansibel to use 19:51:15 corvus: ya I think they are related becuse ansible doesn't pre fork the worker processes anymore and instead forks a new process for each task that it grabs off the serialized queue? 19:51:29 yeah, started digging in the git repo linked from there, seems like that's the one 19:51:32 clarkb: yeah 19:51:49 clarkb: it thinks it has an executor worker distribution model, but it doesn't 19:52:19 Anyway mitogen is a large subject. Paul offered to push a change and see if it would work in testing 19:52:35 just interesting to see that mitogen hasn't released a new version in almost two years, but it does have git commits as recent as 5 months ago 19:52:38 I think it is worth entertaining if it shows an improvement in performance without requiring major refactoring 19:52:56 I don't expect us to say yay or nay here but wanted to call it out 19:53:10 Definitely the most important thing is improving the way we run the jobs in zuul in the first place 19:53:24 oh, they uploaded some prereleases to pypi in january, okay 19:53:25 and the playbooks themselves 19:53:26 Anything else on this subject or shoudl I open up the floor? 19:54:07 #topic Open Discussion 19:54:09 * fungi checks that he's not standing on a trapdoor 19:54:34 Is there anything else? 19:55:02 noticed today, possibly related to a missed change-merged event for a zuul config update, we're getting a dozen or two of these exceptions a day: 19:55:07 #link http://paste.openstack.org/show/808156/ 19:55:47 it's possible this means we're missing a small but not insignificant number of gerrit events in zuul 19:55:56 After some lunch and a bike ride I can help try and udnerstand the sequence of events there if that would be useful 19:56:06 something to keep in mind if you end up needing to run down any similar weirdness 19:56:12 I think we watn to see what sort of actions would be aborted when we get those exceptions 19:56:28 and determine if we can retry those requests or if we can avoid aborting important actions by running them earlier etc 19:56:42 but that discussion probably belongs in #zuul and I haven't pulled the code upyet :0 19:56:44 er :) 19:57:05 is opendev's gerrit or network having issues? 19:57:17 corvus: we haven't seen any other reports 19:57:22 maybe we're seeing this now because zuul<-internet->gerrit ? 19:57:25 it is a possibility 19:57:41 fungi said 12/day 19:57:54 17 yesterday, 16 so far today when i looked 19:58:01 er I mean no reports by users or other issues. These exceptions are the only indication so far of an internet problem 19:58:07 that said it could definitely be an internet problem 19:58:27 zuul is certainly hammering gerrit way harder than the typical user, so more likely to notice 19:59:00 yeah, given that this is a TCP-level error, and we haven't changed that handling in zuul in ages, but we have recently moved gerrit farther from zuul in network topology, it seems there's a circumstancial fit 19:59:31 if that is the case, then "retry harder" may be the best/only solution 19:59:39 also not related, but paste reminded me just now, a user let us know last week that we stopped supporting the pastebinit tool when we switched lodgeit servers. apparently pastebinit hardcodes hostnames and our redirect from paste.openstack.org (which it knows about) to paste.opendev.org is breaking it. we're investigating options to be able to maybe exclude the pastebinit user agent 19:59:41 from the redirect, in case interested is following the meeting 20:00:04 fungi: ianw: it wasn't clear to me if someone was volunteering to update the chagne for that? 20:00:12 do we need someone to volunteer? 20:00:18 i can update the change, i just need to find time 20:00:22 ok 20:00:32 i've got a test server held from the earlier investigations 20:00:34 I was going to suggest we could land the redirect removal now if we want then followup with the UA check 20:00:47 but if we have someone working on the UA check already then no need 20:00:59 And we are at time. 20:01:08 if "i've thought about what it looks like" counts as working on it ;) 20:01:20 Feel free to continue discussion in #opendev or on the mailing list, but we should call it here and let peopel get to lunch/breakfast/dinner etc :) 20:01:23 (I'm hungry) 20:01:26 thanks everyone! 20:01:28 thanks clarkb! 20:01:29 #endmeeting