Tuesday, 2021-08-17

*** diablo_rojo is now known as Guest460210:34
clarkbAnyone else here for the meeting?19:00
clarkbWe will get started shortly19:00
ianwo/19:00
fungiahoy19:01
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Aug 17 19:01:33 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-August/000277.html Our Agenda19:01
yoctozeptoo/19:02
clarkb#topic Announcements19:02
clarkbI didn't have any announcements19:02
clarkb#topic Actions from last meeting19:02
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-08-10-19.01.txt minutes from last meeting19:02
clarkbThere were no actions recorded19:02
clarkb#topic Specs19:03
clarkb#link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement19:03
clarkbtristanC reviewed this spec (thank you!) Would be good if I can get some feedback from infra-root as well19:03
clarkbThis isn't really urgnet at the moment, but I'd like feedback while its still somewhat fresh for me if possible19:04
clarkb#topic Topics19:04
* fungi grins19:04
clarkbLets jump right in. I did quite a bit of pruning on the topics list as we seemed to have reached the end point of a number of in flight items19:04
clarkbthank you to everyone who had been working on those items. it is much appreciated.19:05
clarkbIf something got removed that should have remained please let me know19:05
clarkb#topic Service Coordinator Election19:05
clarkbThe only nomination I saw was the one I submitted for myself. By default I think this means I'm it again.19:05
yoctozepto\o/ congrats clarkb19:06
fungiand/or condolences19:06
clarkbha19:06
fungibut either way, thanks!19:06
yoctozeptoyes, thanks for being awesome 8-)19:06
clarkbWe'll have another election in about 6 months (the details were in that email thread)19:06
clarkbuntil then I'll run this meeting and coordinate with the projects and our resource donators19:07
clarkb#topic Matrix eavesdrop and gerritbot bots19:07
clarkbThe bots live and the latest version of the gerritbot seems to be more reliable19:07
fungiand sounds like zuul is taking the plunge this weekend19:07
clarkbcorvus has proposed that August 21 will be the Zuul switch day19:07
clarkbyup19:08
corvusno objections so far, so i'm assuming so19:08
yoctozeptowhat's the current plan to support the current bots on irc?19:08
corvusi'll send out a formal announcement... wednesday?19:08
clarkbAt this point I'm not sure there is much else for us to do on the opendev side other than helping get a new room created if necessary and joining our matrix clients to it19:08
clarkbcorvus: wfm19:08
corvusand i can take care of making the room too19:08
mordred\o/19:08
clarkbyoctozepto: at this point we'll keep running them though I'd like to start testing matrix gerritbot as a replacement for irc gerritbot at some point19:09
yoctozeptoclarkb: ack, thanks; I know it's less featureful so to speak19:09
clarkbyoctozepto: we'll need a channel to do that in where people won't mind the noise and possible double posting. But if that goes well I think we can drop the irc gerritbot in favor of  the matrix bot19:09
yoctozeptobut perhaps it has better design19:09
clarkbyoctozepto: eavesdrop is trickier becuse it doesn't support meetings on the matrix side19:09
clarkbwe'll need a separate matrix meeting bot before we can swap out the limnoria setup I think19:10
yoctozeptoeh, always uphill19:10
corvusor add meetings to the existing matrix eavesdrop19:10
fungialso not sure if the reliability of the oftc matrix bridge plays into it (missing the occasional gerrit event in an irc channel is probably fine though?)19:10
clarkbcorvus: ya or that19:10
clarkbfungi: yup, another reason starting with gerritbot is a good idea19:11
clarkbthat would be more problematic for logging and meetings though19:11
corvusfor the actual logging part of it, it's nice since it never misses anything that matrix sees19:11
corvusand even though occasionally the oftc irc bridge misses something, on the whole, i'd say it <= what one misses just in normal netsplits19:11
yoctozeptomayhaps the openstack community migrates as well at some point19:12
corvus(it probably is netsplits that cause that)19:12
clarkbcorvus: good point re netsplits19:12
yoctozeptothere was some pressure from ironic I think19:12
yoctozeptowe'll see ;-)19:12
corvusyoctozepto: it's my personal hope -- everything weve done for zuul is designed to be "forward-compatible" for any other projects wanting to either have a robust bridged presence or switch19:12
clarkbyoctozepto: at this point I think it would be good to see how it goes for zuul since it is a much smaller community that knows how to pivot if necessary :) we are learning and having the ability to be flexible with zuul is nice. Not that I expect issues as this point19:12
fungiyoctozepto: one up-side to this model is that ironic can decide to switch even if not all of openstack does19:12
clarkbfungi: I'm not sure that is entirely true the way zuul is doing it19:13
yoctozeptoyou three just said all my thoughts ;p19:13
clarkbassuming you mean they could leverage the bridge19:13
clarkbonce zuul moves the irc bridge is not something that will function for it aiui19:13
clarkbit will be all matrix all the time19:13
fungii mean ironic could decide to stop using their oftc channel and have a matrix channel and still take advantage of the same bots (once the lack of meetings is solved, assuming they have meetings)19:14
clarkbgotcha19:14
corvusthat's a choice we're making, but not the only possible choice :)19:14
fungithe openstack community as a whole could decide that it's okay if some teams use matrix instead of irc, they already have some teams using wechat instead of irc19:14
clarkbDefinitely get your matrix clients set up if you would like to interact with zuul more synchronously in the future. And ya I don't see any issues with zuul moving forward at this point from the bot perspective19:15
mordredyoctozepto: I don't know if the ironic team has seen the zuul docs - but zuul updated all of their contributor docs to point people to matrix as the way to get to the channel even with it currently still on irc19:15
yoctozeptofungi: who's using wechat?19:15
yoctozeptomordred: well, I see more and more people coming from the irc bridge19:16
yoctozeptoon all the channels19:16
yoctozeptoso I think they did see these tips19:16
clarkbIs there any other concerns or issues to call out on the subject of matrix before zuul moves? If not we should move on. Openstack and Ironic can definitely look to see how it goes with zuul to help evaluate things for their uses19:17
fungiyoctozepto: i forget which ones, but one or more of the teams consisting predominately of contributors from mainland china coordinate over wechat, and i think have irc channels which sit unused19:17
yoctozeptofungi: ack19:17
yoctozeptoclarkb: go on; I'm happy to see zuul move :-)19:17
clarkb#topic Backup reliability19:18
clarkbWe had a number of servers fail to backup on Friday. But manually rerunning the backups on one of them succeeded a few hours later19:18
clarkbwhat seemed to happen was they all started their backups at the same time and then timed out 2 hours later sending email about it19:18
clarkbSince then they havne't reported failures19:19
clarkbCalling this out as it was backups to the other backup server not the one gitea01 struggles with. Thought gitea01 is also still strugglging19:19
clarkbianw: I think the backups use a random hour but not minute? or is the minute in the cron randomized too?19:19
clarkbI'm wondering if we need better randomization for the minute portion as it seems quite a few have the same time set up19:20
ianwumm, i would have thought it was random minutes 19:20
ianwiirc we divide 24 by the number of backup servers19:20
clarkbhrm I guess it is just chance on the way it hashes then19:21
clarkbThings have been happy since. I'm fairly certain this was an internet blip and we don't need to focus too much on it unless it becomes persistent19:21
ianwhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/borg-backup/tasks/main.yaml#L6219:21
ianwthe minute should be random19:21
clarkbhuh so it is chance then19:22
clarkblucky us :)19:22
ianwactually -> '{{ 59|random(seed=item) }}'19:22
clarkbya so it doesn't change the cron each ansible run19:22
ianwitem is the backup server i guess.  the seed should probably be inventory hostname of the host we're setting up19:22
clarkboh yes ++19:23
clarkbseed=ansible_fqdn or similar?19:23
clarkbanyway we don't need to fix it now. I wanted to call it out and it does seem that maybe there is a small bug here :)19:23
clarkb#topic Mailman server upgrades19:24
clarkbYesterday with the help of fungi I got a test copy of lists.kc.io booted and upgraded to focal19:24
clarkbThe mailman web stuff seems to work19:25
clarkbfungi did you manage to try a newlist yet to see if it successfully sends email to your server (as I expect mine would reject it)19:25
fungii still need to spend a few minutes creating a test ml on it and trying getting messages in and out19:25
clarkbthanks19:25
clarkbIf people can think of other test actions that would be great, but so far things look ok. The biggest hurdle was realizing the snapshot wouldn't boot because its fstab had an entry for swap that didn't exist on the new servers epehemeral disk19:26
clarkbOnce we're happy with the results I'll reach out to the kata project and schedule an upgrade for them.19:26
clarkbThen when we're happy with that we can do the same for lists.o.o19:27
clarkbfungi: one thing I meant to check which i haven't yet is how the python2 config is selected for mailman19:27
fungialso remembering that you need to fake the hostname resolution in your /etc/hosts to test the webui19:27
fungibecause apache cares19:27
clarkbwant to make sure that in our overrides for mailman vhosts we don't break that19:28
clarkbI expect it is fine since I think we still include the Defaults.py first but want to be sure19:28
clarkb#topic Improving OpenDev CD throughput19:28
fungii replaced our cds with dvds, now they're much faster19:29
clarkbThis came up in discussion last week around how some change in zuul behavior has made applying deploy buildsets for merged changes much slower now due to the hourly deploy pipeline buildsets19:29
clarkbThe hourly deploys were taking almost an hour for each buildset then immediately reenqueing another buildset19:29
clarkbthis meant you'd land a change then it would wait up to an hour before its turn then as soon as it was done the hourly buildset would take teh locks and continue. Landing a second change would have to wait even longer19:30
clarkbThere are a couple of fundamental issues here: Our ansible jobs take a lot of time, our ansible jobs don't run in parallel, and we run/ran expensive jobs hourly19:31
clarkbA quick and easy improvement was to stop running the cloud launcher job hourly since we rarely need its updates anyway. That cut about 20 minutes off of the hourly buildset runtime19:31
clarkbI think we can make similar improvements dropping zuul-preview and remote-puppet-else from the hourly job as well (though the impact won't be as dramatic)19:31
clarkbWe can run those daily instead and not be majorly impacted19:32
fungipabelanger has also proposed a related change for zuul to start prioritizing semaphore locks along with the pipeline priorities, which implies some potential ordering issues between our timer and change-merged triggered deploy pipelines19:32
clarkbA more comprehensive fix is to figure out our dependencies between jobs and express them in the zuul config properly. Then we can run the jobs in parallel.19:32
clarkbOne concern I've got running jobs in parallel is the system-config updates on bridge, but I think we can move that into a central locking job that pauses allowing its hcildren to run19:33
fungiwith the semaphore priority change, we could end up with incomplete periodic deploys rolling back configuration when their buildsets resume after regaining control of the lock, unless we make sure every build updates system-config before using it19:33
clarkbyup and the hourly deploys do currently do that so we should be ok for now19:34
clarkbhourly jobs alawys pull latest system-config19:34
clarkbthe issue will be if we then flip back to a specific change in deploy19:34
fungiahh, okay for some reason i thought we had concluded they were using the state of system-config from the time they were enqueued instead19:34
clarkbfungi: only in deploy19:35
fungigot it19:35
corvusclarkb: they pull it or push it?19:35
corvusthe jobs behave differently in different pipelines?19:35
clarkbcorvus: the pull when in the hourly and daily deploy pipelines and yes19:35
corvusthe job says "i'm in a periodic pipeline, i will manually run 'git fetch...'"?19:36
clarkbhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-production-playbook.yaml#L27-L3219:36
clarkbcorvus: yes19:36
corvusokay, neat, i didn't remember that19:36
corvusso those jobs are safe from "interruptions"19:37
clarkbI suspect so19:37
fungiyes, i agree19:37
fungii'm now less worried about that behavior changing in the scheduler19:37
clarkbthe one place we might have trouble is if land a number of changes tht all queue up in deploy and the hourly buildsets run the results of the end of that stack, then we run the first in the normal deploy which is older19:37
clarkbthen we flip back to hourly and run the latset, then flip back to an older change until we eventually become consistent by running the last change in the normal deploy19:38
clarkbThis is why I mentioend yesterday that we might just want to always deploy from master instead19:38
fungiif deploy has a higher priority than hourly, that shouldn't occur though, right?19:38
fungideploy will be exhausted before hourly gets to run more builds19:38
clarkbfungi: yes I think that is another way to address it19:38
fungiso there should never be something "older" run later19:38
clarkbfungi: currently I'm not sure we do that, it is worth checking on if someone can do that19:39
fungii'll look19:39
clarkbalso we don't want to switch until we address pabelanger's issue I suspect. But we can coordinate that19:39
fungiit's already how we want19:40
fungiprecedence high on deploy, low on hourly19:40
fungias long as we keep it that way, should be race-free19:40
clarkbAnyway, to do parallelized jobs I think we want an anchor job that grabs a global lock and sets up system-config for the entire buildset. Then individual jobs will grab a secondare semaphore that limist the concurency we subject bridge to for that buildset19:40
fungithough if we want to do the parallel thing, yeah we need two layers of semaphores19:41
corvusclarkb: i don't think having deploy run master is necessarily best; if there's a sequence of changes that should happen in deploy, that may not work as expected?19:41
clarkbIn order to do that we need to understand the dependencies between all our jobs and then decide if some jobs should be combined or if they should be separate with explicit dependencies in the config.19:41
corvusclarkb: maybe it's okay -- but it's worth thinking about whether that subverts expectations19:41
clarkbcorvus: yup that is the downside. If we want to say remvoe a cron by ensuring it is abset then remove the cron definition always doing master breaks that19:41
clarkbcorvus: if the priority settings work then that is probably the best option19:42
clarkbFor the first step which is understanding the relationship between jobs I think that would be good to have in some sort of human readable format (graph images?I dunno)19:42
clarkbdoes anyone have a good idea for a way to make that collaborative?19:42
fungiand corvus pointed out that having a buildset semaphore with count 1 would allow us to safely do a build semaphore with a higher count19:43
clarkbThen we could potentailly each grab some subset of jobs and start mapping them out teogether19:43
clarkbmiro is a web tool I've used in other contexts, we could maybe try that here if people are interested?19:43
fungii would say just do an etherpad, we can make a yaml-ish dag in there"19:43
fungiif we want a flowchart instead, then maybe not etherpad19:43
clarkbfungi: my concern with that is we seem to already fail with the yamlish dag because we haven't used it in the zuul config19:44
clarkb(a small number of jobs do have dependencies mapped but not all)19:44
fungiupside to a dag in actual yaml would be we can just copy it into the conifig ;)19:44
clarkbfair point. I guess we can start there and switch if it isn't good enough19:44
clarkbI'm hoping to be able to start mapping some of this out this week. Feel free to get started first if you are interested19:45
fungii'm not opposed to trying some whiteboarding tool instead/in addition though, if you think it would help19:45
clarkbfungi: I'm fine either way wasn't really happy with any of the options really.19:45
clarkbmiro has its problems (makes your browser die)19:45
fungiwe could start by writing out explanations of the relationships, and then trying to reflect those in structured data, i guess19:46
clarkbThen once we've got things mapped out we can express those relationships in the zuul config while keeping everything serialized the way it is today19:46
clarkbAnd once we're happy with the relationships in the zuul config we can swap over to the anchor jobs and the various locks19:47
clarkbI don't expect this will be a quick change but it should make our deployments much happier19:47
corvusi like 'anchor job' as a term19:47
* fungi weighs anchor19:47
clarkbDefinitely let me know if you start looking at this. Otherwise I'll let others know when I've got something written down somewhere that we can look at :)19:47
clarkbFinally I think it is worht mentioning some of the "crazier" ideas that have come up:19:48
ianwwhen in doubt, write a DSL19:48
clarkbwe could try mitogen19:48
clarkbwe could run the hourly buildsets less often like every other hour or every third hour19:48
clarkbI'm not a huge fan of reducing the hourly frequency since the things we want in there do want frequent updates (like zuul and nodepool)19:48
clarkbBut the mitogen idea (thanks pabelanger) is an interseting one19:49
clarkbmitogen claims to significantly reduce the amount of python forking that ansible does19:49
clarkband we know that is a major cost in some of our playbooks19:49
clarkbIt is BSD licensed and installable from pypi19:49
clarkbThe risk would be that we need to rework our playbooks significantly to make it work (and maybe even downgrade ansible? it isn't clear to me if they work with newer ansible)19:50
corvus[fwiw, i don't actually think it's the forking, it's the serialized task distribution algorithm]19:50
fungithis? https://pypi.org/project/mitogen/19:50
clarkbfungi: yes, it is a generic tool but installing that comes with an ansible plugin taht you configure ansibel to use19:50
clarkbcorvus: ya I think they are related becuse ansible doesn't pre fork the worker processes anymore and instead forks a new process for each task that it grabs off the serialized queue?19:51
fungiyeah, started digging in the git repo linked from there, seems like that's the one19:51
corvusclarkb: yeah19:51
corvusclarkb: it thinks it has an executor worker distribution model, but it doesn't19:51
clarkbAnyway mitogen is a large subject. Paul offered to push a change and see if it would work in testing19:52
fungijust interesting to see that mitogen hasn't released a new version in almost two years, but it does have git commits as recent as 5 months ago19:52
clarkbI think it is worth entertaining if it shows an improvement in performance without requiring major refactoring19:52
clarkbI don't expect us to say yay or nay here but wanted to call it out19:52
clarkbDefinitely the most important thing is improving the way we run the jobs in zuul in the first place19:53
fungioh, they uploaded some prereleases to pypi in january, okay19:53
corvusand the playbooks themselves19:53
clarkbAnything else on this subject or shoudl I open up the floor?19:53
clarkb#topic Open Discussion19:54
* fungi checks that he's not standing on a trapdoor19:54
clarkbIs there anything else?19:54
funginoticed today, possibly related to a missed change-merged event for a zuul config update, we're getting a dozen or two of these exceptions a day:19:55
fungi#link http://paste.openstack.org/show/808156/19:55
fungiit's possible this means we're missing a small but not insignificant number of gerrit events in zuul19:55
clarkbAfter some lunch and a bike ride I can help try and udnerstand the sequence of events there if that would be useful19:55
fungisomething to keep in mind if you end up needing to run down any similar weirdness19:56
clarkbI think we watn to see what sort of actions would be aborted when we get those exceptions19:56
clarkband determine if we can retry those requests or if we can avoid aborting important actions by running them earlier etc19:56
clarkbbut that discussion probably belongs in #zuul and I haven't pulled the code upyet :019:56
clarkber :)19:56
corvusis opendev's gerrit or network having issues?19:57
clarkbcorvus: we haven't seen any other reports19:57
corvusmaybe we're seeing this now because zuul<-internet->gerrit ?19:57
clarkbit is a possibility19:57
corvusfungi said 12/day19:57
fungi17 yesterday, 16 so far today when i looked19:57
clarkber I mean no reports by users or other issues. These exceptions are the only indication so far of an internet problem19:58
clarkbthat said it could definitely be an internet problem19:58
fungizuul is certainly hammering gerrit way harder than the typical user, so more likely to notice19:58
corvusyeah, given that this is a TCP-level error, and we haven't changed that handling in zuul in ages, but we have recently moved gerrit farther from zuul in network topology, it seems there's a circumstancial fit19:59
corvusif that is the case, then "retry harder" may be the best/only solution19:59
fungialso not related, but paste reminded me just now, a user let us know last week that we stopped supporting the pastebinit tool when we switched lodgeit servers. apparently pastebinit hardcodes hostnames and our redirect from paste.openstack.org (which it knows about) to paste.opendev.org is breaking it. we're investigating options to be able to maybe exclude the pastebinit user agent19:59
fungifrom the redirect, in case interested is following the meeting19:59
clarkbfungi: ianw: it wasn't clear to me if someone was volunteering to update the chagne for that?20:00
clarkbdo we need someone to volunteer?20:00
fungii can update the change, i just need to find time20:00
clarkbok20:00
fungii've got a test server held from the earlier investigations20:00
clarkbI was going to suggest we could land the redirect removal now if we want then followup with the UA check20:00
clarkbbut if we have someone working on the UA check already then no need20:00
clarkbAnd we are at time.20:00
fungiif "i've thought about what it looks like" counts as working on it ;)20:01
clarkbFeel free to continue discussion in #opendev or on the mailing list, but we should call it here and let peopel get to lunch/breakfast/dinner etc :)20:01
clarkb(I'm hungry)20:01
clarkbthanks everyone!20:01
fungithanks clarkb!20:01
clarkb#endmeeting20:01
opendevmeetMeeting ended Tue Aug 17 20:01:29 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:01
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2021/infra.2021-08-17-19.01.html20:01
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-08-17-19.01.txt20:01
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2021/infra.2021-08-17-19.01.log.html20:01
ianwoh, sorry, yes i will do that 20:01

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!