18:00:44 #startmeeting tc 18:00:44 Meeting started Tue Feb 20 18:00:44 2024 UTC and is due to finish in 60 minutes. The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:44 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:00:44 The meeting name has been set to 'tc' 18:00:55 Welcome to the weekly meeting of the OpenStack Technical Committee. A reminder that this meeting is held under the OpenInfra Code of Conduct available at https://openinfra.dev/legal/code-of-conduct. 18:00:55 Today's meeting agenda can be found at https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee. 18:01:01 #topic Roll Call 18:01:03 o/ 18:01:04 o/ 18:01:05 o/ 18:01:07 o/ 18:01:09 \0 18:01:11 o/ 18:01:17 O/ 18:01:18 o/ 18:01:32 #info One indicated absense on the agenda: jamespage 18:01:40 And everyone else is here! 18:02:01 #topic Followup on Action Items 18:02:13 rosmaita appears to have sent the email as requested last meeting, thanks for that 18:02:24 ;) 18:02:29 Any comments about that action item to send an email inviting people to global unmaintained group? 18:02:44 moving on 18:02:45 #topic Gate Health Check 18:02:47 the promised followup from tonyb and elodilles is still pending 18:02:56 ack 18:03:10 #action JayF to reach out to Tony and Elod about follow-up to unmaintained group email 18:03:14 I'll reach out to them 18:03:23 Anything on the gate? 18:03:29 nova just merged a change to our lvm-backed job to bump swap to 8G. I know that's sort of controversial because "we'll just consume the 8G now", but 18:03:45 we've spent a LOT of time looking at OOM issues 18:03:57 I think we should consider just bumping the default on all jobs to 8G 18:04:23 we've also got the zswap bit up and available and I think we should consider testing that in more jobs as well 18:04:26 note that will make jobs run slower as we can no longer use fallocated sparse files for swap files 18:04:34 I think Ironic is one of the most sensitive projects to I/O performance, so I should see it if it impacts our jobs as a bit of a canary. 18:04:43 speaking of, JayF I assume ironic is pretty tight on ram? 18:04:47 (Ironic CI, I mean) 18:04:52 the zuul update over the weekend dropped ansible 6 support. i haven't seen anyone mention being impacted (it's not our default ansible version) 18:04:55 JayF: see how zswap impacts you mean? 18:05:22 more swap usage generally 18:05:25 okay 18:05:26 IIRC we tried that some time ago and yes, it helps with OOM but indeed, as clarkb said, it makes many jobs to be slower and timeouting more often 18:05:26 I suspect zswap would help in all cases 18:05:28 dansmith: which jobs are running zswap now? 18:05:40 rosmaita: I think just nova-next 18:05:44 ok 18:05:52 slaweq: which zswap? 18:06:06 er, s/which/which,/ 18:06:16 no, with swap bump to 8GB or something like that 18:06:33 ah, well, a number of our jobs are already running 8G, like the ceph one because there's no way it could run without it 18:06:56 dansmith: https://zuul.opendev.org/t/openstack/build/fec0dc577f204ea09cf3ee37bec9183f is an example Ironic job if you know exactly what to look for w/r/t memory usage; it's a good question I can't quickly answer during meeting 18:07:06 zswap will not only make more swap available, but it also sort of throttles the IO, which is why I'm interested in ironic's experience with it 18:07:19 zswap will also lessen the actual I/O hitting the disk 18:07:23 I think monitoring nova-next first will be good to see how slow it will be. I am worried about doing it slow test/multinode tests jobs 18:07:23 with zswap it may be better indeed 18:07:32 at the expense of cpu i guess? 18:07:34 we could also experiment with zram, I suspect 18:07:41 gmann: doing which, 8G or zswap? 18:07:42 fungi: yep, but I/O is the long pole in every Ironic job I've seen so far 18:07:51 dansmith: 8 gb 18:08:00 makes sense. i agree i/o bandwidth seems to be the scarcest resource in most jobs 18:08:01 fungi: yes, but we're starting to think that IO bandwidth is the source of a lot of our weird failures that don't seem to make sense 18:08:09 gmann: okay 18:08:31 which is made worse by noisy neighbors swapping 18:08:45 which in many cases are also our jobs ;) 18:08:54 So if we can reduce actual disk I/O through things like zswap (I'm really curious about zram, too), it should have an improvement 18:09:10 clarkb: I'm totally open to other suggestions, but a lot of us have spent a lot of time debugging these failures lately, 18:09:35 including guest kernel crashes that seem to never manifest on anything locally, even with memory pressure applied 18:10:15 honestly, I'm pretty burned out on this stuff 18:10:21 dansmith: I haven't looked recently but every time I have in the past there is at least one openstack service that is using some unreasnoable (per my opinion) amount of memory 18:10:28 the last one I saw doing that was privsep 18:10:28 and trying to care when it seems other people don't, or just want to say "no it's not that" 18:10:51 and ya I totally get the others not caring making it difficult to care yourself 18:11:23 I think it would be useful for openstack to actually do a deeper analsys of memory utilization (I think bloomberg built a tool for this?) and dtermine if openstack needs a diet 18:11:42 if not then proceed with adjusting the test env. But I strongly suspect that we can improve the software 18:11:58 memray is the bloomberg tool 18:11:59 clarkb: I think the overriding question is: who is going to do this work? Where does it fall in priority amongst the dozens of high priority things we're working on as well? 18:12:24 JayF: I mean priority wise I would say issues affecting most projects in a continuous fashion should go right to the top 18:12:25 The real resources issue isn't in CI; it's not having enough humans with the ability to dig deep and fix CI issues to tag-out those who have been doing that for a long time 18:12:26 JayF: right, I think the implication is that we haven't tried dieting, and I resent said implication :) 18:12:30 I don't know how to find volunteers 18:12:42 JayF: we're all trying to make changes we think will be the best bang for the buck 18:12:46 yeah, exactly 18:12:59 because those of us that sell this software have a definite interest in it being efficient and high-quality 18:13:11 so anyone acting like we don't care about those things is a real turn-off 18:13:14 dansmith: Can you take an action to email-summary to the list the memory pressure stuff, and include a call for help in it? Especially if we have a single project (e.g. privsec?) that we can point a finger at 18:13:49 JayF: TBH, I don't want to do that because I feel like I'm the one that has _been_ doing that and I'm at 110% burnout level 18:14:11 In the meantime I think it's exceedingly reasonable to adjust the swap size, look into environmental helps such as zswap 18:14:12 (not necessarily via email lately, it's been a year or so) 18:14:25 Is there any TC member with intimate knowledge of the CI issues willing to send such an email? 18:14:51 I can take a stab at that profiling memory usage. I’m not planning to run for reelection in the TC this round and doing a deep technical dive on something sounds like something I’m itching for to attempt cure burnout. 18:14:54 my main concern with going the swap route is that I'm fairly confident that the more we swap the worse overall reliability gets due to noisy neighbor phenomena. I agree that zswap is worth trying to see if we can mitigate some of that 18:15:15 and I'll add, because I'm feeling pretty defensive and annoyed right now and so I shouldn't be writing such emails.. sorry, and I hate to not volunteer, but.. I'm just not the right person to do that right this moment. 18:16:00 we should have this discussion at the PTG and take time to gather some data ... i have seen a bunch of OOMs, but don't see an obvious culprit 18:16:02 clarkb: the other side of that is also; if we can get the fire out in the short term it should free up folks to do deeper dives. There's a balance there and we don't always find it, I know, but it's clear folks are trying -- at least the folks engaged enough to be at a TC meeting. 18:16:09 I think we also have the problem of a 1000 cuts rather than one big issue we can address directly 18:16:17 which makes the coordination and volunteer problem that much more difficult 18:16:29 i also get the concern with increasing available memory/swap. openstack is like a gas that expands to fill whatever container you put it in, so once there's more resources available on a node people will quickly merge changes that require it all, plus a bit more, and we're back to not having enough available resources for everything 18:16:41 clarkb: that's exactly why this is hard, but you also shouldn't assume that those of us working on this have not been looking for cuts to make 18:16:56 dansmith: I don't think I am 18:17:07 if you remember, I had some stats stuff to work on performance.json, but it's shocking how often the numbers from one (identical, supposedly) run to the next vary by a lot 18:17:15 fungi I think that what You wrote applies to any software, not only OpenStack :) 18:17:57 fungi: I also think it's worth noting that even non-AIO production deployments have double-digit memory requirements these days 18:18:15 many of us trying/tried our best on those thing and its reality that we hardly get enough support on debugging/solving the issue from project side. a very few projects worried about gate stability and more is just happy with rechecksssss 18:18:15 my suggestion is more that this should be a priority for all of openstack and project teams should look into it. More so than that the few who have been poking at it are not doing anything 18:18:45 In lieu of other volunteers to send the email; I'll send it and ask for help. 18:19:02 #action JayF to email ML with a summary of gate ram issues and do a general call for assistance 18:19:31 gmann we should introduce some quota of rechecks for teams :) 18:19:35 I remember we have sent a lot of emails on those. a few of those a regular emails with explicit [gate stability] etc but I did not see much outcome of that 18:19:41 dansmith: related we've discovered that the memory leaks affect our inmotion/openmetal cloud 18:19:44 slaweq: ++ :) 18:20:06 gmann: I know, and I have very minor faith it will do something, but if I'm able to make a good case while also talking to people privately, like I did for Eventlet, maybe I can shake some new resources free 18:20:26 we had to remove noav compute services from at least one node in order to ensure things stop trying to schedule there because there is no memory 18:20:37 sure, that will be great if we get help 18:20:43 and we have like 128GB of memory or something like that 18:21:21 I can try to find some time and add to my "rechecks" script way to get reasons of rechecks and then make some analysis of the reasons - maybe there will be some pattern there 18:21:30 but I can't say it will be for next week 18:22:03 I will probably need some time to finish some internal stuff and find cycles for that there 18:22:07 I'm going to quickly summarize the discussion here to make sure it's all been marked down: 18:22:29 knikolla offered to do some deep dive memory profiling. Perhaps he'd be a good person for clarkb /infra people to reach out to if/when they find another memory-leaking-monster. 18:22:42 slaweq offered to aggregate data on recheck comments and look for patterns 18:23:02 JayF (I) will be sending an email/business case to the list about the help we need getting OpenStack to run in less ram 18:23:19 and meanwhile, nova jobs are going to bump to 8G swap throughout, and begin testing zswap 18:23:30 Did I miss or misrepresent anything in terms of concrete actions? 18:23:35 that last one is not quite accurate 18:23:45 What's a more accurate way to restate it? 18:23:54 we probably won't mess with the standard templates, but we might move more of our custom jobs to 8G 18:24:05 like, the ones we inherit from the standard template we'll leave alone 18:24:12 ack; so nova jobs are going to move to 8G swap as-needed and begin testing zswap 18:24:23 but, if gmann is up for testing zswap on some standard configs, that'd be another good action 18:24:51 I would be very curious to see how that looks in an Ironic job, if you all get to the point where it's easy to enable/depends-on a change I can vet it from that perspective 18:24:52 sure, we do have heavy jobs in periodic or in tempest gate and we can try that 18:25:11 JayF: it's available, I'll link you later 18:25:16 ack; thanks 18:25:39 I'm going to move on for now, it sounds like we have some actions to try, one of which will possibly attempt to start a larger effort. 18:25:54 Is there anything additional on gate health topic? 18:26:37 #topic Implementation of Unmaintained branch statuses 18:26:47 Is there an update on the effort to unmaintain things? :DD 18:27:20 don't think so ... fungi mentioned that last week's email about the global group helped 18:27:27 yoga is quite far, doing another round of branch deletions right now 18:27:32 ack; and I'll reach out to Elod/Tony about them refreshing that 18:27:55 awesome 18:27:56 frickler: ack, thanks for that 18:28:11 Is there anything the TC needs to address or assist with in this transition that's not already assigned? 18:28:29 some PTLs did not respond at all 18:29:14 we did some overrides for the train + ussuri eols already, may need to do that for yoga, too 18:29:28 if no PTLs respond, that means they get the default behavior: that project is no longer responsible and pushes to global-unmaintained-core (?) group, right? 18:30:27 well currently the automation requires either a ptl response for release patches or an override from the release team 18:30:40 we can go with the policy, keep it open for a month or so then EOL 18:31:04 to eol someone would need to create another patch 18:31:12 gmann: AIUI, we already have volunteers to unmaintain these projects -- someone volunteered at a top level to maintain everything down from Victoria 18:31:40 frickler: ack, do you want me as a TC rep to vote as PTL in those cases? Or is there a technical change we need to make? 18:32:07 JayF: let me discuss this with the release team and get back to you 18:32:12 thanks for turning unmaintain into a verb JayF 18:32:38 I've been doing it in private messages/irl conversations for months; I'm surprised this is the first leak to an official place :D 18:33:08 i had hoped that the phrasing of the unmaintained branch policy made it clear that the tc didn't have to adjudicate individual project branch choices 18:33:11 JayF: you mean the global group? or specific volunteer for specific projects? 18:34:09 gmann: I recall an email to the mailing list, some group volunteered to maintain the older branches back to victoria, I'm having trouble finding the citation now, a sec 18:34:49 well the gerrit group currently is elodilles + tonyb, I don't remember anyone else 18:35:25 elodilles volunteered to maintain back to victoria 18:35:31 yeah, I am also do not remember if anyone else volunteer / interested to keep them alive even no response from PTL/team 18:36:00 rosmaita: ack; I'll try to find that email specifically 18:36:03 rosmaita: ok even there is no response/confirmation from PTL on anyone need those or not 18:36:19 fungi: this is about moving from stable to unmaintained, so not quite yet covered by unmaintained policy. or only halfway maybe? 18:36:47 might need to be specified clearer and communicated with the release team, too? 18:36:47 JayF: not sure there was an email, may be in the openstack-tc or openstack-release logs 18:37:05 anyways, I think actual cases will ne known when we will go into the explicit OPT-IN time 18:37:21 #action JayF Find something written about who volunteered to keep things un-maintained status back to Victoria 18:37:24 that time we will see more branches filter out and move to EOL 18:37:29 I'll do it offline, and reach out to elod if it was them (I think it was) 18:39:37 frickler: ack; I think you're right we may have a gap 18:39:56 https://governance.openstack.org/tc/resolutions/20230724-unmaintained-branches.html doesn't mention the normal supported->unmaintained flow 18:40:14 oh, wait, here it is 18:40:16 > By default, only the latest eligible Unmaintained branch is kept. When a new branch is eligible, the Unmaintained branch liaison must opt-in to keep all previous branches active. 18:40:37 So every project gets one unmaintained branch for free, keeping it alive after the 1 year is what has to be opted into 18:41:08 and i think latest unmaintained branch is also need to be +1 from PTL/liaison otherwise it will go to EOL after a month or so 18:41:20 we have that also in policy/doc somewhere 18:41:24 gmann: that is opposite of the policy 18:41:26 > The PTL or Unmaintained branch liaison are allowed to delete an Unmaintained branch early, before its scheduled branch deletion. 18:41:35 so you can opt-into getting an early delete 18:41:50 nothing in the policy, as written, permits us to close it early without the PTL or Liason opt-in 18:41:58 yeah I am not talking about delete. I mean those will not be opt-in automatically always 18:42:36 let me find where we wrote those case 18:43:23 'The patch to EOL the Unmaintained branch will be merged no earlier than one month after its proposal.' 18:43:33 ah this one it was, to keep those open for a month 18:43:40 https://docs.openstack.org/project-team-guide/stable-branches.html#unmaintained 18:44:06 that reflects my interpretation of the policy ++ 18:44:06 so we can close them after one month and no response 18:44:14 only if it's not the *latest* UM branch though 18:44:59 from the top of that: 18:45:00 >By default, only the latest eligible Unmaintained branch is kept open. To prevent an Unmaintained branch from automatically transitioning to End of Life once a newer eligible branch enters the status, the Unmaintained branch liaison must manually opt-in as described below for each branch. 18:45:33 So I think that's reasonably clear. Is there any question or dispute? 18:46:10 one question 18:47:13 so in current case, should we apply this policy only for yoga the latest unmaintained or for xena to victoria also? as we agreed to make victoria onwards as unmaintained as initial level 18:47:16 I think what applies to pre-yoga is in the resolution, not the project team guide: https://governance.openstack.org/tc/resolutions/20230724-unmaintained-branches.html#transition 18:48:06 Yeah I would think anything older then the release in question should have already been EOL or be included in this 18:48:16 so all 4 v->y will be kept until they are explicitly proposed as EOL 18:48:32 also a note that the policy is nice, but afaict a lot of that still needs implementation / automation 18:49:28 yeah, and that is lot of work. now i feel about that and think we could keep only stable/yoga moving to unmaintianed and rest all previous one EOL 18:49:52 Well, changing our course after announcing, updating documentation, and setting user expectations is also pretty expensive, too :/ 18:50:20 yeah, I agree not to change now but I feel we added lot of work to release team 18:50:22 I'm going to move on, we have a lot of ground to cover and 10m left 18:50:45 frickler: If you can document what work needs to be done (still), it'd be helpful, I know this is a form of offloading work to releases team and I apologize for that :/ 18:50:55 #topic Testing Runtime for 2024.2 release 18:51:01 #link https://review.opendev.org/c/openstack/governance/+/908862 18:51:10 I need to check other review comments/discussion in gerrit but we can discuss about py3.12 testing. 18:51:12 This has extremely lively discussion in gerrit; I'd suggest we keep most of the debate in gerrit. 18:51:37 I am not much confident on adding it as voting in next cycle considering there might be more breaking changes in py3.12 and we did not have any cycle with testing it as non voting so that we know the results before we make it voting. 18:51:49 agree with frickler comment in gerrit on that 18:52:15 yes, another case of "policy needs to be backed by code" 18:52:49 Well, the eventlet blocker that caused this to be completely undoable in most repos is gone now 18:52:55 is there further blockers to turning on that -nv testing? 18:53:35 no distro with py3.12 as default 18:53:45 we never tested it as -nv right? so we do not know what else blocker we have 18:53:52 that at least blocks devstack testing afaict 18:53:55 gmann: that's sorta what I was thinking, too 18:53:56 blocker/how-much-work 18:54:07 frickler: ack 18:54:19 doing tox testing should be relatively easy 18:54:24 tonyb was going to look into the stow support in the ensure-python role as an option for getting 3.12 testing on a distro where it's not part of their usual package set 18:54:35 we usually add nv in at least one cycle advance before we make it voting 18:55:22 I think right path is 1. add -nv in 2024.2 2. based on result, discuss to make it voting in 2025.1 18:55:27 I'm going to move on from this topic and suggest all tc-members review 908862 intensively with "how are we going to do this?" in mind as well. We have other topics to hit and I do not want us to get too deep on a topic where most of the debate is already async. 18:56:00 I'm skipping our usual TC Tracker topic for time. 18:56:10 #topic Open Discussion and Reviews 18:56:30 I wanted to raise https://review.opendev.org/c/openstack/governance/+/908880 18:56:34 and the general question of Murano 18:56:55 IMO, we made a mistake in not marking Murano inactive before C-2. 18:57:32 Now the question is if the better path forward is to release it in a not-great state, or violate our promises to users/release team/etc that we determine what's in the release by C-2. 18:58:44 frickler: I saw your -1 on 908880, I'm curious what you think the best path forward is? 18:58:46 replied in gerrit on release team deadline 18:59:06 i suspect from the release team point of view, it's ok to drop, but not add? 18:59:20 I am strongly biased towards whatever action takes the least human effort so we can free up time to work on projects that are active. 18:59:25 releasing broken source code is more dangerous and marking them inactive without any new release is less 18:59:30 From previous experience I can say that if we would send email to the community with info that we are planning to not include it in the release, it is more likely that new volunteers will step up to help with the project 18:59:44 slaweq: ++ 18:59:57 so I would say - send email to warn about it and do it at any time if project don't meet criteria to be included in release 19:00:17 we did in murano case and we found volunteer let's see how they progress on this 19:00:21 if nobody will raise hand, then maybe nobody really needs it 19:00:33 if we do that, why do we have the C-2 cutoff point at all? 19:00:41 but this is not just murano case but it can be any project going inactive after m-2 so we should cover that in our process 19:00:43 if there is volunteer I would keep it and give people chance to fix stuff 19:01:14 frickler: we do mark any inactive one as soon as we detect before m-2 but if we detect or it become inactive after m-2 19:01:16 frickler maybe we should remove that cutoff 19:01:26 frickler: I honestly don't know. That's why I asked for a release team perpsective. I suspect it was meant more to be a process of "we mark you inactive in $-1, you have until $-2 to fix it" and instead we are marking them late 19:01:29 so I'd say mark inactive but still keep in the release 19:01:36 our best effort will always be check before m-2 19:01:49 if a project isn't in good shape before milestone 2, then it should be marked inactive at that time. if it's in good shape through milestone 2 then releasing it in that state seems fairly reasonable 19:02:12 release even with broken gate/code? 19:02:13 We should finish this debate in gerrit, and we're over time. 19:02:14 frickler what if project have e.g. broken gates and it happend after m-2? How we would then keep it in release? 19:02:34 It's a hard decision with downsides in every direction :( 19:02:37 slaweq: exactly 19:02:44 IMHO we should write there something like each case will be discussed individually by the TC before making final decision 19:03:01 it's hard to describe all potential cases in the policy 19:03:08 I think that'd be a good revision to 908880 19:03:17 but still doesn't settle the specific case here 19:03:23 but we need to let folks go to their other meetings 19:03:24 #endmeeting