17:59:52 #startmeeting tc 17:59:52 Meeting started Tue Aug 22 17:59:52 2023 UTC and is due to finish in 60 minutes. The chair is knikolla. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:59:52 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:59:52 The meeting name has been set to 'tc' 17:59:57 #topic Roll Call 17:59:59 o/ 18:00:01 Hi all, welcome to the weekly meeting of the OpenStack Technical Committee 18:00:05 o/ 18:00:05 A reminder that this meeting is held under the OpenInfra Code of Conduct available at https://openinfra.dev/legal/code-of-conduct 18:00:05 o/ 18:00:07 you're early... _again_ 18:00:09 Today's meeting agenda can be found at https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee 18:00:11 o/ 18:00:12 :D 18:00:13 We have no noted absences. 18:00:14 o/ 18:00:19 o/ 18:00:22 o/ 18:00:24 I try to get the copy pasting done by 2.00pm :) 18:00:57 o/ 18:01:48 #topic Follow up on past action items 18:01:56 No items noted as to follow up from the previous meeting. 18:02:30 #topic Gate health check 18:02:37 Any updates on the state of the gate? 18:02:43 definitely steady improvement 18:02:44 I know we haven’t been in a good place the past few weeks. 18:02:49 it is much better this week 18:03:02 still seeing lots of volume-related failures as the largest single area by my estimate 18:03:09 Awesome! Great to hear of the improvement! 18:03:13 I know some work is going on in that realm, which is good to see 18:03:37 I’ve also started poking at Keystone database metrics and will hopefully have something to report back by next week. 18:03:47 ah, was just about to ask :) 18:03:48 thanks 18:03:56 I'm also looking at neutron db still 18:04:01 ++ 18:04:03 +1 18:04:18 we've reclaimed the full capacity of our rackspace quotas, which has helped tremendously 18:04:24 team works makes the gate work (i’ll see myself out) 18:04:26 I spoke with ralonsoh and we identified what is our top 1 query which we are trying to change now 18:04:30 i was watching over 600 builds running concurrently earlier today 18:04:55 just in time 18:05:36 o/ 18:05:44 I guess I can add one other thing: 18:05:47 projects with broken job configurations are putting a strain on the collaboratory sysadmins though, we were just discussing in #opendev that we probably need to take a harder line on that 18:05:50 slaweq: there might be many unnecessary network creation happing or duplicate in tests, feel free to ping me if there is. we just create netwokr resource for every test creds 18:06:00 but I am sure it can be optimized more 18:06:13 in recent weeks I talked about improving actual performance to help with gate performance, related to things like db queries in those projects 18:06:24 basically give a cutoff date where if projects haven't deleted branches with broken configuration we'll remove them from the tenant config and stop testing changes for all of their branches 18:06:38 and I've since put my money where my fingers are and have been squashing a bunch of nova lazy loads that have crept in over the years, which are unnecessary and should be avoided 18:06:45 gmann so far I was rather looking at it from the neutron server PoV and trying to understand and hopefully optimize how it works 18:06:54 which will reduce database queries, load, and improve responsiveness in general 18:06:56 but later I may also try to look at e.g. tempest 18:07:03 cleaning up branches through implementing unmaintained should hopefully help with that as well 18:07:04 slaweq: cool, thanks 18:07:35 well, unmaintained branches also can have same issue of job config error 18:07:51 Generally it looks like most of the offenders are the oldest branches though 18:07:58 I think that will add more when one project deleting branch and other project using those ? maybe 18:08:00 it should at least take a chunk out and leave meaningful breaks 18:08:02 knikolla: i don't think the unmaintained plan directly solves broken job configs since it means projects will still by default have broken configuration on their abandoned branches for a year potentially before they're culled for lack of interest 18:08:05 yes, but less branches would qualify to even be there in the first place 18:08:20 hence less would be broken. no saying it would solve it, just help :) 18:08:36 also there are plenty of repos with broken configuration on maintained branches too 18:08:50 fungi: if I understand your proposal correctly. if stable/rocky of nova has any zuul config error and it is not fixed on time then opendev will remove nova all changes testing? 18:09:01 or just stable.rocky one 18:09:06 maintained branches, those are more concerning and we should take a harder line. especially in this election cycle. 18:09:27 gmann: yes, all of nova changes 18:09:34 gmann: based on chatter in #opendev 18:09:43 humm. not sure if that is good idea 18:09:49 point being, broken configuration needs to be cleaned up. opendev sysadmins aren't going to bespoke delete random branches, we don't have time for that. openstack can eol and delete branches with broken job configs or the opendev sysadmins can remove those repositories from the zuul config 18:09:50 Pretty good stick to get us to prioritize fixing these errors 18:09:55 and they've let us know about them for a while 18:09:58 having our unmaintained branches also in that list where we do not maintain them 18:10:23 or someone can fix the broken configs, but even the fixes for those in many cases sit unreviewed and unmerged 18:10:26 how about unmaintained branches? is that also in the list of this stick ? 18:10:29 fungi: when you say "broken job configs", what do you mean exactly? 18:10:54 rosmaita: i mean anything that shows up in https://zuul.opendev.org/t/openstack/config-errors 18:11:00 thanks 18:11:08 basically we'd just remove all those repositories for you 18:11:21 fungi: but no results in that list for nova? 18:11:25 we started this long back but did not finsih https://etherpad.opendev.org/p/zuul-config-error-openstack 18:11:25 I'm not sure what the nova connection was 18:11:50 dansmith: there was no nova connection, this was about gate health more generally 18:11:58 fungi: you are considering unmaintined branches also in that list? for exmaple is unmaintained/train has some error then you stop all nova brnaches tetsing? 18:12:02 fungi: okay JayF said "all nova changes" 18:12:10 dansmith: was answering gmann's question about a hypothetical 18:12:13 And it was an example is how it read to me 18:12:14 considering unmaintained/train is not maintained by upstream team 18:12:26 gmann: zuul reads configuration from all branches of every repository 18:12:38 dansmith: I am taking nova as an example to understand the new rule 18:12:42 okay 18:12:45 fungi: can we override that to specific branches only? 18:12:57 ok so this is issue then. we cannot commit to fix the unmaintianed branches right 18:12:59 we could maybe find a way to tell zuul to no longer read configuration from (and no longer test changes for) "unmaintained/.*" branches if that's what you're asking 18:13:17 frickler: I don't think so. Its come up that doing so might be a good zuul feature but no one has implemented it as far as I know 18:13:22 considering supported branches will make sense to be in that stick rule 18:13:29 you can tell zuul to ignore taking action on specific branches but not ignore the config on branches 18:13:49 I don't think fungi is proposing a rule for governance reasons; I think he's reflecting a technical pain that these errors cause 18:14:01 gmann: if projects have unmaintained branches with broken zuul configuration in them, they could delete those branches in order that their maintained branches continue to be tested 18:14:02 I don't know exactly what that pain is, but it's reasonable to ask us to fix them 18:14:02 fungi: yeah, that will be great so that any job issue from there does not make that project defaulter of not fixing it 18:14:14 I would also ask that zuul avoid breaking config changes in the future, too, because that was painful :) 18:14:31 Unless Zulu can be configured 18:14:42 well, we want to keep unmainttained branches open and not maintained by ourself so not sure how soon those zuul cofig can be fixed by external maintainer 18:15:04 gmann: it's already an implicit requirement in practice that we follow opendev requirements in CI 18:15:09 a simple fix is for openstack to just delete the zuul config in those branches 18:15:09 at least if it doesn’t get fixed it goes away in the next cycle, rather than linger as a zombie forever. 18:15:19 you don't have to maintain it, you have to make the error go away. Its a slightly different need 18:15:20 gmann: I would suggest that would extend to unmaintained/ and we'd retire a branch if it was a recurrant issue 18:15:24 JayF: the main one which is a problem at the moment was deprecated an announced over a year in advance of the backward-incompatible change to configuration parsing going into effect 18:15:45 JayF: we do not want to control/maintained the unmaintained CI or anything right? 18:16:30 gmann: I personally don't, no, but the policy we just landed gives the PTL, and by extension, TC power around delegating that 18:16:41 gmann: this makes it clear that a condition of that delegation is 'keep zuul configured properly' 18:17:00 one way we want unmaintained branches to be open/some-testing for external maitainers and at the same time we are putting hard expectation of their maintenance. almost same as supported branches 18:17:26 Nobody would prevent those unmaintained branches from merging a `git rm -r zuul.d/` 18:17:38 so it's a self-imposed requirement that can be removed 18:17:50 It’s a different level of testing. As JayF mentioned. 18:17:56 the only concern we have is that our infrastructure remains happy, and fungi is reflecting we aren't doing a good job of that right now 18:18:22 but did we mention that expectation in resolution. i feel it is coming little extra hard stick for them 18:18:44 all those entries you see in the config-errors list that say "extra keys not allowed @ data['gate']['queue']" have been broken that way for years now, and there was a warning that it would break a year before it did 18:18:55 " At a minimum this needs to contain all integrated jobs, unit tests, pep8, 18:18:55 and functional testing.” 18:19:11 I am fine for expectation and keep them green but stopping project testing based on unmaintained brnaches state does not looks good to me 18:19:15 “The CI for all branches must be in good standing at the time of opt-in.” 18:19:31 So keeping Zuul configured is part of the reqs. 18:19:36 remember, those branches are already not being tested because zuul can't parse their configuration, so you're not going to make that any worse by removing the configuration in them 18:20:14 but also, nothing has merged to those branches, because (again) zuul can't parse their job configuration 18:20:29 there really aren't that many projects here that are in this boat, 18:20:33 I get the requirement, my concern is if unmaintained branches goes bad then it add risk to stop testing supportd branches. that is my concern. 18:20:40 or at least most of them are in a few projects 18:20:45 that's because I did a lot of cleanup already 18:20:48 IMO, it should be "stop testing the branch having zuul config error" 18:20:52 dansmith: yep, Ironic is a big offender and believe it or not that's after we already resolved literally dozens of them 18:20:58 so we might be able to get a bunch of these resolved by pointing some people there and seeing if there's any objection 18:20:58 there were tons more, frickler has done an amazing job trying to push to clean them up 18:21:06 https://review.opendev.org/q/topic:zuul-config-errors 18:21:09 instead of "stop testing projects all brnaches if any branch have error" 18:21:13 I wasn't trying to minimize anything :) 18:21:21 and some other topics, I mixed them up a bit 18:21:42 the status of unmaintained should have zero impact whatsoever on stable branches. 18:22:04 exactly 18:22:14 but with this proposal it seems like it has direct impact. 18:22:24 Whether the config of those works or not. If it doesn’t the branch gets deleted rather than removing the project from zuul. 18:22:51 it should be like, if zuul config error in unmaintained/xyz branch and it is not fixed before deadline then stop testing it and then proceed for EOL. 18:22:53 as happy CI is a condition of opt-in or renewal. 18:23:03 yes, that’s the policy. 18:23:32 during opt-in is ok. how about after 1 month it start config error 18:23:37 I'm still concerned about the process of opt-in/opt-out 18:23:41 as well as timelines for that 18:23:56 @noonedeadpunk: that’s the next item in the agenda. 18:24:03 as this is smth that was mentioned in decision but never explained 18:24:06 here status of unmaintained branches at any time impact directly on supported branch testing 18:24:06 ++ 18:24:50 #topic Documenting implementation processes for Unmaintained 18:25:00 considering we’ve already sort of switched topics. 18:25:19 fungi: any specific reason it cannot be done at branch (havinf config error )level deletion instead of all branches? 18:25:21 We merged the policy, the next step is documenting the opt-in, renewal process, and implementing the tooling necessary. 18:26:24 So updating the project-config-guide and defining the timelines as well. 18:26:30 I still feel doing action at branch level having error is more appropriate things especially we have two set of maintainers now with unmaintained concept 18:27:06 gmann: because the opendev sysadmins don't have time to police branch configs for projects. we have a list of projects included in the tenant and can edit that list fairly easily 18:27:24 it's easy to add and remove a repository from the list 18:27:30 humm 18:29:16 anywyas I still think it is big risk and if it happen to any project then it is panic situation 18:29:31 i agree, config errors should be 18:29:55 gmann: I trust fungi and other opendev sysdmins to make loud noise about this on mailing lists and other places before action were taken, giving TC time to remove the branch as a last resort 18:30:09 they already have made loud noises without a stick; now they have to make loud noises with one :) 18:30:13 ++ 18:30:31 zuul doesn't take backward-incompatible changes to configs lightly (as i said, the queue in pipeline change was announced a year before it was merged) 18:30:49 JayF: even with the risk of stop testing projects ? that does not sound good solution 18:31:12 We’ll nuke the branch before we nuke the project. 18:31:14 stopping testing some projects is preferable to stopping testing all projects 18:31:15 I agree; the good solution is that we proactively fix or retire branches with zuul-config-errors :) 18:31:43 We can't not maintain something and not be OK with consequences from that lack of maintenance. 18:32:00 Which is realistically what this conversation is about; these are all brnaches that haven't had tests running in over a year 18:32:01 if we don't run tests on a project, 18:32:10 I'm not worried about retiring them; I'm worried about someone running them untested 18:32:13 then nothing can merge (or zuul won't merge anything) right? 18:32:16 or does it become open season? 18:32:17 exactly 18:32:21 nothing runs, nothing merges 18:32:28 dansmith: no merging since +2 Verified is a requirement to merge 18:32:28 openstack has a vested interest in making sure the people running the ci/cd system have the time to do that effectively 18:32:29 yeah that is good question? 18:32:29 the branch is effectly dead when it's in config-errored state 18:32:34 clarkb: ack 18:32:51 ok so stop testing and mering the things on master too even everything fine there 18:32:57 if it defaults to blocked, then that seems like a reasonable "fix your stuff and then you can merge again" incentive 18:33:11 JayF IIUC it's not branch but project what is effecively dead in such case 18:33:43 on supported branches i am fine with that approach but unmaintained brnach state impact all other supported as well as master brnach development is not good 18:33:45 slaweq: Eh, I don't think that's true in all cases. python-ironicclient has old branches on the list; it's a sign those branches are not cared about and should be retired (the action I'll be taking with my PTL hat on unless smoeone fixes it) 18:33:50 it has to be independent 18:33:52 as long as a PTL can nuke an unmaintained release that isn't fixing its configs to unblock master, which I think we've covered right? 18:34:40 maybe i can restate this better... the opendev sysadmins review changes that add and remove projects from the openstack tenant. the openstack maintainers decide what branches are still open and can fix or remove configuration problems in them. if the maintainers don't do that, the opendev sysadmins can remove those repositories from the tenant in order to keep the configuration clean 18:34:53 @gmann: if an unmaintained branch having a wrong config causes the maintained branches CI to break, that is a sign the branch should be retired and we can expedite the process. We have the CI breaking occasionally for any kind of reasons, it’s annoying but it’s not unfixable in an active manner. 18:35:13 JayF ahh, ok, now I understand what You said earlier :) 18:35:41 and I agree with it 18:36:19 fungi: sorry if I missed this but, why does this matter at all? some zuul detail that hurts performance or something if there are any projects with config errors? 18:36:29 yes we can do that but that is extra monitoring and work to keep eyes on that. my understanding is that, we check unmaintained branches on opt-in time if all good then say OK and check in next cycle 18:36:46 like, I would expect zuul to just ignore those branches once they have an error without a lot of additional overhead, but I'm guessing it's not that simple? 18:36:59 if config error happen in between of cycle and opendev process take effect then we are in risk or doing extra work on unmaintained branches checks 18:37:13 That risk exists today with EM too. 18:37:29 dansmith: yeah if that can be done it is great but agree it is not easy seems 18:37:34 dansmith: among other reasons, it makes it hard to identify new errors when nobody is bothering to fix the existing ones 18:38:28 fungi: I guess I don't know why it matters really, but okay 18:38:50 also it means zuul is indexing configuration on an ever increasing number of broken branches which makes restarts/reconfigs take that much longer 18:39:01 I propose we continue on the conversation with regards to implementing the processes from the policy that we merged. 18:39:06 indexing takes longer if there are config errors? 18:39:09 frickler and clarkb may have additional concerns beyond those 18:39:10 it creates unexpected behaviors in your testing, it makes it harder to debug real problems, and they tend to hve knock on effects where you get errors causing errors causing erraors that are harder to untangle over time 18:39:34 obscuring more important errors is my main concern 18:39:52 i suppose the question is why do the opendev sysadmins care that openstack has broken job configs. we might not care quite as much if people didn't come to us asking for help with their job configs 18:40:02 so you don't care about error A today, and in six months you decide you don't care about error B. Then six months after that you get error C and now you have to fix all three because you care about C and its much more difficult 18:40:08 easier if ou just fixed A and B as the occurred 18:40:19 on a personal level I'm particularly frustrated with the errors taht occur due to renaming projects 18:40:25 we could entertain not actually caring if openstack's job configuration is broken, and tell the openstack maintainers good luck they're on their own figuring it out 18:40:56 renaming projects in gerrit requires a downtime, is unsupported by upstream, and is potentially dangerous. We do it anyway because people like names to align or not conflict etc and then they don't even fix their zuul configs after we (opendev) do this major surgery on the system for them 18:41:25 I guess I'm just trying to figure out how this is materially different from people not fixing gate stability issues that compound over time, other than that there's a dashboard that lists these errors in a nice list 18:42:00 yeah, gate stability seems more important than syntex error to me. it is directly of OpenStack as software quality 18:42:06 that dashboard exists because previously when people wanted to know why their changes weren't being tested one of the sysadmins had to go trolling through service logs 18:42:21 can we make policy to kick project out if they do not fix gate stability ? 18:42:38 now we still end up looking at that page for people when they don't see changes getting tested 18:43:18 anyway, I think I'm fine with a project no longer running tests on master if they have a broken branch if that will make everyone feel better and be a stronger sentinel to the owners of broken stuff 18:43:20 to be fair, a syntax error creates very stable gating. you can predict with 100% certainty that no changes will be merging on that branch 18:43:21 ya Ithink the main difference is that we end up inn the debug path immediately 18:43:33 and we're already providing you the information needed to get ahead of those problems 18:43:33 dansmith++ 18:43:43 dansmith++ 18:43:44 as long as a designated unmaintained owner can't block master indefinitely by not fixing things (i.e. the PTL can just nuke it) 18:44:24 not sure what is deadline to be from opendev but it seems it can block master immediatly based on deadline ? 18:44:31 ++, I see the situation no particularly different than EM. WIth the difference that a project team had to fix the branch in EM, whereas Unmaintained allows nuking the branch rather than fixing. 18:44:43 yes, giving the ptl control over whether fixes will be merged or deleting the affected branch is absolutely the right way 18:44:46 so that also an important thing to note/check. deadline should at least give 1-2 cycle 18:44:55 knikolla: or nuking the .zuul in the meantime 18:45:07 ++, that too 18:45:15 gmann: this came up because we're wanting to switch the openstack zuul tenant to ansible 8 by default and that may create zuul config errors. The reason for this is ansible 6 is no longer supported and zuul added ansible 8 support recently. The timeline is going to be after the release though 18:45:33 we're discussing it in our meeting in 15 minutes and there will probably be email about that particular change sometime this week once we sort out some details 18:45:35 yeah, .zuul config going until they do not fix can be better then deleting whole branch 18:45:39 and with lots of warning and opportunity for projects to check if it will cause problems for them 18:45:46 it will give time to any external maintainer to come forward and fix 18:46:03 This was very helpful in bringing to our attention a concern from the OpenDev team that we hadn’t prioritized before, and we will moving forward. 18:46:06 clarkb: ack 18:46:13 clarkb: well ,that for sure wil lcreate some issues with jobs. As that contains openstack.cloud collection 2.0 already, doesn't it? 18:46:23 noonedeadpunk: I have no idea 18:46:27 yeah just nuking .zuul and then giving someone to the end of the cycle to revert/fix or delete the branch seems fine to me 18:46:37 but you can test it today on a per job basis by setting the ansible version on the job 18:46:48 ++, this seems good tradeoff 18:46:49 dansmith++ 18:46:56 ++dansmith: that seems a good approach and could even be automated. 18:47:03 which have quite different inputs/outputs and so anything that uses openstack.cloud>2.0 barely compatible with content that was for 1.0 18:47:06 knikolla: I'd prefer we just document it 18:47:13 let's document this so that PTL knows about this way 18:47:16 dansmith: yeah 18:47:23 knikolla: like we document that any nova patch that merged without sufficient core review has a fast-revert escape hatch 18:47:27 just document it as " 18:47:33 "this is how to get out of jail" 18:47:53 noonedeadpunk: jobs relying on that are probably invoking their own nested ansible on job nodes anyway, we're talking about the ansible version run on the executor 18:48:00 Makes sense 18:48:09 Scripting is more fun than writing, haha. But yes, I’ll make a note to add that to PTL docs. 18:48:28 knikolla: well, write a script someone can run to generate the delete commit or something sure 18:48:39 we can use it to submit patches against this list right now :) 18:48:50 knikolla: also in unmaintained branches doc that can help PTL to know what action to take on unmaintained if this happen 18:49:15 btw, the zuul config-errors list is retrievable in json from its rest api if that helps 18:49:36 One thing I want to spend at least some time talking about today, is the opt-in process for Unmaintained. 18:49:41 fungi: do we have doc link of opendev policy which we can link in OpenStack doc to communicate it to PTL/community along with the required actions 18:49:43 fungi: well, given that post jobs (like upload logs to swift) and things like that are well-tested in zuul - then it might be fine 18:49:52 fungi: or you are going to draft one? 18:49:57 What would be the right place to implement that in? 18:50:14 gmann: there is no policy yet, we just started discussing it in irc a few minutes ago 18:50:15 or well, in most cases, as indeed actiouns against openstack usualy are done in nested envs, except maybe pre/post steps 18:50:24 fungi: ok, noted 18:50:27 knikolla: I assume we delete branches via commits in releases or something right? 18:50:50 if so, generate the commit and let someone -1 it to volunteer :) 18:51:11 also iirc ansible 8.0 requires to run on >=python3.10 18:51:42 That might work, give a few weeks time for someone to -1 a patch, if not delete. 18:51:43 I suspect if we let the "fix it or your CI is turned off" message hit the list with a date; we'd see the list pare down even further. No need for TC to take direct/scripted action (yet) IMO. 18:51:45 noonedeadpunk: it's already running on python 3.11, so that's fine (our executors are all 3.11) 18:51:54 sweet :) 18:53:01 noonedeadpunk: the specific concern with ansible 8 is if there are job playbooks/roles whose syntax isn't valid in newer ansible 18:54:23 Last 5 minutes 18:54:25 #topic Reviews and Open Discussion 18:54:27 nah, I'd say it's least of concern - code changes we had between core 2.10 and core 2.15 are rather minimal 18:54:40 it's mostly collections that brings pain 18:56:11 I'll note for open discussion; if you're in your 11th month as a TC member and are planning on continuing to serve, please ensure you re-nominate yourself. 18:56:21 If you're aren't planning on continuing to serve, please help recruit :) 18:56:37 ++ 18:56:39 ++, also encourage other members to run 18:56:52 I guess all chairs have submitted. So we at least won't lack one 18:57:09 But helping to recruit won't hurt for sure 18:57:10 yeah, 4 seat and 4 candidacy we have for now 18:57:24 as it's always good thing to do 18:57:30 more and more member running election is good for long term 18:57:41 ++ 18:57:55 It would be amazing to have to go back to running elections :) 18:58:10 also we can encourage PTLs also existing or new one to send nomination before deadline, at least the one we know 18:58:15 we actually were running year ago ;) 18:58:31 and we have a week to spare 18:58:31 Also it really helps if you post to the ML not just make your commit 18:59:02 sure 18:59:08 Alright, thanks all! 18:59:16 #endmeeting