Tuesday, 2022-04-05

*** rlandy is now known as rlandy|out00:50
*** icey_ is now known as icey06:59
*** jpena|off is now known as jpena07:37
*** ysandeep is now known as ysandeep|lunch08:28
*** ysandeep|lunch is now known as ysandeep09:00
*** rlandy|out is now known as rlandy10:27
*** whoami-rajat__ is now known as whoami-rajat12:01
noonedeadpunkhey there! We noticed changed zuul behaviour regardign queues, that invalidates check results and kind of waste CI resources for us13:07
noonedeadpunkhttps://review.opendev.org/c/openstack/openstack-ansible/+/836378 as example13:07
noonedeadpunkI tried to check for options and how to set up queues, but I'm not sure I understand the consequences.13:07
noonedeadpunkDoesn't mean having same change queue for all repos (projects) we manage, that all patches even if they don't have Depends-On would not run in parallel since they are queued one after another?13:09
funginoonedeadpunk: yes, shared queues in dependent pipelines take all changes ahead of the current change into account when testing. they're still tested in parallel, but their testing has to be reset if there's a failure for a change ahead of the current change so it can be removed from the checkout13:24
fungithat's not a change in behavior, it's how zuul has basically always worked for the past 10 years since it was first conceived13:24
fungiyou use shared queues when your projects have fairly tightly-coupled interrelationships, such that you're concerned a change in one project could break functionality in another13:25
jrosserthis is the thing i mentioned yesterday, were if someone +2+W before a depends-on has merged, zuul drops a -213:26
jrosserthis seems to be new behaviour in the last ~week13:26
fungiokay, the verified -2 for a depends-on when approved out of order is new behavior, yes, and is currently being discussed by the zuul maintainers13:27
jrosseroh ok i'd missed that - where would I keep across that?13:27
fungii don't know what keep across means13:27
fungiif you're asking where to find the zuul maintainers, they're in the #zuul:opendev.org matrix channel13:28
jrosseroh sorry - i didnt know it was discussed beyond me mentioning it in #opendev yesterday13:28
fungipretty sure it got brought up in the zuul matrix channel, but i'm still catching up on discussions this morning13:28
fungijust to be clear, i two changes in projects which don't share a change queue have a depends-on relationship and the depending change is approved before the dependency merges, then the depending change gets a -2 verified vote. previously, zuul simply ignored the approval on the depending change requiring you to reapprove it, correct?13:30
noonedeadpunkfungi: so basically we'd need to define queue and then for each project reference it's name, am I right?13:31
noonedeadpunkyep, it is correct13:31
funginoonedeadpunk: yes, that's how you indicate a particular project belongs to a named queue, you can look at the integrated or tripleo queues for examples13:31
noonedeadpunkeven during some short period, zuul was re-visiting previously ignored +2 and was starting gate jobs without need of manual pushing 13:32
noonedeadpunkbut it was for quite short time - like month or so13:32
fungiand where the resource waste comes in is, because the openstack tenant is configured to require a verified +1 from zuul before it can be enqueued into the gate pipeline, this means that another pass through check is required to clear the resulting verified -213:32
noonedeadpunkyep13:34
noonedeadpunkI see tripleo uses queue for gate only. but not for checks13:35
jrosserthats it - and the out-of-order approval is actually very handy for us with limited reviewers13:35
*** dasm|off is now known as dasm13:59
*** dasm is now known as dasm|ruck14:00
clarkbnoonedeadpunk: queue is a project level setting not a pipeline level setting14:55
clarkbput another wya tripleo is setting their queue for all pipelines not just gate. It is just that gate being a dependent pipeline has the most visible impact of that14:55
*** ysandeep is now known as ysandeep|out16:03
*** jpena is now known as jpena|off16:31
clarkbnoonedeadpunk: jrosser: to followup on this from the zuul matrix room I think it would be good to set up queues appropritaely then reevaluate if zuul is still causing problems. I tsounds like we would need a new zuul reporter type to avoid reporting status messages back without a negative vote and I'm not sure everyone is convinced yet that that is correct16:48
noonedeadpunkwell, right now it's kind of choice of wasting CI resources by setting +W in worng order vs wasting CI resources for invalidating all changes in queue in case of single accidental gate failure?16:50
noonedeadpunkas then in our usecase setting queues will lead to more issues I guess...16:51
clarkbwell the second thing isn't a waste thats how gating is exopected to work and prevents landing conflicting chnages simultaneously16:51
clarkbthat was zuul's first feature16:51
noonedeadpunkbut it works this way even for changes that don't depend on explicitly each other? It's enough to be just in same queue?16:52
clarkbbasically what zuul is saying is you are currently subverting zuul's expectations so the result you get is less than ideal. The ask is that we not subvert zuul and try it the way zuul is meant to be used and see if it is still a problem16:52
clarkbnoonedeadpunk: when you have projects that share a queue then they enter shared queues with dependent pipeline managers. In our case that is the gate queue16:53
clarkbthis means they are tested together to avoid two conflicting changes in different changes from landing at the same time16:53
clarkbthis also enables upgrade testing and other neat functionality with testing things together without needing to explicitly depends-on everything16:54
jrosserdoes that work outside mono-repos though?16:55
noonedeadpunkHm, I think indeed here we might issue in terms how we do test things... As from beginning we use integrated testing16:55
clarkbnot sure I understand. We don't host any mono repos as far as I know16:55
jrosserwhere a new feature for us might be one patch to 5 repos then a pretty empty patch to openstack-ansible which depends-on them all16:55
jrosserunless i mis-understand "testing things together without needing to explicitly depends-on everything"16:56
clarkbjrosser: that is what the queue setting is for16:56
noonedeadpunkbut for that they all must be in gates at same time....16:56
clarkbfor example nova, cinder, glance, swift, neutron, etc share a queue16:56
clarkbnoonedeadpunk: yes that is literally the point it was zuul's reason for existing :)16:56
noonedeadpunkclarkb: the thing is that we also need to share queue with _all_ projects we deploy kind of16:57
clarkbbut this ensures that a change to nova cannot be approved and race a change to neutron that is approved at the same time that conflict with each other16:57
clarkbone will be tested before the other and testing will ensure that only one merges16:57
clarkbnoonedeadpunk: well that i sthe question right? I'm basically saying set up the queue for OSA repos and then lets see if this problem persists?16:57
clarkbI don't know your review patterns well enough to be able to predict that, but gathering data should be straightforward16:58
noonedeadpunkclarkb: so the concern we have with using queues, is that we have unrealted to our code failures quite often16:58
clarkbanyway fixing this problem in zuul is not straightforward and requires adding entirely new features. This is why the ask is we use zuul as it is intended to be used and then evaluate if this problem persists16:59
noonedeadpunkclarkb: and then failure of single patch would invalidate everything that is current;y running?16:59
clarkbnoonedeadpunk: that sounds like something that should be addressed?16:59
noonedeadpunkclarkb: how we should address ansible galaxy outage for instance?16:59
clarkbyes flaky testing is bad in the gate. Typically we indicate that people should try to remove the flakyness since flakyness is bad for other reasons16:59
noonedeadpunksorry need to run away now17:00
clarkbnoonedeadpunk: one approach (taken by tripleo I think) is to have zuul hook up to the git repos for ansible roles so that zuul caches them. Another could potentially be to proxy cache galaxy (this is what we do for docker images)17:00
clarkbOptimizing the gate so that test failures are acceptable is also sort of contradictory to zuul's expectations17:00
clarkbwe do have tools to make external resource access less problematic and depending on specific situations take different appraoches17:01
jrosseri've previously asked for rabbitmq repos to be mirrored17:01
jrosserbut anyway17:01
*** dasm|ruck is now known as dasm|ruck|mtg17:01
clarkbjrosser: and I've said we'd be happy to proxy cache them iirc17:01
clarkbbut no one has written a change for that as far as I know17:02
jrosserthe trouble is they break them17:02
clarkboh right this is the case where the upstream doesn't know how to run a rpeo at all17:02
jrosserinvalid apt repos pretty much every time they release new stuff17:02
jrosserand then there are galaxy roles which are malformed so can't be installed locally17:02
clarkbI think my suggestion for that was to start by working with the upstream to address that17:02
jrosserit's not through lack of trying with any of this, really17:03
clarkbit isn't difficult to do right if you understand the problem exists. The problem is many people don't realize that deb repos wirk that way and just don't realize it is a problem17:03
jrosserwe got a ton of pushback on properly versioning some of the core ansible collections17:03
clarkbwhen you say a galaxy role is malformed so can't be installed locally how do they work at all? I thought all galaxy did was take a tarball or similar and put it on disk. Not all that idfferent from checking out a git rpeo?17:03
jrosserbecause whatever $process pushes them into the galaxy backend inserts the relevant metadate17:04
jrosserit's otherwise missing in the git repo17:04
clarkbI see. I'd be inclined to not use theose dependencies myself if they cannot be built form source17:04
jrosserand thats hit / miss depending on which collection17:04
jrosseransible.netcommon ansible.utils for a start :(17:05
clarkbproxy caching galaxy is likely also doable. I think tripleo wnet with caching git repos because they were all on github so that was straightforward17:05
clarkbI don't know much about the galaxy protocol though and if they subvert caches17:06
clarkbdocker does this which makes caching docker hub difficult but still possible with theright options17:06
jrosseri think i'm still failing to understand how a single queue can understand the relationship between patches without depends-on17:07
clarkbjrosser: its an implied relationship that builds the queue based on reviewer activity17:07
clarkbjrosser: if you approve change A and then change B they can enqueued in that order17:07
clarkbwhat sharing a queue does is builds that queue for change A and B in the same order if they come from different projects17:08
clarkb(within a single project this is always the case due to how git works)17:08
clarkbadditionally zuul knows that it should check all the rlated projects (as set by queue) for actionable state if a parent or child is approved17:08
jrosserand what happens for check rather than gate?17:09
jrosser^ where there is no approval, i mean17:09
clarkbcheck is basically the same as before since check is pre review. It tests changes merged to their target branch17:10
jrosserso i'd still need the depends-on there?17:10
clarkbif there is a strict dependency then yes17:10
jrosseri think that we have a very high proportion of our changes being like that17:10
clarkbdepends-on handles strict dependencies. queue: and cogating handles related projects and the implied relationship preventing things from landing in an improper sequence17:11
clarkbthey solve two different problems and we seem to be conflating them here which isn't very helpful17:11
jrosserno indeed17:11
clarkbto go back to the openstack integrated gate example: you use depends on when nova wants to use a new feature in neutron when creating instances. Nova change depends on neutron change to add the new feature. They share an integrated gate to ensure that refactoring the give me a network api call continues to work with nova's changes to booting an instance17:12
clarkbdepends on are explicit expressions of relationships. The queue is an indication that changes to related projects may interfere with each other so we check them together17:13
clarkband that is why zuul uses queue to determine the list of related projects which it checks for actions when children or parents have actionable events17:14
jrosserthe thing i am concerned about is when we need to make completely unrelated changes like this one https://review.opendev.org/c/openstack/openstack-ansible-os_cinder/+/83570217:18
jrosserthey need to be wholly unrelated (because they are) and implying anything from the approval order will almost certainly be counterproductive17:18
fungia piece of makes the recent behavior change "waste" additional resources is the openstack gate pipeline's "clean check" rule, which is considered a bit of an antipattern. other tenants allow to enqueue a change with a negative verified vote (even verified -2) directly into the gate pipeline without first going through check again17:19
jrosseras those sort of changes tend to weed out latent brokenness in some of our less active repos17:19
clarkbthere are always change slike that for example updating docs in nova isn't going to break neutron api use17:19
clarkbbut we accept that and generally they don't seem to be a major problem because they are simple change sthat test quickly and accurabtley17:19
clarkb*accurately17:19
jrosserthat example i gave is anything but quick to run, hence my concern about implying a relationship that doesnt exist17:20
clarkbok, I' masking that we try it before dismissing it since that is why zuul exists in the first place and other projects use it successfully17:21
fungimost (all?) of the non-openstack tenants in our zuul pipeline even conserve resources by cancelling any running check jobs if the change gets approved and enqueued into the gate17:21
fungier, in our zuul deployment17:21
clarkbotherwise I think the OSA team should bring this up with zuul and maybe help write the new functionality necessary to address this17:21
clarkb(I'm willing to help do that myself but only if we've exhausted the "intended usage pattern doesn't function" first)17:21
fungithis also might be a reason for openstack to revisit the "clean check" rule for its gate pipeline17:22
clarkbfungi: well they are talking about blind rechecks in the nova room right now and sounds like they are a major problem? clean check was implemented due to blind rechecks landing broken code17:23
clarkb Iguess the risk there is if blind rechecks are super common that people will get more stuff landed without any inspection as to why it failed in the first place. But also may be worthy of an experiement17:23
fungisort of. it was implemented because of core reviewers approving untested or failing changes17:23
fungibut yes, repeatedly rechecking changes to clear failures without bothering to look into those failures is closely related17:24
clarkbfungi: well what happens is you +1 in check. Reviewer +2's then developer can recheck as many times in a row as it takes to land the code17:24
clarkber reviewer +A's after the zuul +117:24
clarkbbut trying it and seeing if we end up root causing a bunch of random change sthat were recheked into oblivion when stuff breaks is doable and would be good data gathering17:25
clarkbjrosser: re the rabbitmq thing does upstream not acknowledge the problem exists or they don't want help to fix it or? In general if you put package files in place before removing indexes and remove old files after some delay you end up with a repo that doesn't error. Typically that isn't difficult to achive. They build new packages, upload packages to repo, update index,17:33
clarkbsleep $TIME, remove old packages. I guess I'm wondering if the problem is that they think the problem doesn't exist at all so updating order of operations is refused?17:33
clarkbhrm looks likethey use packagecloud.io?17:35
clarkbI wonder if this is a problem with the third party service17:35
jrosserwe used to get them from the rabbitmq upstream repo17:36
jrosserand when they broke i'd tweet them and ~24hours later it would be fixed17:36
clarkbLooks like cloudsmith.io also hosts packages. I wonder if one has problems and the other doesn't or maybe they are using the same repo content and then synced from broken state17:37
jrosserbut it got so bad we now get them from cloudsmith instead17:37
clarkbjrosser: looking at their release notes for recent releases they seem to only have cloudsmith and packagecloud listed.17:37
jrosseri do wonder if  they just found maintaining the repo too much trouble17:38
clarkbat the very least it seems they don't want people using some other repo for recent releases17:39
clarkbhttps://packagecloud.io/rabbitmq/rabbitmq-server/install they seem to host something that is theoretically proxyable as a repo17:41
jrosseriirc this stuff was bad as well https://www.erlang-solutions.com/downloads/17:41
jrosserand we get all of it from cloudsmith now17:41
clarkbI'm hoping that a tool named "packagecloud" is building proper debs17:41
clarkber proper deb repos17:41
clarkboh its erlang not rabbit?17:41
jrosserwell you need both17:42
clarkbgot it17:42
jrossertheres a compatibility matrix that relates the two17:42
clarkbhttps://packagecloud.io/rabbitmq/erlang yup they provide both17:42
clarkblooks like cloudsmith hosts deb files but not a repo. packagecloud does expose things as a repo with a gpg key you can trust and source.list entries you can add17:43
clarkband the packages on packgaecloud go back ~6 years.17:44
clarkbI think if those packages continue to be unreliable for network access reasons instead of repo conssitency problems then using a proxy cache to packagecloud is reasonable to set up. Then anything else hosted there would be able to be retrieved through the same proxy17:45
clarkblooks like they also host npm and maven and other stuff too so potentially useful beyond distro packages as well17:45
jrosseri think cloudsmith does have a repo, they just make it somehow not browsable https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-erlang/deb/ubuntu17:48
clarkbah17:49
jrossereg `deb https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-erlang/deb/ubuntu bionic main`17:49
clarkbya so could proxy cache either or both. Likely a matter of preference due to reliability more than anything else17:50
*** dasm|ruck|mtg is now known as dasm|ruck17:57
*** rlandy is now known as rlandy|mtg18:33
*** rlandy|mtg is now known as rlandy19:02
*** rlandy is now known as rlandy|mtg20:26
*** dviroel is now known as dviroel|out20:36
timburkeis this a good place to mention there seems to be a problem with the fedora-35 mirrors? seeing py310 failures like https://zuul.openstack.org/build/37ab457a35f74e8eaab81af2fea63916/log/job-output.txt#34120:57
fungitimburke: thanks for the heads up, i haven't seen anyone else mention it yet21:05
fungiwe currently mirror from rsync://pubmirror2.math.uh.edu/fedora-buffet/fedora/linux according to this: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L3621:07
fungithe timestamp file at the root of our fedora mirror tree says we last updated at 2022-04-05T21:02:04,946404602+00:0021:08
fungiso only a few minutes ago, i guess21:09
fungithough the indexes which returned a 404 in the linked job are still nonexistent on our mirrors21:10
fungistrangely, http://pubmirror2.math.uh.edu/fedora-buffet/fedora/linux/updates/35/Everything/x86_64/repodata/ seems to have newer files than what we're serving21:12
fungiincluding the files the job is looking for21:12
fungii'll check the rsync log21:12
fungihttps://static.opendev.org/mirror/logs/rsync-mirrors/fedora.log21:14
fungilooks like there's some massive upheaval for f35 today21:14
fungii don't see any indication that rsync has picked up the missing indices yet21:15
fungialso that log ends 2 hours ago, so is for the prior refresh. i bet we don't flush the output from the latest run to the volume before we release it21:17
fungiyeah, the latest log is still in /var on the mirror-update.o.o server21:18
fungihard to say, but i think we've caught the uh.edu mirror in the middle of a large fedora 35 update21:20
fungiwe could try switching to pull from a different mirror which has already stabilized, or try to ride it out a bit longer21:22
timburke👍 thanks for the analysis! i'm content to wait it out -- nothing critical for me21:23
fungiif it's still broken in 2-4 hours, then we might want to consider picking a different mirror to pull from21:24
fungiunrelated, looks like pypi had a bunch of not-fun earlier: https://status.python.org/incidents/mxgkk3xxr9v721:24
clarkbthat mustlve caused the issues we observed with pacakges installs22:37
clarkbI think thi sis the first time they've noticed that sort of problem when we do. I guess in thi scase because it was more catastrophic22:37
*** dasm|ruck is now known as dasm|off22:43
opendevreviewGhanshyam proposed openstack/project-config master: Remove tempest-lib from infra  https://review.opendev.org/c/openstack/project-config/+/83670322:45
*** rlandy|mtg is now known as rlandy23:16
opendevreviewGhanshyam proposed openstack/project-config master: Retire openstack-health project: end project gating  https://review.opendev.org/c/openstack/project-config/+/83670723:43
*** rlandy is now known as rlandy|out23:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!