Monday, 2019-02-18

*** flwang has joined #openstack-tc00:45
*** whoami-rajat has joined #openstack-tc02:14
*** ricolin has joined #openstack-tc03:38
*** Luzi has joined #openstack-tc05:46
*** e0ne has joined #openstack-tc06:41
*** e0ne has quit IRC06:47
*** e0ne has joined #openstack-tc07:35
*** e0ne has quit IRC07:38
*** dtantsur|afk is now known as dtantsur07:42
*** tosky has joined #openstack-tc08:33
*** jpich has joined #openstack-tc08:57
*** e0ne has joined #openstack-tc09:08
toskyI noticed a new wave of "add py37" jobs - is it for sure this time? I may have missed the new discussion on the list09:28
*** cdent has joined #openstack-tc10:55
smcginnistosky: Stein should be py36 for the officially targeted Python 3 runtime: https://governance.openstack.org/tc/reference/runtimes/stein.html13:07
smcginnistosky: So maybe those are being added to make sure we don't have problems in Train. They should probably post something to the ML to let folks know what their plan is though.13:08
* cdent wonders if we should shrink the tc13:11
cmurphyif you shrink the tc then that leaves even less opportunity for new blood to come in13:12
ttxcdent: we might want to do that yes. 13 is a large number13:12
cdentcmurphy: yes, that is the big (rather quite very big) negative13:13
cdentbut 13 is a pretty big number relative to the size of the community13:13
evrardjp_cdent: well we need to automate more things if we want to scale down -- I can't imaging accurately following all the projects with less than 13 ppl.13:15
cdentevrardjp_: keep in mind that "following all the projects" is a relatively new thing. It might be the right thing, but it hasn't been round long enough for us to base the composition of the TC on just that.13:17
*** leakypipes is now known as jaypipes13:39
*** mriedem has joined #openstack-tc13:50
*** cdent has quit IRC13:59
*** cdent has joined #openstack-tc13:59
*** Luzi has quit IRC14:04
*** jroll has quit IRC14:13
*** elbragstad has joined #openstack-tc14:14
*** jroll has joined #openstack-tc14:14
*** elbragstad is now known as lbragstad14:24
*** ricolin_ has joined #openstack-tc14:40
*** ricolin has quit IRC14:42
*** ricolin_ has quit IRC15:16
*** ricolin_ has joined #openstack-tc15:17
*** e0ne has quit IRC15:42
*** e0ne has joined #openstack-tc15:48
*** jamesmcarthur has joined #openstack-tc16:01
dhellmannsmcginnis , tosky : one of the folks from canonical was trying to get 3.7 jobs added because they're shipping on 3.7 (stein, I think). IIRC, we said "ok, but you have to do it"16:30
dhellmannand there was a mailing list thread, several months ago16:30
toskydhellmann: thanks; I remember the thread, which is also referenced from the commit messages16:31
dhellmannok, good16:31
toskybut I remember that at some point the attempt to add new jobs was stopped (before the end of the year vacations, maybe)16:31
dtantsurF29 also ships with Py37 by default, so it helps people running tox locally. (although Fedora has many versions available)16:31
toskyso I was not sure whether it was fine to propose again those patches or not16:32
dhellmannI think there was some confusion about whether adding the 3.7 jobs was related to dropping 3.5 and if having 3.6, 3.6, and 3.7 all at the time was "a waste"16:32
toskyoh, I see16:33
dhellmannI don't think that adding those unit test jobs is the biggest "resource hog" of our CI though so I tried to get people to consider the questions separately16:33
toskywouldn't it make sense to also add 3.7 as supported into setup.cfg?16:33
dhellmannnow that we have the 3.6 decision in place and documented, we could drop 3.5 anyway16:33
*** e0ne has quit IRC16:34
dhellmannprobably. I'm not sure we've tied unit test jobs to the classifiers in setup.cfg directly before16:34
toskynot always, but I think it wouldn't hurt if we may avoid another wave of reviews just to fix the classifiers16:35
dhellmannyeah, good point. I guess the question is, do you want to do that for unit tests alone, or wait for functional tests of some sort?16:37
dhellmannI don't have an answer to that one16:38
clarkbalso don't we still run a subset of tests in many cases for python3 unittests?16:38
dhellmanncdent , ttx, evrardjp_ : this would also be a good opportunity to engage with some of our Chinese contributors who have run in the past but not won seats16:39
ttxdhellmann: work in progress !16:39
dhellmannclarkb : I don't think so16:39
dhellmanna few, but not nearly as many as before16:39
*** ricolin_ has quit IRC17:00
smcginnisI wonder if it would be more explicit if we had job templates for each release cycle that defined the python versioned jobs to run.17:10
smcginnisMaybe slightly less confusion around what needs to be run versus what would be useful.17:10
smcginnistosky: I would think we would want to have functional tests running, and passing, first before adding it to setup.cfg.17:10
toskyoki17:11
toskymake sense17:11
smcginnisThere probably wouldn't need to be much of a gap between adding a functional job and adding it to setup.cfg. Assuming everything passes.17:12
*** ricolin has joined #openstack-tc17:16
clarkbsmcginnis: re release specific python templates I think 3.7 would still be an outlier because its information futuer looking work not directly supported python.17:21
smcginnisclarkb: Not sure if I parsed that, but I think that was what I was saying. python-stein-pti template would only have the designated py3 environment of py36. So it would be somewhat more clear that any py37 jobs would be additional testing, not the core required runtime for stein.17:23
clarkbah I read it more as "this is what we test on stein, don't add others" but that expansion is basically what I was saying. You'd have the main template then additional stuff on top17:23
smcginnis++17:27
smcginnisI was also thinking then for stable branches we would have a set template of jobs to run and we may choose to drop some of the forward-looking jobs that aren't as interesting for stable.17:27
*** jpich has quit IRC17:50
*** ricolin has quit IRC17:52
fungiespecially if those jobs depend on "ephemeral" node states such as short-support distro releases or rolling release distros17:54
clarkbfungi: rolling release seems to address the target function of these jobs well though. Test with whatever newest python3 is basically. Then let rolling release update that17:56
fungiwe can test stable/stein changes on ubuntu 18.04 lts and centos 7 for years to come still. fedora 29 on the other hand will cease to be supported upstream shortly after we release17:56
fungiand testing stable changes on rolling releases is basically a ticking timebomb17:56
clarkboh ya don't test stable on rolling releases. I meant for master the rolling release model seems ideal for this particluar use case17:57
fungibut for master branch testing, sure i'm all for it as long as we also test development on whatever the target long-term release is to make sure it works by the time we release17:57
clarkbas with pypy I expect we'll not have long term interest/maintenance of such forward looking jobs though and so adding them if/when there is interest on $whateverplatform is good enough17:58
cdentthis conversation reminds me of a somewhat related but mostly orthogonal question: Have we got the capacity to do unit tests in containers if that's what we prefer?18:06
clarkbwe don't currently have access to managed containers at that scale no. In fact I think mnaser prefers to give us VMs because we can scale that upa nd down whereas container resources are always on18:06
clarkb(we could in theory build scale up and down for containers, but it isn't there currently)18:07
cdents/capacity/capability/ is probably more accurate for what I really meant18:07
clarkbcdent: nodepool has kubernetes drivers and we have a test cluster (mostly to make sure the driver itself functions). So yes18:07
clarkbfwiw many of our unittest jobs do consume the whole node so resource wise they are quite efficient18:08
cdentI suspect that's true, but there are other cases where it's not. For example, in placement a local run of the unit tests takes ... (one mo)18:09
cdent9 seconds, 99% of that is tox and stestr overhead18:10
cdentRan: 146 tests in 0.1959 sec.18:10
cdentfunctional is a bit longer at 29s, with about 50% in overhead18:11
cdentbut the nodepool jobs repot at 4-6 minutes18:11
fungihow much of that is installing software?18:11
clarkbcdent: sure, but containers don't make cpus or IO faster.18:12
fungiunless we pre-baked job-specific container images knowing in advance what dependencies to install, it's going to spend the same amount of time installing software18:12
cdentclarkb: that's my point: those tests don't care about cpus or IO. The reason I want to use a container is so the the node creation time is shorter so the resources can be freed upfor other things18:12
clarkbwhat containers in theory give you is caching of system deps (some of the overhead in those jobs for sure is installing these deps) and better resource packing at the cost of being always on (at least with current tooling)18:12
clarkbcdent: right the node creation time is shorter because we never turn off the hosts18:13
clarkbwhich is a different potentially worse cost according to the clouds18:13
cdentfungi: presumably projects could be responsible for providing their own images to some repo?18:14
fungithey could do that with vm images too18:14
fungicontainers don't change that18:14
cdentclarkb: yeah, understood. Do we go to zero nodes in use much?18:14
cdentfungi: I think it does. Containers can be easier in some instances.18:15
fungibasically the only efficiency gains i see from running some jobs in reused containers on nodes we don't recycle is we spend (marginally) less time waiting for nova boot and nova delete calls to complete18:15
cdents/easier/easier to define and manage/18:15
clarkbcdent: not quite zero but we are very cyclical. http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1&from=now-7d&to=now the test nodes graph there illustrates this18:15
fungiour longest-runnnig jobs are also the ones which use the majority of our available quota, and would see proportionally the least benefit from containerizing18:16
clarkbweekends are very quiet and we tend to have less demand during apac hours18:16
fungiour average job runtime across the board is somewhere around 30 minutes last i checked, while our average nova boot and delete times measure in seconds, so optimizing those away doesn't necessarily balance out the new complexities implied18:17
clarkbyou can actually see the difference in boot times between ceph backed clouds and non ceph backed clouds in our time to ready graphs18:18
cdentI think focusing on the averages suggests that those patterns are somehow something we should strive to maintain. The fact that many of the larger projects have unit tests that take an age is definitely not a good thing.18:18
clarkbunder a minute for ceph and just under 5 minutes for non ceph. So the cost is definitely there18:19
fungiit might mean a significant efficiency gain proportional to a 30-second job, but when most of our nodes are tied up in jobs which run for an hour it's not a significant improvement18:19
clarkbright similar to why running python3.7 tests isn't a major impact on our node usage18:19
fungiit's really less the unit tests and more the devstack/tempest/grenade/et cetera jobs we spend an inordinate amount of our resources on18:20
cdentright, I know, but that's in aggregate. Every time a job finishes sooner, however, is the sooner another job can start18:20
cdentIf that's not something we care about, fine, but it seems like it would be nice to achieve. It sounds like containers are not an ideal way to do that at this time, that's fine too.18:21
clarkbcdent: I think that would be nice to achieve which is why we now start jobs in a priority queue. The problem with simply running short jobs quicker is we quickly end up using all the nodes for the long running jobs18:21
clarkbthe priority queue we've got now aims to reduce that pain18:22
fungiit's a ~2% efficiency improvement which implies rather a lot of effort to implement and significant additional complexity to maintain. i expect we can spend that same effort in other places to effect much larger performance gains18:22
cdentyeah, I'm keen on the priority queue18:22
cdentokay, let's go for a different angle18:23
cdentwhat about an up and running openstack node to which we install incremental software updates and run tempest against18:23
cdent(instead of from scratch grenade and tempest)18:23
cdentso again, a long running node, which I guess is not desirable18:24
cdentI assume you both saw the blogpost lifeless had on an idea like that? If not I can find the link.18:24
clarkbya the lifeless suggestion basically. It is doable if you somehow guard against (or detect and reset after) crashes18:24
*** e0ne has joined #openstack-tc18:24
fungiyet another thing we've tried in the past. running untrusted code as root on long-running machines is basically creating a job for someone (or many someones) to go around replacing them every time they get corrupted18:24
clarkbya I've read it. Nothing really prevents us from doing that except its a large surface area of problems you have to guard against18:24
clarkbthe reason we use throw away nodes is we avoid taht problem entirely (and scale down on weekends to be nice to clouds)18:25
* cdent nods18:25
clarkband at the same time can give our jobs full root access which is useful when testing systems software18:25
fungiwe haven't explicitly tried lifeless's suggestion, but we've definitely run proposed code on pets-not-cattle job nodes in years gone by and it's a substantial maintenance burden (which led directly to the creation of nodepool)18:26
clarkba concrete place this would get annoying is we have a subset of projects that insist on running with nested virt. This often crashes the test nodes18:26
clarkb(if you can detect for that and replace nodes then maybe its fine, but how do you detect for that when the next job runs and fails in $newesoteric manner)18:27
clarkbone naive approach is to rerun every job that fails with a recycled node18:27
cdentI guess in my head what I was thinking was that if a test fails or a node crashes (hopefully that is detectable somehow) it gets scratched and the18:27
cdentyeah, that18:27
clarkb(and use that to turn over instances)18:27
clarkbbut now you are running 2x the resources for every test failure18:27
clarkbthat may or may not be cheaper resource cost 9someone would need to maths it)18:27
cdentoh, that's not quite what I meant18:27
cdentif a test fails or a node fails, that job fails, the next job in the queue (whichever it was) starts on a node (building the needs)18:28
fungialso you run the risk that one build corrupts the node in such a way that the next build which should have failed succeeds instead18:28
cdentyou wouldn't do it for unit or functional tests, only integration, and if, in integration tests, we can't cope with that kind of weird state, we have a bug that needs to be fixed18:29
fungiso you can't necessarily identify corrupt nodes based on job failures alone18:29
cdenthowever, I recognize that we don't have a good record with fixing gate unreliability18:29
clarkbcdent: yes, but pre merge testing accepts there will be many bugs18:29
fungiand back to the suggestion of running on pre-baked job-specific images, we did that for a while to and the maintenance burden there pushed us into standardizing on image-per-distro(+release+architecture)18:29
clarkband then ya you add on top that me and mriedem and slaweq end up fixing all the gate bugs... it is likely to not be sustainable18:29
fungiimage management is actually one of the more fragile parts of running a ci system18:30
cdentclarkb: yeah, I wish we didn't run tempest, grenade, dsvm style jobs until after unit, functional, pep, docs had all passed18:30
clarkbcdent: we've tried that too :)18:30
cdentheh18:30
fungisomething else we've tried! ;)18:30
* cdent shakes fist at history18:30
clarkb(and zuul totally supports the use case we just found it to end up in more round trips overall)18:30
clarkbbasically what happens is you end up with extra cycles to find bugs18:30
clarkbso rather than getting all results on ps1 then being able to address them, its ps1 fails, then ps2 fails then ps3 fails and ps4 now passes18:31
clarkbbut zuul can be configured to run that way if we really want to try it again18:31
fungiyeah, 1: it extends time-to-report by the sum of your longest job in each phase, and 2: i often find that a style checker and a functional job point out different bugs in my changes which i can fix before resubmitting18:31
cdentIs it a thing zuul can do per project, or is it more sort of tenant-wide?18:32
clarkbcdent: its per project job list18:32
fungiwhereas if we aborted or skipped the functional job, i'll be submitting two revisions and also running more jobs as a result18:32
clarkbcdent: basically where you list out your check/gate jobs you can specify that each job depends on one or more other jobs completing successfully first18:32
fungii want to say one of the projects started doing that with their changes recently. was it tripleo? or someone else?18:33
fungicurious to see if they stuck with it or found the drawbacks to be suboptimal18:33
cdentooooh. this is quite interesting. I'd be interested to try it in placement. most of us in the group were disappointed when we had to turn on tempest and grenade. If we could push it into a second stage that might be interesting. On the other hand, everyone should be running the functional and unit before they send stuff up for review...18:34
* cdent shrugs18:34
cdentTIL18:34
clarkbhttps://zuul-ci.org/docs/zuul/user/config.html#attr-job.dependencies18:34
cdentthanks18:34
cdentI really wish I had more time. zuul is endlessly fascinating to me, and I've barely scratched the surface18:35
cdentIt's a very impressive bit of work.18:36
clarkbfungi: I am not aware of anyone currently doing it but it wouldn't surprise me18:36
fungias someone who is a zuul maintainer, i too wish i had more time to learn about it18:37
clarkbon the infra side it is more there for building buildset resources like with the docker image testing work corvus is doing18:37
clarkbrather than as a canary18:37
* cdent must go feed the neighbor's cat18:38
cdentThanks to you both for the info and stimulating chat.18:38
fungiany time!18:39
*** dtantsur is now known as dtantsur|afk18:47
* smcginnis had wondered about container usage in CI too19:14
smcginnisThis has been interesting.19:14
*** jamesmcarthur has quit IRC19:19
*** e0ne has quit IRC19:30
*** jamesmcarthur has joined #openstack-tc19:33
dhellmannsmcginnis : your idea about well-named job templates makes a lot of sense. I agree that would make rolling out job changes simpler19:33
dhellmannor at least clearer19:34
smcginnisAt least one more step towards making things more explicit.19:34
dhellmannyeah19:34
*** e0ne has joined #openstack-tc19:37
openstackgerritGorka Eguileor proposed openstack/governance master: Add cinderlib  https://review.openstack.org/63761419:40
*** zaneb has quit IRC19:50
mnaserinteresting talk about containers19:59
mnaseri think they do bring an interesting value to the table in terms of being able to run more jobs, but we can do that by leveraging different flavors on clouds20:00
mnaseri.e.: i think it would be good if we run all doc/linting jobs in 1c/1gb instances20:00
mnaserinstead of 1 doc/lint job, we now run 8 at the same time (possibly)20:00
clarkbmnaser: one complication for ^ is most clouds set a max instances quota so that doesn't help us much (your cloud is an exception to that)20:01
evrardjp_lots of scrollback, and interesting20:01
mnaserclarkb: yeah, i think it would be good to approach the clouds and ask to drop the max instance limit and move to a ram based approach20:01
mnaserin aggergate you're still using just as much resources really20:01
clarkbya it is a possibility20:01
evrardjp_yeah aggregating docs/linting jobs that are pretty much self contained in env seems a good way to reduce turnover (at first sight....now experience has proven me wrong there :p)20:02
evrardjp_increase turnover? anyway20:02
mnaserit doesn't get us jobs that start quicker and get rid of the overheads.. but it does give us more efficent use of resources20:02
clarkbmnaser: right can run more jobs in parallel20:02
clarkb(assuming quotas are updated)20:02
mnaseralso, something i've thought about is with the PTI, we can have a set of dependencies that we almost always deploy, so for example -- doc jobs have a prebaked image, lint jobs as well20:03
clarkbmnaser: we just got away from that :/20:03
evrardjp_clarkb: because of the pain of image management?20:03
*** mriedem has quit IRC20:03
*** mriedem has joined #openstack-tc20:04
clarkbevrardjp_: thats a big part of it. You end up with 20GB images * number of image variants and on top of that the infra team becomes this bottleneck for fixing jobs or adding features to jobs20:04
evrardjp_(just wondering the reasons, and _if containers are thrown into the mix_ if that has changed)20:04
clarkbif you want to use prebaked container images vs bindep I don't really see that being a huge speedup because we already locally cache the data in both cases20:05
clarkbwhere you might see a difference is if those packages are expensive to install forsome reason20:05
clarkband maybe yum being slow makes that the case20:05
mnaserclarkb: also you're introducing io into the equation in this case i guess20:05
mnaserdoing a 1g image pull (maybe once only if its cached) that does big sequential writes20:06
mnaservs doing a ton of tiny io as apt-get install or yum does its thing20:06
clarkbya but it is all already locally cached in the cloud region and bottleneck should be the wire not the disk20:06
evrardjp_my question is more can we run multiple of those jobs on a containerized infra? Instead of getting a minikubecluster per job, we could have a shared minikube that's spawned every x time, which allows tests in containers in jobs if we receive a kubeconfig file for example20:07
clarkbcalculating deps and verifying package integrity is likely to be the bigger delta in cost20:07
clarkbevrardjp_: see the earlier discussion. The problem with that is that becomes a more static setup which costs the clouds more20:07
evrardjp_ok20:07
evrardjp_yeah true I remember what you said above20:08
clarkbin theory we could manage those resources on demand too, but don't have the drivers for nodepool to do that today20:08
evrardjp_so basically we've got the best already. :D20:08
clarkbI think what we have is a good balance between what the clouds want, what the infra team can reliably manage, and throughput for jobs20:08
clarkbthere are many competing demands and what we have ended up with works fairly well for those demands20:09
clarkbbut not perfectly in any specific case20:09
evrardjp_I am happy with it myself.20:09
evrardjp_I don't think we say enough thank you to infra :p20:10
clarkbone thing we can experiment with is jobs consuming docker images to bootstrap system dep installations20:11
clarkbit is possible that that will be quicker due to the package manager overhead of installing individual packages20:11
clarkb(at the same time docker images tend to be a fair bit larger than installing a handful of packages so it might go the other direction)20:11
*** whoami-rajat has quit IRC20:23
evrardjp_clarkb: what do you mean by "bootstrap system dep installation" ? I am confused20:28
clarkbevrardjp_: at the beginning of most of our jobs we run a bindep role that installs all the system dependencies for that test20:29
clarkbevrardjp_: you could install pull and run a docker container20:29
clarkbwhether one is faster than another depends on a lot of factors though20:30
evrardjp_I guess it depends on the jobs too -- because some can have lightweight image that's fast to pull and fast to install a few packages -- but then PTI comes in the way I guess20:31
fungiclarkb: have we tried using eatmydata to cut down on i/o cost from package installs?20:36
fungiit seems that clouds tend to be *much* more i/o-constrained than network bandwidth20:37
clarkbfungi: not recently. I remember fiddling with it once long ago and deciding that $clouds were probably not doing safe writes in the first place. This may have changed and is worth reinvenstigating possibly20:37
clarkbfungi: we have seen a move towards tmpfs for zookeeper and etcd though in part due to the io slowness on some clouds20:38
clarkbso probably would help at least in some cases20:38
fungiit does seem like most of the system package installer delay is i/o-related, but i have no proper experimental data to back that up20:39
*** zaneb has joined #openstack-tc21:12
*** zaneb has quit IRC21:37
*** cdent has quit IRC21:42
*** e0ne has quit IRC22:03
*** e0ne has joined #openstack-tc22:04
*** mriedem has quit IRC23:03
*** dklyle has quit IRC23:03
*** e0ne has quit IRC23:12
*** e0ne has joined #openstack-tc23:23
*** e0ne has quit IRC23:30
*** spsurya has quit IRC23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!