Tuesday, 2023-03-14

ianwmaybe should have just let chatgpt write the rules.  i mean sure every change would have to include "i love you chatgpt", but small price to pay :)00:02
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Convert remaining AnyWithBlock to submit requirements  https://review.opendev.org/c/openstack/project-config/+/87599600:12
ianwclarkb: ^ that's not new, but i think i accidentally unstacked it from all the prior changes so it went into merge conflict00:13
*** dhill is now known as Guest761700:19
clarkbianw: small thing on that00:21
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Convert remaining AnyWithBlock to submit requirements  https://review.opendev.org/c/openstack/project-config/+/87599600:22
ianwthanks!00:22
opendevreviewMerged opendev/system-config master: launch: add a probe for ssh after reboot  https://review.opendev.org/c/opendev/system-config/+/86836900:26
fungibetter yet, get pre-lobotomy bingbot to write the rules00:26
fungithough it may demand blood sacrifice for every change upload00:27
*** \join_weakmayors is now known as \join_iwp903:33
opendevreviewMerged openstack/project-config master: gerrit/acl : Convert remaining AnyWithBlock to submit requirements  https://review.opendev.org/c/openstack/project-config/+/87599605:49
*** jpena|off is now known as jpena08:20
opendevreviewEbbex proposed openstack/diskimage-builder master: Simplify epel/pkg-map  https://review.opendev.org/c/openstack/diskimage-builder/+/87734708:51
opendevreviewEbbex proposed openstack/diskimage-builder master: Simplify epel/pkg-map  https://review.opendev.org/c/openstack/diskimage-builder/+/87734708:57
dpawlikfungi, ianw,  clarkb: o/ https://review.opendev.org/c/zuul/zuul-jobs/+/876081 - could review or reply for Tengu comment when you have few min please?15:16
clarkbdpawlik: the Zuul matrix room would be a better venue to discuss zuul-jobs. I mentioend there when you first pushed the change that you should consider adding a job toat runs the role15:17
dpawlikack clarkb15:18
clarkb*job that runs the role15:19
dpawlikclarkb: will do. Just please check Tengu comment and reply. We would like to move the role from zuul-jobs to dedicated project in oko org and zuul-jobs will just use it. I will cover it with proper zuul job15:20
clarkbagain that is a discussion for zuul-jobs15:21
clarkber sorry for the zuul matrix room15:21
dpawlikunderstand. Thanks 15:22
clarkbianw: feedback from foundation is that the statement we have looks good but it would be better if we can add data/stats like Debian and Ocaml have. Maybe we can talk about numbers of tests run weekly/test nodes weekly?15:42
clarkbs/weekly/some time frame/15:42
opendevreviewClark Boylan proposed opendev/git-review master: Test Python bounds only  https://review.opendev.org/c/opendev/git-review/+/87732116:02
opendevreviewClark Boylan proposed opendev/git-review master: Test old and new Gerrit  https://review.opendev.org/c/opendev/git-review/+/87731316:02
clarkbfungi: ^ I think those will both pass testing now and should be mergeable assuming you want to switch to nox.16:02
clarkbnot terrible to adapt to tox but jobs will need redoing and tox.ini would need updating to pass the var through16:03
fungii'm not convinced that the max-concurrency setting for rax-ord is actually taking effect: https://grafana.opendev.org/d/a8667d6647/nodepool-rackspace?orgId=116:03
fungii see spikes as high as 57 nodes building at the same time there16:03
clarkbfungi: a thread dump should confirm as I think you'll be able to identify the threads associated with the provider16:03
fungijust as recently as an hour ago16:03
clarkb(they should be named in such a way that this is possible iirc)16:03
fungikill -USR1 yeah?16:05
fungitwice obviously16:05
fungiUSR216:06
fungihttps://zuul-ci.org/docs/nodepool/latest/operation.html#daemon-usage16:06
fungirunning this now on nl01 since it's seems like we're consistently well above 10 building concurrently: sudo kill -USR2 1285922;sleep 60;sudo kill -USR2 128592216:08
funginl01:~fungi/stack_dump.2023-03-1416:14
fungilooks like there are 8 openstack-api-rax-ord_* threads, 6 keyscan-rax-ord_* threads, and a PoolWorker.rax-ord-main thread16:17
fungithere are also some generic threads not tied to a particular provider but it doesn't seem like they handle building nodes16:19
clarkbya a building node is always in its own thread iirc16:21
clarkbI think the those counts don't contradict our limit of 10 (I'm not sure you can add 8+6 it may be 6 of 8 are in keyscan mode?)16:22
fungii'm more concerned by the "Maximum number of node requests that this provider is allowed to handle concurrently." but maybe those building nodes don't correspond to a node request, or maybe we average >2.5 nodes per request?16:23
clarkbI'm not sure I understand that16:24
fungithe graph is showing node count rather than node request count, that could be the difference16:24
clarkbthe limit is 10 concurrent builds and the data above is under that limit I think16:24
fungihttps://zuul-ci.org/docs/nodepool/latest/configuration.html#attr-providers.max-concurrency16:24
clarkbyes we set that to 10 and the data above shows 816:25
clarkblet me pull up the dump and look directly16:25
fungiit says "node requests" there, not "nodes" which is where my confusion over the graph likely stems from16:25
clarkbdocker just emailed infra-root and said we have been identified as possibly being a free team organization which is sunsetting on April 1416:26
clarkbI thought we never proceeded with that due to their odd requirement to not document the use of podman or whatever16:26
fungiwe never did complete that application, no16:26
corvusyeah i was just reading that email and looking into it...16:27
clarkbthey are warning us that after that time we may lose access to some of our data16:27
clarkbcorvus: thanks!16:27
fungianyway, on the max-concurrency setting, what we're trying to limit is the number of nova create calls which haven't returned a ready node yet, what we control is the number of outstanding node requests which have been accepted, there's no 1:1 correlation between building nodes and node requests, so we need to scale it by the average request size i think16:27
corvusit looks like "opendevorg" and "zuul" are "organizations" under the "Docker Free Team" subscription16:28
corvusso my reading of that email is that after april, those can disappear at any time16:28
corvusso that's pretty much worst-case scenario for us16:28
clarkbcorvus: huh we asked about it then they told us we couldn't document use of tools other than docker which caused us to not pursue further. Did they just grant it to us anyway?16:29
clarkboh or maybe they are saying there is no free tier at all?16:29
corvushttps://paste.opendev.org/show/bPGrtnOfGSNjw5UG225s/ is the email btw16:29
clarkbya I think thats it. Basically no free option period16:30
corvusclarkb: erm, i don't remember being involved in any application process...16:30
clarkbcorvus: they had a program for open source projects to avoid the rate limits. We looked into that and then didn't pursue it due to their requirements. I was confusing this with that16:30
clarkbcorvus: But I think in reality what is happening is they are saying docker hub will have zero free options after april 1416:30
corvusclarkb: gotcha, yeah i think this is sort-of orthogonal to the rate limits thing (except inasmuch as their subscription levels have differing rate limits)16:31
corvusand i mostly read your second thing about "no free" as correct -- except that i think they may still have a "personal" level that's free16:32
corvus(which they say is "good for open source projects" !)16:32
clarkbI'll make a note to bring this up on the infra team meeting today. But I guess this just became a bit of a priority (hosting elsewhere)16:33
fungilooks like this has exploded on reddit, unsurprisingly16:33
clarkbthe month + 30 days of RO access seems really aggressive too16:33
corvusperhaps there is a possibility to convert each of our several (we have 4) orgs into 4 different "personal" accounts?  but i don't see a button to do that -- only upgrade.16:33
clarkbany idea what the costs are? I could see us potentially paying once to extend that timeframe and actively work to get away from them16:34
clarkbhttps://www.docker.com/pricing/ $9/user/month16:35
corvushttps://www.docker.com/pricing/16:35
corvusi think that would be $600 one time cost to keep opendevorg and zuul for 1 year (and abandon the other 2)16:36
corvusbecause of minimums16:36
clarkbcorvus: sorry what were the other two?16:36
corvusopenstackinfra and stackforge16:36
clarkbah yup those should be abandonable16:36
corvusyes they have no repos16:37
corvusi think we can move pretty quickly if necessary16:37
clarkbso ya that gives us a potential out so that we are not scrambling over the next month. I think we should continue to look into what it would take to migrate. In particular we probably need to start with base images16:37
clarkbbasically we can move the base images elsewhere, then once that is done rebuild everything that sits on top of those16:38
clarkbcorvus: I agree. I think the main gotcha might be whether or not the images we rely on will continue to be available16:38
corvusif it's another public service, we can probably move everything in a few days. if we want to self-host, give us another week.16:38
corvusmm like python base?16:38
clarkbor if we need to also spin up the hosting of a python base image for example16:38
clarkbyes16:38
clarkbthose are all docker hub "library" images so presumably won't be going away?16:38
clarkbbut maybe you'll lose access to them without a docker account?16:39
clarkbfungi: re max concurrent requests I think those are "node build requests" and not api requests.16:39
fungioh, in that case the graph is misleading i guess? or maybe those are nodes which nodepool has given up on but haven't exited building state yet?16:40
clarkbfungi: ya I'm wondering if it is an accounting problem more than a real state issue. Probably needs more investigating to understand. One way to do it is check nodepool list for ord building state nodes over time16:41
corvusi believe max_concurrency is for requests, so if we get 10 requests for 2 nodes each, we will spawn 20 node building state machines simultaneously16:42
clarkboh!16:43
fungiyeah, that's what i was surmising16:43
corvusso that's not quite what we wanted for the ord case -- in that we really want to limit node requests -- but on average, maybe it will work out?16:43
fungiso like i said, we'll need to scale it by the average node request size16:43
clarkbya and if 10 is still too high we could scale it back a bit more I guess16:43
corvuslike, maybe on average we have 1.2 nodes per request or something16:43
clarkbfungi: sorry I was thinking about api requests and thought that is what you were suspecting16:44
clarkbso 10 api requests in flight at a time rather than 10 nodepool things16:44
fungiwell, the graph suggests that we averaged 6 nodes per request at the point where there were 57 building nodes at max-concurrency 10, and currently it's around 2.5 nodes per request average16:44
corvusthen, unfortunately, we'll just get hit in the face when we're unlucky enough to run infra jobs where we get 10 requests for 5 nodes each :)16:44
corvusthat's higher than i would have guessed16:45
corvus6 nodes/request is unpossible, right?16:45
clarkbya I thought our limit is 5 but /me checks16:45
fungiwell, was current up until a few minutes ago, now there are no requests i think16:45
clarkbmax is apparently 1016:45
corvusoh nope, we raised the limit16:46
corvusit's possible16:46
corvusit's 1016:46
corvus(in most tenants)16:46
fungiit could still be some combination of high average nodes per request and the graph also reflecting "building" nodes which the launcher gave up on due to timeouts but haven't transitioned to ready/deleting yet16:46
clarkbhttps://web.docker.com/rs/790-SSB-375/images/privatereposfaq.pdf is an faq that seems to create more confusion because it specifically calls out "private repos"16:53
clarkbsome people seem to think that only private data (which we have none) is affected16:53
clarkbhowever that is't how I read the email16:53
clarkbI think we should begin planning at a high level but looking at discussion around this there is a tremendous amount of confusion. It is probably a good idea to try and avoid making any hard decisions for a day or two giving docker some time to clarify things which may impact our decisions17:07
clarkbI'm going to start collecting notes here: https://etherpad.opendev.org/p/MJTzrNTDMFyEUxi1ReSo17:13
corvusclarkb: agreed all around17:17
*** jpena is now known as jpena|off17:18
fricklerseems the kolla docker account is also affected, mnasiadka received the same email17:42
fricklerclarkb: I don't think I'll make it to the meeting (again), but since the storyboard topic was triggered at the last PTG, I think it would be really good if we could come up with at least some kind of answer before the next one17:45
fricklermaybe at least we can schedule a session during the ptg itself joining the affected projects17:46
clarkbfrickler: re docker I think this is universal for all free orgs on docker hub17:46
clarkbfrickler: re ptg and storyboard. The problem is I can't seemt oget any opinions other than my own to be stated :)17:46
clarkbI don't want to make any unilateral decisins here. I think projects that are actively moving to launchpad should talk to each other and coordinate potential tooling/process to reduce the amount of effort but none of that needs to go through us. Its just they weren't talking together from what I could tell so tried to centralize that discussion which didn't really happen17:47
clarkbIf I were to make a decision I would probably recommend we sunset storyboard (topical for tdoay) give a shtudown date probably more than a month in the future and take it from there17:48
clarkbbut I am/was hopeful that we could reach such a decision more collectively or reach a different conclusion as long as it was a bit more of a collective path17:50
fricklerI'd agree to that except maybe we should see if we can keep it in readonly mode for longer, allowing to still reference existing stories and not have an shortterm pressure to migrate those17:50
clarkbya or maybe some sort of read only archive export? Thats a good idea worth investigating17:50
fricklerworst case a recursive wget that can be put onto static17:52
clarkbfungi: that git-review stack is green now fwiw18:07
*** gibi is now known as gibi_pto18:38
clarkbRamereth: hey, if you do end up hearing back from docker re the DSOS I would be curious to know if the have kept the requirement that program participants document that you must use docker and docker desktop to run the images hosted in the program18:49
clarkbRamereth: we looked into this a while back and that was one of the requirements in the resonse we got which led us to not followup on it18:50
clarkbNeilHanlon: ^ same for you18:50
Ramerethclarkb: I will certainly do that. That's quite the requirement if that's the case but how would they enforce that?18:51
fungiyeah, they really didn't seem keen on the idea of including mention of docker alternative container tooling18:51
fungiin project documentation i mean18:51
clarkbRamereth: they could potentially enforce it using user agent strings in requests? But I suspect it was more of an honor code thing that they would evaluate in the annual reapplication/reapproval process18:51
fungialso they wanted frequent participation in writing docker marketing materials18:51
* NeilHanlon sighs18:51
NeilHanlonwe'll make our own container registry18:52
clarkbit is entirely possible they've removed that rule18:52
NeilHanlonwith blackjack..18:52
clarkbwhich is why I'm curious18:52
Ramerethyup, sounds like the best option is to run your own register if that's the case18:52
NeilHanlonthe new TOS is pretty.. simple https://web.docker.com/rs/790-SSB-375/images/DockerOpenSourceProgramTermsofAgreement.pdf18:52
fungiodds are we caught them while they were still on an early draft of the requirements18:53
NeilHanloni do remember that from the old program18:53
clarkbI wish us all much luck. we've been taking notes here: https://etherpad.opendev.org/p/MJTzrNTDMFyEUxi1ReSo though a lot of that is specific to us18:54
fungiwe originally approached them right after the change in download quotas was announced, so it was probably very early days for their open source community options still18:54
fungiprobably they loosened up a bit after enough communities turned down the offer18:55
NeilHanlonprobably. i do know I got pushback when publishing the Rocky images in their 'library' that our documentation mentioned instructions for running systemd in non-docker containers.. they did not like that :) 18:58
fungiugh18:59
* NeilHanlon was only half kidding about the 'build our own' thing and begins thumbing through his collection of unused, purchased-at-3am domains19:01
RamerethI just got a reply for the OSUOSL request: All right! Your request has been received and put into our queue. The team will start to address this issue immediately.19:03
Ramerethwe'll see what they say..19:04
NeilHanloni applied again as well.. we'll see. will let you know clarkb, fungi, how it goes19:07
clarkbthanks!19:08
mnasiadkaFrom Kolla side we might consider swapping out docker for something else (we already direct our users mainly to quay.io)20:18
Clark[m]mnasiadka: is that with a paid or free quay.io account? It isn't clear how to setup the free use and creating an account requires a phone number20:21
*** dviroel_ is now known as dviroel20:21
NeilHanlonquay is free for public repos20:25
NeilHanlonat the cost of a RH account ;) 20:26
NeilHanlonCan I use Quay for free?20:26
NeilHanlonYes! We offer unlimited storage and serving of public repositories. We strongly believe in the open source community and will do what we can to help!20:26
clarkbNeilHanlon: ya but then when you go to sign up it asks for a phone number. Maybe we just have to give them a phone number and send it20:27
clarkbianw: did you see my note re works on arm statement?20:27
clarkbianw: gitea09 has the backup cron jobs on it now and I haven't seen email yet that it has failed. I'll try to do more indepth verification of it though20:36
ianwclarkb: yeah, was just thinking about how to get some raw counters20:51
clarkbwe can't query by label in zuul's api unfrotunately20:52
clarkbbut we can query by pipeline and mayn of those jobs are in known pipelines. That may be good enough?20:52
ianwperhaps i shouldn't have looked because now i don't like the look of the linaro graph20:53
clarkbuh oh :)20:53
clarkbI'm going to pop out now to take advantage of some "warm" sunny weather. Haven't had this in a few weeks20:53
clarkbbut happy to help look more when I get back20:54
ianw'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:997)')20:54
ianwkevinz: ^20:54
ianwi'm not sure what acme method is used here, but these seems like something that can be a cron job20:55
Ramerethclarkb: so it looks like the OSUOSL got approved and his part of the email was interesting: If you haven’t already, please take the time to update your project’s Hub pages to include a detailed project description, links to your project source code, as well as contributing guidelines, and a link to your organization’s website. Projects lacking this information may not receive the Docker Sponsored Open Source badging for their images on Docker Hub.21:14
Ramerethso far nothing mentioning the requirement of using docker tools21:16
fungithat's reassuring21:17
opendevreviewSergiy Markin proposed opendev/base-jobs master: Bindep libraries update  https://review.opendev.org/c/opendev/base-jobs/+/87743022:50
opendevreviewSergiy Markin proposed opendev/base-jobs master: Bindep libraries update  https://review.opendev.org/c/opendev/base-jobs/+/87743023:14
clarkbdid anyone prune the vexxhost backups server? We got an email about it being at 90% yesterday and haven't received one yet today23:16
* clarkb writes a service-announce email for the April 6 22:00 UTC gerrit work23:16
ianwclarkb: i didn't, i thought at the time "that seems like not that long since we last did it".  which i then looked up to be 2023-01-3023:17
ianwbut what would actually be interesting is how often we have done it before that23:18
ianwi feel like it was not usually ~1.5 months23:18
clarkbianw: that is the smaller of the two backup servers right? But ya I think you are correct that it hasn't been this frequent. We probably do end up accumulating more over time simply with new servers but also with the pruning keeping more content over time (eg two annual backups or whatever our retention is)23:19
ianwis cacti graphing them?23:20
clarkblooks likeyes23:20
clarkbthe saw edge on the disk graph is actually pretty consistent23:20
ianwhttp://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=69081&rra_id=4&view_type=&graph_start=1645782851&graph_end=167883603523:20
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=69081&rra_id=all23:20
ianwheh, jinx again, yeah, that looks fairly periodic23:21
clarkbI'm going to state the gerrit outage will be approximately 2 hours for the april stuff23:21
clarkbsince we have to do offline reindexing I want to give us a bit more time than usual23:21
ianwin that case, i don't mind running the prune in a screen, i'll start it in a bit23:21
clarkbthanks!23:21
ianwcurrently it seems we run playbooks/service-base.yaml just against all hosts in the inventory23:26
clarkbyes, since service-base is our very base config (users, firewall, email)23:26
ianwi'm wondering if for the linaro cloud we could have it inventory, but maybe in a group "unmanaged" or something23:26
clarkbwe want to keep that stuff in sync globally23:26
ianwand then pick-and-choose the bits we want to apply23:26
clarkbianw: and exclude that group from base?23:26
ianwyeah23:27
clarkbya that might work23:27
clarkbmaybe stick them in a special section of the inventory file too to make the distinction more clear23:27
ianwso install our users, something to manage renewing the LE certs, maybe other stuff in the future23:27
ianwthe other option is just to put acme.sh renewal on the linaro host in a local cron job and largely forget about it23:28
ianwwhich frankly makes sense, but also triggers my gitops/collaborative infra nerves23:28
clarkbanother option is to do what we did with the inmotion cloud and self sign a longer term cert and not worry about it for a while23:28
clarkbthats less clean, but is nice and stable23:29
ianwtrue, but it is nice having a cert trusted by all the nodepools with no extra effort23:29
clarkbgerrit outage email sent23:33
ianwactually acme.sh has a deploy plugin to haproxy that concatenates things automatically23:34
clarkbianw: as far as works on arm statements go maybe something like "This has enabled us to run X arm64 test VMs that executed Y test jobs within our CI system" assuming we can asnwer what X and Y are without too much effort would be good23:37
ianwi wonder if we can just do something in graphite that keeps adding the in-use nodes, and then just take the highest point23:39
clarkbyes I think that is doable with graphite23:42
ianwclarkb: https://graphite.opendev.org/S/X23:42
clarkbI've also done a thing where i ask it for json data instead of pngs and then have python do some calculations23:42
ianwi think you could probably say 7,000 raw23:42
clarkbianw: ya or maybe take a weekly value since that 7k will be out of date in a few weeks23:42
clarkb"Enabled 1k test VMs weekly for CI jobs within our CI system" ish23:43
ianwyeah, between 01/26 -> 02/26 ~ 5.5 -- say 6k23:43
ianwflip d/m there if you live in a weird place that does that :)23:43
clarkbbut ya I would give a monthly weekly or daily count rather than a total sum as that should be more accurate going forward. Probably pick whichever one sounds most impressive23:44
ianwyeah, i think the figures support saying 6k/month 23:46
clarkbworks for me23:47
ianw"OpenDev currently provides almost 6000 testing virtual-machines per week, with steady growth as community engagement with ARM increases."23:50
clarkbianw: I made a couple of edits to that line in he etherpad (also its per month not week right?)23:51
ianwoh yeah,s orry23:51
clarkbI think that looks great23:52
ianwok, i can send it on tomorrow maybe, let it marinate and if anyone else wants to chime in23:52
clarkbsounds good. I'll ask foundation to take another look at it too23:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!