Wednesday, 2022-02-09

ianwhttps://zuul.opendev.org/t/openstack/build/3ce64b0d3e3845609c14fcd26be34db4/console00:00
ianwit's coming from 00:00
ianwpip install -c /home/zuul/src/opendev.org/openstack/horizon/upper-constraints.txt -r requirements.txt -r test-requirements.txt00:00
fungican't hurt to add it, i'm just not expecting it to solve the retry failures i was seeing where ensure-sphinx was breaking00:01
ianwbut that's a bit of a misnomer, because aiui it's run active.sh previously in the script.  so that pip isn't the system pip00:01
ianwmy reading of it was that pip was from that venv ... but could be wrong!00:02
fungia lot of the failures i was looking at happen in pre-run, before that script ever comes into the picture00:02
fungie.g. the builds for nova00:03
fungifailures to install pillow into the sphinx venv00:03
ianwhrm, ok, i'm looking @ https://zuul.opendev.org/t/openstack/builds?job_name=propose-translation-update&job_name=upstream-translation-update&result=FAILURE00:04
fungiyou need to broaden it to include RETRY_LIMIT as well00:04
fungibut yeah, one thing at a time. as you say, we probably have multiple places this is breaking00:04
ianwok, i see it now.  i'll start making some notes00:06
fungiit could still be the same underlying issue. maybe the ensure-sphinx role needs to upgrade pip00:06
fungithough a bigger problem in my mind is that we're running this on bionic but applying master branch upper-constraints.txt which no longer take older python into account, and we may not be building those wheels for bionic and upstream may no longer be publishing cp36 wheels to pypi either even though they have a requires_python which allows 3.6 still00:10
ianwthis is true00:11
fungiso as a result, pip is going to grab sdists of some things and the projects don't have the necessary build deps in their bindep.txt00:11
fungibut yeah, let's try the easy things first and then it's simpler to reason about solutions for what's still breaking after that00:12
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-sphinx: upgrade pip  https://review.opendev.org/c/zuul/zuul-jobs/+/82844100:15
ianwfungi: ^ i feel like that might restore the status-quo, maybe?00:16
fungii expect installing with python 2.7 was hitting some legacy paths through constraints files which masked a bunch of problems, so i'm not getting my hopes up00:24
ianwerror: invalid command 'bdist_wheel'00:27
ianwi guess maybe that venv needs wheel too...00:27
fungiyes, venv doesn't have wheel by default00:28
ianwthen again, it also says00:28
ianwThe headers or library files could not be found for jpeg,00:28
ianw    a required dependency when compiling Pillow from source.00:28
fungibut also that only comes into play if it's trying to install things from sdist because it can't find a wheel00:28
ianw... so what is the problem :/00:28
fungiand it's a warning, there's a legacy build codepath which doesn't involve building and installing a wheel00:29
ianwso basically we had a wheel for pillow and now don't is the theory00:29
fungia cp27 wheel probably, yes00:30
funginow it wants cp36 for bionic00:30
ianwCollecting Pillow===8.4.0 (from -c /home/zuul/src/opendev.org/openstack/requirements/upper-constraints.txt (line 97))00:31
clarkbI think it may have had to do with abi3 wheels00:31
clarkbold pip doesn't understand those as valid for any python version iirc00:31
ianwhttp://mirror.iad.rax.opendev.org/wheel/ubuntu-18.04-x86_64/pillow/00:32
clarkbthen once you update pip it recognizes it can install those specially annotated wheels00:32
opendevreviewNeil Hanlon proposed openstack/project-config master: Add rockylinux-8 to nodepool configuration  https://review.opendev.org/c/openstack/project-config/+/82843500:32
ianwahhh ... then the pip upgrade *might* help :)  are we back where we started?! :)00:32
NeilHanlon😂00:32
ianwi think we're where we've always been, in a huge tangled mess of dependencies that somehow sometimes works00:33
NeilHanloni.e., python00:33
fungiianw: i think we'll be at the point that pip will think it's possible to install a newer version of pillow than is available as a wheel, but maybe abi3 works for cp3600:34
ianwyeah, it seems like we need to keep iterating00:34
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-sphinx: upgrade pip  https://review.opendev.org/c/zuul/zuul-jobs/+/82844100:44
ianwnow with less typos00:44
clarkbianw: any idea why the base-test log-inventory stuff was out of sync?00:45
ianwi didn't go through the history; i assumed something was tested that didn't make it in00:46
clarkbgotcha00:47
ianwlooks like the bits i removed came in via I6c93fd03aadb5e4d15ac7da98887dd7ca499831900:47
ianwhttps://review.opendev.org/c/opendev/base-jobs/+/79813900:48
*** dviroel|ruck|afk is now known as dviroel|ruck00:48
ianwthen it looks like https://review.opendev.org/c/zuul/zuul-jobs/+/798087 didn't make it in?00:48
clarkbthat might've stalled out due to the zuul fixes that went in mid last year?00:54
clarkbthings were more aggressively split out and filtered00:54
*** rlandy|ruck|bbl is now known as rlandy|ruck00:57
*** dviroel|ruck is now known as dviroel|ruck|out00:57
*** dviroel|ruck|out is now known as dviroel|out00:57
opendevreviewMerged zuul/zuul-jobs master: ensure-sphinx: upgrade pip  https://review.opendev.org/c/zuul/zuul-jobs/+/82844101:22
opendevreviewMerged opendev/base-jobs master: base-test: sync with base/pre.yaml  https://review.opendev.org/c/opendev/base-jobs/+/82843901:23
opendevreviewIan Wienand proposed openstack/diskimage-builder master: Revert "Use rpm -e instead of dnf for cleaning old kernels"  https://review.opendev.org/c/openstack/diskimage-builder/+/82738103:33
ianwok, it looks like https://zuul.opendev.org/t/openstack/build/0c8aa08db4c844a7bdb70dfe222597ea (upstream-translation-update for nova) passed after making ensure-sphinx update pip in the venv03:58
ianwthis still leaves problems with the host type, zanata in general, etc.  but that's for tomorrow :)03:58
*** rlandy|ruck is now known as rlandy|out04:06
opendevreviewMerged openstack/diskimage-builder master: Cleanup more CentOS 8 bits  https://review.opendev.org/c/openstack/diskimage-builder/+/82721004:39
opendevreviewMerged openstack/diskimage-builder master: Remove contrib/setup-gate-mirrors.sh  https://review.opendev.org/c/openstack/diskimage-builder/+/82721105:02
opendevreviewMerged openstack/diskimage-builder master: General improvements to the ubuntu-minimal docs  https://review.opendev.org/c/openstack/diskimage-builder/+/80630805:19
opendevreviewMerged openstack/diskimage-builder master: Remove extra if/then/else construct in pip element  https://review.opendev.org/c/openstack/diskimage-builder/+/82222405:19
opendevreviewMerged openstack/diskimage-builder master: Revert "Use rpm -e instead of dnf for cleaning old kernels"  https://review.opendev.org/c/openstack/diskimage-builder/+/82738107:50
opendevreviewMerged opendev/base-jobs master: base-test: fail centos-8 if pointing to centos-8-stream image type  https://review.opendev.org/c/opendev/base-jobs/+/82844008:03
*** amoralej|off is now known as amoralej08:11
*** jpena|off is now known as jpena08:31
*** sshnaidm|afk is now known as sshnaidm08:54
*** ysandeep|out is now known as ysandeep09:01
opendevreviewRiccardo Pittau proposed openstack/diskimage-builder master: Fallback to persistent netifs names with systemd  https://review.opendev.org/c/openstack/diskimage-builder/+/82826609:16
opendevreviewRiccardo Pittau proposed openstack/diskimage-builder master: Fallback to persistent netifs names with systemd  https://review.opendev.org/c/openstack/diskimage-builder/+/82826609:18
*** mnasiadka_ is now known as mnasiadka09:18
opendevreviewMerged openstack/diskimage-builder master: Don't run functional tests on doc changes  https://review.opendev.org/c/openstack/diskimage-builder/+/82589109:21
opendevreviewMerged openstack/diskimage-builder master: fedora-container: pull in glibc-langpack-en  https://review.opendev.org/c/openstack/diskimage-builder/+/82777209:35
sshnaidmcores, please review in your time perms patch https://review.opendev.org/c/openstack/project-config/+/82837110:41
sshnaidmfungi, ^^10:41
*** rlandy|out is now known as rlandy|ruck11:06
*** dviroel|out is now known as dviroel|ruck11:10
mnasiadkaGood afternoon12:37
mnasiadkaSince https://opendev.org/openstack/diskimage-builder/commit/398e07e6f2bb5a2f763a22a8e4801108c242ffe2 landed - is there a slight chance that it would be possible to add Rocky Linux 8 to the possible nodesets in Zuul? Kolla projects would be happy to run their CI on something that is not so unpredictable as CentOS Stream (and there's user interest in adding Rocky Linux support - which we'd like to have properly tested).12:39
*** ysandeep is now known as ysandeep|break12:41
*** artom__ is now known as artom13:03
*** amoralej is now known as amoralej|lunch13:07
*** ysandeep|break is now known as ysandeep13:13
fricklermnasiadka: seems https://review.opendev.org/c/openstack/project-config/+/828435 is the next step13:18
fungimnasiadka: it's in progress, i believe we need a dib release and then a version bump in nodepool13:18
mnasiadkaoh, great - Neill followed up on that13:18
mnasiadkawasn't aware :)13:18
fungimaybe the dib release already happened while i was asleep13:19
fricklerI don't think that that change actually tests builds, so that release+bump may still be needed13:20
*** amoralej|lunch is now known as amoralej13:59
*** pojadhav is now known as pojadhav|brb14:06
*** akahat|rover is now known as akahat|PTO14:11
*** pojadhav|brb is now known as pojadhav14:24
*** pojadhav is now known as pojadhav|dinner15:00
*** ysandeep is now known as ysandeep|out15:43
*** pojadhav|dinner is now known as pojadhav16:14
opendevreviewClark Boylan proposed opendev/system-config master: DNM change to test and hold gitea 1.16.1  https://review.opendev.org/c/opendev/system-config/+/82858616:35
clarkbfungi: ^ how do I hold for a specific change again? is it --ref refs/changes/xy/abcxy ?16:35
clarkblooks like I need the ps in there too16:36
fungiyes, you need the revision16:39
fungirefs/changes/xy/abcxy/z16:39
*** ralonsoh_ is now known as ralonsoh16:40
clarkbthanks. I'm hoping to get gitea 1.16.1 held today so we can look it over and double check it against the changelog then maybe upgrade soon16:40
clarkbI set that hold up and claened up my old gerrit tag pushing hold16:42
clarkbianw: looks like you may have gerrit 3.4 holds that are no longer required since we upgraded. But I'll wait for you to confirm before doing any cleanup16:44
*** ykarel is now known as ykarel|away16:48
*** marios is now known as marios|out16:53
clarkbI suspect https://review.opendev.org/c/openstack/diskimage-builder/+/826976 is the change that ianw is hoping to get sorted for the dib release based on what was said yseterday16:55
clarkbI'm having a really hard time parsing what that change aims to do16:58
clarkbI guess we want to set up grub without installing a bootloader. But aren't those two things intertwined?17:00
*** jpena is now known as jpena|off17:27
corvusthe mergers looks like they may be stuck17:46
corvusinfra-root: i think we may be looking at the gerrit slowness again17:47
corvusthe mergers are not stuck, they're just getting really slow performance from gerrit on their git ops.  they have a 300 job backlog17:48
corvusclarkb: did we have a next step for debugging that?17:48
clarkbcorvus: luca asked for show-queue -w output from when it was happening17:50
clarkbcorvus: and maybe we should grab another thread dump since I think the last one captured the very tail end of it17:50
corvusclarkb: do you want to "sudo" and do that?17:51
clarkbya I'll work on running jstack to capture a current thread dump17:51
clarkbthen around the same time try to show-queue -w17:52
corvusah, i've re-learned about my admin account, so i can do show-queue -w now too :)17:53
corvusoh wow the output is a lot different, you can see the jobs waiting17:54
clarkbcorvus: yup that was what I was trying to explain to luca. I guess its a good thing we can actually acpture it now. I've done both a jstack and show-queue which captures data that needs filtering before we can upload it17:54
clarkbif others want to keep poking and debugging I can start working on cleaning up these files17:55
clarkbcorvus: one thing I notice is that we've got another old task at the top of the list.17:58
clarkblast time we had the same thing17:58
clarkbI killed the task last time and maybe that helped make things get better?17:58
clarkbI think we didn't see immediate relief but the built up backlog may explain that17:58
corvusclarkb: yeah let's try that17:59
clarkbcorvus: do you want to kill it or should I?17:59
corvusi'm watching a zm log and should notice if progress picks up.  you kill it.17:59
clarkbok one sec while I look up the command again17:59
corvusfwiw, the backlog started at 16:3918:00
clarkbdone18:00
corvusthere might have been a slight improvement, from ~30s per repo to ~2018:03
corvusi don't think it's enough to keep up with incoming workload though18:04
corvusthe queue is decreasing18:06
clarkbpretty sharply too18:06
clarkbso maybe the issue is that old task18:06
fungino sudden jump in the merger queue this time, it did see a bit of a rise and tehn climbed a decelerating curve as of ~16:40z18:07
clarkband hopefully correlation between thread dump and show-queue prior to removing the old task can help upstream diagnose it. There may also be zuul settings to time out connections for ssh?18:07
fungithe executors were very busy for a while before that too18:07
clarkbwe might be able to workaround this if so by setting a reasonable connection timeout to like an hour?18:07
clarkbI'lm going to keep working on sanitizing these files. But maybe someone can look at gerrit config options around that18:07
fungia largeish stack of nova changes were enqueued into check around that time18:08
fungiyeah, looks like there was a stack rebase and push for those right at 16:4018:09
fungiso it's possible gerrit was already slow, and this was the bump which pushed things over the edge18:10
corvusso do we have 4 slots for servicing this?18:10
clarkbcorvus: we should have 100 which si aprt of what I asked about on the bug I filed18:11
clarkbthe thread dump also shows 100 threads exist18:11
fungithat's how it's acting (and how it was acting last time as well) but yeah it's not what we think is configured18:11
clarkbIt seems like there is some other limitation (thread ocntention, locks? I don't know what)18:11
corvusthen show-queue has all but 4 git-upload-pack jobs waiting18:11
clarkbyup I think that is why luca wanted the show queue output.18:12
corvusk18:12
clarkbreview02:~clarkb/gerrit_queues.20220209.sanitized should be sanitized. But please double check it particularly the query tasks as I'm not sure if we need to scrub out the change identifiers too (I don't think so since all our changes are public)18:12
clarkbnow to work on the thread dump18:12
corvusclarkb: sanitized lgtm18:14
clarkbcorvus: still sanitizing the thread dump but it looks like some of the waiting tasks are waiting on a lock18:15
clarkbI'm hopeful this will end up allowing this to be understood and fixed given what I'm seeing. This may take some time though as I'm trying to synchronize the sanitized usernames between the two files18:15
clarkbthat way they can be directly correlated18:15
*** amoralej is now known as amoralej|off18:17
corvusqueue @ 200 now18:18
corvusthere's now a 2m old task at the top, and things are slowing down again18:31
corvusso it does seem like we're right on the edge of holding performance18:31
corvusit finished; so we don't need to kill it or anything, just may be informative.18:32
clarkbit does seem that after I killed the very old task the queue dropped quickly18:34
clarkbmakes me wonder if longer running tasks create a lot of contention somehow18:35
corvusit leveled off while the 2m old task was there and has resumed falling18:35
corvuswell, if we can only service 4 of them at a time, then our capacity drops by 25%18:35
clarkbbut also none of this explains why 95% of our interactive ssh worker threads are doing nothing. Unless the same lock causes contention with thread assignemnt18:35
clarkbya that18:35
corvusyou can now see the little shelf on the merger queue graph from that 2m job18:36
corvus(was probably 3+ minutes by the end, which more closely corresponds with the shelf length)18:37
corvus<10018:48
corvusstill seeing about 20s per repo on the merger18:49
clarkbok I've gone through review02:~clarkb/gerrit_thread_dump.20220209.sanitized and cleaned up what I could find. The diff against gerrit_thread_dump.20220209 will show you what I changed18:50
clarkbthere were no http entries that needed cleanup this time from what I could see18:50
clarkbcorvus: I think the two big questions are "why are we slowing down in general" and "why are we not using the many free ssh worker threads that could be used to spread out the load"18:51
corvusyeah, though if the slowdown is entirely just waiting for threads, could be only one question.18:51
clarkbindeed18:52
clarkbsshd.idleTimeout and sshd.waitTimeout may be useful here depending on whether or not longer running requests are a problem18:53
clarkbthat might impact zuul listening on ssh event streaming though18:53
clarkbanyway if ya'll can take a look at those two files on review02 and give them a critical eye I can update the issue with a bit more info18:54
clarkbhrm we set idletimeout to an hour already18:55
clarkband waittimeout defaults to 30s so maybe not18:55
corvuswonder why that job was there for so long then18:55
clarkboh! I've just now noticed that luca wanted -q added to show-queue18:57
clarkbunfortunately too late to add that now but if you run it gives more detailed information on the internal queues too18:58
clarkbnotably all those tasks are apparently in the batchworker queue not the stream events queue18:58
clarkbI think we are using 2 batch threads which is the default on a multicore system based on that18:58
clarkband zuul et all are being scheduled to that despite my earlier group membership checking18:58
clarkbmaybe zuul is a member of service users and I missed it before?18:59
corvusoh interesting.  the split between batch and interactive seems arbitrary too.18:59
corvusthere are some 3pci in batch and some in interactive.19:00
fungii think it's based on group membership19:00
clarkbya I think we may have to address that via groups19:00
clarkbreading the config docs if we set batch threads to 0 then the interactive and non interactive users share a threadpool19:00
clarkbthat might be the most starightforward thing for us to do though maybe not the most correct thing19:00
corvuswe need 3 pools :/19:01
clarkbcorvus: humans, zuul, else?19:01
corvusya19:01
fungior arbitrary preemptable pools19:01
fungibut yes, more than 219:01
fungiour users, our ci systems, our users ci systems19:01
clarkbany objections to me pushing a change to set batchThreads to 0 and share for now? or would we prefer to look into cleaning things up an splitting the pools19:02
corvusclarkb: fwiw your file looks good, but it seems like we have things to try before we necessarily go back to gerrit folks19:02
clarkb++19:02
fungimaybe an alternative would be to put third-party ci accounts back into the normal user threadpool, and dedicate the batch pool to zuul?19:03
fungibut i'm fine with trying the giant shared pool first19:03
clarkbfungi: Service Users also impacts the attention set stuff unfortunately19:03
clarkboverloading those two sets makes this really awkward for us19:03
fungithat does seem like something they ought to consider splitting19:03
corvuswhy do each of the queues have 2 worker threads?19:05
corvusoh wait, batch has 2 interactive has 100?19:05
clarkbya that19:05
clarkbcorvus: we've long set sshd.threads to 100 (since like 2.8 maybe? its old) but then recently with attention sets and changes along those lines gerrit recognizes service users and split those out into a separate pool19:06
clarkbcorvus: I thought the default was that threads were always shared though but maybe that changed in 3.4?19:06
corvusi think the batch pool has been around for a while19:06
clarkbcorvus: it has been, but I was fairly certain the default was to share threads not to only use 2. I think that may be the change19:06
clarkbI'm trying to find the 3.3 docs to confirm19:07
corvuswhat was the batch setting for if not to segregate threads?19:07
corvusthough it's really not that important19:07
fungilooks like the mergers have fully caught up again now19:07
corvusthe important thing is which group of users we want to have sidelined when someone holds an ssh operation open for 2 weeks19:08
clarkbcorvus: ya ideally the idletimeout would address that and then we can have enough headroom on thred count we largely avoid it. Or add additional pools19:08
clarkbcorvus: fwiw 3.3 docs say 2 for batchThreads is the default so maybe we just never noticed until recently19:08
corvusdo you have the unsanitized queue dump?19:09
opendevreviewClark Boylan proposed opendev/system-config master: Set Gerrit sshd.batchThreads to 0  https://review.opendev.org/c/opendev/system-config/+/82860519:10
clarkbcorvus: I do19:10
clarkbcorvus: one sec I'll put it on review0219:10
corvusi forgot which user was the one running the task from feb 4 that we killed; would like to confirm they're in the batch worker group19:10
clarkbcorvus: its on review02 now without the santiized suffix19:11
clarkbI think it was userA though19:11
clarkb"gerrit ls-members --recursive 'Service Users'" <- that doesn't show me zuul so it must be finding zuul in that group via some other method?19:13
corvuspriority = batch group Non-Interactive Users19:15
clarkbNon-Interactive Users got renamed to Service users. I think the text may not have updated in that move because the uuid for the group stayed the same and that is what gerrit uses19:16
clarkbso ya it would be membership of that group or another priority = batch entry for additional groups19:17
clarkbI half wonder if ls-members doesn't recurse properly and we should "sudo" and check via the web ui19:17
clarkbhttps://osm.etsi.org/gerrit/Documentation/rest-api-groups.html#list-subgroups should be able to do it too19:18
clarkbheh its fun how you get different gerrit installations back from google when gooling this stuff19:19
corvusyeah, i think it's worth exploring in the web ui; i have to run for a bit19:19
fungithe rest api has a recurse option for group member listing19:19
fungiclarkb: oh, so does ls-members... "--recursive : to resolve included groups recursively (default: false)"19:20
clarkbfungi: ya but if I do --recursive it returns the same results and from everything we acn tell somehow zuul is in that group19:21
fungioh19:21
clarkbbasically I'm not trusting it :)19:21
clarkbbecause otherwise how is zuul ending up in the batch queue19:21
fungiahh, now i see in scrollback you already discussed --recursive19:21
clarkbI need to take a quick break myself and find something to drink. But I can settle back in and look via the web ui19:21
fungimgagne: we've been contacted from someone in sales at inap saying they're turning off the iweb cloud in mtl01. i guess that part didn't transfer to leaseweb?19:29
mgagne_fungi: I didn't know it was that far into the process and that sales was going to do the communication themselves.19:29
fungiahh, well it was someone offering to put us in touch with sales reps about pricing out access to their vmware cloud19:30
fungii just wanted to make sure it's actually going away and not moving to leaseweb before i replied further19:31
mgagne_I don't think you should have received this email, not in that format. But it's true that we are planning on sunsetting the openstack platform in mtl01. The sequence/data is yet tbd.19:31
fungiseems like they may have reached out to contact addresses they had on file for the accounts in that environment, since it went to our infra-root alias inbox19:32
clarkbfungi: corvus: "Continuous Integration Tools" and "Third-Party CI" are both group members of Service Users. The ssh ls-members is bugged I guess. Zuul is a member of Continuous Integration Tools19:33
mgagne_I wonder how they got that specific email tbh.19:33
clarkbfungi: mgagne_: we cc'd you on the email thread but maybe not to a currently valid email?19:33
clarkbfungi: corvus: now that we know that I think we can consider this provisionally solved and work to adjust the batch users thread pool size. I expect that setting it to 0 and sharing is probably the least bad option for us currently due to the overloaded use of service users with attention sets19:34
clarkbit may be possible to do some followup where we remove batch priority from service users and assign it to Continuous Integration Tools and then have third party ci remain non participatory with attention sets and go into the interactive pool or similar. But that would probably need more testing and planning. Setting batchThreads to 0 should be fairly safe19:35
fungimgagne_: anyway, thanks for the details. it sounds like i can have enough info to be able to reply. as far as timeline they said we have 90 days to migrate to vmware before they're turning the environment off19:35
clarkbI've deescalated my privs in the web ui now19:36
mgagne_I'm currently in a meeting, I'll get back to you in ~60m at most.19:36
fungimgagne_: no worries, take your time. i appreciate the help. i'll wait to reply in that case19:37
clarkb++ thank you19:39
clarkbinfra-root https://104.130.74.7:3081/opendev/system-config has been held for gitea checking. Though I think I'll defer a bit on that until we can close out the gerrit issue19:44
*** timburke__ is now known as timburke20:54
opendevreviewJames E. Blair proposed openstack/project-config master: Add zuul-web stats to zuul-status page  https://review.opendev.org/c/openstack/project-config/+/82860921:02
opendevreviewJames E. Blair proposed openstack/project-config master: Add zuul-web stats to zuul-status page  https://review.opendev.org/c/openstack/project-config/+/82860921:03
corvusi would like to do a rolling restart of zuul now.  21:03
corvusi'm going to run https://review.opendev.org/828176 and then do the scheduler/web part manually at the end21:04
corvusand by "now" i mean in about 5 minutes after i confirm the image promotion21:04
fungisounds good to me. thanks!21:05
clarkbzuul is happy with https://review.opendev.org/c/opendev/system-config/+/828605 if we want to go ahead and land that and plan for a gerrit restart later today21:12
corvuslgtm21:13
fungiand in it goes21:14
sshnaidmcores, please merge a patch about perms to delete branches: https://review.opendev.org/c/openstack/project-config/+/82837121:15
corvuspull finished, restarting now21:17
fungii'm getting started on dinner but somewhat around and can pivot to help if something goes sour21:19
corvusit does not appear that the mergers exit appropriately on 'zuul-merger stop'21:25
corvusi hard-stopped them after gracefully stopping them.  i think that will stop them without errors.  they're probably just hung on a thread that doesn't exit.21:26
corvusze01-06 are gracefully stopping now.21:27
fungiinteresting in relation to clarkb's graceful change21:32
clarkbcorvus: oh fun21:35
clarkbcorvus: we can probably do a stop against one of the mergers then ask it for a thread dump to see what it is held up on21:48
clarkbor I suppose just running it locally may reproduce21:48
clarkblooking at the gerrit code for ls-members I don't see anything that might recurse in the actual implementation21:49
mgagne_fungi: sounds like they are planning on 90 days. We had internal discussions about it but no timeline. I guess we have one now. Hopefully they didn't confused mtl01 that used to be at INAP with our other OpenStack platform at iWeb.21:57
corvusclarkb: yeah, can repro locally.  i'm working on a fix.22:00
fungimgagne_: they said that the iweb.com domain went to inap in the sale, and that the identity.api.cloud.iweb.com endoint we're communicating with is what's being shut down. do we need to switch hostnames there?22:01
fungiwe're definitely using the mtl01 region, but maybe we need to adjust the api url?22:02
mgagne_ok, I wonder what was in that email, it's a bit confusing.22:03
fungithere wasn't much in the email, which is why i started asking them questions22:03
mgagne_For INAP customers, they need to move the URLs used for the OpenStack API to inap.com. Although it was an INAP product, it was using iweb DNS. Now they have to move to inap DNS because well, they don't own iweb.com22:04
fungii said we were using https://identity.api.cloud.iweb.com and they replied "That domain transferred to INAP in the sale.  That is exactly what we are shutting down."22:04
mgagne_* face palm *22:04
fungiyeah, maybe language barrier? it's possible jennifer curry at inap didn't completely understand what i said we're using22:05
mgagne_I think there is confusion and the request got lost in translation or across department.22:05
fungithat wouldn't surprise me at all. this is a complicated field ;)22:07
mgagne_For INAP customers: they need to move to cloud.inap.com. cloud.iweb.com will be phased out. It's the same product, different DNS.For mtl01, it's gonna be phased out, there is no replacement. So updating DNS won't help. Now the official timeline for mtl01 phase out wasn't known to our team. But we had discussion about how/when to make it happen.22:07
mgagne_We (I) didn't communicate yet to you because we didn't have an official timeline/answer about it. But now you know. it won't happen overnight but it's gonna happen at some point.22:08
fungioh, no worries, i was just reaching out trying to understand, sounds like there are a lot of people not talking to one another. this actually started more than a week ago (we received the first communication on february 1 and responded a couple of days later)22:11
fungii guess inap chose a timeline and didn't pass that information along right away22:12
mgagne_I can't officially talk for my new employer but there is no plan to work with OpenStack in the near future. But they also didn't want it to be perceived as officially closing the door forever, whatever that would mean.22:13
fungii definitely don't see it that way either. the help we've had has been great and much appreciated22:14
mgagne_=)22:14
fungii mainly just need to know whether we should turn off our use of that environment right away or wait until the dns record disappears22:14
funginodepool will handle it fine either way22:15
opendevreviewMerged opendev/system-config master: Set Gerrit sshd.batchThreads to 0  https://review.opendev.org/c/opendev/system-config/+/82860522:15
mgagne_DNS will be there for a couple months.22:15
mgagne_I'll keep the boat floating until they officially ask us to shut it down.22:15
fungisounds like we can keep our configuration in place in the meantime then. thanks for all the clarity!22:16
mgagne_We have a lot more other things to take care of before mtl01.22:16
mgagne_np, sorry that it happened that way, I didn't know their plan about communication.22:17
mgagne_#canadians22:17
fungii would say "don't apologize" but... canadians22:17
mgagne_:D22:19
*** dviroel|ruck is now known as dviroel|out22:33
opendevreviewMerged openstack/project-config master: Add zuul-web stats to zuul-status page  https://review.opendev.org/c/openstack/project-config/+/82860922:38
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [dnm] testing https://review.opendev.org/c/opendev/base-jobs/+/828440  https://review.opendev.org/c/zuul/zuul-jobs/+/82861522:39
ianw^ "msg": "The conditional check '{{ item }} == 'centos-8'' failed. The error was: error while evaluating conditional ({{ item }} == 'centos-8'): 'centos' is undefined22:47
ianw... and that is why we test base job additions :)22:47
corvusi love how excited we all are when the system tells us we're wrong :)22:49
fungii've come to accept zuul's judgement, that i am nearly always wrong22:50
ianwheh, i've never thought of it like that, but so true!  clarkb ^ might be one for your talk.  we actually get excited when we thought something would work and it fails, because it means we just avoided a big production mess :)22:51
corvusianw: the grafana change is deployed already.  thanks!  (for the review and deployment speedup)22:52
fungiearlier today i tried to make a trivial zuul docs change. i attempted to test it locally, but `tox -e docs` wanted more than the 3.5gb i had available, so i punted it up to review and zuul let me know that my assumption about sphinx treating implicit labels the same fo :ref: directives as it does for normal link targets is wrong22:53
fungii was quite sure it was fine, but happy to have been proved wrong22:54
fungihumility as a service22:55
opendevreviewIan Wienand proposed opendev/base-jobs master: base: fail centos-8 if pointing to centos-8-stream image type  https://review.opendev.org/c/opendev/base-jobs/+/82843723:00
opendevreviewIan Wienand proposed opendev/base-jobs master: base-test: fix typos in centos-8 detection  https://review.opendev.org/c/opendev/base-jobs/+/82861623:00
ianwfungi / ianchoi[m] : i'm guessing from https://zuul.opendev.org/t/openstack/builds?job_name=propose-translation-update&job_name=upstream-translation-update&skip=0 the translation jobs are roughly back in shape23:11
fungioh, awesome! thanks ianw, less complicated than i had feared23:11
ianwthe only failure seems to be possibly just a network blip -> https://zuul.opendev.org/t/openstack/build/397db8ca6c204fca8620d7c0a470959b/console23:12
fungionly so many times we can defibrillate zanata though23:13
ianwfungi: i definitely agree with your analysis though, it's a ticking time-bomb of fair complexity23:13
fungi"the patient miraculously survived" <furtive glance at other soap opera actors>23:14
corvusone of the executors has finally finished stopping :)23:22
fungithat's reassuring. the others will fall like dominoes23:22
corvus[in low gravity]23:22
*** rlandy|ruck is now known as rlandy|out23:23
fungibetter than witnessing their infinite fall into the event horizon of a black hole23:24
clarkbianw: ++23:26
clarkbI got new hardware today and am flipping back and forth between it and the old one so I can real work done too23:27
clarkbI should probably just put it down for a bit though23:27
clarkbturns out relatively high res displays in small form factor cause a bunch of random things to be weird23:27
fungior force yourself onto the new hardware and fix up whatever's missing as you go23:28
clarkbfungi: I find that I haev a really hard time doing that :) I need xmonad and firefox set up just so and so on23:28
opendevreviewMerged opendev/base-jobs master: base-test: fix typos in centos-8 detection  https://review.opendev.org/c/opendev/base-jobs/+/82861623:29
fungiafter i found a scalable terminal, i was all set23:29
clarkboh ya thats the other thing fonts and getting the terminal set up so it doesnt' get in xmonad's way with a bunch of menu bars23:29
clarkbI could probably automate some of this but xfce (and I think other desktops) have so much config in a registry like db these days23:30
clarkblooks like the batchThreads gerrit change is ready23:30
clarkbs/ready/in place on review02/23:30
clarkbis now a bad time to restart for that? gate queues seem pretty quiet and I don't see release jobs23:31
fungiexecutors are still stopping, but that's probably not going to make it a worse time for a gerrit restart23:32
corvusno objection from me23:35
clarkbok I'll finish this zuul stopping fixup change review then restart gerrit23:35
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Replace kpartx with qemu-nbd in extract-image  https://review.opendev.org/c/openstack/diskimage-builder/+/82861723:38
fungiinteresting problem with booting a snapshot of the ethercalc server... we put the software for it in the ephemeral disk mounted on /opt, which isn't included when making a server image23:38
fungirsync t the rescue23:39
clarkbok I'm going to prep for a gerrit restart. Shouldn't need a new image. Will just be a docker-compose down then up -d23:39
clarkbzuul queues still look good. I'm proceeding23:40
clarkboh I just realized that the batchThreads change will conflict with fungi's normalization change of the gerrit config23:40
clarkbfungi: we should land your normalization change and land the change to force case sensitive users soon23:40
clarkbbut back to restarting gerrit23:40
fungiwhich normalization change? i've clearly not told myself what it is i'm working on lately23:41
clarkbfungi: you added a bunch of tabs for consistency iirc23:42
clarkbthe web ui seems to be up. One thing I noticed is that chagnes loaded immediately after the restart did not have diff or file info23:43
clarkbwait 30 seconds and refresh and it shows up23:43
fungioh, i thought the tabs merged already23:43
clarkbthere don't appear to be tabs in the diff for my change? https://review.opendev.org/c/opendev/system-config/+/828605/1/playbooks/roles/gerrit/templates/gerrit.config.j2 or maybe we need to add more tabs?23:45
clarkbZuul shows up in the interactive queue worker list now doing a show-queue -w -q23:45
clarkbfungi: ya looks like it wasn't a complete edit. Just partial. That explains my confusion23:47
clarkb#status log Restarted Gerrit to pick up sshd.batchThreads = 0 config update23:47
opendevstatusclarkb: finished logging23:47
clarkbI notice that apple's web crawler is tripping over the changes that are in a sad state that our reindexing complains about too23:49
clarkbI don't think there is much we can do about that23:49
corvusregarding the zuul restart, i will likely allow the executor restart to continue tonight and then do the scheduler+web first thing tomorrow23:50
corvus(unless someone feels adventurous overnight; but i think running half-upgraded for a while is fine)23:51
fungii'll consider it a valuable experiment23:52

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!