Thursday, 2020-09-17

ianwfrickler / AJaeger: was wording the only issue with ?  it would be good to get that stack in, but particularly the f31 drop05:29
AJaegerianw: yes, I think so - gave my +2.05:57
AJaegerianw: found a few nits in
ianwfrickler: at least for zuul-jobs, i think let's drop it rather than have people work on it06:24
ianwi would like to see it working with devstack, i think there has been some initial work06:24
fricklerianw: oh, that only changes zuul-jobs, not retire f31 nodes completely, that's fine then06:30
fricklerianw: lyarwood started but there seems to be some issue with dhcp to instances06:31
ianwclarkb: 3.3.0 dib released, should build suse07:49
ianwfrickler/AJaeger: thanks... in the mean time haskell has updated and dropped f32 from the copr repo, and so the change is broken in the gate :/07:57
AJaegerianw: oh fun ;(08:35
AJaegerfrickler: could you review 751975, please?08:35
clarkbianw: thanks! looks like now we need to land a nodepool change to get a new nodepool image15:51
clarkbworking on sorting that out next15:51
fungiour storyboard docker image promotion is apparently breaking on getting a jwt token:
clarkbfungi: we may have to create the storyboard image under our opendevorg ?17:29
clarkband/or ensure we've got the correct creds in there17:29
fungiwe were publishing images there until a few months ago17:30
clarkboh I didn't realize storyboard was already publishing images17:30
clarkband updated within the last 3 days17:31
fungithe gate job is uploading them17:31
fungiit's the promote we're failing17:32
fungiif it were the creds at issue, i would expect them to be the same17:32
clarkbagreed. In this case we're manually manipulating the api to get a token then promote an image (rather than using the client itself)17:33
clarkbI guess they could've updated things breaking that17:33
clarkbI'm fairly sure we haven't updated things on our side but we should double check that too17:33
clarkbwe may have to emulate what ansible is doing (or run it locally without the no_log: false stuff) and see whatthe error is17:34
fungii wonder if this is impacting any other users of the job17:34
fungizuul-promote-image worked today just fine17:36
clarkb ya thats the most recent one17:37
clarkbwas about 12 hours ago17:37
clarkbcould be a one off17:38
clarkbdocker servers just had a sad17:38
clarkbmaybe we should try a reenqueue first and see if it is persistent?17:38
fungii can try to reenqueue the change in promote, yeah17:38
clarkband if it happens again then we try to reproduce locally so we can see what the failure is17:38
fungiokay, reenqueued now17:42
clarkb is the task we already retry it a bunch (so failures are maybe common?)17:44
clarkbthat looks pretty easy to reproduce locally with ansible though whcih is good17:44
fungifailed on the same task agaion17:45
clarkbwe can watch the nodepool image promotion that should happen soon too17:47
* clarkb looking to see what secrets we use for this job17:47
fungiyeah, i wonder if we're merely referencing the wrong secret17:48
clarkb is what we use17:48
clarkb same username there at least17:49
fungiyeah, just checked, storyboard-upload-opendev-image and storyboard-promote-opendev-image use the exact same vars/secret ref17:50
clarkbI think we need to take that task and run it locallywith the credentials to see what hte log details are17:51
clarkbmaybe the return code is no longer 200 (201 would make sense possibly?) or similar change with the api17:52
clarkbfungi: the credentials appear to be in the normal location17:52
clarkbdo you watn to try doing the local run or shoudl I spin up a local ansible venv for that?17:52
clarkbfungi: I'm reading the ansible docs and it looks like the token request should be a POST but I think we do a GET based on default uri module behavior17:57
clarkb vs
clarkbwe should run it locally and reproduce the failure then add method: POST to the task and see if it works17:58
fungistrange though, zuul-promote-image is parented on the same opendev-promote-docker-image that storyboard-promote-opendev-image is, and worked fine18:01
clarkbmaybe it changed in the last 12 hours?18:03
funginah, the earlier failure was from days ago18:05
fungithis job has never actually run successfully:
fungidating back to december (when it was first added, i guess)18:07
fungithe oldest build we have ansible data from still is which failed almost a week ago18:09
fungisame task18:09
fungibut i wouldn't be surprised if it's been failing that way ~forever18:10
fungimaybe the credential used in the storyboard jobs isn't authorized for whatever action it's trying to perform?18:10
clarkbthe task that is running doesn't have any context like that yet18:11
clarkbits just authing in a basic fashion18:11
clarkbI've got a local play running now that works how I expect with bogus creds18:11
clarkbnow to try with real creds18:11
fungithe upload role uses `docker login` instead, looks like18:13
clarkbI am not able to reproduce using real creds18:13
clarkbI get a token18:13
clarkbdoes the secret possibly have whitespace on either end of it?18:14
fungimaybe the way the credentials are encoded work in the docker login call but not with a uri task? trailing newline?18:14
clarkbmaybe other things chomp that18:14
fungihah, minds going to the same place18:14
clarkbfungi: maybe just reencrypt it since the utility for that chomps whitespace now I think18:15
fungitrying to decrypt it now18:15
clarkbthat works too :)18:16
fungimainly to work out what's there18:16
clarkbfwiw I think we should switch that to a POST method too but I won't go changing things until we figure this out18:16
clarkbcorvus: ^ fyi do you knwo why we don't use method: POST to auth with docker hub?18:17
fungiyup! there's a trailing newline on the password string18:17
fungiokay, so we need to replace it with a reencrypted copy that has no newline18:18
fungii already have the string here, so i can push that18:18
clarkbyup at least reproducing works locally when there is no trailing new line :)18:18
fungii must have a change in some project which is taking gertty forever to sync things18:22
clarkbalso interesting that that must mean docker tooling chomps18:24
fungii think it's the command line parsing doing that18:24
fungiwe inline the password in a command task string18:24
fungiso the shell just sees additional whitespace as a separator18:24
fungion a related note, it would be nice to finally get back18:25
clarkbthe linaro mirror was off, I have started it. Noticed beacuse nodepool's image builds were unhappy18:42
clarkbwe should maybe move nodepool's image builds to pull the arm wheels from an x86 mirror I'll look at that after lunch18:42
fungiianw may have set up a netconsole stream for that already18:43
clarkbfungi: ^ fyi I think we'll need to reenqueue the nodepool change18:43
clarkbI smell lunch now, back in a bit to look at that more18:43
fungiclarkb: or approve your one-liner followup?18:43
fungiif my gertty ever gets done trying to consume all my swap space i'll review it18:44
fungii should get around to switching to a vm with more than 1gb ram18:44
fungidiscussion over in #openstack makes me realize that lower-constraints jobs don't get the benefit of our prebuilt wheelhouse in many cases because they can specify different versions from what the central requirements upper-constraints.txt does19:08
fungilike designate trying to install an old version of cffi with a new python interpreter19:09
clarkbespecially if upper constraints skips versions19:09
fungiwell, also upper-constraints is going to use them with relatively contemporary python interpreter versions19:14
fungiso even if it did at one time include cffi 1.11.5 it likely moved on to list a newer version before python3.8 even existed19:15
fungiand therefore would never have been built for cp3819:15
clarkbpypi's bandersnatch mirror is still not done rebuilding fwiw19:31
fungino surprise there19:32
clarkbit will be interesting to see if it comes in under 12TB :)19:34
clarkbfungi: corvus I think is a safe followup to getting all the zuul and nodepool processes on zk tls19:47
clarkb addresses an annoying thing building new servers too19:48
corvusclarkb: lgtm; if we need to zk-shell we'll need to install and run it on a zkXX host after that20:28
clarkbyup, I've already got my venv on zk01 from previous debugging20:29
corvuskk, i think i used nb01 in the past; others may have as well20:30
corvusor nl01.  or something starting with an n.20:30
funginl01. i'm also still able to run the nodepool cli from there for the moment20:45
clarkbyou'll still be able to if you exec into the container20:46
fungicool, i'll remember to use docker-compose exec for that in the future20:46
clarkbI've abandoned since we seem to be moving to f32 and not bother with f3120:52
ianwclarkb: did nb03 go off with the mirror?22:23
clarkbianw: no22:23
clarkbor at least the nova api said it was up22:23
clarkbI didn't try to ssh in22:23
johnsomFYI, I just saw a docs job timeout as well. I know Julia mentioned this earlier in the week. Job setup, before the tox ran took 29.5 minutes. Much longer than I would have expected. The actual docs work only got two minutes to run before being killed.22:26
clarkbjohnsom: can you link to a log?22:27
clarkbin TheJulia's case I think we confirmed that pdf generation was slow?22:27
clarkbjohnsom: installing pdf prereqs was slow22:28
johnsomYeah, that is what I see too22:28
clarkbI wonder if the ansible console log stuff will give us more info than what we get in the terminal console log22:28
johnsom21 minutes in fact22:29
clarkb if I had to guess its the tex stuff22:29
clarkbsince that usually expands to a billion packages, but confirming that with the ansible stuff would be good22:29
johnsomI really haven't seen this before, but thought I would mention it as someone else had also run into it.22:35
johnsomSeems to be a trend22:39
fungimaybe related to the recent change which added texlive-full to docs builds?22:40
fungithat drags in an insane number of dependencies22:40
clarkbfungi: ya I may have exagerated when I said a billion but its pretty close :)22:41
clarkbarg the console stuff doesn't work when we timeout22:42
clarkbI guess we can spelunk in the json22:42
fungii'll see if i can find where that was22:42
johnsomIt looks like 458 packages on one of those older runs22:42
johnsomclarkb I found it here on a previous failure:
johnsomMy log the run hasn't finished yet22:42
johnsomYeah, so 3GB of stuff if I read that correctly22:43
clarkbunfortunately no internal timing22:44
fungi merged just over a week ago22:44
clarkbbut we can probably get an idea for where the cost is if someone runs the same installation in a container or vm and pays attention22:44
clarkbanother option would be to only install what is needed22:45
clarkbrather than every tex package22:45
johnsomThe title pretty much sums it up: TeX Live: metapackage pulling in all components of TeX Live22:46
johnsomIt's a long list of dependencies.22:46
fungiand those dependencies also have dependencies22:46
johnsomSo, yeah, we might want to be a bit more targeted here.22:46
fungigmann: AJaeger: heads up, the texlive-full addition in the prepare-build-pdf-docs role may have started causing job timeouts22:47
johnsomPulling this down for every docs job run can't be a good thing either "Need to get 3096 MB of archives."22:47
fungiyeah, any time i'm installing stuff on debian and it "recommends: texlive-full" i end up doing surgery because, holy heck that's a bunch of packages22:49
johnsomSadly I don't have a focal instance (nor the cycles) to try to roll that back and see what is needed.22:49
fungiand unfortunately the failed build linked in the commit message has since expired out of swift, so not sure what the error was22:51
johnsomYeah, at least that tells us it was the nova docs that triggered whatever the issue was.22:52
fungii guess proposing a revert and then doing a dnm depends-on change for nova ought to recreate it22:53
gmannfungi: johnsom it is in many projects where doc job failed on Focal.22:55
johnsomgmann Yes:
gmannit is easily and 100% reproducible on Focal without that pckg22:56
gmanni mean without texlive-full22:56
johnsomOh, I have no idea22:56
clarkbgmann: right but just because installing every tex package fixes it doesn't mean we need every package22:57
gmannwe might shrink it to specific pckg which causing error but need to check that as pdf error was not clear22:57
clarkbthere is likely a subset that would work22:57
johnsomI was just looking at the commit message for the patch that added texlive-full, it referenced a nova docs run.22:59
clarkbthinking about the tex stuff: I think what we should do is install the package list on bionic and xenial and diff the resulting package lists23:12
clarkber bionic and focal23:12
clarkbI'm guessing on bionic it pulled in more for whatever reason23:13
* clarkb spins up some containers23:13
clarkbdo our VMs disable recommends too? that may play into it23:14
ianwi think so, that has got us in trouble sometimes, openafs packages is one i can think of23:15
johnsomLooking at that console log it doesn't appear to be installing the suggested or recommended packages.23:17
clarkbya its 182 vs 197 with bionic having more23:20
clarkbI'm installing them then will do a dpkg listing and diff it23:20
clarkb nothing jumps out to me after I've removed overlap just with different versions23:36
clarkbmy hunch is something moved from requires to recommends though23:37
*** mlavalle has quit IRC23:44
fungiyeah, probably need to try the pdf build steps on a focal node and then check what it actually says is missing23:51
ianwclarkb: mind a look at ; we should fix up our deb-docker usage23:54
ianwalso which switches in base-jobs23:55
ianwthen i can delete the old, non-updating bits from the mirror23:55
ianw(in case you forgot, we have to use deb-docker/<distro> because upstream keeps the same package names with different versions so we can't build one consolidated mirror)23:57
clarkbianw: the non updating bits are what make things work right now ya?23:57
ianwyeah, it's just stuck in the past from the last time it happened to work23:59

