Wednesday, 2020-04-15

clarkbmordred: ok I think that is really close but some of the puppet stuff still needs updating comments on the chnage00:09
mordredclarkb: responded00:11
mordredclarkb: and no - those are remote paths00:11
clarkboh I'm going to need to melt my brain again I guess00:12
mordredclarkb: (I had to check myself)00:12
mordredclarkb: I actually think we should completely rework the puppet tests to be based on remote_puppet_else00:12
clarkbmordred: mgmt_ is bridge? and not mgmt_ is remote?00:12
clarkbmordred: ok so the way this would work is we just copy from /home/zuul/etc into /opt/system-config/production on the remote and nothing else changes?00:13
clarkbI guess that simplifies things for making changes onbridge00:13
mordredlike - I think it would be nice to get rid of the current puppet jobs completely - make per-service jobs that are essentially "run remote-puppet-else but with only host X" - then we'll be set for each service we transition00:13
mordredclarkb: yah00:13
clarkbmordred: ++ on the job idea00:13
mordredclarkb: becuase also we need thsoe legacy puppet jobs to die anyway00:14
mordredclarkb: I mean - really - we could start making service-foo playbooks for everything too - just with roles: - puppet in them00:15
mordredand completely get rid of else00:15
mordredcorvus: if you have a sec for a re-review of the first patch in the stack: - I can land those when I'm watching in the morning00:17
clarkbsimilarly if someone is willing to review those docker-compose upgrade changes I'm happy to babysit those tomorrow as they go in (assuming I get a second +2)00:18
mordredinfra-root: that's and ^^00:28
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Document output variables
openstackgerritMerged zuul/zuul-jobs master: ensure-pip: Add role
openstackgerritMerged opendev/system-config master: Write out db config for root user
openstackgerritIan Wienand proposed zuul/zuul-jobs master: Python roles: misc doc updates
openstackgerritMerged openstack/project-config master: Move suse builds to nb04, drop pip-and-virtualenv
*** ysandeep|away is now known as ysandeep|rover02:11
openstackgerritIan Wienand proposed openstack/project-config master: AFS Grafana : add mirror release timers
openstackgerritIan Wienand proposed openstack/project-config master: AFS Grafana : add mirror release timers
ianwdirk / clarkb: so suse has built on nb04 now03:48
ianwi'd like to, and will be available to, push on anything needed to get things working without pip-and-virtualenv.  as i've said, i think the ensure-pip stack is ready03:49
*** DSpider has joined #opendev03:51
ianwcmurphy: ^ might also affect as i saw some things fly by about certs03:53
cmurphyianw: ooh good to know, a new image might help me avoid needing
ianwahh yeah that was what i was thinking of.  i'm not going to make a prediction, but maybe? :)03:59
openstackgerritMerged openstack/project-config master: AFS Grafana : add mirror release timers
*** ysandeep|rover is now known as ysandeep|BRB04:08
*** ysandeep|BRB is now known as ysandeep|rover04:23
*** ykarel|away is now known as ykarel04:25
openstackgerritMerged openstack/project-config master: Revert "Revert "Introduce job for granular GitHub mirroring""
AJaegerianw: reviewed the stack and gave my +2s, I did not approve - wanted you do do the honours yourself when you're around. Thanks!05:41
*** roman_g has joined #opendev05:42
ianwAJaeger: ok, thanks, i'll do that in the morning then to avoid pushing anything before i disappear :)05:43
AJaegerianw: enjoy your evening ;)05:46
openstackgerritOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml
*** roman_g has quit IRC06:15
prometheanfireshould glean be updated (tox wise) to py36/38?06:19
*** hashar has joined #opendev06:26
*** hashar has quit IRC06:42
*** dpawlik has joined #opendev06:49
AJaegerianw, cmurphy, dirk, keystone is now failing openSUSE tests, see
AJaegerRETRY_LIMIT - and no log files ;(06:59
openstackgerritMatthew Thode proposed opendev/glean master: write one resolv config
*** roman_g has joined #opendev07:00
prometheanfireok, that passes tests locally ^07:00
*** lpetrut has joined #opendev07:07
*** hashar has joined #opendev07:07
openstackgerritMerged openstack/project-config master: Normalize projects.yaml
*** ralonsoh has joined #opendev07:14
ianwAJaeger: ok ... hrm that seems before anything i'd even expect to have changed wrt pip-and-virtualenv07:14
ianwAJaeger, dirk, cmurhpy: this seems to be the relevant bit ->
ianw  "msg": "Data could not be sent to remote host \\"\\". Make sure this host can be reached over ssh: Permission denied07:18
ianw# cat /etc/dib-builddate.txt07:21
ianw2020-04-15 04:3807:21
ianwi'm logged into a opensuse host that was built today though ...07:22
AJaegerso, you can login but Zuul cannot?07:23
ianwhrm, maybe?07:23
*** tosky has joined #opendev07:23
ianwzuul@opensuse-15-inap-mtl01-0015944709:~> cat .ssh/authorized_keys07:25
ianw /var/lib/nodepool/.ssh/id_rsa.pub07:25
ianwthat .. does not look right?  like it's a file and not the actual public key?07:26
ianw2020-04-15 02:02:16.244 | + /opt/dib_tmp/dib_build.ujqmkwxc/hooks/extra-data.d/60-zuul-user:main:16          :   echo /var/lib/nodepool/.ssh/id_rsa.pub07:26
*** rpittau|afk is now known as rpittau07:28
AJaegershouldn't that be cat?07:28
ianwi think so, but ...07:30
openstackgerritIan Wienand proposed openstack/project-config master: Add ZUUL_USER_SSH_PUBLIC_KEY to opensuse-15 image
ianwAJaeger: ^ that should fix it.  i'll have to think about the echo/cat thing07:30
ianwif we want to merge that, i can come back and kick off a build soon, or maybe frickler could babysit it if around?07:31
ianwthis is *exactly* why i did the abstract job/inheritance thing in nodepool config, so wouldn't forget stuff like this.  still have to get back to convert the file07:32
AJaegerianw: thanks, approved07:32
ianwi feel like opendev-prod-hourly might be stuck07:43
*** ysandeep|rover is now known as ysandeep|lunch07:48
openstackgerritMerged openstack/project-config master: Add ZUUL_USER_SSH_PUBLIC_KEY to opensuse-15 image
ianwAJaeger: ok, i pulled that manually and triggered a build07:54
*** ykarel is now known as ykarel|lunch07:54
ianw <- this one07:55
*** hashar has quit IRC08:02
openstackgerritSorin Sbarnea proposed opendev/gerritlib master: Switch to ensure-docker role
*** ykarel|lunch is now known as ykarel08:37
*** ysandeep|lunch is now known as ysandeep|rover08:40
openstackgerritSorin Sbarnea proposed zuul/zuul-jobs master: Improve 404 error message on
openstackgerritMerged opendev/irc-meetings master: Update OpenDev meeting location and name
openstackgerritRoman Gorshunov proposed openstack/project-config master: Retire airship-in-a-bottle
openstackgerritRoman Gorshunov proposed openstack/project-config master: Retire airship-in-a-bottle
*** hashar has joined #opendev09:14
*** roman_g has quit IRC09:25
openstackgerritMarcin Juszkiewicz proposed openstack/project-config master: Add CentOS 8 AArch64 nodes
openstackgerritMerged zuul/zuul-jobs master: Support ssh-enabled windows hosts in add-build-sshkey
*** rpittau is now known as rpittau|bbl10:23
ttxTest GitHub replication on release-test repository:
ttxfungi, corvus, mnaser: please review ^ -- I'm wondering about all those deleted references and created branches10:36
ttxI mean those branches definitely correspond to the opendev repo... just wondering why they weren't already up10:37
ttx(maybe it's just a log artifact)10:37
ttxLike... That list of deleted refs could be quite long in a more active repo10:38
* ttx is tempted to queue a second test10:39
AJaegeryeah, interesting to see10:40
ttxok, sending a new one in10:40
openstackgerritMerged zuul/zuul-jobs master: Improve 404 error message on
*** ysandeep|rover is now known as ysandeep|coffee11:00
ttx only has the additional change mentioned11:03
ttxso yeah I fear that for large projects we may end up deleting thousands of reference, which might or might not be costly11:05
*** ysandeep|coffee is now known as ysandeep|rover11:17
openstackgerritMerged openstack/project-config master: Add devstack-plugin-ceph notifications to manila channel
AJaegerttx, are those changes all on github? Did you double check?12:03
ttxThey are, but then since Gerrit-wide replication was not turned off, that does not mean much12:05
ttxAJaeger: oh, you mean the refs?12:05
ttxlet me do a recent clone12:05
ttxAJaeger: on a fresh clone there aren't any refs on GitHub other than refs/remotes/origin/HEAD and refs/heads/master (+ branches)12:10
ttxno refs/changes12:10
*** factor has joined #opendev12:16
Eighth_Doctorhey, is it normal that a repo like openstack/nova would have 182116 refs?12:18
Eighth_Doctorthere's this refs/changes thing and refs/users thing...12:19
ttxEighth_Doctor: where did you clone from?12:29
Eighth_DoctorI did a `git clone --mirror`12:30
ttxfrom opendev or github?12:30
ttx(or both)12:30
Eighth_DoctorI only pulled from opendev12:35
ttxI suspect opendev has a full mirror of the Gerrit repo, which keeps all the refs/changes12:38
ttxwhile the new job pushes a mirror of a clone, so it does get rid of refs/changes in the process12:39
Eighth_Doctorttx, well, it certainly exposed some interesting things about doing a full mirror from there to stg.pagure.io12:41
Eighth_Doctoralso, wow, `git reflog` does not like this repo on my computer :/12:41
Eighth_DoctorI was taking a look at it due to a convo I had with mordred, clarkb, and fungi about using pagure as the source code browser frontend for instead of gitea12:44
Eighth_Doctorprocessing all those refs at once was a bit painful on the machine that runs on...12:48
Eighth_Doctorbut at least now it's there:
Eighth_Doctorthis is probably going to turn into a good test case, actually, since I hadn't encountered a repo like this before12:48
*** rpittau|bbl is now known as rpittau12:49
*** roman_g has joined #opendev12:50
mnaserttx: i wonder if the reason why it does this because we don't do a deep mirror clone by zuul into the executor12:53
mnaserttx: and so because we have a shallow clone that doesnt include all the refs (because that would take a long time and probably not needed)12:54
ttxthat's what I meant by "pushes a mirror of a clone"12:54
*** ykarel is now known as ykarel|afk12:55
openstackgerritAndreas Jaeger proposed openstack/project-config master: Update update_constraints for Py3.8
Eighth_Doctorttx: well, it took four days to push all those refs13:08
Eighth_Doctorand most of the git command line tools seem to be rather unhappy with the repo on my machine because of all the refs13:09
Eighth_Doctorbut it's a nice test case, so it's not all bad13:09
ttxlol... Yeah I expect it will also take days to delete them if we end up mirroring nova with the new per-repo system13:10
ttxhence my question up there13:10
Eighth_Doctorttx: if gerrit+zuul was directly managing the pagure git repository, I don't think this would be a problem13:22
Eighth_Doctorotherwise, probably should be somehow not sending those refs when pushing, because damn they're expensive13:22
mordredttx, mnaser we _do_ have a full mirror on the executor - however, the refs/changes thing might be a smidge interesting13:25
mordredbecause I'm not sure each executor is always going to fetch refs/changes it doesn't happen to work with - so in any given push we may not get the full story of the refs/changes/*13:25
mordredalthough maybe it's fine that they're not there13:26
Eighth_Doctormordred: my theory at least is that this would only be painful once13:26
mordredI'm a little concerned about that origin/stable/train -> origin/stable/train and friends13:26
mordredEighth_Doctor: well for the gitea/pagure case it's a little different - we use those also so that people can browse proposed changes too - so we need all of the refs/changes to be in that system13:27
mordredfor github mirroring - meh, I don't think it's actually important13:27
Eighth_Doctorthough gitea looks like it's not happy with me doing a git fetch right now13:27
fricklerianw: AJaeger: cmurphy: new opensuse image seems to work better, but now fails with "virtualenv: command not found"13:27
Eighth_Doctormordred: at least with gitea, refs/changes are not visible13:28
mordredttx: I think the created branches are a logic bug13:28
Eighth_Doctorit wouldn't be hard to extend pagure to show you the refs/changes stuff, but I'm not sure how useful it would be given that the refs have no context13:28
Eighth_DoctorI'm not even sure what the numbering scheme is here13:28
mordredEighth_Doctor: yeah - they're hidden refs - those are how gerrit stores proposed changes13:29
ttxmordred: not very concerned with the branches really. Just don't want the script to block executors for one day deleting 182,116 refs every time Nova is synced13:30
Eighth_Doctorpagure PRs work similarly, except they're stored in an adjacent repo for pull requests13:30
mordredso - is going to be in refs/changes/86/719186/913:30
Eighth_Doctorwhere does `86` come from?13:31
Eighth_Doctoris it just the last two digits?13:31
mordredthe last 2 digits13:31
mordredit's a dir hashing scheme13:31
mordredbut that ref can be seen in gitea:
mordredso we push them there, but since they aren't branches they don't show up in the branches list13:32
*** ysandeep|rover is now known as ysandeep|away13:32
Eighth_Doctorthat should work the same way with pagure, I think13:33
mordredttx: yeah - I think we might want to come up with a $something to do in git config to control refs/ interactions13:33
mordredttx: or - we could do an offline script to push up refs/changes deletions for all of them13:34
mordredttx: so that we just stop caring about those refs on github completely13:34
mordredthey're not exactly browseable anyway13:34
ttxyeah, it's just tricky to do without freezing mirroring for a bit13:35
ttxLike 1/ disable Gerrit-wide replication, 2/run refs/changes deletion script other a thousand repos and 700,000 refs, 3/ enable per-repo mirroring13:36
ttxI have no idea how long 2 will take :)13:37
Eighth_DoctorI wonder if we could be clever here in pagure, and make it so that when those refs/changes things show up, they make a link to Gerrit?13:37
Eighth_Doctorttx: four days at least on nova :)13:37
ttxEighth_Doctor: it was to create them, hopefully deleting is faster :)13:37
Eighth_Doctoractually, would the Change-Id be a better thing to process and hyperlink than the refs?13:38
ttxDamn it's more than 700,000, it's one per patchset13:38
Eighth_Doctorttx: yeah, it's a _lot_13:39
Eighth_DoctorChange-Ids are unique to Gerrit and are the way it tracks those things, is there a way to use that to link to the change review?13:39
ttxIt deleted 80 in 50ms in the script13:39
ttxso about 14 hours for a million patchsets13:41
ttxassuming 3 revs per change (average from fungi), would take about a day13:42
ttxnapkin math13:43
Eighth_Doctormordred: so the way that pagure renders commits doesn't seem to make the refs thing useful :(13:44
Eighth_Doctorthat might be worth fixing, not sure13:44
Eighth_Doctorthe commits list typically has these things, so it might be worth extending that view to support it13:44
Eighth_Doctormordred: what do you think would be more useful? a link via change-id (assuming that's possible) or a population in the commits view of refs/changes/* that link to gerrit?13:46
mordredEighth_Doctor: I think a view of a given refs/change is really only useful as something you might look at if you follow the link _from_ gerrit13:48
mordredthat said - I do think a link from change-id back to gerrit could be useful for people browsing normal commits13:48
mordredEighth_Doctor: is how you go to a change via change-id13:49
Eighth_Doctorokay, that's neat13:49
Eighth_DoctorI'm going to log that as an RFE and take a look at adding a feature for supporting that in pagure13:50
openstackgerritMonty Taylor proposed opendev/system-config master: Upgrade to gitea 1.11.4
mordredinfra-root: I'm landing the patches to run zuul prod patches from zuul checkout - I'll be watching to make sure it all happens properly14:03
*** ykarel|afk is now known as ykarel14:03
corvusttx: i agree with your analysis; we may be able to reconfigure gerrit not to replicate refs/changes, so if we did that, we could modify your process to: reconfigure gerrit to not replicate refs/changes; delete refs/changes asynchronously; enable zuul replication; disable gerrit.  that would avoid a replication outage.14:06
corvusmordred: #zuul -> we're about to need to make a moderately complex change to the zuul deployment in order to support zk tls14:10
fungiEighth_Doctor: still catching up, but the most effective way to link back to gerrit reviews from the git repository is via the git "notes" it stores14:10
fungithey used to be displayed by default by cgit, i think we need to configure gitea to do it (they didn't support alternative notes trees until somewhat recently and we haven't had time to revisit it since upgrading)14:11
corvusmordred: but we don't have a solution for running the executor in docker yet, so i don't think we can convert everything to docker; should we do the new work in ansible instead of puppet?  should we use windmill?14:11
mordredcorvus: re: gerrit - setting 'push' to +refs/heads/*:refs/heads/* should do the tric14:11
corvusttx: ^14:11
mordredcorvus: I mean - I've got all the config bits converted - so I think it would be easier to just change the executor to pip install in that instead of trying to use windmill14:12
fungithe way i had imagined that replication job was that it would just push the current head or tag when triggered, not try to push a full mirror every time it's invoked14:12
mordredalso - we'd have to add zk tls support to windmill and I don't know what paul's status for stuff like that is atm14:12
fungiso i'm surprised it was deleting anything14:12
corvusmordred: what do you mean you've got all the config bits converted?  isn't zuul.conf still written by puppet?14:13
mordredcorvus: (we could also change all of it to run via pip instead of docker)14:13
mordredcorvus: my zuul patch ... one sec14:13
fungiassuming it's in sync already, the job should be triggered for any update to a branch anyway (and a tag once we add it to the right pipelines)14:13
corvusmordred: re windmill -- someone needs to, right?  doesn't the ansible zuul run via windmill?14:13
openstackgerritMerged opendev/system-config master: Update install-ansible away from /opt/system-config
openstackgerritMerged opendev/system-config master: Run playbooks out of zuul checkout
mordredcorvus: that's a stab at converting our pupppet use to using ansible instead - although it's obviously not going to work for the exeutor because of docker - but it's a start. but we could also use windmill - I didn't do that in this case because it seemed like a harder step14:14
corvusmordred: oh!  you sure did write that patch.  :)14:15
corvusmordred: i agree, updating that to s/docker/pip/ is probably easy and gets us to a place to use zk tls quickest14:15
mordredcorvus: do you think I should do that everywhere? or just on ze?14:15
corvusmordred: well, it seems nodepool -> docker is already in progress, so a mixed env is a given; therefore, maybe just do that on ze14:17
mordredcorvus: kk. I'll update the patch14:25
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
mordredthat's just a rebase14:28
*** ysandeep|away is now known as ysandeep14:31
Eighth_Doctorfungi: git notes?14:33
fungiEighth_Doctor: notes refs14:33
fungithat stuff14:33
fungithough gerrit doesn't use the default refs/notes/commits tree in case you're already using that for other purposes14:34
fungiit uses a refs/notes/review tree14:35
fungibut it stores the numeric vote values/dates/users, review link and related data in there14:36
ttxmordred, corvus: would limiting the push not result in refs deletion ? (like what happens during the replication process for refs/changes already on GitHub ?)14:36
Eighth_Doctorfungi: interesting14:36
ttxor is it just additive14:36
fungittx: i'm surprised we actually built that job to mirror all refs in the first place, i had thought it was just going to push refs for the branch or tag which triggered the build14:38
ttxalso where is that setting 'push' to +refs/heads/*:refs/heads/* happening ? replication.config?14:40
corvusttx: oh, that's a good point, it may well do that.14:40
ttxanswering my last question :yes14:40
fungittx: yeah, it'll be in the replication config14:40
ttx    push = +refs/heads/*:refs/heads/*14:41
ttx    push = +refs/tags/*:refs/tags/*14:41
ttxprobably both14:41
corvusttx: we still might want to consider that though; i think we dedicate one thread to github replication; we could increase that to two, which would mean replication is slow, but would allow other work to happen while nova was 'stuck'14:41
ttxok will give it some extra thought14:42
mordredI'm not sure it would push deletes14:42
mordredI think with teh mirror script it's mirroring all refs, so that means it's going to try to mirror in the refs/changes namespace, meaning pushing deletes14:42
ttxon mirroring it definitely deletes remote extra refs14:43
mordredif we limit the ref namespaces gerrit is pushing14:43
mordredthen I don't think it would push empty refs/changes to delete things14:43
mordredthat would be pushing ref information for a namespace we told it not to push14:43
corvusyeah, i don't know for sure.  that sounds plausible.14:43
mordredwe can try this out on review-dev14:43
mordredbut I'm gonna put my money on it being safe to configure gerrit to just stop replicating them14:44
mordredand then being able to run a cleanup script14:44
*** mlavalle has joined #opendev14:47
fungiit's still not clear to me, why have the job replicate everything each time it runs and not just the branch or tag for which the build was triggered?14:47
fungibranch and tag updates won't happen outside gerrit typically anyway, so zuul will receive events for those and then run the job14:48
corvusfungi: that's a fair question.  perhaps to catch up after previous errors?  maybe that's low-risk though?14:49
corvusor maybe it could be configured not to delete14:49
fungithe only one i'd worry about is missed tags, but maybe if the job is triggered by a tag then push all tags, but branches will eventually get new commits14:49
fungijust seems unnecessary to have it try to replicate the entire repository when the triggering event was a new commit merging to a single branch, and the job gets run each time that happens14:50
fungi(to be honest, i had it in my head that was the design, and didn't realize until now that wasn't how it was working)14:51
corvusspeaking of replication... i think we may have a gitea backend out of sync; i'm seeing different data pulling zuul updates14:51
cmurphyfrickler: ianw dirk "virtualenv: command not found" the pip-and-virtualenv element was removed from the image build ???14:52
cmurphycan we put it back? keystone needs this14:52
fungicmurphy: the idea was that the tox parent job would start installing virtualenv, i think14:53
AJaegercmurphy: ianw is working on this stack:
fungi(and no, a big part of the delay in the suse image updates was so that we didn't have to work out installing pip, virtualenv and tox into the system context, since we're going that direction for the other distros as well, fedora is already like that apparently)14:54
AJaegeronce that's merged all should be green again14:54
AJaegercmurphy: best discuss with ianw once he's awake. We expected that what we had would work already as is.14:55
cmurphyAJaeger: great! can we merge that asap?14:55
cmurphy:( ianw won't be awake till the end of my day14:55
AJaegercmurphy: ianw wants to merge once he's around - but corvus just left a -1 on
corvusi thought the "plain" image was being used to work through this?14:57
corvusi didn't think anything was removed from the main images yet14:57
AJaegercorvus: - we had problems building the opensuse images as well14:58
AJaegerSo, between a rock and a hard place ;(14:58
corvusperhaps we should revert that as cmurphy says?  because if we override my -1 we're going to break other zuul installs14:59
corvusi haven't looked into how long it would take to fix my -1, it's probably not too hard, but i'm certainly not up to speed15:00
AJaegercorvus: 718299 was needed to fix image builds that have been broken for ages ;(15:00
fungiwe can roll back to months-old images maybe, if we still have them hanging around15:02
cmurphythe months-old image was semi-working for me with workarounds15:03
mordredcorvus: if you have a sec - could you look at ?15:03
mordredcorvus: that's trying to use mirror-workspace-git-repos when talking to bridge - it seems to be having a sad but I'm not 100% sure what the issue would be - did I use the wrong role here?15:04
mordredcorvus: I'm starting to think maybe I was supposed to use prepare-workspace instead15:04
mordredcorvus: no - I guess prepare-workspace does the synchronize15:05
corvusmordred: in a bit...15:05
*** lpetrut has quit IRC15:05
corvuswhat should we do about the keystone situation?15:05
corvusrevert and rollback?  or are we going to say "sorry it's broken for a day or two"?15:06
corvusare there any other options?15:06
mordredI think revert and rollback15:07
mordredand then the re-revert needs to take this issue in to account15:08
mordredbecuase I think it was the assumption that this wouldn't break things15:08
corvusare we talking about the opensuse-15 image?15:08
mordredI believe so?15:08
yoctozeptomorning folks15:09
yoctozeptoit's etherpad again :-(15:09
yoctozeptono worky15:09
yoctozeptoonly loady15:09
corvusit looks like we have deleted all of the months-old opensuse-15 images15:09
corvusoh wait15:10
yoctozeptoAn error occurred15:10
yoctozeptoThe error was reported with the following id: 'LxotxdY5BrhtpIZtbDud'15:10
corvuswe still have one that's 69 days old on nb0215:10
mordredcorvus: well, that one won't have the pip-and-virtualenv element removed - maybe that's the most recent?15:10
corvusyeah, i think if we revert back to nb02, it'll upload those15:11
fungipriteau just mentioned in #openstack-infra that is unresponsive. i'll take a look15:12
mordredfungi: while you're looking, see issue from yoctozepto above about that etherpad too15:12
openstackgerritJames E. Blair proposed openstack/project-config master: Revert "Move suse builds to nb04, drop pip-and-virtualenv"
corvusmordred: i think the local apache mirrors for gerrit may be out of date15:13
corvusmordred: maybe we missed a bind mount?15:13
yoctozeptomordred, fungi: oh well, I asked priteau if it worked for him :-)15:13
mordredcorvus: looking15:13
corvusAJaeger, fungi, mordred, cmurphy: see 720223 ^15:13
yoctozeptodid not think he would crossreport :-)15:13
fungiyoctozepto: oh, well thanks, i missed your mention of it in here, sorry, there's been a lot of discussion going on15:14
yoctozeptofungi: sure, no problem15:14
AJaegerthanks, corvus15:15
mordredcorvus: yes.15:15
fungithe server itself is definitely up and reachable over ssh15:15
cmurphythanks corvus15:15
fungiand "node node_modules/ep_etherpad-lite/node/server.js" is running since some time on monday15:15
fungiand it's suddenly responding to me again vi browser, i didn't change anything15:16
fungiload average is low15:16
openstackgerritMonty Taylor proposed opendev/system-config master: Add /opt/lib/git to the volume mounts
funginothing going haywire with the kernel per dmesg15:17
fungicacti doesn't show anything particularly anomalous either15:19
openstackgerritMonty Taylor proposed opendev/system-config master: Use prepare-workspace-git in production playbook
mordredcorvus: ^^ I believe the first of those will fix the gerrit local git replica issue15:20
mordredcorvus: and the second should fix my git repo replication issue15:21
mordredactually ... let me change that15:21
fungiyoctozepto: so far i'm not finding anything on the server to explain the temporary outage. is loading now too15:22
fungimight have been network-related, but i'm going to dig deeper in logs15:22
yoctozeptofungi: yes, thanks; I can only offer this id LxotxdY5BrhtpIZtbDud15:22
yoctozeptomaybe it's greppable or something :D15:23
fungii'm checking15:23
yoctozeptoduck, I got another failure15:24
openstackgerritMonty Taylor proposed opendev/system-config master: Just use synchronize to sync the repos
corvusthose are both "Uncaught TypeError: Cannot read property 'setStateIdle' of null"15:26
mordredcorvus: ^^ I think that's a better approach for our use on bridge15:26
fungii find a couple of recent proxy errors apache logged at 15:06:22z15:27
fungi"AH01102: error reading status line from remote server localhost:9001" and "AH00898: Error reading from remote server returned by /"15:27
fungithose may be unrelated though15:27
yoctozeptothis must be nodejs looking at that message15:28
fungido we use docker-compose to view etherpad's service logs now? are those written to disk in the chroot or spewed on stdout/stderr?15:28
mordredfungi: spewed15:28
mordredfungi: cd /etc/etherpad-docker ; docker-compose logs15:28
mordredfungi: will get you the spew15:28
mordred(-f will tail)15:28
openstackgerritMarcin Juszkiewicz proposed openstack/project-config master: Add CentOS 8 AArch64 nodes
corvusfungi: i ran: "docker logs etherpaddocker_etherpad_1|grep -C 8 LxotxdY5BrhtpIZtbDud" to get the error message above15:29
*** ykarel is now known as ykarel|away15:29
corvus is relevant15:29
mordredcorvus, fungi: I know there'sa. ton of things going on - but could I get a quick +A on 720227? we're dead in the water on bridge without it, and that means we can't land the nodepool revert that we need for the keystone issue15:30
mordred(when it rains it pours)15:30
corvusmordred: oh :( well i wanted to really look into that15:31
corvusbut if we need to just merge it to put out fires sure15:31
corvushow did we get into that situtation though?15:31
clarkbmordred: is there a tl;dr of nodepool issue?15:31
corvusoh i guess this only runs in prod15:31
yoctozepto17:30:39 <mordred> (when it rains it pours)15:31
mordredcorvus: we merged the "run from git" patch - and it failed being unable to sync the git repos to bridge15:31
*** redrobot has joined #opendev15:32
clarkband nodepool change I assume is related to the opensuse things?15:32
clarkbnote I think opensuse like fedora3X was not actually building on the old setups15:32
mordredcorvus: yeah - it's unfortunate timing - when I clicked +A it was quiet15:32
clarkbso a revert is unlikely to fix anything15:32
corvusclarkb: revert+rollback is the proposal15:32
mordredclarkb: there is a 69 day old opensuse image15:32
clarkbah ok if we still have old image then we are good15:32
mordredclarkb: but we need to be able to land the revert15:33
clarkbin the case of opensuse it isn't building due to python2 changes15:33
mordredclarkb: so if you have a quick morning second15:33
clarkbso its directly related to the change made to the image build, not to anything in the builder itself15:33
clarkbbasically you can't have a working oepnsuse with python2 now or something15:33
corvusthe main difference i see betwen our etherpad config and is we don't have a timeout setting15:34
mordredcorvus: I agree re: timeout15:35
corvusmordred: wait i don't understand your comment about "delete: false"15:35
corvusmordred: that just means that synchronize won't delete files (which could cause errors)15:36
clarkbcorvus: oh I was just getting to that :)15:36
corvusi mean, i'm still okay with +2 meaning -1 just to try to dig out of this hole15:37
clarkbI think in the context of a git repo thats not a good thing to have set to false15:37
corvusyeah.  it will probably work okay for the next couple of changes we land15:37
mordredcorvus: oh - want me to put that back in? I was mostly just thinking we don't want to delete and repush project-config over and over15:37
clarkbmordred: well I think the best way to handle it is to use git push15:37
corvusi don't know why that would "delete and repush"15:37
clarkbnot rsync15:37
mordredcorvus: not all jobs have project-config in their required-projects15:38
corvusthat just means "delete files on the remote side that aren't on this side"15:38
corvuslet's just merge this and replace it with the right role15:38
mordredyeah - this should work until we can breathe and dig in better15:38
clarkbok I've approved it15:38
clarkbbut ya I think we want a role that does a git push15:38
clarkb(and it can skip pushing if the source doesn't exist)15:39
corvusdo we need to make sure that 720223 lands after that?15:39
mordredcorvus: yes15:39
yoctozeptoeh, etherpad does not like me :/15:39
openstackgerritJames E. Blair proposed openstack/project-config master: Revert "Move suse builds to nb04, drop pip-and-virtualenv"
mordredcorvus: +A15:40
mordredcorvus: thanks - and sorry for timing there15:40
corvusmordred: np; what was wrong with prepare-workspace-git?15:40
mordredcorvus: it might be the right choice - we were using mirror-workspace-git15:40
corvusmordred: here's a handy cheat sheet:
mordredcorvus: I wasn't 100% sure prepare-workspace-git was the right thing to use and figured the simple rsync would _definitely_ work in this case15:41
corvusmordred: prepare-workspace-git calls mirror-workspace-git15:41
corvusso what went wrong with mirror-workspace-git?15:41
mordredcorvus: there were no git repos on the remote side to push to15:42
corvusmordred: ah, then prepare-workspace-git may well work15:42
mordredit tried to git config them and got an error "you can't do that without a git repo"15:42
corvusbecause prepare-workspace-git does the "use a cache if it's there, otherwise git init" step i believe15:42
mordredcorvus: yeah. I believe that's accurrate15:42
corvusmordred: okay, want to push up a prepare-workspace-git change, and we can merge it after the nodepool change lands?15:43
mordredfwiw - we could handle the "delete and re-push project-config over and over again" by having the bridge playbook maintain an /opt/git cache of both15:43
mordredcorvus: ++15:43
corvusmordred: i think using this role will avoid the issue; it's not going to delete any repos already in the workspace15:44
mordredit will - it'll call mirror-workspace-git at the end which will do the rsync --delete15:44
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: dhall-diff: add new job
corvusmordred: mirror-workspace-git-repos doesn't use rsync15:45
AJaegerconfig-core, could you review - needed for release, please15:45
mordredcorvus: sigh. so you are right :)15:46
mordredcorvus: so yay15:46
openstackgerritMonty Taylor proposed opendev/system-config master: Switch to prepare-workspace-git
mordredclarkb corvus :^^15:47
clarkbAJaeger: I guess we are ok with dealing with cases where 3.6 doesn't imply 3.8 when they come up?15:48
fungitailing etherpad logs, i see quite a few errors related to that KollaWhiteBoard pad15:49
yoctozeptohZUCejCWzEctlM9uquL5 - another token of despise from etherpad15:49
yoctozeptofungi: it's probably me15:49
yoctozeptoit fails for me and other folks15:49
funginot just "Uncaught TypeError: Cannot read property \'setStateIdle\' of null" but also some others15:49
yoctozeptowe are having a kolla meeting, must be the reason15:49
AJaegerclarkb: for now yes15:49
clarkbfungi: yoctozepto the other day we were theorizing with subline that it is the client15:49
yoctozeptochrome 81? :D15:50
yoctozeptotoo fast? to slow?15:50
mordredclarkb: that some browsers are doing a bad thing?15:50
yoctozeptotoo awesome?15:50
clarkbyoctozepto: not specific versions of browsers but state in your browser15:50
yoctozeptolemme try another15:50
fungi"Error: Can't apply USER_CHANGES, because Trying to submit changes as another author in changeset ..."15:50
clarkbso try another or try private browsing mode15:50
yoctozeptoyeah, I tried incognito already15:50
clarkbfungi found a bug from etherpad that showed etherpad is super sensitive to client activity too :/15:51
prometheanfireian: fungi: have time to look at ? (glean systemd-resolved thing)15:51
fungi"[ERROR] console - (node:1) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'colorId' of null"15:51
clarkband the bug fix proposed was client side too I think15:51
clarkbrather than making the server robust15:51
fungi[WARN] client - TypeError: pad.collabClient is null"15:52
yoctozeptohmm, it works in firefox, but seems sluggish15:53
corvusfwiw, the kolla pad has been working for me without issue in ff for a while, but i haven't been writing15:53
yoctozepto(to load)15:53
clarkbAJaeger: also that sed seems to do the same replacement?15:53
clarkbAJaeger: it replaces python_version==3.8 to python_version==3.8 can you double check that?15:54
AJaegerclarkb: that's correct - it should update versions. Let me double check...15:55
AJaegerclarkb: we need to use '$VERSION' - that's the difference15:56
clarkbAJaeger: got it15:56
openstackgerritMerged zuul/zuul-jobs master: Adds roles to install and run hashicorp packer
fungisaw a similar setStateIdle warning pop up for an unrelaetd pad16:02
fungii wonder if we're just running into tuning errors and today is the first day we've got the new deployment under typical load16:03
yoctozeptoyikes, it finally loaded16:04
fungii picked another pad i saw scroll by in the logs and am getting indefinite "loading" from it16:04
yoctozeptofungi could be right16:04
fungiokay, the one i was trying to load finally loaded16:05
openstackgerritBrian Rosmaita proposed openstack/project-config master: Change gerrit ACLs for cinder-tempest-plugin
AJaegerfungi, could you review , please?16:05
openstackgerritBrian Rosmaita proposed openstack/project-config master: Change gerrit ACLs for cinder-tempest-plugin
fungiokay, spotted a "[WARN] client - TypeError: r.dropdowns is undefined" for which is likely related to and the later
fungichecking to see if that pad is broken16:14
openstackgerritMerged openstack/project-config master: Update update_constraints for Py3.8
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
mordredcorvus: ^^ ok - that's now updated to use pip to install the executor16:23
mordredYAY the system-config fix patch failed on a puppet module remote repo being unreachable16:24
corvusi'll re-enqueue it16:25
mordredok - cool16:25
corvusmordred: it doesn't make sense to land 231 right after 22716:27
corvusmordred: which do you want?16:28
mordredcorvus: why don't we just do 23116:28
corvusi'm starting to think we should just put all of nodepool in the emergency file and manually fix it16:28
mordredcorvus: yeah16:29
corvusbecause we're now at 1 hour past our decision to rollback and have made no progress on actually doing it16:29
mordredcorvus: i'll add nodepool to emergency16:29
corvusi'll start logging into the builders16:30
mordredcorvus: we just need the builders right?16:31
corvusmordred: yeah16:32
mordredk. done16:32
*** rpittau is now known as rpittau|afk16:33
fungito follow up on my earlier speculation, doesn't seem permanently broken (it loaded for me at least) so that "r.dropdowns is undefined" warning is apparently not always accompanied by a broken pad16:34
johnsomfungi FYI, we just noticed that pad won't open for some of us anymore. It times out.16:35
johnsomRough time to lose our priority planning etherpad I have to say.16:35
openstackgerritMonty Taylor proposed opendev/system-config master: Switch to prepare-workspace-git
fungijohnsom: yeah, i'm starting to suspect tuning issues. we just switched out deployment to use containers so may be hitting different performance limitations, but we also upgraded to a newer etherpad release so could just be hitting new regressions in the software16:36
corvusthis is nice -- it's easy to revert nb04 since its nodepool.yaml is a git checkout16:37
corvusbut i have to copy the file on nb01 and nb0216:37
mordredcorvus: yay for things being nicer in the future16:38
mordredcorvus: actually - I think the not-yet-landed project-config would make it back in to files - but I think "I want to easily revert a change in an emergency" is a good use case, so I'll make sure we retain the nb04 behavior when we roll that out16:39
mordredcorvus: nope - nevermind - it'll stay a git repo16:40
johnsomfungi FYI ^^^16:43
fungijohnsom: yep, that's another of the "[WARN] client - Uncaught TypeError: Cannot read property 'setStateIdle' of null" events16:48
clarkbfungi: does the apache status page show us filling up on connections?16:49
openstackgerritMerged opendev/system-config master: Just use synchronize to sync the repos
corvusclarkb, fungi, mordred: okay i think the configs are reverted on nb01, nb02, and nb0416:50
fungiclarkb: i was just working on trying to connect to it, we do seem to be flat-lining at 500 established per cacti, lower than typical before the switch as you can see at
corvusi think i should now delete the opensuse-15-0000086893 and opensuse-15-0000086892 dib images?16:51
corvusthat should then prompt nb02 to upload the opensuse-15-0000053000 image?16:51
mordredcorvus: yes - and nb should upload the old one16:51
clarkbfungi: I wonder if our old tuning isn't applying beacuse bionic apache uses a new mpm worker system compared to xenial16:52
clarkbfungi: but ya that seems like a good thread to pull on16:52
fungii've confirmed that the new deployment at least hasn't broken reaching from a local shell on the server (and also hasn't inadvertently exposed it to the public)16:54
fungifirefox always likes to pick the worst possible times to tell me it needs to restart for an upgrade16:56
fungithe scoreboard still has quite a few open slots16:58
corvusokay, we have no opensuse-15 images now; i don't see an upload happening yet16:58
fungi148 requests currently being processed, 2 idle workers16:59
fungiclaims we're still using the "event" mpm16:59
corvusi wonder if it's because there's still an deleted image in vexxhost for it16:59
mordredcorvus: I didn't think we blocked on that - but maybe I'm wrong?17:01
corvusinstance d2d73e84-d988-4605-a596-b0ddef9b2b23 in vexxhost has been deleting for 18 days17:01
mordredthat didn't block it from uploading the new image last night17:01
clarkbcorvus: thats "normal" beacuse openstack17:01
corvusmordred: right but this is an *old* image17:01
mordredgood point17:01
corvusit's an image that already has existing zk records because it "exists" because it's "deleting"17:02
corvuscan anyone try deleting the that instance while i try to figure out what nodepool should do in this case?17:02
clarkbfungi: taking a quick look at the server we have /var/log/apache2 logs for gerrit vhost (I think that must be copy paste error taking gerrit ansible and adopting it for etherpad) we should clean that up17:02
mordredcorvus: I'll take a stab at it17:03
fungiclarkb: yeah, i saw that too17:03
clarkbfungi: I'll take a look at that now while I'm thinking about it17:03
fungiinterestingly we're logging traffic in those17:04
clarkbmordred: corvus ime there are two states. One is where volume is attached to a server that does not exist. That we can clean up by removing the attachment and deleting the volume. The other is server refuses to delete which keeps the whole resource chain alive. That requires cloud intervention17:04
fungioh, but they're etherpad access requests17:05
clarkbalso I do not think that would affect nodepool's ability to make new images17:05
corvuswe don't want it to make a new image17:05
corvuswe want it to upload an old image17:05
openstackgerritMerged openstack/project-config master: Revert "Move suse builds to nb04, drop pip-and-virtualenv"
mordredcorvus: I don't see that instance in vexxhost - by instance you mean server here right?17:07
corvusmordred: yes17:08
corvus| 0014437332 | vexxhost-sjc1       | opensuse-15               | d2d73e84-d988-4605-a596-b0ddef9b2b23 |    | 2604:e100:3:0:f816:3eff:fe52:b724       | deleting | 18:02:57:09  | locked   |17:08
mordredcorvus: thanks17:08
openstackgerritClark Boylan proposed opendev/system-config master: Fix etherpad port 80 logging
clarkbfungi: ^17:08
corvusmordred, clarkb, fungi, AJaeger, cmurphy: i figured out why nodepool isn't uploading the old image17:08
corvusit is no longer on the filesystem of the nodepool image builder17:09
corvusso we've lost it17:09
AJaegeroops ;(17:09
fungiahh, right, we "fixed" that in nodepool17:09
corvusfungi: we what?17:09
clarkbcorvus: fungi the change was once all images were in a deleting state we could delete the image from disk17:09
fungibecause before, it would pile up local copies of images on the builders until they could be completely deleted from all the providers17:09
corvusit's not in a deleting state, it's ready.17:09
clarkbrather than waiting for the image to delete from the cloud because what was happening is vexxhost was failing to delete many images and then our disks filled up17:09
corvusthis was deleted out from under nodepool.17:09
fungioh, then that's different17:10
corvusthough, also, that's a really unfortunate nodepool behavior17:10
fungiwhat got fixed was to have the builder delete the local copy once it told all providers to delete their copies, whether or not the delete command was sucessful/completed17:10
clarkb| 0000053000 | 0000000002 | vexxhost-sjc1       | opensuse-15          | opensuse-15-1580919581          | c5b3b55a-4c74-4d41-998c-265342ab3afc | deleting | 33:14:40:58 |17:10
clarkbis it that image? beacuse it is deleting17:10
clarkbfungi: right becuse otherwise we'd need like 10TB fo disk17:11
corvusthat's the *upload* not the diskimage17:11
corvus| opensuse-15-0000053000          | opensuse-15          | nb02        | qcow2,raw,vhd | ready    | 70:01:24:58  |17:11
corvusthat is the image that nodepool told us is ready to be uploaded17:11
fungibecause nodepool has no control over whether providers actually follow through on image delete requests, so we were filling up the hard drives when providers failed to be able to process a delete for various reasons17:11
corvusyeah, i get it17:11
corvusso 1 of 2 things happened here: either one of us deleted the image from disk behind nodepool's back to free up space17:11
corvusor, somehow this new behavior change we made to nodepool did apply to this case, in which case, we seem to have programmed our software to lie to us17:12
clarkbcorvus: I think that may be the case because we can't remove the image record until all uploads are done due to the zk fs hierarchy? and maybe thats a bug where we need to update the state on the dib build as a result?17:12
corvuseither way, we just blew 2 hours of work17:12
corvusbecause it said "ready" when it wasn't17:12
corvusclarkb: if nodepool deleted the diskimage, then there is no excuse for it saying "ready".  we have "deleting" for that.17:13
clarkbbasically the record can't go away until all the uploads go away so we need to update that record state and it may be a bug that we don't (I haven't checked that in the code)17:13
clarkbcorvus: I get that, but code has buigs17:13
clarkbits clearly not intentional if that is the case17:13
corvusmordred: can you check the openstack state of the image with id c5b3b55a-4c74-4d41-998c-265342ab3afc ?17:14
mordredcorvus: it shows active17:14
mordredcorvus: is that the image we want?17:14
corvusclarkb: yes, we agree that if that is the case, then it's a bug17:15
clarkbmordred: yes that that the copy of the image in vexxhost sjc117:15
mordredcorvus: we can download it from openstack17:15
clarkb*that is17:15
corvusmordred: yes.  so we may be able to convince nodepool to continue to use that17:15
mordredcorvus: why don't we download it as well, just to be on the safe side17:15
corvusmordred: first thing: were you successful in deleting that instance?17:15
clarkbI wasn't around when all of this was originally debugged. Did we decide we can't roll forward for some reasno (thinking about options here)17:15
fungiclarkb: there are issues raised with the next steps job config changes17:16
corvusclarkb: see my -1 on
clarkbfungi: right but one that is easily fixable17:16
mordredcorvus: no - I do not see if thwen I look for it - which is very strange to me17:16
corvusmordred: neat, at least we're in stasis17:16
corvusmordred: then yes, let's start by downloading that17:17
mordredcorvus: ok. I'm going to do that now17:17
fungiclarkb: seemed mostly a decision as to whether it would be more work/faster/better guaranteed to return to a known state17:17
fungithough i think it was also assumed at the time that rolling back to the older image would be relatively easy17:18
corvusyep, and that is SOP in situations like this17:18
clarkbya, I think the thing that makes this odd is we haven't been able to build that image for months (very similar to the fedora situation)17:19
clarkbnormally I would agree17:19
clarkband probably would have this morning. Just wanting ot make sure the other options were considered too (and if so what counted against them)17:19
mordredcorvus: I am downloading the image to /opt/nodepool_dib/ on nb0217:21
mordred~/osc/bin/openstack --os-cloud=vexxhost --os-region-name=sjc1 image save c5b3b55a-4c74-4d41-998c-265342ab3afc --file=/opt/nodepool_dib/
corvusmordred: cool, i think when that finishes we probably want to make md5 and sha256 files, then copy that to opensuse-15-0000053000.raw17:23
corvusnodepool also expecs qcow2 and vhd17:24
corvusmaybe we can just let it fail those uploads?17:24
corvusor maybe we can edit the zk record17:24
mordredwe could convert them17:24
corvusor that17:24
mordredwe have the conversion tools on the host after all17:24
clarkbnote that nodepool may try to delete them again if that image does end up deleting (periodic cleanup by provider maybe)17:24
fungichecking the etherpad apache server-status periodically, we have 5 of the currently 11 running workers perpetually in "stopping" state due to being on an old config generation, so not accepting connections. though i don't think that's currently causing issues because there are as many open slots for more worker processes too17:27
fungii take that to mean we've updated the apache config since the parent started, and those workers are in a graceful shutdown but still have existing clients who haven't closed out (or where the line has gone dead and apache doesn't know they're never coming back)17:29
openstackgerritMonty Taylor proposed opendev/system-config master: Use project-config from zuul instead of direct clones
corvusmordred: not the speediest process is it?17:31
mordredcorvus: nope17:32
mordredcorvus, clarkb : ^^ I had to rebase that patch due to merge conflict17:33
corvusmordred:  do you know how to do those conversions?17:34
fungii spot-checked one of the "Cannot read property 'setStateIdle' of null" hits in the log just now and found it correlated to a request which started for the old domain (determined through correlation with /var/log/apache2/etherpad.openstack.org_access.log since we're logging that redirect vhost separately). will try to see if that is consistent17:34
mordredcorvus: I can pull it out of the dib source17:35
corvusmordred: cool, if we want to do that, now's probably a good time to get that ready17:35
corvusmordred: is the d/l finished?17:36
mordredcorvus: yes - it just finished17:36
mordredcp $TMP_IMAGE_PATH $1-intermediate17:36
mordred        vhd-util convert -s 0 -t 1 -i $1-intermediate -o $1-intermediate17:36
mordred        vhd-util convert -s 1 -t 2 -i $1-intermediate -o $1-new17:36
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Check if pip is preinstalled before installing it
corvusmordred: cool, i'll do the work for the raw image if you want to convert17:36
mordredthat's the "convert to vhd" steps17:36
corvusmordred: or i can do those too17:36
mordredI can do the converts17:36
mordredyou want me to wait until you've renamed?17:36
mordredor should we cp and keep this as-is just in case?17:37
corvusmordred: i was going to copy, in order to avoid nodepool deleting it :)17:37
clarkbI think addresses corvus and tristanC's concern with ensure-pip17:37
clarkbI'm hoping that existing tests will help point me in the right direction as far as testing goes (because so many of our images come preinstalled with pip)17:37
corvusi was told there was a 'plain' image to verify this stuff17:38
clarkbcorvus: oh right that one should be "clean" lets see if it runs tests yet17:38
mordredclarkb: we should make sure that the 3pci shows it doesn't try to reinstall17:38
mordredcorvus: wow. even just copying the image is slow17:39
corvusmordred: i'm several minutes into an md5sum17:39
clarkbmordred: ya if it clears our initial tests I can rebase into ianw's stack and that should haev 3pci run it17:39
tristanCclarkb: commented17:39
clarkbtristanC: because `pip` shoudl always be present regardless of python version17:40
clarkbthen we check version specifics based on what is enabled17:40
tristanCclarkb: i meant the change assigns shell variable using `if` jinja statement, and then it evaluate content based on `if` shell statement. couldn't the type command be selected by the jinja `if` statement?17:41
clarkbtristanC: it could but I thought it was easier to set flags (basically translate yaml truthyness to bash truthyness) then evaluate the results in a bash context17:42
clarkbthis way you don't have to parse jinjayamlbash17:42
clarkband instead its jinjayaml then bash17:43
tristanCclarkb: hmm ok17:44
tristanCiirc, a tox user who want python2 needs to set both `tox_prefer_python2: true` and `ensure_pip_from_packages_with_python2` ?17:47
clarkbtristanC: or ensure_pip_from_upstream and ensure_pip_from_upstream_interpreters has python2 in it17:48
*** dpawlik has quit IRC17:48
clarkbthough maybe what you mean is if you want python packages you need that? since pip_from_upstream doesn't imply python packages17:49
smcginnisHopefully quick and easy question - do we have any nodes with py38 available yet?17:49
clarkbsmcginnis: I believe the bionic nodes can do that with special packages (the tox-py38 enables them)17:50
smcginnisPerfect, thanks clarkb.17:50
mordredcorvus: qcow2 image convert should be: qemu-img convert -f raw -O qcow2 opensuse-15-0000053000.raw opensuse-15-0000053000.qcow217:51
fungismcginnis: yeah, if you just use the py38 jobs they should work magically17:52
mordredcorvus: I am currently doing the second stage of the vhd convert17:52
smcginnisfungi, clarkb: Would that include openstack-tox-functional jobs?17:52
clarkbsmcginnis: I don't know. You may have to add the package installs that tox-py38 does to tox-functional if it isn't already there17:53
fungismcginnis: yeah, no clue, i've only seen folks using the tox-py38 job so far17:53
smcginnisGuess we'll find out.17:54
fungibut that installs the python3.8 package on the default image17:54
fungier, default node type17:54
*** ralonsoh has quit IRC17:55
corvusmordred: nodepool is attempting to upload images now (but failing since not all files are in place)17:56
corvusso either the md5sum file or the "vhd-new" file is enough for it to think there's an image there17:56
corvusanyway, i think that's good, harmless, but chatty :)17:56
corvusokay, all 3 raw pieces are in place17:57
mordredcorvus: cool17:58
corvusand it looks like we're really uploading to vexxhost now17:59
mordredcorvus: I'll do the qcow2 conversion as soon as the vhd conversion is done17:59
corvusmordred: cool -- want me to do the checksums and rename for vhd, or you?17:59
mordredcorvus: if you could do the checksums that would be neat18:00
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Check if pip is preinstalled before installing it
corvuswill do; lemme know when it's ready18:00
mordredwill do18:00
clarkbre image conversions for qcow2 you might need to set the compatibility flag. I 'm not sure if we ever managed to decide if that was or wasn't needed anymore18:01
clarkbthe --compare=0.10 or something similar flag18:01
mordredclarkb: I believe we stopped doing it18:01
mordredcorvus: done18:01
mordredclarkb: we were only doing that for hp cloud anyway18:01
clarkbmordred: ya at this point it would surprise me if there were any qemu-imgs in the wild old enough to trip voer that18:02
mordredclarkb: I also don't see us setting QEMU_IMG_OPTIONS18:02
mordredcorvus: I am now doing the qcow conversion18:03
clarkbmordred: well if some clouds are unhappy with it without the flag we'll learn something :)18:03
clarkb(and probably be able to suggest strongly that people upgrade qemu)18:03
mordredclarkb: re-review ?18:11
* mordred would like to get that done since there's a manual transition step and we're sort of in the awkward half-rolled-out stage :)18:12
clarkbmordred: is that gonna need a new rebase when the git role changes ?18:12
clarkbmaybe we should decide on an order there with some depends on?18:12
mordredclarkb: or else the git role change will need a rebase18:13
mordredclarkb: I'd like to get the zuul one landed first (deleing the extra ansible.cfg is important) - then I'll rebase the other one18:13
mordredclarkb: (turns out that ansible.cfg in the root of the repo was a bad idea)18:14
mordredcorvus: qcow2 is done18:14
corvusneat, still waiting on the sha256 from vhd :)18:15
mordredcorvus: have brainspace for a rebase re-review of while we wait? (ok if not)18:15
openstackgerritMerged opendev/system-config master: Fix etherpad port 80 logging
mordredhrm. that patch was unhappy in deploy ... why18:17
openstackgerritMonty Taylor proposed opendev/system-config master: Remove infra-prod-update-system-config from etherpad
mordredfungi, clarkb : ^^18:18
corvusvhd summed and moved into place18:19
mordredcorvus: wot18:19
corvusmordred: did you create the qcow2 as the final name?18:20
mordredcorvus: oh - I did. cause I'm dumb18:20
corvusmordred: i think it's been uploading the qcow2 for 15 minutes, but it only finished converting 5 min ago18:20
corvusi don't know what that's going to do18:20
corvusthat's kna1, mtl01, limestone, openedge, ovh18:21
corvusmaybe it'll just worke?18:21
mordredcorvus: maybe? maybe it'll just be reading from a file that's being appended to18:22
corvusi don't know if it does anything with sizes or checksums beforehand though18:22
clarkbcorvus: it does, but I don't think it checks any of that except for on rax18:23
clarkband there its just checking the checksum for reuploading purposes?18:23
corvusokay, checksum files for qcow2 are in place18:25
corvusit looks like we're now really uploading everywhere18:25
mordredcorvus: woot18:27
mordredclarkb: heh. your zuul-jobs fix for opensuse failed on there being no opensuse images18:34
clarkbmordred: yup, it also failed on -plain and centos ps1 but ps2 looks good18:35
clarkbI think tjat implies our testing has reasonable coverage18:36
mordredyeah. I agree18:36
AJaegerteam, I'm puzzled gives me a 404 - but exists18:36
AJaegerlooking at the last promote job via - everything looks fine.18:38
AJaegercan we run the promote job again?18:40
openstackgerritDonny Davis proposed openstack/project-config master: Adding custom label to OE for airship support
AJaegeror has anybody an idea why after the upload there's no content?18:42
clarkbAJaeger: I think the job log records what it rsyncs? /me is lokoing18:42
clarkb heh I guess not18:43
clarkbAJaeger: is ^ that the job that needs to be rerun?18:43
AJaegeryes - and rsync output is
clarkbAJaeger: hrm that seems to show files being copied to the correct place. We need an index.html for your url to work right?18:46
clarkb(and there is an index.html copied)18:47
clarkbAJaeger: if you look in afs the files are there18:48
openstackgerritAndreas Jaeger proposed openstack/project-config master: Use TOX_CONSTRAINTS_FILE in release script
AJaegerclarkb: they are in afs - but not displayed on docs.o.o?18:49
AJaegerclarkb: do you also get a 404 on ?18:49
clarkbAJaeger: I do18:49
clarkbif I try to navigate to /afs/ on static01.o.o that fails18:50
clarkbso I think this is an afs issue18:50
clarkbperhaps related to [Wed Apr 15 00:08:56 2020] afs: Waiting for busy volume 536871090 () in cell openstack.org18:50
clarkbI'm going to try and invaldiate the cache things for that path18:51
clarkbI just have to remember hwo to do that18:51
fungioh, maybe the vos release hasn't completed?18:54
clarkbmaybe? fwiw `fs flush` on that path fails becuse it thinks it doesn't exist18:55
clarkbflushing the parent dir didn't help18:55
fungi/afs/ wouldn't exist, would it? i thought that was a redirect18:55
clarkbfungi: its perfectly navigable on hosts that are not static01.opendev.org18:56
fungioh, i see18:56
clarkbfungi: and the log AJaeger shared shows we copy directly into it18:56
fungiand also /afs/ is navigable from static18:56
clarkbI don't think it is a redirect18:56
clarkblistvldb shows there isn't a release in progress18:56
fungioh, right, we redirect *to* latest18:57
fungiyeah, so this does seem like a cache problem if other clients see it18:57
fungiyesterday's kernel upgrade seems to have broken my local openafs lkm18:57
clarkbwe could try restarted openafs services on static or rebooting it18:57
clarkbwe could also try a flushvolume18:58
clarkbwhich is the more heavy handed version of flush that applies to the volume entirely18:58
clarkbshoudl I try fs flushvolume first? that seems like maybe the least heavy hadned thing we can do next18:58
clarkbdidn't hel18:59
corvusAJaeger, clarkb, fungi, mordred, cmurphy: some uploads of the old image have completed, so i think we should be back to where we were yesterday19:00
fungithanks corvus, mordred!19:00
clarkbcorvus: I can recheck my ensure-pip change to check19:00
AJaegerthanks, corvus and mordred !19:00
clarkbalso looks like systemctl stop openafs-client ; systemctl start openafs-client might be the next thing to try on static?19:01
clarkbthat will blip everythign though19:01
fungiless of a blip that a reboot at least19:01
fungibut yeah, that's where i'd go next unless corvus has suggestions19:01
cmurphythanks corvus19:02
fungiclarkb: is the kernel logging anything19:02
fungiahh, just the "Waiting for busy volume"19:02
clarkbfungi: kern.log just shows those waiting for busy volume19:02
fungiclarkb: more than one volume though19:02
clarkblet me wee what volume that id belongs to19:02
fungilooks like they were all for volume 536870992 in previous weeks, but 536871090 is the one from earlier today19:03
*** hashar has quit IRC19:04
clarkbproject.airship maybe? its got 536871091 and 536871092 now19:04
fungithat's for the site19:06
clarkbwhich isn't where python-cinderclient docs are stored so could eb those warnings are just noise?19:07
fungii'm suspecting they may be unrelated, yes19:07
fungiespecially since they're occurring infrequently19:08
fungithere's only one entry in dmesg from today, and it was around 08:00z if memory serves19:08
AJaeger is an opensuse-15 log ;)19:11
clarkbI'm not coming up with anything better than stop starting openafs-client. Except for maybe use the rw volume for now19:11
clarkb(and that will let us debug further)19:11
AJaegerand clarkb's change passed now19:13
*** factor has quit IRC19:14
*** factor has joined #opendev19:14
corvusoy, there's another fire?  /me catches up on afs stuff19:14
clarkbfwiw I checked lsof against that path and its parent and it says nothing has parent open and child doesn't exist19:15
clarkb(just in case there would be clues in the kernel file tables)19:15
openstackgerritMerged opendev/system-config master: Use project-config from zuul instead of direct clones
openstackgerritMerged opendev/system-config master: Remove infra-prod-update-system-config from etherpad
clarkbmordred: ^ fyi19:17
mordredclarkb: woot19:17
mordredI have renamed the zuulcd user and moved the home dir - so that _should_ run without issue19:18
clarkbI need to find lunch. On static.o.o's /afs/ issue my only current input is that maybe we need to restart openafs-client there. I can't find anything in logs or vos output saying that it is unhappy. But it definitely doesn't seem to stat19:20
clarkbthe dir does stat and is navigable on other hosts19:20
corvusclarkb: i have run some flush commands as root and they made it better19:20
clarkbcorvus: hrm I ran fs flush on the cinderclient/ and cinderclient/latest paths as well as flushvolume on cinderclient/ and cinderclient/latest19:21
corvusclarkb: as root?19:21
clarkbcorvus: I ran those from static01. did you do differently?19:21
corvushuh.  then maybe the 'fs checkvolumes' command helped19:21
corvusi ran that as non-root, but initially didn't think it did anything, but i may have been mistaken19:22
AJaeger is working now - thanks!19:22
corvusat any rate, some combination of those 3 commands run as some combination of non-root and root seem to have helped19:22
clarkbcorvus: that is what I ran19:22
corvusif it happens again, maybe we can narrow it down more19:22
AJaegerlet me spider again ;)19:23
AJaeger(openstack-manuals merge does some sanity check for indices)19:23
corvusclarkb: me too, though i did it from the python-cinderclient directory against '.'19:23
corvusmordred: i think we can remove nb from the emergency file now, yeah?19:24
clarkbso ya maybe checkvolumes was what we needed. I'll keep thati n mind for testing if this comes up again19:24
clarkb(basically try that first then test paths I guess)19:24
mordredcorvus: yes - I agree - I'll do that in just a bit19:28
mordredcorvus, clarkb : the project-config chagne did not work - we hit retry limit on it in deploy pipeline - I'm looking on the zuul scheduler to try to figure out why19:29
clarkbmordred: k, I'm making a burger but can help after lunch19:29
dirkcorvus: ajaeger: cmurphy: the original issue is fixed id we'd get a new dib release19:32
dirkThere is a fix in there.that would make pip-and-virtualenv element work again and then we have time to figt out things19:33
mordredclarkb: I may need it - I'm not sure what I'm looking for :(19:33
clarkbmordred: usually if you grep the job name you find the jobs that ran. They'll have an event id in the logs then you grep that id and do a trace19:33
clarkbat least thats been how I've debugged similar in the past. Also you can look in logstash if we are caught up19:34
clarkbbut it will only have info if there were logs published19:34
openstackgerritMerged openstack/project-config master: Adding custom label to OE for airship support
mordredclarkb: I'm dumb19:36
mordredclarkb: I missed a rename19:36
mordredclarkb: turns out - when you rename a user in /etc/passwd - you ALSO need to rename the user in /etc/shadow :)19:36
*** factor has quit IRC19:36
mordredclarkb: I want enqueue-ref for re-triggering the deploy pipeline right?19:38
mordredcorvus: ^^ ?19:39
mordreddoes zuul enqueue-ref --pipeline deploy --ref refs/changes/43/719343/19 --trigger gerrit --tenant openstack --project opendev/system-config  look reasonable?19:41
mordredor I need newrev and oldrev don't I?19:42
*** factor has joined #opendev19:43
mordredit's a change-merged trigger - so I think I don't19:44
corvusfor change merged you want 'enqueue'19:44
mordredah - cool19:45
corvusshould be just like a check/gate enqueue19:45
*** osmanlicilegi has quit IRC19:45
mordredcorvus: zuul enqueue --pipeline deploy --change719343 --trigger gerrit --tenant openstack --project opendev/system-config19:45
mordredso that look ... sigh. with a =19:45
mordredzuul enqueue --pipeline deploy --change 719343 --trigger gerrit --tenant openstack --project opendev/system-config19:46
mordredthat look more sane?19:46
corvus--change 719343,1919:46
mordredk. hopefully it'll work more better this time19:47
mordredcorvus: it is at least running - so yay!19:48
mordredcorvus: I have removed nodepool from emergency - so we should get a nodepool ansible run this time too19:50
mordredcorvus, clarkb : infra-prod-install-ansible has run successfully from /home/zuul19:51
corvusmordred: woot!19:51
mordredcorvus, clarkb : we should be able to land now19:52
openstackgerritMonty Taylor proposed opendev/system-config master: Add /opt/lib/git to the volume mounts
mordredcorvus, clarkb also that ^^ which should fix the local mirror issue19:55
*** jkt has quit IRC20:09
AJaegerconfig-core, please review  - small cleanup for release20:10
*** jkt has joined #opendev20:10
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
corvusmordred: i'm about to start looking into your question on ^, ok?20:17
corvusmordred: i suspect the answer is "no it's not still accurate, and we are using the normal python3 gear install that we get in the ansible envs"20:19
corvusmordred: yep, pretty sure that's the case20:22
mordredcorvus: cool!20:23
mordredcorvus: that excites me20:23
mordredcorvus: I'll remove that from the comment in the next iteration then :)20:23
corvusmordred, clarkb, fungi: so... i can't remember if i asked this quesction or not -- i do remember i was typing it into irc right as all the fires exploded.  for the zk tls work, we can either (a) run manually on bridge and copy the resulting keys into private hostvars (like we did for the gear certs).  or (b) we could (all in ansible) run on bridge and slurp up the keys to put them on the20:25
corvuszuul hosts.20:25
corvusi kind of like (b) -- the keys aren't precious, it just means that if we lose bridge, we will end up rekeying zuul.  that doesn't sound like a big deal.20:26
mordredcorvus: because won't make new certs if we already have old certs, right?20:27
corvusyep, it's idempotent20:27
mordredyeah. so I like b20:27
mordredI don't see any point in managing them in private hostvars if we don't have to20:27
corvuscool, i'll start down that path then20:28
corvus(i will make followup changes to 717620)20:28
mordredfwiw - manage-projects did not run well this trigger20:29
mordredI am now investigating20:29
mordredit failed on synchronize with no logs because no_loig20:31
mordredI'm going to say "shrug"20:31
corvusmordred: was that on the superceded patch?20:31
corvuswhere we replaced synchronize with the git role?20:31
clarkbcorvus: how does the slupring in b) operate? is it different than putting things in private vars?20:32
mordredno - the superceeded patch landed - I just re-approved the patch to replace it with the git role20:32
corvusmordred: i mean, what 'synchronize' operation failed?20:33
mordredthe synchronize that we're replacing with the git role20:33
corvusok, that was my question, sorry for being unclear. i agree that shrug is the right answer20:33
fungiclarkb: i was assuming copying files20:33
mordredyeah - if the other thign fails, I'll debug _that_20:33
corvusclarkb: yeah, it would mean a task to copy the file from bridge to the remote zuul/nodepool node20:33
fungicorvus: plan b sounds safe, and less hands-on20:34
clarkbcorvus: gotcha so major difference is not tracking it in git history20:34
clarkbya I think that is fine for this use case20:34
funginot as fantastical as plan 9, but then what is?20:34
corvusoh, we're going to need all of nodepool out of puppet for this too20:34
mordredplan 💩20:34
corvusis there anything preventing rolling nb01/02/03 into containers now?20:35
corvus(afaik nb04 is good, with no outstanding issues)20:35
mordredcorvus: I don't think so - I think ianw was going to start rolling each of them out20:35
fungicorvus: yeah, i think that was ianw's plan next, once the pip-and-virtualenv bits are settled20:35
mordredcorvus: I think we need to do all of zk too, yeah?20:36
corvuscool, i'll go ahead and write the skeleton of this change, but clearly we won't be able to land it until that happens20:36
corvusmordred: yeah20:36
mordredcorvus: cool. are you doing that bit in your change? or want me to start working on a change for that. also - for nodepool-launcher20:36
corvusmordred: i'll focus on the CA aspects for now, and deploying to zuul; if you want to start on zk and nodepool-launcher, that'd be great; i can pitch in on that when this is done20:37
corvusthen if all that's done, we can help ianw with the nb rollout :)20:37
corvusmordred: oh, i just did a bunch of docker testing for zk, let me grab my docker-compose file20:38
clarkbbefore we start deploying more services with docker compose it might be a good idea to land and its child20:38
corvusclarkb: the names are changing?20:38
mordredcorvus: yeah - isn't that swell?20:39
clarkbcorvus: yes docker-compose was chomping the - in dir names but now it doesn't20:39
corvusclarkb: what happens with the upgrade?20:39
clarkb is my attempt at testing that upgrade path20:39
clarkbcorvus: ^ seems to show everything works even with the name change, but reviewing this upgrade changei s probably worthwhile too20:39
clarkbI was also hoping I could spend a bit more time trying to formalize what that change does into a generic upgrade testing job/tool20:40
corvuswhat does "work" mean?  does it restart/recreate containers or does it just recognize old names as its own containers still?20:40
clarkbcorvus: based on testing it stopped the old containers and started the new containers with no problems despite the name check. When I didn't do the updated test sed's in that change we failed testinfra tests beacuse those old containers did not exist anymore20:41
clarkbcorvus: that implies to me that its stopping old name properly, then starting new name properly20:41
clarkb(the job runs everything with old version, upgrades docker-compose, runs docker-compose up --force-restart, then reruns testinfra)20:42
corvusclarkb: does --force-recreate cause the restart?20:42
corvuswe don't normally run that, right?20:42
clarkbcorvus: ya its the flag that says stop and start even if container images haven't changed20:42
corvusi'm just trying to figure out what happens to gerrit when we land
clarkbcorrect we normally rely on images to have changed in order to triggerthe restarts20:42
corvusso if that's omitted, and we upgrade docker-compose, do we know what happens?20:43
clarkbcorvus: I see you're thinking that maybe new docker-compose will restart even without the force20:43
clarkbwe can test that :) one moment I'll get a patchset up for that case20:43
corvusyeah, it might (a) do nothing (yay) (b) restart without any prompting (meh) (c) run a second copy (boo)20:43
corvusmy guess based on your test so far is (a), but would be good to confirm that, because (c) would be bad.20:44
mordredfour legs good, c bad20:45
openstackgerritClark Boylan proposed opendev/system-config master: DNM Test docker-compose upgrade
clarkbcorvus: mordred ^ that runs docker-compose up -d which is what we normally run. Then it runs testinfra against the old names (we should expcet this to pass), then it udpates testinfra to check for new names and runs testinfra. This last testinfra run should fail20:46
clarkbif the last testinfra run passes it implies we are running both sets of containers20:47
clarkband if the second to last fails it implies a restart happens even though we don't force it to20:47
corvusmordred:  zk docker compose and config file from my testing -- for the first pass of containerization, we should drop all the tls stuff obviously20:48
corvusmordred: that's based on the upstream documentation for using the container images with docker-compose, so it's shiny and new20:48
corvusclarkb: ack sounds good, thx20:48
corvusmordred: and we have actual real different hosts, so we don't need to worry about the ports and docker-based hostnames and stuff20:49
clarkbalso actual different hosts are important for taking advantage of reliability there20:49
corvusand we probably have some tuning in our current config we should make sure not to lose20:50
clarkb(though I guess we can't guaruntee they are on differeny hypversors)20:50
corvusso all in all, maybe a few lines of that paste will be useful, but it's a good reference :)20:50
clarkbcorvus: ya we force it to rotate the journal and bump up the write to disk time20:50
openstackgerritMerged opendev/system-config master: Switch to prepare-workspace-git
mordredcorvus: ++20:58
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
clarkbapparently my new yaml in that test change isn't valid for jinja?21:12
clarkbits the ' unbalancing again21:13
clarkbI should just start typing without any 's and the issue will go away21:13
openstackgerritClark Boylan proposed opendev/system-config master: DNM Test docker-compose upgrade
clarkbtrying again21:13
corvusclarkb: be like data; no contractions21:13
*** DSpider has quit IRC21:14
corvusmordred: there seems to be a chunk of puppet in the ansible for the zuul-scheduler role :)21:15
clarkbcorvus: what about compression?21:17
mordredcorvus: you're just imagining that21:17
corvusclarkb: i believe his upper spinal support is a poly-alloy, designed to withstand extreme stress21:18
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
corvusmordred: left a second comment on that too21:19
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
openstackgerritMerged opendev/system-config master: Add /opt/lib/git to the volume mounts
mordredclarkb, corvus : we're going to need to restart the gerrit container to pick that up21:31
mordredmaybe we shoudl wait until the compose change lands too21:31
mordredso that we just do one restart21:32
corvusmordred: i'm inclined to restart asap -- i have clones from those urls and they are several days out of date.  maybe i'm the only one, but if not, then it's a service impact21:33
corvus(also, we're going to need to trigger a full replication of everything to that after restarting)21:33
clarkbya I think the only reason we need to wait is if we are worried about not restarting gerrit gracefully21:34
clarkbbecause the docker-compose stack also addressse ^21:34
corvusmordred: and our container is going to take up a lot of extra space -- so maybe we should --recreate it?21:34
corvusclarkb: is there a way to gracefully shut it down now?  with a plain docker comand maybe?21:35
mordredwe could do a docker-compose exec to send the hup21:35
mordredcorvus: and yes - let's do recreate for sure21:35
corvusand we don't have 'restart: always'?21:35
clarkbcorvus: ya I'm not sure. What we want to do si hup it then wait long enough for it to stop on its own21:36
clarkbwhich is less than a minute with our version of gerrit iirc21:36
corvusright i'm just wondering if we do that will docker restart it21:37
clarkboh mordred ^21:37
mordredwe do not have restart: always21:37
mordredso I think it will not21:37
corvusi guess we're waiting for that to land on disk21:38
mordredoh fun21:38
corvushow does that not have a default value21:40
corvusmordred: i guess just add that to the role invocation?21:40
clarkbcorvus: mordred maybe because we add host the server21:40
clarkbI bet you can set it when you add host?21:41
openstackgerritMonty Taylor proposed opendev/system-config master: Add port and user_dir to add_host in prod playbook
corvusclarkb wins21:41
mordredyeah - I think that should do it21:41
mordredI added ansible_user_dir - just from looking at the role for other things it wants21:41
clarkbI feel like I'm learning a lto about ansible :)21:41
mordredclarkb: me too!21:41
mordred(we could also add a default(22) to the role there)21:42
clarkbmordred: should system-config-run-nodepool have a parent of system-config-run-containers? or does it not matter because it is consuming images from the zuul tenant?21:43
mordredclarkb: that's right - that base job is only for jobs where we're dpeending on containers we're building21:43
mordred(that's right- it doesn't matter)21:44
corvusmordred: we should run our zuul containers as non-root users21:44
corvus10001 is set up as the zuul user in the container21:47
corvuser in the image21:47
mordredcorvus: ++21:47
mordredwe should run them as that21:47
corvusand likewise, same number is the nodepool user in the np images21:47
mordredcorvus: the images set USER already ... so don't these start as that user absent other intervention?21:48
corvusdo they?21:48
corvusi didn't see that they did21:48
mordredoh - I guess not21:48
clarkbwhat does the USER directive do in that case?21:49
mordredwe don't do one21:49
corvusclarkb: the user directive says what user to run as21:49
corvusin the image21:49
mordredyeah- so if we DID do a USER, it would run as that - but we don't, so we need to set it in the compose21:49
corvusthe current state of the nodepool/zuul images is that they have a unix user created in the filesystem of the image, but they run as root by default.  but we can tell docker to run as that user.21:49
clarkbcorvus: oh its for build time21:50
corvusclarkb: USER affects build and run21:50
corvus(you can use it during build to switch users for build activities; and the last USER line also says what it will run as by default)21:51
clarkbgot it21:51
corvuswhich makes a weird sort of sense when you think of building and running images as the same thing, which docker does21:51
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
mordreduser: zuul added21:52
mordredgood catch21:52
corvuswe should be able to run an 10001 everywhere except zuul-fingergw, which still probably wants to be run as root since we run in host networking; that way it can grab the port and drop21:52
mordredoh. yeah. lemme fix fingergw - I forgot about port drop21:53
corvusoh, and i have no idea about nodepool-builder :)21:53
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
mordredcorvus: I think just running n-b as root makes sense- otherwise it's just going to be sudoing all over the place anyway21:53
mordredoh - hah. we run as nodepool but with privileged: true on21:54
corvushuh, we apparently run the builders as the nodepool user21:54
mordredso I guess diskimage-builder sudos where necessary?21:54
mordredI mean - whatever it's doing is apparently working21:55
clarkbya it should sudo21:55
clarkbcorvus: mordred it appears to have recreated the containers22:00
clarkbthats painful22:00
clarkbI'll get links to lgos once the buildset reports22:00
corvusmordred: we have some files with owner: zuul... we may want to change that to owner: 10001 ?22:01
corvus(maybe later we could re-id the zuul / nodepool user as 10001?)22:02
corvusmordred: i'm looking at the 'add github key' task in your change22:02
mordredclarkb: ok. so we need to emergency review when landing that22:03
mordredcorvus: the zuul/nodepool user already is22:03
clarkb its recreating there then we fail when checking the old names at
mordredcorvus: we set them as 10001 in the images because that's what they are in opendev :)22:03
clarkbmordred: well review doesn't actually docker-compose up ever I think22:03
clarkbthats always manual22:03
corvusmordred: no way.  wow.  cool.22:04
clarkbbut all of the other services we'll need to have a think about?22:04
corvusclarkb: yeah, but i suspect they should all be okay.  except we'll probably leak something in nb04.  but we do anyway.22:04
clarkbya so maybe this is a "land it when there haven't been fires all day and we can pay attention to things as it goes in change"22:05
openstackgerritJames E. Blair proposed opendev/system-config master: WIP: add Zookeeper TLS support
clarkbI'll WIP the change now22:05
fungiclarkb: i'm looking forward to a day with no fires22:06
corvusmordred: ^ if you have a quick second to look at 720302 as an early draft, that'd be great22:06
clarkbgitea is the one I worry about most since our restart process relies on a new image building available22:06
clarkbwe might be able to coincide the docker-compose update with a new image somehow and have it run through its normal updates22:07
corvusmainly looking for feedback about how i set it up for delegation.  the role is heavyweight, so that using it should basically be a one-liner to each of the zuul/nodepool service roles, then updating their config files to point to the locations.22:07
corvus(and yeah, i'm thinking of having the nodepool and zookeeper config files point to /etc/zuul/certs/cert.pem)22:08
corvus(cause why not)22:08
clarkbfungi: ya it might be wishful thinking. I just want to balance "restart all the things" against "we probably need to make this transition at some point so better when all the things is relatively small"22:09
clarkbwe could do it service by service too fwiw22:09
clarkbthen only merge docker-compose install into the ensure-docker role once all existing services use new docker-compose22:10
mordredcorvus: that's the flock incantation that waits for the lock?22:10
clarkbinfra-root ^ would you prefer I split it up that way and we can iterate through it?22:10
corvusmordred: yep, it's exclusive and waits by default22:10
mordredcorvus: cool - I think that approach looks good22:10
fungiclarkb: what's the list of services we're currently deploying that way?22:11
fungigitea, gerrit, etherpad, one of the nodepool builders...22:11
clarkbfungi: the list of services are roughyl represented by the playbooks/roles files there22:11
fungijust trying to judge possible impact22:11
mordredclarkb: honestly - I think I'd go with the bandaid myself - we already serialize gitea, so it shoudl be fine22:11
mordredwe don't do gerrit by default, so it should be fine22:11
mordredso we're really just talking about etherpad and nb0422:12
clarkbetherpad, gerrit, gitea, haproxy, jitsi, nodepool-builder, docker registry, zuul-preview22:12
mordred(as things where a restart might have a noticable impact we should worrry about)22:12
fungiclarkb: yep, basically the set i was thinking of22:12
clarkbmordred: thats a good point re gitea. We may haev to do a replication to everything after but thats relatively low effort22:13
fungiand yeah, the current set seems small enough we can probably just juggle them all in one go22:13
mordredyeah - mostly seems like the review/land burden of doing them one at a time might actually be more costly on the team22:13
mordredbut - definitely not today22:14
clarkbya I'll leave the WIP in place for now but if things are calmer tomorrow maybe we give ti a go then22:14
clarkb is a related chagne that is completely safe to alnd now if anyone wants to look at it (ensures we run jobs when updating dockerfiles)22:15
mordredcorvus: left one thought on there - it's not important, just a thing we might want to think of as a followup22:17
*** prometheanfire has quit IRC22:17
openstackgerritClark Boylan proposed zuul/zuul-jobs master: ensure-tox: use ensure-pip role
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Update Fedora to 31
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Document output variables
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Python roles: misc doc updates
corvusmordred: cool yeah, i don't like the number in there either.  we'll just want to make it work for both zuul and nodepool22:19
corvusi'm well past eod now, so i'm going to, well, eod.22:19
ianwi just noticed a revert for the suse change ... what's the plan?22:19
clarkbianw: basically get in once 3pci confirms it works against
clarkbianw: then we can land the zuul-jobs stack you've got (I think this was the only objections that came up) and then we can retry with new images for suse22:20
clarkbianw: as an alternative midway step dirk asserts that a dib release would make existing builds work22:21
corvuswhen we retry, we should keep the gap between image builds and landing that stack small -- keystone broke which is why we rolled back22:21
fungiit was specifically keystone's functional test job, yeah?22:22
fungisomething which expects virtualenv to be present but isn't a typical tox unit test/linter/whatever model22:22
openstackgerritMerged opendev/system-config master: Add port and user_dir to add_host in prod playbook
clarkbianw: fwiw now that the docker-compose thing is on semi hold I'm available to keep pushing on the suse things22:23
clarkbat least for a few more hours22:23
fungipizza time is just about over and then i can get back to looking at etherpad/apache logs22:23
ianwthe only thing with rolling back is that new images won't work because pip-and-virtualenv is broken ... i've been trying to avoid making a dib release with a pip-and-virtualenv that only sort of works by accident22:24
fungiso far the handful of spot checks i did showed each of the characteristic etherpad warnings was preceded by a request for that pad at the old domain name roughly a minute prior22:25
ianwat the time the ensure-pip stack was fully reviewed, so i'd hoped we could push forward with it, that was my thinking, anyway.22:25
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers
clarkbfungi: fwiw mordred was wondering if we should rop the redirect and just contineu to serve at the old name too22:25
clarkbfungi: maybe we put it in emergency and try that?22:25
mordredyeah - maybe something something cookies something state something sad22:26
clarkbianw: not sure I understand your second to last message22:26
clarkbpip and virtualenv is broken but would build proper suse images?22:26
clarkblikely broken for a different platform I guess22:26
mordredianw: also - not related to suse or pip - we started working on getting zuul+nodepool+zk all up on the ansible so we can roll out zk auth. just as an fyi22:26
fungiclarkb: maybe that would be okay... though could make getting people to use the new domain harder and prolong the problem if it's their existing cookies. still if it clears up the problem that's at least a data point22:26
corvusmordred, fungi, clarkb: i'd like to keep the redirect...22:27
fungicorvus: as would i22:27
corvusmaybe we can confirm that's the problem before doing that22:27
* mordred would also like to keep it22:27
corvusmaybe by asking people to clear cookies, restart browser, and directly go to the new url... things like that22:27
mordredknowing how to reproduce the issue at all would be super great22:27
clarkbianw: oh so we need package lists22:28
ianwclarkb: like how tumbleweed is a python3 only platform, but _do_py3 is commented out, so it's using the python2 logic to install the python3 path, and making links with tools with "2" in them and stuff22:28
clarkbhrm tumbleweed has python222:29
ianwbut not python2 packages i think?22:30
ianwanyway ... i don't want anyone to invest a lot of time fixing things up, and i don't want to spend a lot of time reviewing it, when we want to get rid of it asap22:30
clarkbthats fair22:31
clarkbhrm git/gerrit/zuul don't like my ensure-pip change being set as a depenods on22:32
clarkbmaybe I have to rebase it in properly22:32
clarkbworking on that now22:32
openstackgerritClark Boylan proposed zuul/zuul-jobs master: ensure-pip: export ensure_pip_virtualenv_command
openstackgerritClark Boylan proposed zuul/zuul-jobs master: fetch-zuul-cloner: use ensure-pip
openstackgerritClark Boylan proposed zuul/zuul-jobs master: fetch-subunit-output test: use ensure-pip
openstackgerritClark Boylan proposed zuul/zuul-jobs master: ensure-tox: use ensure-pip role
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Update Fedora to 31
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Document output variables
openstackgerritClark Boylan proposed zuul/zuul-jobs master: Python roles: misc doc updates
clarkbthere was a conflict between my change and so ya needed to be rebased :/22:34
*** prometheanfire has joined #opendev22:39
clarkbianw: I guess one risk there is we're sort of equating pip to virtualenv there in the followon change?22:39
clarkbianw: do we also need to check for virtualenv and if it isn't present run the install anyway?22:39
ianwclarkb: sorry which one is the followon change?22:40
clarkbianw: that one22:41
clarkbianw: basically with that change we introduce the idea that if pip is present then so is virtualenv (because we're installing them together when installing pip)22:41
clarkbI think in the default case everything will work fine, but tristanC's case might get a little odd if they aren't also installing virtualenv22:42
clarkb(and maybe that is ok as power users they can deal with that)22:42
ianwclarkb: umm, not really ... i've tried to deliberately make it not install virtualenv22:42
ianwit seems like we have to on Xenial, because we found that venv doesn't work there with our mirrors22:43
clarkbianw: but it is?22:43
clarkbgotcha there might be a few exceptions but in general its relying on python -mvenv which should be there if python is there22:43
ianwyeah, were it has to, such as the python2 install22:43
ianwbut i expect that to be hardly used22:43
clarkbianw: I'm mostly wondering if we need to check for python -m venv and/or virtualenv being valid in addituion to `pip` in
clarkb(or in a followup)22:44
ianwi don't think so? checks and prefers "-m venv" in all cases it can?22:45
ianwthat should then be tested by the on all our platforms, to ensure that the ensure_pip_virtualenv_command is something valid22:46
clarkbianw: ya but we are skipping the installs entirely if pip is already present22:46
clarkbso if you had pip installed but not venv or virtualev (depending on platform) you would be in a weird spot22:46
clarkbI think for now its probably fine22:47
clarkbbecause its a corner case that only power user types like tristanC will run into22:47
clarkbthinking about it more I think its ok to not worry about that too much. Basically what we're saying is if you know better then we'll get out of the way22:49
clarkband if that breaks you its on you22:49
ianwi'm wondering if we should be doing this for the packaged pip case ->
clarkbianw: fwiw this all started because ensure-pip broke 3pci22:51
clarkband its my undersatnding that happened because pip was already installed22:51
clarkband this wasn't reconciling that state for some reason22:51
ianwwell it is already installed on infra images too22:52
clarkbbut the ensure-* roles are intended to noop if the thing they ensure is already there22:52
clarkbwhcih is why corvus -1'd it22:52
clarkb(and why people didn't want to roll forward this morning)22:52
clarkbthe deafult is to install from packages so skipping the checks when installing from packages doesn't help I don't think?22:53
clarkbat least not with the current testing22:53
clarkbI wonder if they are running jobs with a ro fs?22:53
ianwjust that the package: install should be idempotent (i.e. noop when already installed) anyway22:54
tristanCclarkb: not sure what do you mean by power user, but i think that using the tox job with a python container that doesn't have sudo should not be a corner case22:54
clarkbtristanC: the corner case is you've preprepped the image. This role is for prepping the image22:54
clarkbtristanC: I think the correct way for you to use this would be to not use esnure-* anything if you are using prebuilt images without root22:55
clarkbbut I'm also happy to try and accomodate the preinstalled case beacuse I think it won't be uncommon22:55
clarkbtristanC: the corner case here is that you are using a role that will install things if necessary but you don't let it do that22:55
ianwtristanC: so if ensure-pip has a "package:" call with become: yes, that won't work for you, right?22:55
ianweven though that is idempotent, as such -- keeping to the rules of ensure-* roles that they don't do anything if the stuff is already there22:56
tristanCclarkb: we are not using that role, we just use the tox job provided by the zuul-jobs project.22:57
clarkbtristanC: on the root point the whole system has sort of been designed to make using root as safe as possible. because unfortunately a lot of stuff does need root (not necessarily tox though)22:58
fungihow exactly did it break for you then?22:58
clarkbfungi: its because sudo rpm -q or whatever it does to check if the package is installed failed22:58
fungioh, right the *job* not the *role*22:58
clarkbfungi: via ensure-tox consuming ensure-pip22:58
fungithe tox job in zuul-jobs tries to install the things it will use, so if you're preinstalling those things the job might still try to sudo even if it'll be a no-op23:00
fungigot it23:00
clarkbfungi: yup23:00
ianwso ... should we make the tox job not call ensure-tox?  i thought we decided it wasn't yesterday?23:00
fungiso yeah any become would need to be guarded behind whatever conditional ensures it's a no-op23:01
ianwplaybooks/tox/pre.yaml:    - ensure-tox23:01
clarkbianw: well I think there is still value in the check if pip is there without package manager case because it could be installed without the package manager?23:01
fungior else the ensure roles should not be included23:01
tristanCclarkb: fungi: iirc we already agreed that the tox job should be usable without sudo access23:01
fungitristanC: yep, makes sense23:01
clarkbtristanC: yup I wrote the change to fix it :)23:02
ianwtristanC: ++ on tox job not using sudo23:02
clarkbbut there is a weird side case where the way we pull in pip implies virtualenv (or venv) will be available23:02
ianwheh, well we agree on something :)23:02
clarkband if you haven't built the image with virtualenv or venv it will be weird for you23:02
clarkbbut we can't fix that in any case because there is no sudo so its not worth worrying about I don't think23:02
ianwi'm back to why ensure-tox is in the tox role pre.yaml playbook23:02
clarkbianw: how does the job work if it isn't ensuring tox is available?23:03
clarkb(I don't think I followed that conversation from before)23:03
ianwclarkb: i thought from yesterday, i'll have to go back, we were somewhat of the agreement it was up to you to run "ensure-tox" before running the tox job23:04
clarkbianw: I think the implication was that maybe tristanC should have a different tox job that didn't run any of the roles23:04
clarkb*any of the ensure-* roles23:04
tristanCperhaps we could drop the assumption that zuul-jobs are not meant to be usable by custom container, and then we should provides a zuul-container-jobs that provides light weight version of the job's play that doesn't use the ensure-* role23:05
clarkbbut I'm not sure23:05
tristanCthose jobs could even reference public container images that are known to work with the jobs23:05
clarkbtristanC: I wouldn't even label them container jobs as the pattern could be useful in other systems too23:05
clarkbfwiw I think my change will fix this particular problem23:06
clarkband we never merged the change that would break tristanC ?23:06
clarkbso the system is working?23:06
ianw was the comment i was thinking of23:07
ianw"that happens to make it so that tristanC can avoid running the ensure role too if he wants to define a new tox job."23:08
tristanCyes, the system is working, and i'm happy to keep supporting sudo-less environment. I'm also happy to drop the support, as long as we have an agreement23:08
clarkb is green now23:09
clarkbfrom I mean23:09
corvusianw: the context for that quote was that tristanC was concerned that we were doing extra work that wasn't necessary for him.23:09
corvusthat's different than our current understanding, which is that if we merged that change we would have broken a working system23:10
corvus(so, to be clear, i support tristanC optionally creating a new job that is more efficient; but at this point i don't think we're saying that should be required in order for the basic thing to work)23:10
ianwright, that's ok23:12
tristanCcorvus: clarkb: it seems like there is value in being able to associate job with prepared runtime known to be working for a specific task. So perhaps we could start designing an extra zuul-jobs project that provides job play using the role from zuul-jobs.23:12
tristanCwe could even agree on labels name and provides nodeset too23:12
corvustristanC: i'm not sure i'm ready to give up on having a tox job in zuul-jobs that works everywhere23:13
corvusit seems like there's a path forward here, so maybe let's see how good we can make that before we fork23:13
ianwnow i'm starting to wonder if having the virtualenv bits in ensure-pip is a good idea23:21
openstackgerritMerged zuul/zuul-jobs master: Check if pip is preinstalled before installing it
ianwlooking at the keystone job23:35
*** tosky has quit IRC23:39
*** mlavalle has quit IRC23:47
ianwcmurphy: ^ i can not understand where this is coming from :/23:55
ianw ... it should be using venv ... it must be a branch or something i haven't considered23:55
fungistable/stein, right?23:56
cmurphy is on master not stein23:57
fungiyeah, just double-checked23:57
fungiso codesearch is returning the relevant hits in that case23:58
fungionly seems to appear in devstack23:58

Generated by 2.15.3 by Marius Gedminas - find it at!