Wednesday, 2021-03-17

ianwi hacked in a mkdir -p and touch/chmod+x of the file in nb02; let's see if the next build works with that00:04
openstackgerritMerged opendev/system-config master: Use upstream jitsi-meet web image
fungii'm stumped by the failure on
fungithe error message seems to completely contradictory to what's implemented by the depends-on change, and the zuul inventory even indicates it included that change00:40
ianwClass[Ptgbot]: has no parameter named 'aliases' at /opt/system-config/production/modules/openstack_project/manifests/eavesdrop.pp:114:3 on node ?00:45
ianwyeah, i wonder if we use the zuul checkout correctly?00:46
ianwtumbleweed is converting.  i guess it's an open question how it goes01:19
corvusfungi, ianw: let's not drop that; that could be the sort of error we should be on the lookout with zuul; i can pitch in tomorrow to help verify if it's not confirmed by then01:39
corvusmaybe start by seeing if there are git shas in the build log01:40
fungiyeah, i'll try to dig into it tomorrow, getting late for me01:46
*** ysandeep|away is now known as ysandeep|bbl01:56
openstackgerritIan Wienand proposed openstack/project-config master: nodepool elements: create suse boot rc directory
ianw#status log kdc03/04 manually upgraded to focal.  they are in emergency until 779890; we will run manually first time to confirm operation03:50
openstackstatusianw: finished logging03:50
openstackgerritMerged opendev/system-config master: kerberos: switch servers to Ansible control
openstackgerritIan Wienand proposed opendev/system-config master: borg-backup-server: fix verification run
*** ykarel|away has joined #opendev04:17
*** ykarel|away is now known as ykarel04:17
openstackgerritOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml
openstackgerritArtem Goncharov proposed opendev/irc-meetings master: Move meeting to 16 UTC as agreed
*** ykarel has joined #opendev07:42
openstackgerritArtem Goncharov proposed opendev/irc-meetings master: Move meeting to 16 UTC as agreed
*** ysandeep is now known as ysandeep|lunch08:46
openstackgerritRoman Gorshunov proposed opendev/git-review master: Add missing -h to manpage
openstackgerritRoman Gorshunov proposed opendev/git-review master: Add missing -h to manpage and remove -c from it
openstackgerritRoman Gorshunov proposed opendev/git-review master: Add missing -h to manpage and remove -c from it
openstackgerritRoman Gorshunov proposed opendev/git-review master: Add missing -h to manpage and remove -c from it
lourotttx o/ I have a few governance reviews open. Are you the right person to ping for that? thanks!
ttxlourot: no that would be the TC members, I don;t approve those changes anymore as I'm not elected to the TC anymore. You can ask them on #openstack-tc09:34
ttxI'll do a pass on them and sprinkle Codereview+1 magic, but that only goes so far09:34
*** hemanth_n has joined #opendev09:58
lourotunderstood, thanks!10:00
ykarellooks like openstackgerrit bot is down or have some issues10:42
ykarelnot getting IRC notification10:43
ykarelcan someone check10:43
fungiykarel: yep, sorry, taking a look now13:45
fungi2021-03-17 09:33:30     <--     openstackgerrit ( has quit (Quit: Changing servers)13:46
fungii guess that was the last we heard from it13:46
fungi#status log Restarted gerritbot container since it never returned after a 09:33 UTC server change13:49
openstackstatusfungi: finished logging13:49
fungilooks like it reported a change in #openstack-ansible after the restart, 2021-03-17T13:48:0413:50
fungiykarel: it should be back to normal now13:50
*** openstackgerrit has joined #opendev13:54
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size
ykarelfungi, Thanks it's working now13:59
*** amoralej|lunch is now known as amoralej14:01
fungii wish we could figure out why it doesn't detect that it's no longer connected to freenode and keeps shouting into the void14:06
fungiit never even logged anything from the server about it14:10
ykarelyes that would be good14:15
fungithough interestingly, it did log a traceback right at that time because of the length of the commit subject on this change:
fungiauthor forgot to add a blank as the second line14:18
fungii doubt it's related since the quit message freenode showed was "Changing servers" but i'll try to see if i can recreate the problem later14:19
fricklerfungi: I saw similar logs about overlong messages on earlier disconnects, but never was able to figure out whether they were just happening because the connection got broken just a bit earlier or whether they might actually to related to triggering the issue14:56
fricklerwould sure be interesting to try and reproduce14:57
clarkbfungi: frickler: any objections to me starting to land this morning to begin the rollout of nl02-04 today?15:06
clarkbsome of those need reviews too15:06
clarkbI'm happy to do the +As and watch them if others can sanity check them first15:06
fungiclarkb: no objection15:09
clarkbgreat, I've approved the first 3. The 4th is child of one of the other three and depends on a different one in the other three so I'll let things settle a bit before landing that one15:12
clarkbthanks for the reviews!15:12
fungitrying to work out why a system-config change with a depends-on to a puppet-ptgbot change seems not to have used the modified module source... how do we go about making use zuul-prepared repository states?15:15
fungii can see where it cloned puppet-ptgbot in the job, but unfortunately it doesn't say where it cloned it from or what command it ran to do the clone15:16
clarkbcorvus: is there any reason to not make the change at ? I noticed that our existing nodepool configs were confusing when I was updating them to do this rolling replacement of the servers. That change is an attempt to make it less confusing but I am not sure if maybe the confusing aspect was intentional for some reason15:16
clarkbfungi: have a link to the job?15:16
clarkbI think I'll need the context to help make sense of it at this point15:16
fungithis was the build:
fungiit was triggered by system-config change
fungiwhich depends-on
openstackgerritMerged opendev/ master: Add nl02-04 to DNS
fungithe zuul inventory reflects that set of changes, but i'm trying to figure out if we somehow didn't actually check out the patched version of the module, or my change is just subtly wrong in some way i can't see15:18
fungithe log indicates /etc/puppet/modules did not exist on the bridge node, and then it was created and our needed modules (including the ptgbot module) were cloned into it15:19
clarkbya and that uses modules.env in system-config as the input list15:21
clarkblooking at modules.env I expect we are supposed to override OPENSTACK_GIT_ROOT maybe?15:22
fungiyeah, that's what i'm guessing too15:22
clarkbhowever it doesn't check for a value before setting https://opendev.org15:22
fungiwell, also codesearch turns up no existing references to that variable outside that one file15:23
fungithere's this at the end of the file: ttps://
openstackgerritMerged openstack/project-config master: Add idle configs for
clarkbfungi: I think the way it worked with some jobs was we used zuul-cloner to put the repos in place first15:24
clarkband then would largely noop15:25
fungiyeah, maybe system-config-run-eavesdrop is missing that somehow15:25
clarkbI expect all the system-config-run jobs are because they rely on more modern zuulisms and we don't seem to have adapted this to new zuul15:26
clarkb# If puppet integration tests are not being run, merge SOURCE and INTEGRATION modules <- thats from modules.env and is a clue15:28
fungioh, you know what? system-config-run-eavesdrop probably assumes we're only doing integration testing for ansibilified/containerized services15:28
clarkbwe need to set PUPPET_INTEGRATION_TEST=1 when running and separately clone the repos ourselves15:28
clarkbfungi: yes exactly15:28
openstackgerritMerged opendev/system-config master: Cleanup
clarkbfor the separate clone step we may be able to simply do a ln -s of the module from the zuul dir to puppet modules dir?15:29
fungiokay, well, i'm not super concerned about it in that case, it's a problem which will solve itself once we move the rest of eavesdrop's services off puppet15:29
clarkbPUPPET_INTEGRATION_TEST=1 seems to be the key though as that is what causes to say it is your problem15:29
fungii see little point in spending time trying to fix our ansible integration testing to do puppet integration testing correctly15:30
fungicorvus: mystery solved ^ i assumed incorrectly that was one of our existing puppet integration test jobs, it's not15:31
* clarkb finds breakfast while ansible sorts out the three changes that just landed. Will cleanup afterwards then land the change to add nl02-04.opendev.org15:32
dtantsuranyone knowing about glean around now? I'm pondering adding `NetworkManager restart` somewhere (where?) otherwise static DHCP won't work until reboot..15:39
fungii'm about to disappear for a routine dental checkup, or i'd try to figure it out now15:41
*** ykarel|away has quit IRC15:42
openstackgerritSorin Sbârnea proposed zuul/zuul-jobs master: Upgrade ansible-lint to 5.0
fungiokay, headed out, back as soon as possible15:44
clarkbdtantsur: what does static dhcp mean?15:44
dtantsurclarkb: ouch. It was static IP.15:44
* dtantsur brain boiling15:44
fungiahh, okay, so not a reservation15:44
openstackgerritSorin Sbârnea proposed zuul/zuul-jobs master: Upgrade ansible-lint to 5.0
dtantsurno. I'll post more details to the older thread with ianw15:45
clarkbdtantsur: we use glean + static IPs on all our rax nodes. The centos and fedora nodes use network manager. They do not require a separate restart step15:45
dtantsurclarkb: yeah, I know. No idea how it works for you, it 100% does not for me. But that's on a ramdisk, that's the only difference.15:45
clarkbdtantsur: does the ramdisk use systemd or some other init setup? I bet it has to do with service startup ordering15:46
dtantsursystemd, we've already ruled it out with ianw as it seems15:46
dtantsurI'll paste links to the relevant ML posts, hold on15:46
dtantsurthis is my earlier message with systemd-analyze output:
dtantsurthis is today's with boot logs:
dtantsurclarkb: ^^15:53
dtantsurThis issue may be related to reading from a (virtual) CD ROM?15:54
clarkbdtantsur: you need glean to run before netowrk manager15:54
clarkband that is a unit ordering problem aiui15:54
dtantsurso, the DIB simple-init element is broken?15:54
dtantsurI think it simply uses 'glean install' or something like that15:55
clarkbno it also setups up the udev and unit rule as explained in ianw's response15:55
dtantsurudev rules also come from glean itself:
clarkb and
dtantsurright. so these are wrong?15:56
clarkbdtantsur: yes but the pip install can't install them15:56
dtantsurthey do run, otherwise glean wouldn't be triggered at all15:57
clarkbno, I'm explaining that the simple-init element installs those files. glean is merely a containment vessel since pip isn't robust enough to do those installations properly across various distros15:57
dtantsur(and simple-init doesn't to pip install, it uses glean-install)15:57
dtantsurokay, fine, I think we're talking about the same thing in the end15:57
dtantsurI would assume is enough. maybe it's not?15:58
dtantsurmaybe it needs Before=networkmanager.service?15:58
clarkbpossibly. The first thing I notice is that there is a different unit file for nm use and the udev rule refers to the unit file name15:59
clarkbI wonder if the udev rule isn't configured properly for the nm unit15:59
dtantsur"Started Glean for interface enp1s0 with15:59
dtantsurthis seems the correct unit15:59
dtantsurmatches the description:
clarkbdo we know if that is or  ?16:00
clarkb(I don't think we do)16:00
dtantsurclarkb: see "with NetworkManager" above16:00
dtantsurthe one you link does not have this bit in the description16:01
clarkboh it went across multiple lines16:01
dtantsuryeah, sorry for that16:01
dtantsurI think I'll try before=networkmanager, it may give us a better insight16:01
clarkbok that is good, that means the udev rule is triggering the expected unit at least16:02
dtantsurI'm worried that NM has a habit of always initializing a connection16:02
clarkblooking at your log, not only are nm and glean started roughly at the same time but doesn't do anything for almost a minute?16:02
dtantsuryeah, that's the most surprising bit16:03
dtantsurI wonder if it has anything to do with reading from a virtual CD16:03
dtantsur(silly, but that's the only guess I have)16:03
clarkbreading the config-drive data from virtual cdrom? ya that could be it I usppose16:03
dtantsuryep. nested VM, so everything is slooooooooow.16:03
clarkbwhat you need for NM to be configured properly is for to have written the interface config file before NM evaluates the interface16:04
clarkband that log clearly shows we aren't writing that file until well after NM has done its thing16:04
clarkbis network manager also before
dtantsurouch. I've just realized that Before=NetworkManager may not work since NetworkManager starts very early16:05
dtantsurmmm, lemme try to figure out16:05
clarkbdtantsur: I think the is intended to run glean prior to NM16:06
clarkbwith the assumption that NM and friends won't happen until starts16:06 dbus.service16:06 network.service16:06
dtantsurThis is on my normal machine, not inside the ramdisk16:06
dtantsurbut it's also centos 816:06
clarkbmy NetworkManager is dbus.service and Before=network.target16:07
clarkbya for both of these machines the glean-nm service should be fine16:07
clarkbbut it certainly seems like this isn't the case in the system whose log you've pasted16:07
clarkb(thinking out loud here) could it potentially be that in your nested VM udev doesn't process its events until well after systemd has decided to move along?16:08
clarkbthe glean unit is also implicitly expecting that udev will have fired off an event to systemd saying add this to your startup graph I think16:08
clarkbsystemd-udev-trigger.service is the unit on my suse system that coldplugs all devices and it runs Before sysinit.target16:10
clarkbdouble checking ^ on the test env might also be worthwhile16:10
dtantsur[   40.128942] systemd[1]: Started udev Coldplug all Devices.16:12
dtantsurit's much earlier than NetworkManager (timestamps 63)16:12
*** lpetrut has quit IRC16:12
dtantsur.. which doesn't necessary mean that it has processed all events by NM start-up...16:13
roman_gHello team. Could I ask you to check kna1 Ubuntu mirror, please? Thank you!
roman_gErrors: E: Failed to fetch  Unable to connect to
clarkbdtantsur: ya my udev plug unit does execstart and as soon as the process tarts running I think systemd will consider the unit started?16:15
roman_gdtantsur Hi Dmitry. Привет, Дмитрий :)16:16
dtantsurroman_g: Привет! o/16:16
clarkbroman_g: you can check it too :) is available from here16:16
clarkbroman_g: can you link to the job that hit that problme so we can see timestamps and cross check against the server (but without timetamps the best I can do quickly is say the server is up and running now according to my browser)16:16
roman_gclarkb I did. But unreachable from VMs16:17
roman_gclarkb this job:
clarkbhrm that doesn't log timestamps?16:17
clarkbah the other parts of the job do thats is good16:18
roman_gclarkb thanks for pointing out to a way to link to specific line. This seems to be something new.16:19
clarkbdmesg on the mirror doesn't show any recent afs connectivity issues. I think we can rule that out16:20
roman_gclarkb I'm also missing a tab in Zuul UI to list all jobs on specific provider (label). You know we are often having troubles :)16:20
roman_gThis could have allowed me to see if this is common problem for all jobs using this provider (or label).16:21
clarkbroman_g: zuul doesn't track that iirc16:22
clarkbthe logstash service tries to fill that gap though16:22
TheJuliaI guess meetpad got updated?16:22
clarkbTheJulia: yes, or at least I think the change to do that landed16:22
roman_gclarkb yes, that's righ16:22
clarkbTheJulia: just the web portion though, we had had a fork for a while but then they upstreamed corvus' changes and so that got updated16:22
TheJuliaCool, are we expecting ny etherpad integration breakages?16:23
clarkbTheJulia: no, but the component that was updated does handle that so is possible16:24
* dtantsur adds moar logging to and rebuilds16:24
clarkbTheJulia: is the etherpad integration not working as expected?16:25
TheJuliaclarkb: did not but it could just be cached items16:25
TheJuliatrying another browser16:25
clarkbroman_g: I see the test node hitting the access logs around the time the build logs it cannot access things16:26
TheJuliaclarkb: nope, not working. the embedded etherpad says 400 Bad Request in the background on a brand new fresh browser16:26
clarkbTheJulia: ok, it is possible the integration stuff which we thought was working upstream isn't actually working upstream. corvus did they not take your change as is?16:27
clarkbwe can revert the cleanup and go back to our forked version easily enough. Though I'm helping to debug a couple of other things right now so may be a bit16:27
TheJuliaI'm not too worried about it, tbh16:28
TheJuliaso don't rush anything on my account16:28
clarkbTheJulia: ok, ya you should be able to screenshare an etherpad window and/or tell people to open it up separately16:28
clarkbroman_g: here is a curious thing, I see the same IP hitting the mirror ~13 minutes propr16:28
TheJuliaclarkb: exactly16:28
clarkbroman_g: but the job had only just started ~3 minutes prior16:29
roman_gclarkb Thank you. This is very interesting. I don't have ideas how to debug it then.16:29
clarkbroman_g: I wonder if this is an arp caching problem with stale arp tables and IP reuse16:29
roman_gclarkb on provider side?16:29
clarkbroman_g: basically the packets from the host get to the mirror but then the return path goes to some node that doesn't exist anymore16:29
clarkbroman_g: ya, mostly just calling that out as a possibility given the use of the IP outside of this job context only a few minutes prior16:29
roman_gThen this is very occasional. Happens not so often16:29
clarkbroman_g: let me get a paste together that tries to show this16:30
*** hamalq has joined #opendev16:31
clarkbroman_g: assuming apt doesn't say "connection time out" when it gets a 404 (that would be a pretty big bug) I think we can probably blame provider networking since the mirror sees the requests but the host doesn't seem to agree16:37
clarkbTheJulia: I can reproduce the etherpad problem16:38
roman_gTheJulia, clarkb I confirm that I also see Bad Request page there16:39
clarkb <- is the url that jitsi tries to fetch the etherpad at and that is producing the same result16:40
*** marios|call is now known as marios16:40
clarkbthe nginx config to do that proxying seems to still exist16:43
clarkbI need to find the server side logs to see what it doesn't like about this request I guess16:44
openstackgerritDmitry Tantsur proposed openstack/diskimage-builder master: Install glean from source-repository
openstackgerritDmitry Tantsur proposed openstack/diskimage-builder master: Install glean from source-repository
dtantsurclarkb: added debugging logging, found our something. this first invocation: takes half a minute17:00
dtantsurwhich pushes the actual network configuration to much later. why is it needed?17:00
dtantsurit seems like we could just default to --ssh and --hostname and avoid it17:01
clarkbdtantsur: I'm not sure, will need to do some digging17:03
dtantsurand I can confirm that networkmanager ignores before=network-pre :)17:05
clarkbthat seems particularly problematic17:06
dtantsuryep. it's a recipe for races17:06
clarkbdtantsur: the line you call out is just for when the config drive isn't present. I think there is a bug there where we should put the last line in the script in an else block though17:09
clarkbdtantsur: we want one or the other to run not both I think.17:10
dtantsurclarkb: the other way around: it's when configdrive IS present17:10
dtantsur(which I can confirm in vivo)17:10
clarkbdtantsur: however, in your case if the no config drive block is running that will default to dhcp which may explain part of your problem?17:10
clarkbdtantsur: the test is -n "$CONFIG_DRIVE_LABEL"17:10
clarkboh ya what the comment there is confusing17:11
clarkbya I really don't understand what that is trying to achieve17:11
dtantsur-n checks for presence, you're confusing with -z17:11
dtantsurI'm trying without that line now17:11
clarkbthe code has alays been there too :/17:12
clarkbdtantsur: yes, the comment above it was confusing me17:12
clarkbit says if teh config drive isn't present skip it17:12
clarkbwhich is not what the code is doing at all17:12
dtantsurright, yeah. I had a minute of confusion as well17:12
clarkbfungi: the meetpad upgrade broke etherpad embedment17:12
clarkbfungi: jitsi meet should look to for the pads. That too fails with an http 40017:13
dtantsurI think I understand why After=network-pre does not work. network-pre is optional, it gets pulled in via Wants=network-pre in glean@ units. But they appear too late for NM.17:13
dtantsurthis is how I read
clarkbfungi: if you look at the nginx logs on meetpad (using docker logs for the web container) it shows the 400s but I don't see any explanation of why it doesn't like that url17:13
clarkbdtantsur: I think I get it, the script is only wanting to write out the ssh key and the hostname if the config drive is present as that data is in the config drive only. It then runs again to set up the network which can handle config drive being present or not (fallback to dhcp)17:16
clarkbdtantsur: I can write up a patch you can test that executes glean once hanlding both scenarios17:16
dtantsurwill gladly do (I'm trying my own hacked-together version now)17:16
dtantsurglean takes 40 seconds just to start executing O_____o17:18
dtantsurwhat the....17:18
* dtantsur rewrites it in rust17:18
dtantsurideally we should be able to deal with networkmanager somehow17:19
dtantsurthe only problem is: I haven't found a way to tell it to re-read configuration files short of restarting17:20
openstackgerritClark Boylan proposed opendev/glean master: Run glean fewer times in
clarkbdtantsur: you have to delete the interface configuration. It skips if it sees they are already there iirc17:20
clarkbdtantsur: ^ something like that maybe17:21
dtantsurclarkb: no, I mean a different thing. I think NM keeps its configuration somewhere else.17:21
clarkboh NM yes it does. The /etc/ configs are a compat convenience thing but it maintains a db or somethign iirc17:21
dtantsurso this DB is initialized early with a dummy "Wired Connection 1"17:22
dtantsurand then it refuses to read the files glean creates...17:22
* dtantsur shakes first at NM17:22
dtantsurfirst... fist17:22
*** frigo has joined #opendev17:24
dtantsurfrigo (not here) has put together workarounds:
openstackLaunchpad bug 1916348 in diskimage-builder "simple-init/glean missing some requirements (centos-minimal 8)" [Undecided,New]17:25
dtantsurbut it boils down to restarting NM networking17:25
dtantsur(and I question some of these steps)17:25
dtantsurhey frigo, I've just posted your link17:25
frigohaha:D  I did all that without thinking, and did not put the "updated" version of the glean.sh17:25
dtantsurI'm not sure why /dev/sr0 doesn't work for you, it works for me17:26
dtantsurthe NM changes essentially amount to restarting the full networking, right?17:26
clarkbfrigo: dtantsur: I don't think glean should handle the multiple config drive situation (that isn't valid is it?)17:26
frigo /dev/sr0 works but17:26
clarkbyour cloud has broken something very badly if that happens and you should fix it under glean17:26
frigoit does not if you already have a /dev/sda with a config-2 drive label in it17:27
dtantsuraaaaaah. oooh!17:27
* dtantsur runs away17:27
clarkbfrigo: but yuo shouldn't have that?17:27
frigoin the context of bifrost, for some reason, the first time you enroll a server, the clean-up does not run17:27
clarkbif nova gave me a host with two config drives I would immediately ask the nova devs to fix it :)17:27
dtantsurfrigo: fair enough. it's an issue with microversions in ansible openstack modules.17:27
frigoalso sometimes, it's useful to disable the automated clean-up17:27
dtantsurevery time you disable cleaning somewhere far-far away cries a lonely ironic developer17:28
clarkband yes dhcp-all-interfaces and simple-init need to be used XOR each other17:28
dtantsur(but yes, there is an actual bug with enrollment)17:28
clarkbbut that is up to you as the person compiling the elements list17:28
frigowell, I used to envision to leverage the cleaning steps to run a lot of wild things17:29
frigolike firmware upgrades17:29
frigothen I opened!/story/200864317:29
frigoand more I think17:29
dtantsurfunny enough, dhcp-all-interfaces worked fine in my earlier testing on debian17:30
dtantsurIIRC it has logic to skip DHCP if there is configuration17:30
dtantsurfrigo: please report it to HPE folks17:31
frigoI report things one after the other:D  I opened quite a lot of tickets already17:32
dtantsurthat's good, thank you17:32
corvusclarkb: nodepool zk change lgtm17:37
corvusfungi: re mystery solved, great!17:37
clarkbcorvus: cool, just wanted to double check on that17:37
corvusclarkb: afaik they took the change as written, but who knows, maybe they renamed the variable or reverted it17:38
clarkbcorvus: ya digging around it seems they kept it pretty stable.17:39
clarkbI have just discovered that the request is making it all the way to etherpad but with an extra /p/ prefix so you get /p//p/padnamehere in the url and that breaks it17:40
clarkbI don't understand why this is happening yet though17:40
*** frigo has quit IRC17:41
clarkbinfra-root I'm going to delete shortly (say in 5 minutes). Please say something if that doesn't work for you17:42
dtantsurclarkb: left one suggestion on your patch17:42
openstackgerritMerged opendev/irc-meetings master: Move meeting to 16 UTC as agreed
clarkbdtantsur: that is an interesting idea, my only concern is I would need to think about what that means for logging (I don't think it means anything since fds should be inherited but need to run it through in my head)17:44
dtantsurwell, I'm going to try it out now17:44
* dtantsur is trying to understand what is taking so much time17:45 has been cleaned up17:54
dtantsurPondering another idea: a very early execution of glean for whatever interfaces are already initialized (without --interface)17:57
clarkbTheJulia: are you done with meetpad? I think I see what a fix is but I'd like to test it before I push anything up18:04
clarkb(and to do that I need to restart services)18:04
clarkbessentially we set ETHERPAD_URL_BASE= and that gets picked up by the default nginx config. What I don't understand is how the default nginx config is being used when we supply an alternative18:05
clarkbbut I think this is close enough taht updating ETHERPAD_URL_BASE to drop the p/ is worth a go18:05
clarkband if that fixes things, push that update, then work out why our supplied nginx config is ignored18:06
clarkbok that didn't fix things but it did confirm that that var seems to be in play as now I have //p/padnamehere instead of /p//p/padnamehere18:09
clarkbbut also that may have broken more things? yay18:10
fungilunch prep and consumption took longer than expected, so still catching up18:10
johnsomFYI, just got an odd rsync POST_FAILURE18:11
dtantsurfood, mmmmm18:11
clarkbI think restarting services to pick up the config change has resulted in sadness18:11
clarkbI'm going to undo the /p/ removal in case that somehow made things worse18:12
openstackgerritMerged opendev/system-config master: Add new nodepool launchers
clarkbwell I see the /p//p/padnamehere behavior has returned but jitis still says I've been disconnected18:13
dtantsurI think I'll continue suffering tomorrow, dinner is calling18:13
*** dtantsur is now known as dtantsur|afk18:14
clarkbinterestingly the mobile app seems to load it up ok18:14
fungijohnsom: that looks like the node died before test result collection was attempted18:14
fungithe executor is saying it couldn't reach it18:15
clarkbsomething about my browser? can anyone else confirm or deny that jitsi meet is alive for them18:15
johnsomfungi Yeah, not a lot in the main log. Thought I would mention it in case it was a bad sign of things to come.18:15
fungiclarkb: i get the jitsi-meet main page when i go to
fungiand i can start a meeting with the start meeting button18:18
clarkbdoes it say you have been disconnected when you start a meeting?18:18
fungijohnsom: thanks, looks like that happened in ovh-gra1 so i guess if we see more and they're in the same region, we'll have reason to believe it's correlated18:19
fungiclarkb: oh, i wasn't trying from a machine which i expected to actually be able to use it, switching rooms now18:19
fungiconnected from my workstation with chromium, i get "you have been disconnected" yeah18:21
fungiretrying with firefox because i saw something different there18:22
clarkbya firefox gives you a little diablog that says connection failed18:22
fungioh, firefox shows a little red "connection failed" after the browser warning in the bottom-left18:23
fungiso that happened after you switched the etherpad base url?18:23
clarkb is the problem I think18:23
clarkbfungi: yes, more precisely after I restarted all the jitsi meet services to pick up the therpad base url change18:23
clarkbbut I think that issue gives me a fix18:24
fungiaha, "proxy the incoming WSS connection, not only WS"18:30
*** calcmandan_ has quit IRC18:33
*** calcmandan has joined #opendev18:33
clarkbI manually did ^ and that fixed audio and video. etherpad doc sharing is still broken but we have more clues to debug that18:41
clarkbI need to eat lunch but I'll pick this back up again after18:41
clarkbok I think I see what may be happening config wise with jitsi meet. We bind mount /var/jitsi-meet/web/ to /config in the container. When the container starts it populates the contents of /config from /defaults with env vars replaced19:01
clarkbwe then later run ansible which updates the /var/jitsi-meet/web contents causing all sorts of confusion19:02
clarkbbut the file contents actually loaded by nginx at startup are the ones produced from the defaults file19:02
clarkbknowing that I now believe that the problem with the etehrpad doc loading is at least partially related to and
clarkbI think we are sending a host head to with the value localhost in it19:09
clarkbwe may still need to fork their images to fix this; however, it should be possible to use a much lighter weight fork that simply patches the etherpad proxy config19:12
fungiaha, and that's causing it to redirect with an additional /p/?19:13
clarkbfungi: we set ETHERPAD_URL_BASE with a /p/ suffix. It shouldn't have that suffix at all. But even fixing that (I've manaully done this) we still get an http 400 response from and I suspect ti is because of the bad Host header values19:14
clarkbI'm now checking if I can manually edit the config and have nginx reload it somehow19:15
clarkbthis way the container doesn't resetart and rewrite the files19:15
clarkbyou can `nginx -s reload` and that fixed
clarkbhowever the shared document is still not showing up in the meetings19:17
clarkbconfig.etherpad_base = '""/etherpad/p/'; I suspect this is the problem19:21
clarkbya so that is being written out from the configs too19:24
clarkbI'm somewhat amazed that this is working at all to be honest :(19:24
clarkboh I see the beginning of the generated config has a bunch of generic stuff then it redefines things with the vars we pass in later which is how this works19:27
clarkbok I think I fixed it by hand. But not really sure how to properly fix it yet19:30
clarkbI removed the ""s from config.etherpad_base = '""/etherpad/p/'; and hard refreshed and it loads after I also fixed the ETHERPAD_BASE_URL and the proxy config19:30
clarkbI'm beginning to wonder if we should consider a soft revert. Basically add the opendev image back but base it off of an up to date jitsi meet web and then reapply our configs19:35
fungiis the problem that there's configuration baked into the image which we can't easily override?19:39
*** frigo has joined #opendev19:42
clarkbfungi: yes19:43
clarkbfungi: the image has a bunch of configs in /defaults it then runs the frep templating tool over them to produce working configs and outputs the results to /config. /config is where we bind mount /var/jitsi-meet/web19:43
clarkbwhich essentially means that all of the configs we manage with ansible are ignored19:43
fungii guess we could try to work out how to bind-mount over top of some of the templates?19:44
clarkbfungi: the old setup "worked" because we had the same configs in ansible and in our forked docker image19:44
clarkbfungi: I don't know how that would work since the cotnainer is writing over what we bind mount19:44
fungii mean bind-mount over the templates in /defaults19:45
clarkboh ya that could potentially work19:45
fungireplace what the templating tool reads rather than what it wants to write19:45
clarkbfwiw I think most things are working except for the doc sharing due to some issues with our env vars that get substituated and due to the meet.conf setting bad headers19:45
clarkbwe also set the browser warning and a coupel of other flags that i Think are probably less of a concern19:46
clarkbfungi: ya, that may work19:46
clarkbI'll give that a try with the meet.conf nginx config shortly19:47
openstackgerritClark Boylan proposed opendev/system-config master: Improve meetpad env options for templating
clarkbfungi: ^ thats not what we have been talking about but also seems to be necessary19:49
corvusclarkb: where's your zk 4 letter word change?19:53
corvusah found it
openstackgerritJames E. Blair proposed opendev/system-config master: Add mntr to ZK whitelist
openstackgerritClark Boylan proposed opendev/system-config master: Manage jitsi-meet meet.conf as a template input for the container
clarkbinfra-root ^ maybe that will work?20:05
clarkbI think ansible undid my manual changes so jitsi meet may not be working again. The first change in that stack should be very safe. The second too20:13
clarkband that will get at least audio and video working again20:14
*** roman_g has quit IRC20:18
*** roman_g has joined #opendev20:19
openstackgerritClark Boylan proposed opendev/system-config master: Restore some meetpad settings we had previously set
openstackgerritClark Boylan proposed opendev/system-config master: Restore meetpad etherpad settings.
clarkbI think ^ that stack may be largely complete now assuming it works and people are happy with it in review20:32
clarkbI haven't touched the cleanups yet because i Figure that will be easier once we've got something that works20:32
clarkbfungi: do you think we should go ahead and approve the first two chagnes so that things minimally work again?20:32
fungiclarkb: yeah, i'll single-core approve them. if corvus gets a chance to look (since the original implementation was his he might have a different take on it) we can always revert20:34
corvusfungi, clarkb: ++ go aheand and approve; i'm about to push up some changes i'd like you to review and i'll retro-review that then.20:34
clarkbansible failed on for an odd reason "reason": "Could not find or access '/home/zuul/src/' on the Ansible Controller." but nl03 and nl04 did not and I started their launchers which seem to be idling as expected20:37
clarkbthe hourly nodepool deployment should run once the directly triggered set of jobs finish and I expect it will finish up nl02 at that point. Once that is done I'll remove my WIP from the change that flips from the to servers for launching20:38
fungi"the ansible controller" is bridge.o.o in that sense?20:39
clarkbyes I believe so20:39
fungicannot access '/home/zuul/src/': No such file or directory20:39
fungidoesn't seem to exist on bridge.o.o20:39
clarkbyup it doesn't exist20:39
clarkbnot in the git repo either20:40
fungithe directory is there, yeah20:40
clarkbno it shouldn't be there20:40
clarkbat least from what I can tell it isn't a valid path, I expect ansible had a problem doing logrotate role lookup and somehow that bubbled up as an actual problem20:41
fungiright, that's what i mean, the parent directory exists and seems to be a current checkout, so not sure why it was looking for something not in the repo20:42
fungibut yeah, maybe that was a fallback and the actual error was earlier20:42
openstackgerritJames E. Blair proposed opendev/system-config master: Add zookeeper-statsd
corvusinfra-root: ^ if you could take a look at that relatively soonish, i'd like to get that running ASAP so we have baseline data before we start to increase our load on zookeeper with the zuul ha scheduler work20:48
corvusinfra-root: also note the parent change... actually, let me just squash them.20:49
openstackgerritJames E. Blair proposed opendev/system-config master: Add zookeeper-statsd
*** roman_g has joined #opendev20:50
corvusinfra-root: ^ squashed20:50
corvusmordred, tristanC, tobiash: ^ fyi20:51
clarkbcorvus: I +2'd it, seems to do what it says on the tin. The only thing I am not quite sure of is if there is any risk to exposing those gauges and counters though I expect not20:58
fungiit's hard to cram sensitive information into a gauge/counter20:59
corvusi don't think they're sensitive (at least, no more sensitive than any other metric we're exposing for anything)20:59
corvusclarkb: meetpad changes seem reasonable; it seems like a lot of stuff has changed in the interim; i'm sure we had a good reason for /p/ at the time but it feels like with so much changing, all original assumptions are invalid, so if it it works, great.  i don't see any red flags in there.21:01
corvusclarkb: (but it sure does seem like there's a bunch of stuff that's going to bitrot before the next update unless we can start to get some things upstream in the dockerfiles)21:01
clarkbcorvus: ya, it feels like rolling forward with the enw assumptions make sense. A lot of config things seem to have gone away (I think because they are moving forward too)21:01
fungialso in testing new meetpad, i spotted an exciting feature: experimental end-to-end encryption!21:02
clarkbya I was starting to think about how to make the meet.conf upstreamable21:02
clarkbI think the issue is they primarily support people proxying to localhost for etherpad21:02
clarkbin that case they don't want Host header to be localhost21:02
openstackgerritMerged opendev/system-config master: Disable xmpp websocket in jitsi meet config
clarkbbut we can probably add in another template switch to toggle that21:02
corvusyeah, so seems like we'll need some more conditions in their template.  yeah that.  :)21:02
openstackgerritMerged opendev/system-config master: Improve meetpad env options for templating
clarkbI think that use the room name as the doc name may also be the default21:03
clarkbat least meetpad seemed to do that for me when I had it working for a short time21:03
fungicorvus: any feel for whether we should be trying to get wss support working/proxied? sounded like they added it because it was more stable and better supported by browsers21:03
clarkband then maybe add another toggle for start the meeting with shared doc open21:03
corvusclarkb: cool, definitely was not before21:03
corvusfungi: i have no current knowledge of that21:03
clarkbfungi: I think it may already be since we're using the upstream meet.conf for nginx21:03
fungipart of why they made it the new default, i suppose21:04
clarkbfungi: I just didn't want to also debug websockets on top of everythin else21:04
fungisure, makes sense to test that separately later21:04
clarkbit probably makes sense to do another change on top of all that to toggle that var to 1 if you want to do that21:04
fungiyeah, let's save that for after the other changes are in21:04
fungialso easier to revert if we find it's terrible before or even during the ptg21:04
clarkbthere are definitely upside to relying on upstream more. Upgrades should be easier as we improve our side21:06
clarkbbut we're in that awkawrd place where we need ot redo things and then catch upstream up to speed on some of our changes21:06
clarkb makes me wonder if I'm missing something with how the config templating is intended to work21:10
clarkblike we're doing it the hard way now maybe21:10
clarkbactually no. I think people are editing the configs by hand after they are templated then simply restarting services without redoing the containers21:13
clarkbwhich is basically what I was doing earlier21:13
corvusi think us building the container was masking that issue earlier21:14
corvusclarkb: this looks really weird:
corvusclarkb: i'm guessing that's related to launcher turnover21:19
corvusthe blue line doesn't usually move like that :)21:19
clarkbya it sets max-servers to 0 for the providers I bet it is related21:19
clarkbbut I checked the two I started and they both were running without any active requests.21:19
clarkbI think it may just be a reporting artifact if max-servers is sent as is21:20
clarkb(then the two conflicting servers fight over the statsd data)21:20
*** roman_g has quit IRC21:21
clarkbcorvus: fyi and
clarkbfungi: posted a response to your question on
openstackgerritClark Boylan proposed opendev/system-config master: Restore meetpad etherpad settings.
clarkbyay testing ^ it caught a bug with my ansible update21:33
corvusi'm going to restart zuul now21:40
fungithanks for the heads up21:42
corvus#status log restarted zuul on commit 8a06dc90101c4b5285aaed858a62dadc5ae2786821:46
openstackstatuscorvus: finished logging21:46
openstackgerritJames E. Blair proposed opendev/system-config master: Add zookeeper-statsd
corvusclarkb, fungi: ^ that was missing a job dependency21:48
corvusoh wait one more error21:48
clarkbcorvus: the nodepool job21:49
corvusok 2 errors :)21:49
openstackgerritJames E. Blair proposed opendev/system-config master: Add zookeeper-statsd
corvusthe nice thing is that 2 jobs failed; one because of a missing python import in the testinfra file, the other due to the missing job dependency.  the one that failed due to python did have the right job dependencies and it ran the image21:53
fungianother example of why short-circuiting after the first failure isn't a clear win21:54
corvusi think we may not have the latest images pulled; i may need to redo that restart21:55
ianwreading has got me thinking that we should store /home/gerrit2 on the new server on LVM, and modify the backup script to snapshot, incrementally backup from that, then remove the snapshot22:04
clarkbianw: what is the advantage to that? seems like it would suffer from the same problems with git packs changing?22:05
fungithat's generally safer if you have data changing continuously, but then you have the problem that the fs you're snapshotting may not be quiescent22:05
ianwyeah, that's true22:06
openstackgerritClark Boylan proposed opendev/system-config master: Restore meetpad etherpad settings.
fungithere are filesystems designed for fs-level snapshotting, which solves that particular challenge22:07
ianwyeah, we could make /home/gerrit2 btrfs22:07
fungias long as you can be sure transactional writes your application makes are atomic22:07
fungiotherwise you still have the same problem another layer up22:07
ianwno idea if jgit provides that22:07
clarkbbtrfs needs defragging too22:08
clarkband they aren't upfront about it until your disk fills and you wonder why and then find some esoteric arch wiki article on the subject22:08
clarkbnote the disk will report many hundreds of gigabytes of disk free when it happens too22:08
ianwi'm pretty sure gerrit's suggestion to copy the h2 .db files does not make for consistent backups22:09
fungibasically to get true point-in-time backups, your application needs to be designed to accommodate that first and foremost22:09
clarkbcorvus: is the hourly opendev-prod-hourly enqueued chagnes supposed to have a commit of 0000000... ?22:09
clarkbI'm slightly concerned that that might not do what we expect there :/22:09
fungiwell, there is no valid commit, maybe that's a placeholder?22:10
clarkbit could be, but I thought it reported something else like master instead22:10
clarkb is what it is linking too which si why I'm concerned22:11
clarkbshould I remove ssh keys from bridge?22:11
clarkbor jsut let it happen and see what happens?22:11
corvuslet's watch for a sec22:12
clarkbok, it is starting a job now22:12
corvusbut be ready to kill it22:12
fungigitea says that commit doesn't exist, so it should probably just break?22:12
fungiwas that from a reenqueu?22:12
fungigit also says that object doesn't exist in my copy22:13
clarkbfungi: yes it would have been manualyl reenqueued after the restart22:13
corvus<Branch 0x7f3b5112b7c0 opendev/system-config deletes refs/heads/master from 000000000000000000000000000000000000000022:13
corvusso it thinks master was deleted22:13
fungiright, wondering if that's a bug in the reenqueue script not handling timer triggered pipelines correctly22:14
corvuszuul enqueue-ref --tenant openstack --pipeline opendev-prod-hourly --project --ref refs/heads/master22:14
corvusfungi: that seems plausible22:14
clarkbit hasn't tried to run ansible on bridge yet according to my tail -F install-ansible.yaml.log22:15
corvusi'm guessing that means "enqueue refs/heads/master with oldrev=0 and newrev=0"22:15
clarkbcorvus: can you manually dequeue it?22:15
corvusclarkb: possibly; i'd like to see if it breaks though22:15
corvusthat way we know if this is dangerous22:15
corvus(if it breaks harmlessly, no big deal, if it doesn't then we know there's danger lurking in the enqueue script)22:16
clarkbthe job doesn't seem to be doing much. The console doesn't show anything and tailing the log file for the playbook it should run shows it hasn't started doing that yet22:16
ianwthis feels like which I never quite got to the bottom of22:16
corvusianw: yep22:17
clarkbit just started according to the console stream22:17
corvusianw: if that's the case we can expect retry limits22:17
corvusthough im surprised it made it this far22:17
corvusclarkb: if it really checks out the null commit; there won't be any ansible files to run22:18
clarkbcorvus: good point22:18
corvusit looks like it checked out master22:19
clarkbya the console stream seems to confirm that22:19
clarkband git log in the actual repo dir does as well22:19
corvusoh good i was about to check that; glad you did22:19
corvusso for whatever reason, it seems to actually be doing the thing we want it to do22:20
clarkbya, I think the only other concern is if that will somehow find the wrong project-config version but I don't think it will since that commit shouldn't affect project-config right?22:20
corvusthere's at least a minor bug in that if it's going to checkout master, it should report the correct sha.  there may be a larger bug in what it's actually deciding to check out (depending on whether that's undefined behavior)22:21
corvusbut it doesn't seem to be a major bug22:21
corvusclarkb: yeah.  i'm inclined to say it's looking harmless and we can let it run22:21
clarkbianw: heh luca is saying to use an h2 db now? ugh I feel like this question comes up constantly and I have to go dig in emails and docs and find where it says to not do that22:25
ianwclarkb: yeah.  tbh i feel performance is not much of a consideration.  being able to manage the db not using odd .jars downloaded from the web and using our (now) standard backup mechanisms i think still makes an external db worth it22:27
clarkbthat is a fair point22:27
fungisaying to use h2 for the reviewed files db?22:28
fungihuh, okay...22:29
clarkbI swear I just dug up where someone (I thought luca) said not to use h222:35
clarkbit was corvus asking iirc22:35
clarkbbut I cannot find it in my logs22:35
corvusi think all the notes i took are in etherpads but i don't have links handy22:36
ianwthe old documentation says not to
clarkbya not sure it is urgent, it just bugs me that I distinctly remember looking this up recently and can't manage it again22:37
clarkbianw: that is for the reviewdb though22:37
ianwyeah, but all the points about everything but performance i think still count22:38
ianwi'm going to manually run the new ansible kerberos playbook on kdc04 just to make sure it is as idempotent as testing suggests it is22:39
clarkbianw: did you do the double run of playbook idea?22:42
ianwyep, that passed22:42
fungione of the changes adds it22:42
clarkbinfra-root we are ready for the hourly run updated after its failure and I have started the launcher there22:50
fungisame as the previous flip22:52
fungijust several times as many launchers22:52
openstackgerritClark Boylan proposed opendev/system-config master: Clean up the old nodepool launchers.
clarkbI've WIP'd ^ as we don't want that landing until the new servers take over22:54
clarkbfungi: thanks I went ahead and approved it since we've done this once already22:55
fungii'm around to assist if something goes wrong22:57
clarkbfungi: cool, I plan to stick around until we've at least got the old ones headed to idle22:58
clarkbthen probably pick up meetpad stuff again tomorrow22:58
openstackgerritIan Wienand proposed opendev/system-config master: Add kerberos-client group
ianwinfra-root: ^ that will help with a missing var23:01
openstackgerritMerged openstack/project-config master: Flip to
clarkbI've got my nodepool launcher tails running on both old and new to see when old has taken over23:11
clarkband new servers have begun taking requests23:17
ianwclarkb: what's with the removal of "--skip-network" in ?23:25
clarkbianw: dtantsur|afk pointed out that one of the things making glean slwo in their testing is that glean is run twice23:26
clarkbianw: in the old code we ran glean ignoring the network to configure ssh keys and the hostname if config drive was present then ran it again to configure the network regardless fo the config drive state23:26
clarkbianw: instead we should be able to run it once with the config drive and setup ssh keys, hostname, and network and once without config drive where we only configure the network23:26
ianwohh, right, ok23:27
clarkband that should improve startup costs on slow qemu vms pretending to be baremetal23:27
clarkbfrom dtantsur|afk logs it was taking like 30 seconds each time or something like that23:27
ianwlooking at we could probably just leave -ssh and --hostname turned on always, it looks like it will gracefully ignore it if there's no configdrive23:30
clarkbianw: thatwould simplify the code even more23:31
clarkbthough its possible users may not always want it?23:31
clarkb(we should be fine as we use it through simple-init but maybe some users dont and write their own units/shell scripts?)23:32
ianwwe could also optimise in glean there to only read meta_data.json once, i think it's doing multiple times23:32
clarkbI think nl02 and have gone idle now. I'll give them a little longer then stop their containers23:32
ianwyeah each step opens and json loads it23:32
openstackgerritIan Wienand proposed opendev/glean master: Reduce metadata read/parsing overhead
clarkbianw: fwiw I think dtantsur|afk itnended on testing these cahnges pre merge with the ironic fake baremetal stuff23:42
clarkbso we can probably safely wait on approving things until dtantsur|afk shows they help23:42
ianwok, i think they're generally correct at any rate, and hopefully help23:42
clarkbyup and I think yours may really help those small systems23:43
clarkbits our own little version of the gtav startup bug23:43
ianwhaha yes.  i await a bounty ;)23:44
ianw(i'm sure they spend more on m&m's for the office than they paid that guy :)23:44
clarkbI think my favorite part of the bug is that part of why it was a problem was they were parsing the entire manifest of things they would sell you for real money23:45
funginot sure the sweatshop studios provide office m&ms for their slave laborers23:47
fungiit's not like they're valve or something23:47
fungiunless the m&ms are laced with amphetamines to keep them awake through those 16-hour workdays23:48
clarkbnow nl04 has gone idle23:49
clarkbI've stopped the nodepool launchers on all 3 old hosts now23:50
clarkbI've set active (it was wip) but I'm happy to land that tomorrow morning after we've confirmed the new launchers are all happy23:50
clarkbthen once that lands we can land the project-config cleanups23:50
clarkband delete the servers23:50
clarkbI'll also try to land the meetpad fixes if they haven't gone in by then (I don't think we're in a rush as the service is minimally functional now or should be anyway)23:51
*** hamalq has quit IRC23:57
*** hamalq has joined #opendev23:57

