Thursday, 2020-04-16

fungithat's what it looked like a month ago00:02
ianwoohhh, this must have run before https://opendev.org/openstack/devstack/commit/07be5574726ac71cae7707677258a5d71141172500:02
ianwmerged 7 hours ago?00:02
ianwthat has to be it00:03
fungihttps://review.opendev.org/712609 is what changed it to not that00:03
ianwyes it merged at 16:16 and the job i'm looking at ran at 13 something00:04
fungithat makes sense, in that case00:04
ianwso what have we reverted the suse images too?  was the old one still around?00:04
fungiianw: it was only sort of around00:04
fungibasically vexxhost was failing to delete a copy so mordred managed to download it back from vexxhost and he and corvus reintegrated it into nodepool00:05
ianwwell, i feel like it would work now anyway00:05
fungiseems likely, yes00:07
fungibut in case it doesn't, it would be good to have a rollback solution which doesn't take hours to enact00:08
fungigranted a bunch of that time was spent in confusion because nodepool doesn't correctly reflect deleting state for the remote copy (a fix for that has since been proposed)00:11
fungibut we basically lucked out that there was still a copy stuck deleting in vexxhost00:12
openstackgerritIan Wienand proposed openstack/project-config master: nodepool: use job inheritance  https://review.opendev.org/71315800:46
openstackgerritIan Wienand proposed openstack/project-config master: Add ubuntu-bionic-plain to more regions  https://review.opendev.org/72031600:48
openstackgerritIan Wienand proposed openstack/project-config master: Add ubuntu-bionic-plain to all regions  https://review.opendev.org/72031600:52
*** prometheanfire has quit IRC01:00
openstackgerritIan Wienand proposed openstack/project-config master: nodepool: Add more plain images  https://review.opendev.org/72031801:07
*** prometheanfire has joined #opendev01:07
openstackgerritIan Wienand proposed openstack/project-config master: Add ubuntu-bionic-plain to all regions  https://review.opendev.org/72031602:31
openstackgerritIan Wienand proposed openstack/project-config master: nodepool: Add more plain images  https://review.opendev.org/72031802:31
prometheanfireianw: mind taking a look at my glean review? https://review.opendev.org/71733903:18
prometheanfireianw: it's probably needed for the dib change03:18
prometheanfirewhich means a release :(03:18
ianwthanks, i think that seems sane03:23
prometheanfireglad it does to someone :D03:25
ianwcouple of nits inline03:26
prometheanfirecool03:37
openstackgerritMonty Taylor proposed opendev/system-config master: Remove ansible_user_dir  https://review.opendev.org/72033603:38
mordredianw: if you have a sec ... ^^ - been fighting the long tail on the new trigger-things-with-in-tree03:39
ianwmordred: heh, i am familiar with long tail on changes!!!!03:39
mordredit's the best tail isn't it :)03:39
mordredianw: most of the stuff is well tested - but the bits where tests and prod are different are where we keep failing. go figure right?03:40
prometheanfireianw: for some reason, even though the exit code is 0, if I use == 0 instead of true it doesn't work, I'll try again (tox runs should catch it)03:40
prometheanfireianw: this fails tox :| https://gist.github.com/77295626417800dedb6971e3188ae7a503:49
prometheanfireI do think that it should check against numbers though03:49
ianwwhat's the failure?03:54
prometheanfire   DEBUG [glean] resolved in use, writing to /etc/systemd/resolved.conf04:03
prometheanfireianw: it's confusing to me, it doesn't seem like the networkd codepath in the test_glean.py file isn't being hit04:04
prometheanfirespecifically if distro.lower() is 'networkd':04:05
*** osmanlicilegi has joined #opendev04:14
openstackgerritIan Wienand proposed openstack/diskimage-builder master: Add centos aarch64 tests  https://review.opendev.org/72033904:16
prometheanfireianw: you mind applying the patch and running tox?  I feel like I'm doing something wrong mock wise04:21
ianwprometheanfire: tox -e py3 works for me?04:24
prometheanfireianw: with the patch (gist) I linked you?04:25
prometheanfireianw: those should be returning 0/3 and checking for it, but that doesn't seem to be working, for some reason (for instance) the opensuse test has resolved_enabled == 004:27
ianwhrm, just a sec04:28
*** ykarel|away is now known as ykarel04:28
openstackgerritMerged openstack/diskimage-builder master: Do not try to use MBR on AArch64  https://review.opendev.org/71980504:28
prometheanfirechanging the is not to 0 to be is not '0' (and is '0') seems to help04:29
prometheanfirenow I have the oposite fail, but closer, I think04:29
ianwwhy don't you have just one function as the side-effect of os.system and switch in there?04:33
prometheanfireI'd have to pass the distro to the side_effect as well, at least I think04:33
prometheanfirelike I said, my python/mock isn't the best04:33
ianwyou can do that, just pass it with the functools partial04:37
prometheanfiredo you think it'll help here? or is that just a style fix?04:38
prometheanfirebecause I think it's just style04:38
* prometheanfire has spent hours on this true/false/string/int stuff at this point04:38
openstackgerritMerged opendev/system-config master: Remove ansible_user_dir  https://review.opendev.org/72033604:44
openstackgerritIan Wienand proposed openstack/project-config master: nb03: use linaro-us mirror  https://review.opendev.org/72034204:51
ianwsorry i've just got about 3 other things i'm monitoring right now04:51
prometheanfireunderstood, still banging my head against it04:56
prometheanfiresomething is going to give and it's not my head04:56
prometheanfireianw: basically, if you can figure out a better way to mock that system call (who returns ints not bool it seems) then I'm all for it, but for some reason the moking is not picking up what distro the test is running04:58
prometheanfireadded a print statemet that only gets printed if we are networkd, then, if not returns 3 for that os.system call.  in cmd.py I log the output of that os.system call05:00
prometheanfireit's always 005:00
prometheanfirewtf05:00
prometheanfirefreaking is vs ==05:01
openstackgerritMatthew Thode proposed opendev/glean master: write one resolv config  https://review.opendev.org/71733905:04
prometheanfireianw: computers suck, they do exactly what we tell them instead of figuring out the right thing05:04
ianwif distro.lower() is 'networkd' won't do what you'd think ... that wants to be ==05:12
prometheanfireis one of those still around?05:16
prometheanfireoh, I should remove my tox change05:17
openstackgerritIan Wienand proposed opendev/glean master: [dnm] update of I644e0b50cfb7bb00a108160b99c0c1359d6a9dd4  https://review.opendev.org/72034805:17
ianwprometheanfire: ^ something like that i think05:18
openstackgerritMatthew Thode proposed opendev/glean master: write one resolv config  https://review.opendev.org/71733905:18
prometheanfireianw: I'm not sure what you changed? my review doesn't use 'is' anymore05:18
prometheanfireah, we both solved it separately :D05:19
ianwjust make one os_system_side_effect function05:19
prometheanfireI did05:20
prometheanfireianw: see my last two changes05:20
prometheanfire:D05:20
ianwok05:22
prometheanfiremy function is slightly diferent, but overall does the same thing05:23
openstackgerritAndreas Jaeger proposed openstack/project-config master: Use TOX_CONSTRAINTS_FILE in release script  https://review.opendev.org/72026505:36
AJaegerianw: updated as suggested ^05:37
ianwAJaeger: perhaps a __ typo there?05:37
*** ysandeep is now known as ysandeep|brb05:38
AJaegerindeed ;(05:41
openstackgerritAndreas Jaeger proposed openstack/project-config master: Use TOX_CONSTRAINTS_FILE in release script  https://review.opendev.org/72026505:41
AJaegerthx05:46
AJaegerianw: is https://review.opendev.org/713158 safe to merge? Then I'll review later...05:46
ianwumm, how about i restart the builders so it is, we have to do it sometime.  nb04 is ok05:48
openstackgerritMerged openstack/project-config master: nb03: use linaro-us mirror  https://review.opendev.org/72034205:49
AJaegerinfra-root, in https://review.opendev.org/720342 the promote job failed - infra-prod-service-nodepool05:52
AJaegerguess mordred needs to fix the failure above first ^05:53
ianwHost key verification failed. ... weird05:53
*** ralonsoh has joined #opendev05:56
openstackgerritMerged openstack/project-config master: Use TOX_CONSTRAINTS_FILE in release script  https://review.opendev.org/72026505:58
*** Romik has joined #opendev05:58
ianw#status log restarted all nodepool builders to pickup https://review.opendev.org/#/c/713157/05:59
openstackstatusianw: finished logging05:59
ianw(well, not nb04 because that's already got it)05:59
prometheanfiregood, passed tests06:10
*** Romik has quit IRC06:24
*** ysandeep|brb is now known as ysandeep06:34
*** DSpider has joined #opendev06:35
openstackgerritAndreas Jaeger proposed openstack/project-config master: nodepool: use job inheritance  https://review.opendev.org/71315806:37
openstackgerritAndreas Jaeger proposed openstack/project-config master: Add ubuntu-bionic-plain to all regions  https://review.opendev.org/72031606:37
openstackgerritAndreas Jaeger proposed openstack/project-config master: nodepool: Add more plain images  https://review.opendev.org/72031806:37
fricklermordred: ianw: AJaeger: the jobs adds the local known host key for bridge.o.o, but then the role connects to zuul@localhost. not sure whether amending the role or just adding the key for localhost would be the right solution, though06:38
AJaegerthanks, frickler06:39
*** drifterza has joined #opendev06:39
AJaegerfrickler: care to review https://review.opendev.org/713158 , please?06:39
*** Romik has joined #opendev06:44
fricklerAJaeger: uh, that's a big one, will have to put it on my list for later today06:44
AJaegerfrickler: yeah, took me a mug of tea ;)06:45
AJaegerfrickler: it's rather mechanical on the other hand06:45
*** dpawlik has joined #opendev06:49
*** rpittau|afk is now known as rpittau07:34
*** tosky has joined #opendev07:38
*** ykarel is now known as ykarel|lunch07:43
*** moppiner is now known as moppy07:45
openstackgerritRoman Gorshunov proposed openstack/project-config master: Retire airship-in-a-bottle  https://review.opendev.org/72016008:03
*** Romik has quit IRC08:10
*** ysandeep is now known as ysandeep|lunch08:53
*** hrw has joined #opendev09:04
hrwmorning09:04
hrwianw: thanks09:04
*** ykarel|lunch is now known as ykarel09:21
openstackgerritMarcin Juszkiewicz proposed openstack/project-config master: Add CentOS 8 AArch64 nodes  https://review.opendev.org/72016709:22
ianwthanks, lgtm, my preference would be to remove pip-and-virtualenv to not grow further dependencies we need to remove.  happy for someone to merge, you should be able to view build logs on nb03.openstack.org09:26
openstackgerritMarcin Juszkiewicz proposed openstack/project-config master: Add CentOS 8 AArch64 nodes  https://review.opendev.org/72016709:30
hrwthanks09:37
*** smcginnis has quit IRC09:52
*** lpetrut has joined #opendev09:56
fricklermordred: the post-merge failure here also seems related to your bridge updates https://review.opendev.org/72024510:09
*** elod_ has joined #opendev10:09
*** elod_ has quit IRC10:09
*** rpittau is now known as rpittau|bbl10:19
*** ysandeep|lunch is now known as ysandeep10:22
openstackgerritMerged openstack/project-config master: Add CentOS 8 AArch64 nodes  https://review.opendev.org/72016710:37
openstackgerritAndreas Jaeger proposed openstack/project-config master: nodepool: use job inheritance  https://review.opendev.org/71315810:55
*** avass has quit IRC11:06
*** ysandeep is now known as ysandeep|afk11:15
*** drifterza has quit IRC11:22
*** ysandeep|afk is now known as ysandeep11:47
*** dpawlik has quit IRC11:50
*** dpawlik has joined #opendev11:50
*** rpittau|bbl is now known as rpittau12:07
*** hashar has joined #opendev12:09
*** dpawlik has quit IRC12:09
*** smcginnis has joined #opendev12:09
*** dpawlik has joined #opendev12:10
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: dhall-diff: add new job  https://review.opendev.org/71869412:44
mordredfrickler: I'm very confused why it's trying to push to localhost :(13:02
mordredit should be trying to push to bridge ... which is what I'd expect ansible_host to be set to there13:06
openstackgerritMonty Taylor proposed opendev/system-config master: Set ansible_host explicitly  https://review.opendev.org/72046913:07
mordredfrickler, fungi : ^^ that's a bit of a stab in the dark13:08
*** ysandeep is now known as ysandeep|mtg13:14
corvusmordred: see comment on https://review.opendev.org/72046913:25
mordredcorvus: damn. yup13:27
openstackgerritMonty Taylor proposed opendev/system-config master: Set ansible_host explicitly  https://review.opendev.org/72046913:27
mordredcorvus: that said - does that make _any_ sense to you?13:28
corvusmordred: nope, was about to start looking for other ideas13:28
mordredthe role does delegate_to: locahost and uses ansible_host ... only thing I could think of is maybe add_host isn't setting ansible_host - but that is just bong13:29
mordredcorvus: I have confirmed the behavior13:31
corvusmordred: well, it did the same thing with _port13:31
corvusmordred: can we add it to add_host, like port?13:32
mordredmaybe - lemme check13:32
*** ykarel is now known as ykarel|afk13:32
mordredyes, that works13:32
openstackgerritMonty Taylor proposed opendev/system-config master: Set ansible_host explicitly  https://review.opendev.org/72046913:34
mordredcorvus: that works in my local testing13:34
corvusmordred: +313:35
mordredcorvus: ok - so - I get the same behavior with the file in the ... oh!13:36
mordredcorvus: we explicitly set ansible_host in the zuul prepared inventory13:36
mordredto the ip address13:36
mordredif I make an inventory with just a host in it (no ansible_host set) - I get the same behavior as with add_host in the playbook13:36
corvusah, so we accidentally relied on that being set by zuul (which isn't crazy, it's pretty much a zuul role :)13:37
mordredso - in general there seems like some weirdness related to add_host there - but in the zuul exeution context it should work13:37
mordredyeah13:37
corvusi think this is probably the best solution13:37
openstackgerritJames E. Blair proposed opendev/system-config master: Meetpad: proxy through meetpad to etherpad.opendev.org  https://review.opendev.org/72009513:39
corvusmordred: i think my next task should either be working on containerizing zk or nodepool-launcher -- are you doing either of those right now?13:41
mordredcorvus: I started looking at nodepool-launcher right before eod yesterday - it's my planned next task13:42
corvuscool, i'll start on zk13:42
mordredcorvus: I'm excited about our new containerized zuul future13:42
corvusya13:42
*** kevinz has quit IRC13:51
openstackgerritMonty Taylor proposed opendev/system-config master: Remove puppet and cron mentions from docs  https://review.opendev.org/71879114:00
mordredcorvus, fungi: ^^ updated that with mentions of the DISABLE-ANSIBLE flag file14:00
openstackgerritMerged opendev/system-config master: Set ansible_host explicitly  https://review.opendev.org/72046914:01
corvusoof, and rebased :(14:01
corvusmordred: whenever possible, can you try to avoid rebasing changes like that? :)14:02
corvusi know there's a lot of stuff in flight, but i'd really love to just review the delta there14:02
*** roman_g has quit IRC14:02
corvusmordred: i rebased it locally, mind if i push that up?14:04
openstackgerritJames E. Blair proposed opendev/system-config master: Remove puppet and cron mentions from docs  https://review.opendev.org/71879114:05
mordredcorvus: thanks14:08
mordredcorvus: woot! install-ansible worked14:09
corvusmordred: install-ansible?  er, do you mean the workspace sync stuff?14:09
mordredyeah14:09
corvus\ol14:09
* mordred is going to re-enqueue the project-config patch that failed overnight14:10
mordredcorvus: multiple project stanzas are ok and get merged right?14:13
corvusyep14:13
corvus(we use that heavily in zuul-jobs/zuul-test.d)14:14
mordredcorvus: I'd like to split the system-config zuul.yaml into a .zuul.d dir organized by purpose14:14
mordredso I was thinking putting the project defs for the set of jobs in the same file would be nice14:14
corvus++ yep that's the pattern in zuul-jobs14:14
mordredcool14:14
mordredinfra-root: could we land https://review.opendev.org/#/c/711057/ and https://review.opendev.org/#/c/718788/14:17
* mordred looking through things that are maybe falling through the cracs14:17
mordredcracks14:17
*** lpetrut has quit IRC14:34
*** mlavalle has joined #opendev14:38
openstackgerritMonty Taylor proposed opendev/system-config master: Install kubectl via openshift client tools  https://review.opendev.org/70741214:42
openstackgerritMonty Taylor proposed opendev/system-config master: Remove snap cleanup tasks  https://review.opendev.org/70929314:42
openstackgerritMonty Taylor proposed opendev/system-config master: Install kubectl via openshift client tools  https://review.opendev.org/70741214:42
openstackgerritMonty Taylor proposed opendev/system-config master: Remove snap cleanup tasks  https://review.opendev.org/70929314:42
corvusapparently we have an "install-zookeeper" role which we use to test the nodepool deployment; i think once this is finished, we should remove that in favor of having the gate stand up a 1-node zk cluster with this setup.14:42
mordredcorvus: ++14:44
mordredcorvus: also - amusingly enough - the service-zuul patch is failing ... because it's trying to install a zuul user and there is already a zuul user, because it's what we use in zuul14:45
mordredit is ... an unfortunate edge case14:45
*** ykarel|afk is now known as ykarel14:45
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers  https://review.opendev.org/71762014:47
mordredcorvus: for now I'm going to just add a failed_when: false to the user creation... I don't have a better idea of how to deal with it but maybe we'll think of something14:47
openstackgerritJames E. Blair proposed opendev/system-config master: WIP: run ZK from containers  https://review.opendev.org/72049814:50
AJaegerconfig-core, another victoria change for review, please: https://review.opendev.org/72025714:51
corvusactually, we're probably going to *have* to replace install-zookeeper with this, otherwise we're not going to be testing the tls stuff14:51
corvusAJaeger: is that more of an #openstack-infra thing?14:52
AJaegercorvus: I rather use #opendev nowadays - should we really split that?14:54
corvusi think the idea was to make this more approachable to folks who are here for non-openstack projects14:55
AJaegerbut the goal was to abandon #openstack-infra, so my understanding is we wanted to use this channel primarily. #opendev is for everybody - not excluding openstack. That's at least how I understood it so far...14:57
mordredI didn't think it was the goal to abandon #openstack-infra - but instead to split so that opendev is about the general service, and openstack-infra is about things in service of the openstack project more specifically14:58
corvusAJaeger: we have different understandings then; mine was a venn-diagram of communities, and they overlap here, but other channels may still be useful for openstack/airship/zuul/etc specific topics.  we should get more folks together to talk about this and come to consensus :)14:58
*** ysandeep|mtg is now known as ysandeep14:59
mordredyeah - it's a new concept and I think we're still figuring it out for sure :)14:59
AJaeger;)14:59
AJaegerI'm fine to change, would be great to have a common understanding.15:00
fungiwe did talk about folding the openstack-infra ml into openstack-discuss, but i don't recall talking about doing something like that for the irc channel15:01
fungiif we abandon the #openstack-infra irc channel, it might be more consistent to move openstack-oriented discussions to #openstack-dev or #openstack-qa15:02
mordredyeah - although project-config like discussions might be weird still15:03
fungiinfrastructure-related discussions can certainly happen in here for any project hosted in opendev's infrastructure i think15:05
*** roman_g has joined #opendev15:05
fungibecause we shouldn't need to be expected to hang out in everybody's irc channels15:05
fungibut discussing release and job configuration which is specific to a particular project, even if it's hosted in one of our trusted config repos, may still be better in a project-specific channel15:06
AJaegerso, airship/starlingx job configuration happens should happen in some airship/starlingx/... channel? Or #opendev - but openstack ones in #openstack-infra?15:07
fungiit's a good question. the current comingling of project-specific configs in a central repository means picking a venue for discussion depends on finding somewhere everyone who needs to discuss that can be present15:09
AJaegerwe can play it by ear for now - and review in a few weeks.15:10
AJaegerWhat I take away is that really openstack specific discussion can stay on #openstack-infra - for now ;)15:10
fungithat's true, at a minimum15:10
fungii intend to continue sticking around in that channel anyway15:11
*** ysandeep is now known as ysandeep|away15:11
AJaegerreading http://lists.openstack.org/pipermail/openstack-discuss/2020-March/013380.html again - I agree with your comments, sorry, I somehow had that internalized differently.15:12
openstackgerritMonty Taylor proposed opendev/system-config master: Install kubectl via openshift client tools  https://review.opendev.org/70741215:12
openstackgerritMonty Taylor proposed opendev/system-config master: Remove snap cleanup tasks  https://review.opendev.org/70929315:12
fricklermordred: corvus: comment/question on https://review.opendev.org/711057 about non-root-useability15:21
*** dpawlik has quit IRC15:26
*** dpawlik has joined #opendev15:26
AJaegerfrickler: is https://review.opendev.org/713158 now good? I made the suggested changes quickly15:28
fricklerAJaeger: I was hoping ianw could answer the question regarding nb03, but we can also do that in a followup, if you think we should merge now before new conflicts appear15:30
AJaegerfrickler: Ah, I see - ok, let ianw self-approve and followup15:32
*** ykarel is now known as ykarel|away15:35
*** ttx has quit IRC15:38
*** ttx has joined #opendev15:38
openstackgerritJames E. Blair proposed opendev/system-config master: WIP: run ZK from containers  https://review.opendev.org/72049815:40
corvusmordred, fungi: can you review the commit message there ^  -- figure out what kind of a migration we want to do15:40
fungilookin'15:41
openstackgerritMerged opendev/system-config master: Get rid of all-clouds.yaml  https://review.opendev.org/71878815:41
mordredcorvus: I think that proposal seems fine15:42
corvusk.  part of me is itchin to do the 'roll out new servers under a new domain' thing, but this'll be faster, and that's not user-facing15:42
fungicorvus: will moving data directories break the running daemons?15:43
corvusfungi: yes, sorry i meant to suggest we do that during an outage15:43
corvuslike a 5-min outage15:43
fungiahh, okay, that wasn't indicated in the commit message. given that, sounds fine. we need to take down zuul and nodepool during that time too i guess?15:44
corvusyep15:44
corvusoh, you know, there's probably a way to do this as a rolling restart15:44
corvusit *is* an ha cluster :)15:45
mordredcorvus: in nodepool-launcher we're currently installing an ssh private key - but it feels like a leftover - are we still using that for something?15:45
corvusmordred: i don't think so15:45
mordredk. I'm going to leave it out - and we can always add it back if needed15:45
corvussounds like a plan15:45
*** moppy has quit IRC15:49
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers  https://review.opendev.org/71762015:52
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers  https://review.opendev.org/72052715:52
mordredcorvus: somehow I think that's actually it for nodepool15:53
mordredcorvus: and looking at it - I kind of think we could rebase it to not be on top of the zuul one - and just shut down the launchers and run it and I think we'll be transitioned :)15:53
mordred(there's not a lot going on with the launchers)15:53
corvusmordred, fungi: a quick look at zk upgrade instructions makes me think we should be able to do a live rolling upgrade from our current to the new container-based system15:54
mordredcorvus: cool!15:55
mordredcorvus: that said - won't we need an outage to update to ssl?15:55
mordredor is that online too?15:55
corvusi'm thinking maybe what we should do is make a copy of the data files on all 3 servers (as a DR backup), then do the rolling upgrade.  if it borks, shut everything down and try to restart the new servers on the old data files.  and if that borks, well, we'll get nice new images.  :)15:56
corvusmaybe we can do a data dump too....15:56
mordredcorvus: seems good to me15:56
corvusmordred: i'm not sure, i think there might be a way to rolling upgrade to tls15:56
corvuseither way, i'd like to do that as a second phase anyway15:57
corvuslooks like we can do zk-shell mirror15:58
corvusso that's a good second-level data backup15:58
corvuscool, i just made a data backup on nl0116:01
openstackgerritJames Page proposed openstack/project-config master: Add TrilioVault charms  https://review.opendev.org/72053416:01
corvusit took ~40 seconds16:01
corvusit's a json file16:01
fungilooking into rolling maintenance across the cluster seems like a useful exercise anyway, agreed16:03
openstackgerritJames E. Blair proposed opendev/system-config master: Run ZK from containers  https://review.opendev.org/72049816:06
mordredcorvus: cool!16:09
AJaegerdo we want to run pypy on bindep - or time to drop pypy testing?16:14
* mordred does not care about pypy at all16:15
*** knikolla has joined #opendev16:16
openstackgerritAndreas Jaeger proposed openstack/project-config master: Remove pypy job from bindep  https://review.opendev.org/72054316:17
AJaegerIf anybody cares, please -1 ;) ^16:17
*** sshnaidm has joined #opendev16:17
*** rpittau is now known as rpittau|afk16:18
mordredAJaeger: I'm excited about removing pypy jobs :)16:21
AJaeger;)16:22
AJaegerthe templates are still used in a few stable branches but master is now rid of it - with exception of jjb16:23
corvusaww, i like pypy :)16:23
fungiif they renamed it puppy, nobody would want to get rid of something so cute16:24
corvusi mean, i agree that we're not really targeting it or putting effort into supporting it, and we shouldn't run it.  but that doesn't make me happy16:24
openstackgerritMonty Taylor proposed opendev/system-config master: Install kubectl via openshift client tools  https://review.opendev.org/70741216:25
openstackgerritMonty Taylor proposed opendev/system-config master: Remove snap cleanup tasks  https://review.opendev.org/70929316:25
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers  https://review.opendev.org/71762016:25
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers  https://review.opendev.org/72052716:25
mordredcorvus: I have no problem with it - my excitement about removing them is that nobody put any energy in to supporting it, so the jobs have been wasted energy - I agree, I thik it would have been cool if people had actually cared16:26
corvusya16:29
corvusapparently we have a 'linter' that runs tests that check "yaml groups"16:36
corvusit emitted the error "The group <puppet> does not contain host <zk01.openstack.org>"16:36
corvusi'm like "yeah, that's right... why do you think it should?"16:36
openstackgerritJames E. Blair proposed opendev/system-config master: Run ZK from containers  https://review.opendev.org/72049816:38
corvusfungi, mordred: ^ okay that is ready for review and expected to pass all jobs now16:39
corvusi think we might be able to execute that today16:39
fungithanks! adding to the top of my pile16:39
corvusyou can look at the output of the run from the previous patchset -- the newest ps only fixes that linter error16:39
mordredcorvus: yeah - I think that's ultimately a test of the yamlgroup plugin - but it sure is annoying when we shift things :)16:40
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers  https://review.opendev.org/72052716:42
mordredcorvus: ^^ I rebased that off of the other stack - because it really doesn't depend on it16:43
mordredcorvus: I think we should be able to roll it out today too16:43
corvusmordred: ya.  we'll still need the zuul stuff for tls though16:43
mordredyah16:43
mordredbut in the spirit of rolling out smaller changes as we can :)16:44
corvus++16:44
clarkbmordred: corvus so should we land https://review.opendev.org/#/c/719589/ to avoid needing to coordinate that change across more services?16:44
clarkbfungi: also any progress on etherpad. I was going to use it to help record that nodepool debugging but it seems to now be unhappy with my browser :/16:45
mordredclarkb: yeah - I think ti would be good to land the compose change before we roll out the new services16:45
clarkbalso do we need to treat the etherpad issues as more of a fire?16:45
clarkb(they seem to be persisting)16:45
corvusclarkb: what browser?16:46
mordredclarkb: have we applied the new remediations?16:46
clarkbcorvus: FF16:46
corvuswhat remediations?16:46
clarkbmordred: I wasn't aware of any, do you have links?16:46
fungiclarkb: we were going to try to troubleshoot from an affected client next16:46
mordredI think we might need to run the etherpad playbook - if we landed and changes during the jobs being broken16:46
fungimordred: the only outstanding possible config adjustment indicated in that one seemingly related bug was to set a timeout on the proxy16:46
mordredah - but we haven't done that yet16:47
clarkbcorvus: fungi: I clicked new pad which didn't actually render the text in the button box, then it seemed to sit on loading until I closed the tab and switched to paste16:47
corvuscan we downgrade?16:47
fungiclarkb: what ip address were you coming from?16:47
corvusclarkb: i also see it hanging after clicking 'new pad'16:48
mordredcorvus: I don't know - I'm not sure what the db implications would be16:48
corvusclarkb: eventually loaded for me after about a minute16:48
mordred(since there isn't a schema or schema upgrades, I have no idea if 1.8 would write data that earlier can't read)16:48
fungieww, docker-compose logs include ansi escapes even if stdout isn't a tty16:49
clarkbfungi: probably because those originate in the service16:49
clarkb(so they aren't rewriting the logs for us)16:50
mordredthere is a --no-ansi option to docker-compose logs16:50
fungimordred: thanks16:50
corvusor you can use 'docker logs'16:50
fungiclarkb: looks like it attempted to create k4tmEkOycMixbBdNxbHc for you16:51
fungiat 16:42:0716:51
clarkbmordred: looking at etehrpad server we don't have the fixed apache logs yet so rerunning config management for that may be a good idea if nothing else16:51
mordredclarkb: k. want me to run the playbook real quick?16:51
clarkbfungi: that timestamp looks correct16:51
clarkbmordred: I'll defer to others as I'm not sure what all else is happening, just noting we did fix that in config so should apply it at some point16:52
mordredcorvus, fungi : thoughts?16:52
clarkbidea: etherpad-dev is upgraded we can compare to it maybe?16:53
corvusmordred: running the pb sounds good16:53
corvusclarkb: can you elaborate?16:53
fungiclarkb: so interstingly, there was no error or traceback related to pad k4tmEkOycMixbBdNxbHc in the logs16:53
* mordred runs playbook16:54
clarkbcorvus: we've got etherpad-dev running the newer etherpad code too, but it is not using upstream docker images, may be using a different apache, nodejs, mysql, etc16:54
fungijust two info lines, one for creating the pad, one for the author leaving the padf16:54
clarkbcorvus: if we find that etherpad-dev is operating more happily that may help us narrow down where the problems are16:54
corvusclarkb: right -- though we should make sure we consider "load" as a variable16:54
clarkbcorvus: thats fair. fwiw etherpad-dev new pad button loads properly and gives me a pad relatively quickly (~3 seconds or so?)16:55
fungii'm fetching an updated server-status since we know it's misbehaving currently16:55
fungithat's also taking a while to return16:55
*** moppy has joined #opendev16:56
fungii think apache itself may be having trouble16:56
fungi`wget -qO- https://etherpad.opendev.org/server-status` locally on the server just hangs for me16:57
corvusmordred: "mysql -u root -p" inside the mariadb container with the root password specified in the environment in the docker-compose file isn't working -- anything obvious you see i'm missing?16:57
fungi[Thu Apr 16 16:57:45.176108 2020] [mpm_event:error] [pid 31892:tid 139699770100672] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.16:57
fungithat's likely related16:58
fungistarted up around 16:54:54 after a graceful restart was triggered16:58
fungianybody object if i want to restart apache? i don't think we can get status info out of it in this state anyway16:58
corvusfungi: coordinate with mordred16:59
corvushe's running a playbook which is probably gracefully restarting apache16:59
fungioh16:59
mordredgo for it16:59
mordredplaybook is done16:59
mordredcorvus: looking16:59
fungiwell, possible your playbook is why i couldn't get server-status in that case16:59
clarkbAH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit. <- that implies to me that maybe we do need apache tuing16:59
clarkb*tuning16:59
fungisorry, i was still trying to investigate and gather data, didn't realize folks were already changing things16:59
mordredcorvus: mysql -p$MYSQL_ROOT_PASSWORD works for me17:00
corvusclarkb: we had apache tuning; did we lose it, or are you saying we need to re-tune?17:00
fungiclarkb: yes, that's what i pasted a few minutes ago, but it only began at 16:54:5417:00
fungiwhich probably coincides with ansible17:00
corvusmordred: weird that works for me too.  i wonder why my copy/paste didn't work17:00
mordredyeah. that's about when I ran ansible17:00
clarkbcorvus: maybe its a side effect of the graceful restart and wouldn't otherwise be an issue17:00
corvusmordred: hahaha17:01
fungiseems to have started immediately following ansible requesting a service reload17:01
corvusmordred: a close examination of the password will explain the problem17:01
corvusmordred: there is a character that should not be in a password17:01
corvus(we should probably use a consistent "pwgen -s 16" or similar to make our psswords17:01
fungianyway, i still can't fetch server-status, while i was able to do it this way yesterday with no problem17:01
mordredcorvus: "neast"17:01
corvusfungi: are you going to restart apache?17:02
fungiwaiting for conversation to coalesce. we're okay with trying an apache restart next?17:02
corvusfungi: mordred said go for it17:02
fungiokay, doing that next17:02
fungiapache has started back up now17:03
fungiand /server-status returns now for me17:03
corvusi'm able to load a pad but it's agonizingly slow17:04
corvusthe db is super responsive; it's almost completely idle, i don't see any contention17:05
mordredyeah - db seems fine to me too17:07
fungithe /server-status scorecard is much smaller today than yesterday17:08
fungier, scoreboard17:08
corvusclarkb, mordred, fungi: i think we did lose our apache tuning17:09
corvushttps://opendev.org/opendev/puppet-etherpad_lite/src/branch/master/files/apache-connection-tuning17:10
corvusi don't see anything like that on the new server17:10
mordredagree. totally lost that.17:10
fungiyeah, even yesterday, we only had 11 workers when i was checking17:11
mordredcorvus: you re-adding or want me to?17:11
corvusmordred: you do it17:11
mordredon it17:11
corvusi'll look for anything else in the old module we might have missed17:11
corvusmordred, clarkb, fungi: i also don't see this, but i don't know what it does: https://opendev.org/opendev/puppet-etherpad_lite/src/branch/master/files/pad.js17:12
clarkbcorvus: that should default open the chat window17:13
clarkbwe can probably work that in later ocne general performance things are happier17:13
corvusah, yep, that does appear to be a behavior change17:13
corvusagreed17:13
fungidoesn't sound too critical17:13
openstackgerritMonty Taylor proposed opendev/system-config master: Add apache connection tuning back to apache  https://review.opendev.org/72056217:14
corvuswe do have a robots.txt, but it's returning 40317:15
mordredcorvus: we're not binding mounting it17:15
fungiRewriteRule ^/robots.txt$ /var/etherpad/robots.txt [L]17:16
fungiyeah, i guess need a bindmount for /var/etherpad/robots.txt then?17:16
corvusAH01630: client denied by server configuration: /var/etherpad/robots.txt17:16
corvusno we want apache doing that17:16
corvuswe just don't have a correct apache config17:16
openstackgerritMonty Taylor proposed opendev/system-config master: Bind mount robots.txt  https://review.opendev.org/72056417:16
fungiinteresting, yeah i wonder why we're missnig a directory allow17:17
mordredoh - wait - that's dumb from me - sorry.17:18
mordredwe don't need to bind mount it - apache is running on the host17:18
fungion the old server we used to serve it from /srv/etherpad-lite/robots.txt17:18
mordredcorvus: we're missing an allow on that path aren't we?17:19
fungieven on the old server i don't see any directory block allowing access to that17:19
mordredwell - on the old server that was set as docroot17:20
mordredin the puppet module17:20
funginot in the vhost config though17:20
corvusmordred: yeah, we need a Direcotry + Require all granted17:20
mordreddid puppet set something?17:20
fungithere is no docroot in the old vhost either17:20
mordredyeah. I agree17:20
fungimaybe we broke it a while back17:21
corvusmordred: if it was under /var/www i think it'd be ok17:21
fungianyway, yeah, i think we need to explicitly include an allow directive for at least that one file path17:21
corvuscan you do a file?17:21
corvusotherwise, how about we move that to /var/etherpad/www/robots.txt then add a <Directory> for /var/etherpad/www17:22
corvusso that we don't accidentally allow /var/etherpad/db/17:22
openstackgerritMonty Taylor proposed opendev/system-config master: Grant access to robots.txt  https://review.opendev.org/72056417:22
mordredcorvus: good point - changing17:22
openstackgerritMonty Taylor proposed opendev/system-config master: Grant access to robots.txt  https://review.opendev.org/72056417:23
fungithough if we just want that one file, https://httpd.apache.org/docs/current/mod/core.html#files17:23
corvusah cool, that'd work too17:24
fungii'm not seeing absolute path references for the files directive, but it can be nested in a directory17:25
fungiso could make a directory block of /var/etherpad with a files block for robots.txt inside that and then grant access just to that17:25
corvuswell, mordred has the directory version done... except i left a comment of https://review.opendev.org/72056417:25
fungibut yeah, i think the directory solution is fine17:26
corvusi think we only need "Require" now?17:26
fungiRequire all granted17:26
fungiyep17:26
corvusyeah, but we don't need order or allow17:26
fungiunless we want to allow overrides or anything17:26
corvusi think order+allow is 2.2 backwards compat17:26
fungiright17:26
corvusyeah https://cwiki.apache.org/confluence/display/HTTPD/ClientDeniedByServerConfiguration confirms17:27
fungijust "Require all granted" is sufficient in 2.4+17:27
openstackgerritMonty Taylor proposed opendev/system-config master: Grant access to robots.txt  https://review.opendev.org/72056417:28
corvuscool +2 on both17:28
corvusactually +3 on the first17:28
fungi+3 on the second now17:29
corvusokay, i think there's a good possibility this explains the issues, so we should re-evaluate after they land i think17:29
fungiit almost definitely explains the issues. we were low on slots when i started looking at /server-status yesterday *after* prblmes had calmed down17:30
corvusah, even better then17:30
fungiand we didn't really get complaints until tuesdayish when folks started to do stuff en masse on the server17:30
mordredagree17:30
fungiour testing over the weekend showed it was nice and snappy when we were the only folks using it17:31
fungiand etherpad-dev is still nice and snappy17:31
corvusthose pesky users17:31
mordredwe should pay attention post ansible ...17:33
mordredthe last ansible driven apache restart seemed maybe unhappy - but maybe also it was fine and was just the same symptom17:33
fungii think it was just that graceful restarting when the server was already overloaded isn't going to go well17:34
mordrednod17:35
fungiwe may need to do a hard restart after this one for the same reason17:35
fungigraceful restart tries to keep serving established connections and no longer accepting new connections on each worker until they can be expired from rotation and a new worker spawned with the updated config17:35
corvusmordred: zuul -1 on https://review.opendev.org/72052717:36
corvusi'm going to afk for 30m17:36
fungiso couple that with long-lived websocket connections for etherpad and a lack of available worker slots...17:36
clarkbya I'm in and out right now with kids school stuff. I have about 10 minutes to the next thing anything I should review urgently17:36
*** sshnaidm has quit IRC17:37
corvusi think we got the urgent stuff +3d17:37
fungireview a cup of tea17:37
corvusclarkb: you can probably review the commit message of https://review.opendev.org/72049817:37
corvusnot urgent, but a good use of a minute i think17:38
mordredclarkb, corvus: https://review.opendev.org/#/c/707412/ ... what am I doing wrong?17:38
clarkbcorvus: looks good. I only wonder what split data and log files means, but that seems to be implementation detail17:39
mordredoh. blerg17:39
openstackgerritMonty Taylor proposed opendev/system-config master: Install kubectl via openshift client tools  https://review.opendev.org/70741217:40
openstackgerritMonty Taylor proposed opendev/system-config master: Remove snap cleanup tasks  https://review.opendev.org/70929317:40
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers  https://review.opendev.org/71762017:40
clarkbmordred: path issues I think17:40
corvusclarkb: zk has an option to store the data files and transaction logs in different locations; we don't use it, but the default zk image sets up volume mounts for it that way.  i figured now would be a good time to do that, even though we still will have them on the same disk.  we could move them later more easily if we want to add an ssd or something.17:40
clarkbcorvus: got it17:41
corvusessentially something like "mv *.log ....."17:41
clarkbmordred: seems like you untar to /opt but check /usr for the binaries17:41
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers  https://review.opendev.org/72052717:42
mordredclarkb: no - it was a symlink issue - I didn't tell it to link :)17:42
mordredclarkb: or - I could copy them in place. let me do that actually17:44
openstackgerritMonty Taylor proposed opendev/system-config master: Install kubectl via openshift client tools  https://review.opendev.org/70741217:45
openstackgerritMonty Taylor proposed opendev/system-config master: Remove snap cleanup tasks  https://review.opendev.org/70929317:45
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers  https://review.opendev.org/71762017:45
mordredclarkb: that should be better17:45
mordred(that way we don't have to mount /opt/oc into containers or bubblewrap)17:46
*** diablo_rojo has joined #opendev17:50
openstackgerritJeremy Stanley proposed opendev/irc-meetings master: Not all meetings are OpenStack  https://review.opendev.org/72006317:52
*** ildikov has joined #opendev17:58
*** hashar is now known as hasharAway18:03
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers  https://review.opendev.org/71762018:10
mordredinfra-root: nb02 is unhappy - I can't shell in to it18:11
clarkbI agree "connection reset by peer"18:12
mordredI think at this point we need to cloud reboot it right?18:12
clarkbmordred: I think that is the normal resolution. Maybe check the console first for obvious signs of distress18:12
mordredyeah18:13
mordredlooking now18:13
mordrednope18:13
mordredrebooting18:14
corvusback18:15
mordredback in18:15
corvusit apparently died around 18:10:3118:16
*** ralonsoh has quit IRC18:16
*** diablo_rojo has quit IRC18:18
mordredcorvus: re-review on https://review.opendev.org/#/c/707412 ?18:19
mordredclarkb: and review from you please on https://review.opendev.org/#/c/707412 and https://review.opendev.org/#/c/709293/18:20
*** icarusfactor has joined #opendev18:24
clarkbmordred: corvus I've removed the WIP from https://review.opendev.org/#/c/719589/ maybe thats a thing to try and land once we're happy with where etherpad has ended up?18:26
*** factor has quit IRC18:27
mordredclarkb: I think it's ok to land that one whenver there's adequte human attention18:28
clarkbok, I'm not really in that space at the moment. virtual aquarium tour is over but now I need to find food and get a bike ride in then should have the bulk of the afternoon for attention18:28
mordredyeah. I'm ok with rolling forward with it once you're around - however, I'm doing a bike ride a little later this afternoon, so I don't know if our attention buckets will overlap (Although I'm also fine if you want to go ahead withit)18:29
clarkbk18:29
mordredprobably just mostly need 2 of us to actually pay attention18:30
openstackgerritMerged opendev/system-config master: Add apache connection tuning back to apache  https://review.opendev.org/72056218:33
clarkbmordred: the weather has been great here. My fake commute keeps getting longer :)18:33
mordredclarkb: same!18:34
mordredclarkb: our range for what a "normal short walk" is has gotten quite long too18:34
corvusclarkb: +218:34
corvusclarkb, fungi: if either of you want to add a +2 to https://review.opendev.org/720498 then i can start doing that after lunch18:35
clarkbcorvus: left a couple of notes18:39
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers  https://review.opendev.org/71762018:39
corvusclarkb: thanks!18:40
clarkbinfra-root I think we may have to restart apache to pick up the connection tuning change in production18:40
clarkbthe file seems to be there but apache wasn't restarted18:40
fungichecking18:40
mordredclarkb: oh - you know - we need a notify on that task18:40
fungiwe may not trigger a reload on those18:40
fungiyeah18:41
mordredit's about to get a restart for something else18:41
mordredthe robots patch is in the gate18:41
mordredso that'll take care of it - but I'll do a followup real quick18:41
clarkbah ok so if we wait it will get taken care of but still a minor thing to fix18:41
openstackgerritJames E. Blair proposed opendev/system-config master: Run ZK from containers  https://review.opendev.org/72049818:41
mordredoh - there IS a notify on that18:41
mordredso I don't know why it didn't restart18:42
mordredclarkb: hahaha. my nodepool launchpad patch failed because I forgot to install docker-compose18:43
fungi[Thu Apr 16 18:35:30.647491 2020] [mpm_event:notice] [pid 30025:tid 139892810472384] AH00493: SIGUSR1 received.  Doing graceful restart18:43
openstackgerritMerged opendev/system-config master: Grant access to robots.txt  https://review.opendev.org/72056418:44
fungi[Thu Apr 16 18:35:30.658496 2020] [mpm_event:warn] [pid 30025:tid 139892810472384] AH00501: changing ServerLimit to 128 from original value of 16 not allowed during restart18:44
clarkbmordred: just parent to my chagne and it will fix that for you :P18:44
mordredclarkb: yeah18:44
fungi[Thu Apr 16 18:35:30.658529 2020] [mpm_event:warn] [pid 30025:tid 139892810472384] AH00516: MaxRequestWorkers of 4096 would require 128 servers and exceed ServerLimit of 16, decreasing to 51218:44
corvusfungi: cool, i vote we just manually restart as a one off18:44
clarkbcorvus: ++18:44
fungiyep, that's where i was headed18:44
mordredwell - hang on18:45
mordredthe robots patch is about to run in deploy18:45
mordredlet's see if that does it?18:45
mordred(otherwise we might be fighting ansible here)18:45
fungiyeah, but it's going to do a graceful too right?18:45
mordredoh - yeah - probably18:45
mordredfungi: so I now agree just restart it :)18:45
fungiapparently the tuning changes need a hard restart, not just graceful18:46
mordred++18:46
fungiamd restarting18:46
fungier, and18:46
funginow /server-status has a huuuuge scoreboard compared to before18:47
mordredcorvus: just left a comment on the zk patch ... it's going to conflict silently with clarkb's patch18:47
fungiinfra-root: keep an ear to the ground for more reports of etherpad issues, but this has hopefully resolved them18:47
mordredcorvus: so we either need to rebase yours on his and remove the install of docker-compose from packages, or we need to rebase his on yours and then have his include a removal of the docker-compose from the zk role18:48
* mordred doesn't have a strong opinion on which - just want to make sure we don't miss the overlap18:48
corvusmordred: yep, i think we'll just need to see what our schedules are :)18:49
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers  https://review.opendev.org/72052718:49
mordredcorvus: ++18:49
clarkbmordred: corvus another option is to have zk change install from pypi and then we can drop that once my change merges18:49
clarkbrebase lite18:49
* mordred has rebased the nodepool patch on the clarkb patch - because the nodepool patch is failing due to lack of docker-compose install :)18:49
mordredclarkb: good point18:50
clarkbthat might be a good way to decouple things for now18:50
clarkbits a little extra on the todo list but its an easy todo18:50
mordredyeah - will also keep us at one zk restart18:50
mordredleft that suggestion as a comment18:51
corvusclarkb: how did you determine that distro docker-compose did not support stop_grace_period?18:56
mordredcorvus: we added it to the compose file and the tests failed18:56
mordredcorvus: with an "unsupported option" error - and then clarkb went through and found out that option was added in a later version of compose than what's in xenial18:57
mordredcorvus: so - you know - not so much with that versioned compose file format :(18:57
corvusis there a link to where it was added?18:57
mordrednot sure - lemme find a link to the error though18:58
corvusi'm just not seeing the version info when i look it up18:58
corvusi believe you :)18:58
corvusi'm just trying to learn (a) what version is required and (b) how to learn what version is required18:58
mordredhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_436/719051/1/check/system-config-run-review/4366e0a/bridge.openstack.org/ara-report/result/f97758dc-12ba-44de-9275-d2e7195d2f38/18:59
mordredcorvus: :)18:59
mordredcorvus: yeah- good point18:59
corvusmaybe the docs are just wrong?19:00
corvusie, the config versioning is correct, it just shouldn't have been listed in the v2 docs?19:00
mordredhttps://docs.docker.com/compose/release-notes/#110019:01
mordredin the compose file version 2.0 and up section19:01
corvuswell, that knocks that theory19:01
mordredand xenial has 1.8.019:01
corvusmordred: but i think that doc is the answer to my q, thanks!19:01
mordredcorvus: \o/19:01
* mordred has provided helpful19:01
corvusmordred: bionic has 1.1719:02
corvusi realize the pip install is a working solution19:02
corvusi just have some slight hesitation, because, well, the whole point of dockering was to stop global pip installs19:03
clarkbdocker compose releasenotes hadit iirc19:03
clarkboh good you found it19:03
clarkbya v2 doesnt mean v2 I guess?19:03
corvusbut we can totally play the "it'll be fine this time" card, and fix it later :)19:03
clarkbit limits the blast radius at least19:04
corvusthis is actually a case where i'd much rather just install a statically compiled binary :)19:05
mordredyeah :)19:05
*** hasharAway is now known as hashar19:05
openstackgerritJames E. Blair proposed opendev/system-config master: Install docker-compose from pypi  https://review.opendev.org/71958919:05
corvusokay, that's rebase-lite19:05
mordredcorvus: it looks like you did rebase-lite in the docker-compose patch19:06
mordredinstead of in the zk patch19:06
corvusderp, sorry19:07
openstackgerritJames E. Blair proposed opendev/system-config master: Install docker-compose from pypi  https://review.opendev.org/71958919:08
openstackgerritJames E. Blair proposed opendev/system-config master: Run ZK from containers  https://review.opendev.org/72049819:08
corvusshould be fixed (reverted to previous ps on d-c patch19:08
fungiwe seem to still be getting frequent oom conditions on lists.o.o... today around 12:45-12:50 we had a burst of 10 oom events killing different "python" processes over that 5 minute span... i can go through and restart all the queue runners for the sites on that server, but am curious if there's a good way to diagnose what's running away with memory (or if we've simply added too many sites to one machine).19:09
fungicacti graphs also show a hole in data around that time, suggesting the server mostly fell over, though on the tail end of that you can see load average coming down from a massive spike, most likely swap thrash?19:09
fungicacti data (and lack thereof) is making me suspect that adding more swap won't help, and that it's probably not a function of the number of sites we're hosting19:10
mordredcorvus: +219:10
mordredfungi: ugh19:11
mordredfungi: I have no useful suggestions19:11
fungi#status log restarted all mailman sites on lists.openstack.org following oom events around 12:45-12:50z19:18
openstackstatusfungi: finished logging19:18
openstackgerritMonty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers  https://review.opendev.org/71762019:24
openstackgerritMonty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers  https://review.opendev.org/72052719:24
mordredclarkb: if you get a sec, https://review.opendev.org/#/c/718791/19:25
mordredcorvus: neat! the zuul change has failed because zookeeper hosts is undefined :)19:51
mordredcorvus: so - I think it'll likely make sense to base it on your zk patch so it can, you know, create a zk19:52
corvusmordred: cool, yeah let's do that (rather than the nodepool thing)19:52
corvusi'm just about to start the manual zk work19:53
mordredcool. I'm about to step out for a bit - but I think you've got that under control19:53
corvusadded zk* to emergency20:07
corvusi prepared a checkout of change 720498 with the parts about running docker-compose commented out20:12
corvusi copied the data files to a backup location on all 3 servers20:13
corvusi made a secondary json backup on nl0120:15
corvusi'll start running the playbook now20:16
corvusnow i'm going to run this in the locally modified checkout: ansible-playbook --limit="zk01.openstack.org:localhost" playbooks/service-zookeeper.yaml20:17
corvusnow i'm going to be confused why "no hosts matched"20:20
fungizk01.openstack.org is definitely an actual hostname, and localhost should match regardless right?20:21
corvusyeah...20:21
corvusand it's using some of the inventory files from /etc, but that should be okay -- zk01 is currently in the inventory file and in the 'zookeeper' group20:21
corvus(the only inventory related modifications from my change are to remove it from the puppet group; should have no impact here)20:22
corvusoh20:24
corvusit's also reading the emergency file20:24
corvusi'll modify the playbook to omit !disabled20:24
corvusoff we go20:24
fungimakes sense20:25
corvusdone; i'll check out the state on zk01 now20:26
corvusoh, i just now had the thought that maybe we should try running zk as a non-root user20:28
corvus(that's how it is currently run in packages)20:28
fungiand not how it's running in the container images i guess?20:28
corvusright20:29
openstackgerritJames E. Blair proposed opendev/system-config master: Run ZK from containers  https://review.opendev.org/72049820:37
corvusfungi: how's the patchset delta on that look to you?20:38
corvusi think creation will no-op since we currently have a zookeeper user and group on the host20:38
corvusso if it looks good, i can rm -rf /var/zookeeper (the new data/conf location) and re-run the playbook with that applied20:39
fungichecking20:39
fungicorvus: lgtm. marked a misspelling inline but just a nit as it's only in the task name not anything syntactic20:41
*** Romik has joined #opendev20:41
corvusgoup? :)20:41
corvuscool i'll fix that and remove some whitespace20:41
openstackgerritJames E. Blair proposed opendev/system-config master: Run ZK from containers  https://review.opendev.org/72049820:42
corvusand i'll rm and run the playbook again now20:42
fungicool20:42
fungii'm here but also operating a hot skillet at the same time so may go quiet for a few minutes at a time20:43
corvusah drat, that's a change to the user's home dir, which can't run while the user is in use20:46
corvusi think what i should do is modify the home directory in my local copy so it's a no-op, then do a further manual usermod while the service is stopped to change to the new location, then we'll merge the change as written with the new location20:48
corvusi'm going to stop zk on zk01 now20:49
fungithat sounds fine, yep20:53
clarkbreviewing mordred's doc change now. Is etherpad stuff done?20:53
*** Romik has quit IRC20:55
corvushere's what i've done so far for the migration: http://paste.openstack.org/show/792255/20:56
corvusfollowed by some apt-get purging and autoremoving20:57
corvusi think i'm ready to bring zk01 up now20:58
clarkbcorvus: are you going to manually run ansible on brdige to do that?20:58
corvusclarkb: i manually ran a modified playbook to do the prep steps; to bring it up i'll manually run docker-compose up -d20:58
clarkbrgr20:59
corvusdoing that now20:59
corvusoh, ha -- removing the packages removed the zk user21:00
corvusso i'll run the playbook again, and it should (re-)create the user this time21:01
corvusperhaps with different ids, so i'll check that21:01
fungiyeah, if the user was in an autocreate uid range it will wind up getting whatever the next unassigned uid is in that range, so will likely change21:04
clarkblooks like apache on etherpad was restarted21:04
clarkbcreating a new pad seems to be happy21:04
clarkbI guess we watch it and see if that was the fix we needed?21:05
fungii restarted it manually, yes, and confirmed /server-status showed many more available slots21:05
fungiapparently some tuning changes can't be applied via graceful restart, and require a hard restart instead21:06
openstackgerritClark Boylan proposed opendev/system-config master: Use HUP to stop gerrit in docker-compose  https://review.opendev.org/71905121:08
clarkbok ^ is rebased onto latest ps of its parent now21:08
clarkbmordred: corvus fungi (ianw if around) should we go ahead and approve https://review.opendev.org/#/c/719589/ now?21:08
corvusoh... hrm, the zk container image has a zk user, but it's uid 100021:09
corvusthat's the 'ubuntu' image on our host...21:09
clarkbya 1000 is a bad choice21:09
clarkbsince a lot of distros start non system there21:09
corvus(ugh, the whole non-root user under docker thing is a mess)21:09
clarkbcorvus: is there a way to map uids in and out of containers21:10
corvusi think we can just give it a numeric uid and probably nothing inside the container will care21:10
corvuslike, we could tell it to run as 999:998 (which is what just got created), or we could pick new numbers, like 1000221:11
fungialmost everything is always happy with numeric uid/gid anyway, yes21:11
fungiuser and group names are mainly cosmetic21:11
fungiunless referred to in conffiles and the like21:11
corvus(i believe in the bleeding edge or possibly later crio systems, they actually modify /etc/passwd inside containers on startup, so this might actually get better in th efuture)21:11
corvusso maybe let's just specify 10001:10001 in our user/group creation, and map that in numerically21:12
fungii'm good with giving that a try21:13
clarkbhttps://docs.docker.com/engine/security/userns-remap/#about-remapping-and-subordinate-user-and-group-ids21:13
fungiit's probably only a concern if the uid/gid also happen to be used by another unprivileged user in the parent system21:13
fungiin which case they get access to files in the container's file tree21:13
clarkbI guess that is so you can pretend to be root? but maybe it would work for ths too? though it seems fairly involved to set up21:14
clarkboptions to dockerd, and config files need to be set21:14
clarkboh and we couldn't use host netowrking if we do that21:15
corvusi think that's useful for root, but unecessary here21:15
*** hashar has quit IRC21:17
openstackgerritMerged opendev/system-config master: Remove puppet and cron mentions from docs  https://review.opendev.org/71879121:18
corvushrm.  it's running, but i don't think it's able to join21:24
corvusan exception in the log i don't understand21:24
clarkbcorvus: let me know if you'd liek more eyeballs on it21:26
corvusack, i'll try to eliminate some simple things first21:27
corvushttps://stackoverflow.com/a/6121548721:33
corvusthat might be us21:33
clarkbah so maybe we pin the image afterall?21:34
corvusyeah, let me see what i was testing with locally21:34
corvuslooks like 3.5.621:38
corvusthere's a 3.5.7 now21:38
corvusmaybe we should use :3.5 ?21:38
corvusor should we pin to 3.5.7?21:38
clarkbI think we are probably sfae to stick to 3.5.x21:39
corvusokay, now i think i should shut it down, and copy all the old data files over again21:39
corvusbecause 3.6.0 may have munged them21:39
fungisounds prudent given the circumstances21:39
corvuszk_1  | 2020-04-16 21:42:08,022 [myid:1] - INFO  [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):Follower@69] - FOLLOWING - LEADER ELECTION TOOK - 19 MS21:42
corvuszk_1  | 2020-04-16 21:42:08,189 [myid:1] - INFO  [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):Learner@529] - Learner received UPTODATE message21:42
corvusthis looks good.21:42
corvusi'm going to stop it and back out some of my silly hostname/ip changes which shouldn't be necessary21:43
corvusokay, one of them is necessary21:45
corvuswe have to specify 0.0.0.0 for an own server's binding in the config file21:45
corvusi'll work on jinjaing that up real quick21:46
openstackgerritJames E. Blair proposed opendev/system-config master: Run ZK from containers  https://review.opendev.org/72049821:51
corvusokay, i think that reflects reality on zk121:51
corvuscool -- that last patchset with the bogus uids -- it failed testing :)21:53
corvusall right, i think i'm ready to move on to zk02 with this modified procedure: http://paste.openstack.org/show/792256/21:55
fungilooks good, i guess ansible creating the user and group is still fine then?21:59
corvusyep.  it'll get removed, then created with the id we specify22:00
corvusmoving on to zk02 now22:03
clarkbcorvus: plan lgtm22:03
corvusstopped22:04
corvusrunning playbook22:06
clarkbnot to completely change the subject but I'm thinking if we have interest from frickler and ajaeger for virtual PTG slots we can do an early morning (relative to me) PTG chunk of time one day and if ianw is interested a late day chunk22:09
clarkband maybe do ~2 4 hour chunks for a total of 8 hours or something22:09
clarkb(I worry any more than that will just be painful)22:09
corvusokay, second set of manual steps complete; config files look good22:10
corvusi'm going to start it22:10
corvusi was unconvinced it rejoined correctly the first time, so i stopped it and started it again22:15
corvusthe second time i see the warm fuzzy UPTODATE message22:15
ianwclarkb: i'm happy to be around for such a thing, maybe some others in ~ tz's like tonyb or ricolin and kevinz might find it good too22:15
corvusi will move on to zk03 now22:17
clarkbianw: good point22:17
clarkbLooks like the europe chunk is 1300-1700 UTC and apac is 0400-080022:18
clarkbnow to do math to see how those map onto my timezone22:18
clarkb6am to 10am and 9pm to 1am22:19
corvusincidentally, we did the rolling restart the 'hard' way, killing the leader each time22:20
clarkbthe third chunk is 2pm to 6pm. I think I can actually do all three of those if necessary22:20
fungifor me that's 9am to 1pm and midnight to 4am22:20
clarkbfungi: ya I'm thinking that maybe 1300-1700 and 2100-0100 might be better given your eastern timezone22:20
clarkbbut I'm not sure how early 2100 is for ianw yet22:21
fungimidnight to 4am might be tough for me, but if i don't have anything scheduled the day before/after that i can manage it22:21
openstackgerritJames E. Blair proposed opendev/system-config master: Run ZK from containers  https://review.opendev.org/72049822:21
clarkb7am to 11am for ianw I think22:21
ianw2100 would be ... 6 or 7 i think22:21
clarkbianw: and china would be even earlier I think22:21
clarkbnow I'm thinking maybe we do 2 hours in each of the chunks maybe and try and make things nicer then do ~4 sessions? I'll keep noodling on this and probably set up a chart of timezones :)22:22
ianwi think taiwan is 2 hours before that22:23
ianwnot sure what tz kevinz falls into22:23
clarkblike maybe we do 1300-1500, 2300-0100, and 0400-060022:23
corvusmoving to #opendev-meeting22:24
clarkbcorvus: sorry!22:24
corvusdon't be, that's what it's there for22:24
ianwfrickler / AJaeger: thanks for reviews and fixups on the nodepool config.  i'll apply it now and watch through.  frickler i responded that i very much hope the -plain images disappear ASAP22:25
ianwif we need fixups or other odd things (which it's looking like we should, hopefully, not) we should be able to handle that in base jobs, rather than images22:26
ianwclarkb: the only change to https://review.opendev.org/#/c/718224/ was putting ontop of the "check for pip" bits, yeah?22:28
clarkbianw: yup22:29
ianwcorvus: when you have a second, if you could loop back on https://review.opendev.org/#/c/717663/26 and i believe with the changes to check if pip is installed, your -1 should be satisfied22:29
corvusianw: yep +0 (looks like it just carried over cause it was a rebase)  feel free to proceed with existing votes :)22:30
ianwthanks22:32
openstackgerritMerged openstack/project-config master: nodepool: use job inheritance  https://review.opendev.org/71315822:34
ianwmy scrollback ran out ... did we figure out the project-config post job localhost key thing?22:34
ianwi guess so, it looks like it's running for ^^22:36
openstackgerritIan Wienand proposed openstack/project-config master: Add ubuntu-bionic-plain to all regions  https://review.opendev.org/72031622:42
openstackgerritIan Wienand proposed openstack/project-config master: nodepool: Add more plain images  https://review.opendev.org/72031822:42
ianwinfra-root: ^ if we could consider these, it will be good for testing both "plain" hosts, and also testing the container builder on other image types22:42
clarkbok I used pencil on paper and drew somethings. I think if we do Monday 1300-1500, 2300-0100 then Wednesday 0400-0600, that gives us 6 hours to talk about things but also gives each timezone chunk 4 hours of non super painful time. And we should be able to find sleep too22:43
clarkbwe can shift the days forward if necessary too, but this also nicely doesn't conflict iwth our normal team meeting22:44
clarkbianw: for those changes we don't add the -plain images to the launcher providers looks like? is that intentional? we just want ot upload for now?22:47
ianwyeah i thought maybe build them first?  i can add if you want an all-in-one22:52
clarkbnah if that is intentional its fine22:55
clarkbI was owrried you were expecting nodes in the providers too22:55
ianwit moves a lot of builds into the container builder, so i'll be interested if anything happens22:56
ianwwrt to building the other image types there.  we don't (yet) have functional tests covering them all; on my todo list to convert dib22:56
openstackgerritMerged zuul/zuul-jobs master: ensure-pip: export ensure_pip_virtualenv_command  https://review.opendev.org/71822423:01
ianwnote that devstack is working on the bionic-plain images https://review.opendev.org/#/c/712211/23:04
ianwopensuse should have actually been working, but the required change got -2'd by devstack initially and didn't make it in, leading to a lot of unfortunate confusion23:04
*** tosky has quit IRC23:09
mnaserhmm23:16
mnaserdid something happen with zuul not long ago?23:16
mnaseroh, docker-ifying things23:16
mnaserthat can explain why a job that was wrapping up is now in 2. attempt ?23:17
clarkbmnaser: yes, its zookeeper db had a sad. We think we've got it back to 2 happy nodes and now trying to get the third to achieve quorum23:17
clarkbmnaser: yes23:17
mnaserok no worries, sorrym i quickly glanced scrollback and didn't see anything obvious23:17
* mnaser sends hugops23:17
clarkbmnaser: when zuul loses connectivity to zk nodepool cleans up all the nodes23:17
clarkbzuul sees that as a network error and will retry the jobs if they haven't arleady retried 3 times for some reason23:17
mnaserive only seen the 2 attempt happen when pre fails, but that's a new scenario i learned i guess23:18
openstackgerritMerged zuul/zuul-jobs master: fetch-zuul-cloner: use ensure-pip  https://review.opendev.org/71788223:25
mnaseri just saw 2 jobs fail to upload to buildset registry with a timeout23:31
mnaser Get https://zuul-jobs.buildset-registry:5000/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)23:31
mnaserill retry them, because they had to be restarted, but i'll let yo uknow if i see it happen again..23:32
clarkbmnaser: that could be fallout from.the other issues23:45
clarkbbecause buildset registry runs jn a paused job23:46
clarkband that wont restart properly maybe?23:46
clarkbI expect rechecks to be fine now that zk is stable again23:46
*** mlavalle has quit IRC23:50

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!