Thursday, 2020-04-16

*** diablo_rojo has joined #opendev-meeting17:51
*** diablo_rojo has quit IRC18:18
corvushi me!22:24
corvusokay, i have stopped zk0322:24
clarkbhello I'm paing attention to the zk stuff too and didn't meant to interrupt it. Just noticed the cross platform folks were asked to do ptg thinking by april 16th22:24
* clarkb follows along here too22:24
fungithis is the last cluster member, right?22:25
corvusyep22:25
fungiawesome22:25
fungiso with it down we're running entirely on containerized cluster members already22:25
corvusi noticed during the leader election of zk02 that it emitted an error about being unable to write out the 'next version of the dynamic config'22:26
corvusit should have write access to the conf dir, but the existing zoo.cfg was root-owned22:26
corvusi'm not expecting it to update zoo.cfg, so i dunno what's up with that22:26
corvusit also wasn't too upset -- it didn't crash or anything22:26
corvusbut to learn more, i've chowned those on the other hosts, so maybe in a bit i'll force a leader election and see what happens22:27
corvusbut for now, i'll just proceed with zk0322:27
corvusrunning playbook now22:28
corvusconfig files look good, starting22:32
corvusit says it's synchronizing, but like zk02, it's taking a while; i may stop and start it again22:36
corvusthat appears better22:39
corvusokay, i'm going to try restarting zk02 now since i think it's the leader22:41
corvusthis is not going well22:42
corvusthe remaining servers are rejecing client connection requests22:43
clarkbcorvus: we expected them to elect a elader and continue to run along irght?22:44
corvusinfra-root: we may be about to see a lot of fallout in zuul22:44
corvusyep22:44
clarkbthe "good" news is we know what that fallout looks like22:44
clarkb(its the OOM in scheduler case)22:45
fungithanks for the heads up22:45
corvusi restarted zk02 but they're still not happy22:45
corvusi'm open to ideas22:48
corvusmaybe a full stop/start?22:48
clarkbRefusing session request for client /104.130.246.196:53542 as it has seen zxid 0xb00000000 our last zxid is 0xa00040d89 client must try another server22:49
clarkbthat seems to be the issue. I think we want to find the server with zxid 0xb00000000 and make it the running one?22:49
clarkb(the issue with clients)22:49
corvusi don't think that zk02 is able to join22:49
fungii'm not finding any literal references to "next version of the dynamic config" in a web search, fwiw22:49
fungi(if that was a literal quote)22:50
corvusit was not22:50
clarkb[WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (3, 2) is also interesting22:51
corvusclarkb: 0xb00000000 looks suspcious to me22:51
corvusi'd like to just stop everything and restart22:51
corvusany objections?22:51
fungithat is a very round nuber22:51
funginumber22:52
fungicorvus: no objection here22:52
clarkbcorvus: no objects here and yes I believe it is suspect too22:54
clarkbif you look at logs for 'proposed zxid' they are all smaller22:54
corvusfully stopped22:54
corvusi will start it up in reverse order22:54
clarkbwhen the election starts they emit log info for myid and proposed zkid22:54
clarkb*zxid22:54
corvuszk_1  | 2020-04-16 22:55:13,988 [myid:1] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):QuorumPeer@1619] - Error writing next dynamic config file to disk:22:56
corvusthat's the error22:56
corvusit looks like zk1 and zk3 have establisheda  quorum with 3 as the leader22:57
corvusnodepool has reconnected22:57
corvuszk02 still looks like it has not connected22:58
clarkbhttps://zookeeper.apache.org/doc/r3.5.2-alpha/zookeeperReconfig.html#sc_reconfig_file22:58
clarkbdynamicConfigFile is the config setting for that file path22:58
clarkbtrying to figure out what the deafult is now22:58
fungi"the docs don't mention that the "static" configuration files _needs_22:59
fungito be writable"22:59
fungihttps://github.com/pravega/zookeeper-operator/issues/6622:59
corvusyeah, maybe it's writing to some unwritable path in the container22:59
corvusand we should explicitly set that22:59
clarkbexplicitly setting it solves needing to figure out what the dfeault is :)23:00
clarkbI like explicit23:00
corvusi still can't think of why zk02 won't join23:00
clarkbcorvus: 2020-04-16 22:54:49,250 [myid:2] - WARN  [WorkerSender[myid=2]:QuorumCnxManager@685] - Cannot open channel to 1 at election address zk01.openstack.org/23.253.236.126:388823:02
clarkbcould it be a tcp problem?23:02
clarkbhrm no I can telnet23:02
clarkbthough maybe that was a race during startup and I need to look later in the logs23:03
corvusyeah, that time looks suspect23:03
clarkbcrazy idea. we stop zk02, move its data file aside and start it and force it to sync that state from the other two23:03
corvussounds good23:04
clarkbthis assumes that maybe its a conflict between their db states (we would be asserting 01 and 03 are correct)23:04
corvusfyi: here's an explanation of the 'smaller identifier' message http://zookeeper-user.578899.n2.nabble.com/Have-smaller-server-identifier-so-dropping-the-connection-td7583860.html23:04
clarkbbasically do the process of completely replacing a server23:04
corvus02 is down, moving data23:04
corvusoh!23:05
corvuscat myid23:05
corvus0223:05
corvusthat should be single digit23:05
corvusand it is on the other hosts23:06
corvushow did that end up as 02?23:06
clarkbcorvus: I believe it was 02 from puppet23:06
clarkblet me rephrase23:06
clarkbI believe puppet put the 0 prefix because it got it from the hostname23:06
clarkband zk was ok with that on puppet deployed zk23:06
corvushrm.  well, i guess i "improved" it then23:06
corvusbecause the ansible stuff should be doing single digits23:06
corvusand....23:07
corvusmaybe i accidentally copied the puppet myid file on 0223:07
clarkboh that could explain the errors with permissions too if file paths were different23:07
corvusyeah, i don't think i can confirm that, but i think that's entirely possible23:07
corvusi don't think there are any perm errors other than the dynamic config thing23:08
corvus(i definitely didn't copy the puppet zoo.cfg file, that's in a different dir)23:08
corvusclarkb: so let's continue with the 'restart from scratcch' process, and just also fix the myid file23:08
clarkbk23:08
corvusi've restarted 223:13
corvusno joy yet23:13
clarkbthis time around might take longer if its syncing the data?23:14
clarkbhrm looks like container isn't running anymore23:14
corvusoh, it was a *different* myid file23:14
corvusi stopped it23:14
corvusi'm going to rm the data again23:15
corvusstarted again23:15
clarkbcorvus: INFO  [NIOWorkerThread-1:FourLetterCommands@235] - The list of enabled four letter word commands is : [[srvr]] that was me trying to run the stat command and it wasn't allowed23:24
corvusclarkb: ah yeah, i think we may need to whitelist them in this version23:24
clarkbsrvr is apparently more verbose than stat so I'll try that now23:24
clarkband that errors with zk is not currently serving requests23:25
clarkbso it seems busy?23:25
corvusclarkb: on zk02?23:25
clarkbya23:25
corvusthat's the one that's not in the quorum23:25
clarkbdid telnet localhost 2181 and issued that command23:25
clarkbits like gearman23:25
corvustry it on the other nodes23:26
clarkbcorvus: ya I seem to recall when I set up the puppetized cluster that things would still respond to confirm but maybe thats beacuse it was always in quorum and never unhappy23:26
clarkbbut let me see what 01 says23:26
clarkbhttps://issues.apache.org/jira/browse/ZOOKEEPER-216423:28
clarkbcorvus: ^ possible that is related?23:29
clarkbthere is at least one individual in there indicating the docker images have exhibited this where older 3.4 zk never did23:29
clarkbcorvus: the last thing there indicates that 3.5.8 will have the fix23:30
corvusgah23:30
clarkbcomputers23:30
clarkbI've been summoned to start dinner plans, back in a bit23:30
corvusthere are 2 bugs in that bug23:31
corvusone of them is this: 125423:31
corvusgrr23:31
corvushttps://github.com/apache/zookeeper/pull/125423:31
corvusi think that's the one that's fixed in 3.5.823:31
corvusand it's related to using 0.0.0.0 and connection ids being smaller23:32
corvusso looks very likely23:32
corvusokay, we *might* be able to work around this by specifying our actual ip addresses23:32
corvuswe should have that in the inventory23:32
corvusi'm going to stop 02, then replace the config entirely with ipv4 addrs, then start 223:33
clarkbok23:33
corvusinstajoin23:35
corvusclarkb: nice find :)23:35
clarkbwow23:37
clarkbso we need to name the members by ip explicitly?23:38
corvusyep, due to that bug23:39
corvusi'm a little confused about how we should handle the dynamic config23:39
corvusperhaps we should omit it entirely and leave our zoo.cfg file owned by root23:41
clarkbcorvus: does it try to write to t he normal file by default?23:41
clarkbseems like explicitly setting a writeable path should work?23:42
corvusclarkb: the way i read that is that if it writes a dynamic config file, it may, in some circumstances, rewrite the main config file and remove some lines23:42
corvuswhich would be weird for us the next time ansible runs23:42
clarkboh huh23:43
corvusi think i'm going to eod and leave things as they are right now23:44
corvusi have 2 notes in my zk change to address; i'll do that tomorrow, and as part of that, we'll get the ips in the other config files23:44
clarkbya seems stable enough for now, they arejn the emergency file right?23:44
corvusthen we can merge the change and remove from emergency23:44
corvusyep23:44
corvusinfra-root: ^ fyi23:45
corvusbasically, zk hosts are stable, but in emergency file, because their configuration is ahead of what's in ansible.23:45
corvusand we're running in containers now23:45
corvus#status log upgraded zk ensemble to 3.5.7 running in containers23:46
openstackstatuscorvus: finished logging23:46
fungimakes sense, thanks for working through that23:46
corvusthat *almost* went perfectly :)23:46
corvuswe actually did complete a rolling upgrade of zk from 3.4 to 3.5 without any user-visible impact23:47
corvusit was only when we did further testing afterwords that we had the hiccup23:47
fungiand a worthwhile experiment23:48

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!