Due to the hardware warranty expired, we need to move the main-eqiad zookeeper cluster from conf100[123] to conf100[456].
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | elukey | T192642 Upgrade Analytics infrastructure to Debian Stretch | |||
Resolved | elukey | T182924 Refresh zookeeper nodes in eqiad | |||
Resolved | None | T166081 rack/setup/install conf1004-conf1006 | |||
Resolved | Joe | T190230 install conf1004-6 ssd upgrades |
Event Timeline
Noted one thing: when restarting the leader, an observer shuts down and tries to participate to leader election (like a participant) even if it should not:
2018-03-28 16:52:51,690 - WARN [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:Observer@77] - Exception when observing the leader java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) at org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:73) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:833) 2018-03-28 16:52:51,698 - WARN [RecvWorker:103:QuorumCnxManager$RecvWorker@813] - Interrupting SendWorker 2018-03-28 16:52:51,698 - WARN [SendWorker:103:QuorumCnxManager$SendWorker@727] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715) 2018-03-28 16:52:51,699 - WARN [SendWorker:103:QuorumCnxManager$SendWorker@736] - Send worker leaving thread 2018-03-28 16:52:51,702 - INFO [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:Observer@135] - shutdown called java.lang.Exception: shutdown Observer at org.apache.zookeeper.server.quorum.Observer.shutdown(Observer.java:135) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:837) 2018-03-28 16:52:51,703 - INFO [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@497] - shutting down 2018-03-28 16:52:51,703 - INFO [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:ObserverRequestProcessor@121] - Shutting down 2018-03-28 16:52:51,703 - INFO [ObserverRequestProcessor:105:ObserverRequestProcessor@105] - ObserverRequestProcessor exited loop! 2018-03-28 16:52:51,704 - INFO [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down 2018-03-28 16:52:51,704 - INFO [CommitProcessor:105:CommitProcessor@153] - CommitProcessor exited loop! 2018-03-28 16:52:51,704 - INFO [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] - shutdown of request processor complete 2018-03-28 16:52:51,708 - INFO [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] - Shutting down 2018-03-28 16:52:51,708 - INFO [SyncThread:105:SyncRequestProcessor@186] - SyncRequestProcessor exited! 2018-03-28 16:52:51,709 - INFO [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING 2018-03-28 16:52:51,709 - INFO [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id = 105, proposed zxid=0x8000000000000000 2018-03-28 16:52:51,722 - INFO [WorkerReceiver[myid=105]:FastLeaderElection@600] - Notification: 0 (message format version), 103 (n.leader), 0x50014eb56 (n.zxid), 0x2 (n.round), LO OKING (n.state), 101 (n.sid), 0x5 (n.peerEpoch) LOOKING (my state) 2018-03-28 16:52:51,724 - WARN [WorkerSender[myid=105]:QuorumCnxManager@400] - Cannot open channel to 103 at election address zk1-3.analytics.eqiad.wmflabs/10.68.20.68:2183 java.net.ConnectException: Connection refused (Connection refused)
The other nodes do the right thing not accepting the id as participant, but using observers for this migration seems not a stable/viable solution.
As we did in a similar use case before (https://etherpad.wikimedia.org/p/Jessie-Zookeeper) the procedure should be to:
- select one of conf100[456], add to it a new id in hiera, replacing another running one. Merge and run puppet on all nodes (to verify: this will not restart zookeeper).
- start zookeeper on the conf100[456] selected node, it will join the cluster and become a follower.
- Stop zookeeper (mask the unit) on another node among conf100[123] (a follower).
- use the kafka cli in all the clusters with the new node and make sure it works.
- restart zookeeper on the other two nodes
- check the status of the cluster
At this point, a rolling restart of the kafka brokers + hadoop daemons would be needed to test the new config, but in theory as long as one zookeeper node is reachable, it might be fine to just keep going.
Another idea to reduce the amount of restarts would have been to add DNS CNAMEs for each zookeeper host, in order for example to avoid touching the zookeeper.connect settings in each kafka cluster config when the zookeeper cluster changes. This seems not that feasible due to how the zookeeper client handles DNS resolutions, namely only once.
Change 425238 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster_stretch: add zookeeper profiles for conf1004
Change 425239 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] zookeeper: swap conf1001 with conf1004
Change 425474 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] network::constants: add conf100[456] to zookeeper_main_hosts
Change 425474 merged by Elukey:
[operations/puppet@production] network::constants: add conf100[456] to zookeeper_main_hosts
Re-tested today the swap of one node in labs (analytics project) to verify again logs and things that might break. Some details about the procedure:
starting cluster:
- zk1-1.eqiad.wmflabs - id 101 - os Jessie - zk version 3.4.5 (same as on conf100[123] - zookeeper role follower
- zk1-2.eqiad.wmflabs - id 102 - os Stretch - zk version 3.4.9 (same as on conf100[456] - zookeeper role leader
- zk1-4.eqiad.wmflabs - id 104 - os Stretch - zk version 3.4.9 - zookeeper role follower
goal: swap zk1-4 with z1-3
Did the following:
- Swap zk1-4 wit zk1-3 in hiera, and ran puppet on zk1-3.
- ensured that zookeeper on zk1-3 was running and listing itself as 'follower', meanwhile the rest was unchanged.
- stopped and masked zookeeper on zk1-4, and re-checked everything. All good.
- restarted zk on zk1-1 and checked that everything was good. Weird state registered.
So the status on point 4) was what I wanted to test, since this is a turning point: the quorum for the cluster goes away when zk1-1 begins the restart, but then it does not work anymore even when zk1-1 gets back up.
zk1-2 seems fine:
2018-04-12 09:59:20,023 - WARN [RecvWorker:101:QuorumCnxManager$RecvWorker@810] - Connection broken for id 101, my id = 102, error = java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.net.SocketInputStream.read(SocketInputStream.java:224) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:795) 2018-04-12 09:59:20,026 - WARN [SendWorker:101:QuorumCnxManager$SendWorker@727] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715) 2018-04-12 09:59:20,033 - WARN [SendWorker:101:QuorumCnxManager$SendWorker@736] - Send worker leaving thread 2018-04-12 09:59:20,033 - WARN [RecvWorker:101:QuorumCnxManager$RecvWorker@813] - Interrupting SendWorker 2018-04-12 09:59:20,054 - INFO [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state) 2018-04-12 09:59:20,269 - INFO [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state) 2018-04-12 09:59:20,675 - INFO [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state) 2018-04-12 09:59:21,481 - INFO [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state) 2018-04-12 09:59:23,086 - INFO [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state) 2018-04-12 09:59:26,290 - INFO [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state) 2018-04-12 09:59:32,694 - INFO [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state) 2018-04-12 09:59:45,503 - INFO [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state) 2018-04-12 10:00:11,112 - INFO [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)
But then zk1-1's state is not, since it keeps repeating the following:
2018-04-12 10:07:55,347 - INFO [WorkerReceiver[myid=101]:FastLeaderElection@542] - Notification: 102 (n.leader), 0x700010d13 (n.zxid), 0x1 (n.round), LEADING (n.state), 102 (n.sid), 0x7 (n.peerEPoch), LOOKING (my state)
After a final restart of zk1-2 (the leader), all the hosts have the new config and a new leader election happens (in this case, electing zk1-3 as leader). From this point onwards, everything works as expected.
I would have thought to cause issues on the leader rather than on one of the followers, but it can be kind of expected that a move like this one causes a brief unavailability due to inconsistency in the ensemble. Moreover from what I can read it might also be that version 3.4.5 is bugged and can lead to these weird states.
From the consumers point of view (two kafka clusters and one hadoop cluster) I have observed only some non critical logs in one of the hadoop masters when zookeeper broke its session:
2018-04-12 10:08:24,485 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x66629631f0860005, likely server has closed socket, closing socket connection and attempting reconnect 2018-04-12 10:08:24,587 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... 2018-04-12 10:08:24,841 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zk1-1.analytics.eqiad.wmflabs/10.68.18.31:2181. Will not attempt to authenticate using SASL (unknown error) 2018-04-12 10:08:24,843 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.68.22.32:55520, server: zk1-1.analytics.eqiad.wmflabs/10.68.18.31:2181 2018-04-12 10:08:24,844 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x66629631f0860005, likely server has closed socket, closing socket connection and attempting reconnect 2018-04-12 10:08:25,533 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zk1-3.analytics.eqiad.wmflabs/10.68.20.68:2181. Will not attempt to authenticate using SASL (unknown error) 2018-04-12 10:08:25,534 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.68.22.32:37352, server: zk1-3.analytics.eqiad.wmflabs/10.68.20.68:2181 2018-04-12 10:08:25,543 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server zk1-3.analytics.eqiad.wmflabs/10.68.20.68:2181, sessionid = 0x66629631f0860005, negotiated timeout = 5000 2018-04-12 10:08:25,544 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2018-04-12 10:08:25,546 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced... 2018-04-12 10:08:25,558 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a15616e616c79746963732d6861646f6f702d6c61627312276861646f6f702d6d61737465722d312d616e616c79746963732d65716961642d776d666c6162731a276861646f6f702d6d61737465722d312e616e616c79746963732e65716961642e776d666c61627320d43e28d33e 2018-04-12 10:08:25,558 INFO org.apache.hadoop.ha.ActiveStandbyElector: But old node has our own data, so don't need to fence it. 2018-04-12 10:08:25,558 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/analytics-hadoop-labs/ActiveBreadCrumb to indicate that the local node is the most recent active... 2018-04-12 10:08:25,561 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at hadoop-master-1.analytics.eqiad.wmflabs/10.68.22.32:8020 active... 2018-04-12 10:08:25,565 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at hadoop-master-1.analytics.eqiad.wmflabs/10.68.22.32:8020 to active state
So as far as I can see, even if we might need to expect a brief unavailability, everything resolves very quickly.
This error seems really similar to what described in https://issues.apache.org/jira/browse/ZOOKEEPER-1732 (fixed in 3.4.6). If true this would mean that our 3.4.5 cluster (conf100[123]) might be affected by this, and the rolling restart of the nodes might end up in more errors after the conf1001/4 swap. I am wondering if an upgrade to the stretch's version of zookepeer (3.4.9) could be a good intermediary step.
Interesting discovery today while testing zookeeper on stretch. I tried to clean up /etc/zookeeper/conf and ran puppet to check if everything was going to be restored or not, and with my great surprise, the zookeeper systemd unit wasn't able to "start". After a big of digging, the culprit seems to be the following:
elukey@conf1001:~$ ls -l /etc/zookeeper/conf lrwxrwxrwx 1 root root 32 Jun 17 2015 /etc/zookeeper/conf -> /etc/alternatives/zookeeper-conf elukey@conf1001:~$ ls -l /etc/alternatives/zookeeper-conf lrwxrwxrwx 1 root root 27 Jun 17 2015 /etc/alternatives/zookeeper-conf -> /etc/zookeeper/conf_example elukey@conf1001:~$ ls -l /etc/zookeeper/conf_example total 20 -rw-r--r-- 1 root root 535 Mar 15 2014 configuration.xsl -rw-r--r-- 1 root root 754 Mar 15 2014 environment -rw-r--r-- 1 root root 1063 Jun 17 2015 log4j.properties -rw-r--r-- 1 root root 4 Jun 17 2015 myid -rw-r--r-- 1 root root 2217 Jun 8 2016 zoo.cfg
So we make sure that myid, zoo.cfg, log4j are populated via puppet, but we don't do anything about the environment file (that is created the first time by the package). So if for some reason it is deleted and/or overwritten by the package, than we don't have a good way to restore it.
There seems to be a bug opened about it:
Change 426918 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/zookeeper@master] Ensure existence of environment conf file
Tested in labs a migration with two stretch hosts running zk 3.4.9 and one jessie host running zk 3.4.9 (Moritz's backport) and the host swap happened without any issue (no host in LOOKING state).
Change 427078 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::zookeeper::server: introduce the zookeeper349 component
Change 427078 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: introduce the zookeeper349 component
Change 427090 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster: upgrade zookeeper to 3.4.9-3~jessie in codfw
Mentioned in SAL (#wikimedia-operations) [2018-04-17T13:00:20Z] <elukey> upgrade zookeeper on conf200[123] to 3.4.9~jessie - T182924
Change 427090 merged by Elukey:
[operations/puppet@production] role::configcluster: upgrade zookeeper to 3.4.9-3~jessie in codfw
Change 427343 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster: upgrade zookeeper main-eqiad to 3.4.9
Mentioned in SAL (#wikimedia-operations) [2018-04-19T09:33:32Z] <elukey> upgrade zookeper on conf100[123] from 3.4.5 to 3.4.9 - T182924
Change 427343 merged by Elukey:
[operations/puppet@production] role::configcluster: upgrade zookeeper main-eqiad to 3.4.9
Highlight of the plan to swap conf1001 with conf1004:
- merge https://gerrit.wikimedia.org/r/#/c/432615/ and run puppet on kafka[12]00[123] hosts (not related but since we do rolling restarts..)
- disable puppet on all the conf1* nodes and kafka-jumbo nodes.
- merge https://gerrit.wikimedia.org/r/#/c/425238 and run puppet on einstenium and conf1004. Test zookeeper on conf1004, it should show the status "follower" and zkCli.sh should be able to connect and see the znodes. Check also the logs.
- run puppet on conf1001, stop zookeeper and mask the zookeeper unit.
- run puppet on conf1003, verify that it starts up correctly and then do the same on conf1002 (current leader).
After the above the zookeeper cluster should be healthy and not showing any error.
Then, rolling restart of:
- Kafka main eqiad and codfw
- Kafka Analytics
- Kafka Jumbo
- Burrow on kafkamon1001 should restart by itself
- Hadoop masters and journal nodes
Change 425238 merged by Elukey:
[operations/puppet@production] Swap conf1001 with conf1004 in Zookeeper main-eqiad
Change 433157 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::ops: add new zookeeper hosts' monitoring
Change 433157 merged by Elukey:
[operations/puppet@production] role::prometheus::ops: add new zookeeper hosts' monitoring
Change 433183 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster: decom zookeeper on conf1001
Added conf100[4-6]'s ips to the analytics-in4 firewall rules, forgot to do it yesterday.
Change 433183 merged by Elukey:
[operations/puppet@production] role::configcluster: decom zookeeper on conf1001
Change 433322 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap zookeeper on conf1002 with conf1005
Change 433322 merged by Elukey:
[operations/puppet@production] Swap zookeeper on conf1002 with conf1005
Change 435810 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] network::constants: update kafka[12]00[1-3] ipv6 addresses
Change 435810 merged by Elukey:
[operations/puppet@production] network::constants: update kafka[12]00[1-3] ipv6 addresses
Change 435821 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] network::constants: update kafka-jumbo ipv6 addresses
Change 435821 merged by Elukey:
[operations/puppet@production] network::constants: update kafka-jumbo ipv6 addresses
Change 435963 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap zookeeper from conf1003 to conf1006
Change 435963 merged by Elukey:
[operations/puppet@production] Swap zookeeper from conf1003 to conf1006
Change 436064 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add notes about conf100[1-3] -> conf100[4-6] migration status
Change 436064 merged by Elukey:
[operations/puppet@production] Add notes about conf100[1-3] -> conf100[4-6] migration status
Change 436222 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster_stretch: page if a zookeeper node goes down
Change 436222 merged by Elukey:
[operations/puppet@production] role::configcluster_stretch: page if a zookeeper node goes down
Change 436226 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] network::constants: remove decommed zookeeper nodes
Change 436226 merged by Elukey:
[operations/puppet@production] network::constants: remove decommed zookeeper nodes