Refresh zookeeper nodes in eqiad
Closed, ResolvedPublic21 Estimated Story Points
Actions

Assigned To

Authored By

	• Nuria
	Dec 14 2017, 11:48 PM

Description

Due to the hardware warranty expired, we need to move the main-eqiad zookeeper cluster from conf100[123] to conf100[456].

Details

Subject	Repo	Branch	Lines +/-
Ensure existence of environment conf file	operations/puppet/zookeeper	master	+53 -1
network::constants: remove decommed zookeeper nodes	operations/puppet	production	+0 -4
role::configcluster_stretch: page if a zookeeper node goes down	operations/puppet	production	+1 -2
Add notes about conf100[1-3] -> conf100[4-6] migration status	operations/puppet	production	+3 -5
Swap zookeeper from conf1003 to conf1006	operations/puppet	production	+5 -7
network::constants: update kafka-jumbo ipv6 addresses	operations/puppet	production	+12 -12
network::constants: update kafka[12]00[1-3] ipv6 addresses	operations/puppet	production	+12 -12
Swap zookeeper on conf1002 with conf1005	operations/puppet	production	+4 -4
role::configcluster: decom zookeeper on conf1001	operations/puppet	production	+6 -4
role::prometheus::ops: add new zookeeper hosts' monitoring	operations/puppet	production	+10 -1
Swap conf1001 with conf1004 in Zookeeper main-eqiad	operations/puppet	production	+19 -5
role::configcluster: upgrade zookeeper main-eqiad to 3.4.9	operations/puppet	production	+2 -4
role::configcluster: upgrade zookeeper to 3.4.9-3~jessie in codfw	operations/puppet	production	+2 -0
profile::zookeeper::server: introduce the zookeeper349 component	operations/puppet	production	+18 -0
network::constants: add conf100[456] to zookeeper_main_hosts	operations/puppet	production	+6 -0
zookeeper: swap conf1001 with conf1004	operations/puppet	production	+1 -1

Related Objects
Search...

Status	Assigned	Task
Resolved	elukey	T192642 Upgrade Analytics infrastructure to Debian Stretch
Resolved	elukey	T182924 Refresh zookeeper nodes in eqiad
Resolved	None	T166081 rack/setup/install conf1004-conf1006
Resolved	Joe	T190230 install conf1004-6 ssd upgrades

Event Timeline

• Nuria created this task.Dec 14 2017, 11:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 14 2017, 11:48 PM

• Nuria renamed this task from Refresh zookeeper nodes on equiad to Refresh zookeeper nodes on eqiad.Dec 14 2017, 11:49 PM

• Nuria assigned this task to elukey.

• Nuria updated the task description. (Show Details)

• fdans moved this task from Incoming to Wikistats on the Analytics board.Dec 18 2017, 4:30 PM

• Nuria edited projects, added Analytics-Kanban; removed Analytics.Dec 19 2017, 10:36 PM

elukey renamed this task from Refresh zookeeper nodes on eqiad to Refresh zookeeper nodes in eqiad.Mar 27 2018, 2:52 PM

elukey triaged this task as High priority.

elukey added a project: User-Elukey.

elukey added a subtask: T166081: rack/setup/install conf1004-conf1006.

elukey updated the task description. (Show Details)

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 27 2018, 2:54 PM

Noted one thing: when restarting the leader, an observer shuts down and tries to participate to leader election (like a participant) even if it should not:

2018-03-28 16:52:51,690 - WARN  [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:Observer@77] - Exception when observing the leader
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:73)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:833)
2018-03-28 16:52:51,698 - WARN  [RecvWorker:103:QuorumCnxManager$RecvWorker@813] - Interrupting SendWorker
2018-03-28 16:52:51,698 - WARN  [SendWorker:103:QuorumCnxManager$SendWorker@727] - Interrupted while waiting for message on queue
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715)
2018-03-28 16:52:51,699 - WARN  [SendWorker:103:QuorumCnxManager$SendWorker@736] - Send worker leaving thread
2018-03-28 16:52:51,702 - INFO  [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:Observer@135] - shutdown called
java.lang.Exception: shutdown Observer
        at org.apache.zookeeper.server.quorum.Observer.shutdown(Observer.java:135)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:837)
2018-03-28 16:52:51,703 - INFO  [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@497] - shutting down
2018-03-28 16:52:51,703 - INFO  [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:ObserverRequestProcessor@121] - Shutting down
2018-03-28 16:52:51,703 - INFO  [ObserverRequestProcessor:105:ObserverRequestProcessor@105] - ObserverRequestProcessor exited loop!
2018-03-28 16:52:51,704 - INFO  [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
2018-03-28 16:52:51,704 - INFO  [CommitProcessor:105:CommitProcessor@153] - CommitProcessor exited loop!
2018-03-28 16:52:51,704 - INFO  [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] - shutdown of request processor complete
2018-03-28 16:52:51,708 - INFO  [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] - Shutting down
2018-03-28 16:52:51,708 - INFO  [SyncThread:105:SyncRequestProcessor@186] - SyncRequestProcessor exited!
2018-03-28 16:52:51,709 - INFO  [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
2018-03-28 16:52:51,709 - INFO  [QuorumPeer[myid=105]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id =  105, proposed zxid=0x8000000000000000
2018-03-28 16:52:51,722 - INFO  [WorkerReceiver[myid=105]:FastLeaderElection@600] - Notification: 0 (message format version), 103 (n.leader), 0x50014eb56 (n.zxid), 0x2 (n.round), LO
OKING (n.state), 101 (n.sid), 0x5 (n.peerEpoch) LOOKING (my state)
2018-03-28 16:52:51,724 - WARN  [WorkerSender[myid=105]:QuorumCnxManager@400] - Cannot open channel to 103 at election address zk1-3.analytics.eqiad.wmflabs/10.68.20.68:2183
java.net.ConnectException: Connection refused (Connection refused)

The other nodes do the right thing not accepting the id as participant, but using observers for this migration seems not a stable/viable solution.

As we did in a similar use case before (https://etherpad.wikimedia.org/p/Jessie-Zookeeper) the procedure should be to:

select one of conf100[456], add to it a new id in hiera, replacing another running one. Merge and run puppet on all nodes (to verify: this will not restart zookeeper).
start zookeeper on the conf100[456] selected node, it will join the cluster and become a follower.
Stop zookeeper (mask the unit) on another node among conf100[123] (a follower).
use the kafka cli in all the clusters with the new node and make sure it works.
restart zookeeper on the other two nodes
check the status of the cluster

At this point, a rolling restart of the kafka brokers + hadoop daemons would be needed to test the new config, but in theory as long as one zookeeper node is reachable, it might be fine to just keep going.

elukey moved this task from Backlog to In Progress on the User-Elukey board.Mar 29 2018, 10:34 AM

Another idea to reduce the amount of restarts would have been to add DNS CNAMEs for each zookeeper host, in order for example to avoid touching the zookeeper.connect settings in each kafka cluster config when the zookeeper cluster changes. This seems not that feasible due to how the zookeeper client handles DNS resolutions, namely only once.

Change 425238 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster_stretch: add zookeeper profiles for conf1004

https://gerrit.wikimedia.org/r/425238

gerritbot added a project: Patch-For-Review.Apr 10 2018, 10:18 AM

Change 425239 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] zookeeper: swap conf1001 with conf1004

https://gerrit.wikimedia.org/r/425239

Change 425239 abandoned by Elukey:
zookeeper: swap conf1001 with conf1004

https://gerrit.wikimedia.org/r/425239

Change 425474 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] network::constants: add conf100[456] to zookeeper_main_hosts

https://gerrit.wikimedia.org/r/425474

Change 425474 merged by Elukey:
[operations/puppet@production] network::constants: add conf100[456] to zookeeper_main_hosts

https://gerrit.wikimedia.org/r/425474

Re-tested today the swap of one node in labs (analytics project) to verify again logs and things that might break. Some details about the procedure:

starting cluster:

zk1-1.eqiad.wmflabs - id 101 - os Jessie - zk version 3.4.5 (same as on conf100[123] - zookeeper role follower
zk1-2.eqiad.wmflabs - id 102 - os Stretch - zk version 3.4.9 (same as on conf100[456] - zookeeper role leader
zk1-4.eqiad.wmflabs - id 104 - os Stretch - zk version 3.4.9 - zookeeper role follower

goal: swap zk1-4 with z1-3

Did the following:

Swap zk1-4 wit zk1-3 in hiera, and ran puppet on zk1-3.
ensured that zookeeper on zk1-3 was running and listing itself as 'follower', meanwhile the rest was unchanged.
stopped and masked zookeeper on zk1-4, and re-checked everything. All good.
restarted zk on zk1-1 and checked that everything was good. Weird state registered.

So the status on point 4) was what I wanted to test, since this is a turning point: the quorum for the cluster goes away when zk1-1 begins the restart, but then it does not work anymore even when zk1-1 gets back up.

zk1-2 seems fine:

2018-04-12 09:59:20,023 - WARN  [RecvWorker:101:QuorumCnxManager$RecvWorker@810] - Connection broken for id 101, my id = 102, error =
java.net.SocketException: Socket closed
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at java.net.SocketInputStream.read(SocketInputStream.java:224)
	at java.io.DataInputStream.readInt(DataInputStream.java:387)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:795)
2018-04-12 09:59:20,026 - WARN  [SendWorker:101:QuorumCnxManager$SendWorker@727] - Interrupted while waiting for message on queue
java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715)
2018-04-12 09:59:20,033 - WARN  [SendWorker:101:QuorumCnxManager$SendWorker@736] - Send worker leaving thread
2018-04-12 09:59:20,033 - WARN  [RecvWorker:101:QuorumCnxManager$RecvWorker@813] - Interrupting SendWorker
2018-04-12 09:59:20,054 - INFO  [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)
2018-04-12 09:59:20,269 - INFO  [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)
2018-04-12 09:59:20,675 - INFO  [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)
2018-04-12 09:59:21,481 - INFO  [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)
2018-04-12 09:59:23,086 - INFO  [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)
2018-04-12 09:59:26,290 - INFO  [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)
2018-04-12 09:59:32,694 - INFO  [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)
2018-04-12 09:59:45,503 - INFO  [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)
2018-04-12 10:00:11,112 - INFO  [WorkerReceiver[myid=102]:FastLeaderElection@600] - Notification: 0 (message format version), 101 (n.leader), 0x800015c31 (n.zxid), 0x1 (n.round), LOOKING (n.state), 101 (n.sid), 0x8 (n.peerEpoch) LEADING (my state)

But then zk1-1's state is not, since it keeps repeating the following:

2018-04-12 10:07:55,347 - INFO  [WorkerReceiver[myid=101]:FastLeaderElection@542] - Notification: 102 (n.leader), 0x700010d13 (n.zxid), 0x1 (n.round), LEADING (n.state), 102 (n.sid), 0x7 (n.peerEPoch), LOOKING (my state)

After a final restart of zk1-2 (the leader), all the hosts have the new config and a new leader election happens (in this case, electing zk1-3 as leader). From this point onwards, everything works as expected.

I would have thought to cause issues on the leader rather than on one of the followers, but it can be kind of expected that a move like this one causes a brief unavailability due to inconsistency in the ensemble. Moreover from what I can read it might also be that version 3.4.5 is bugged and can lead to these weird states.

From the consumers point of view (two kafka clusters and one hadoop cluster) I have observed only some non critical logs in one of the hadoop masters when zookeeper broke its session:

2018-04-12 10:08:24,485 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x66629631f0860005, likely server has closed socket, closing socket connection and attempting reconnect
2018-04-12 10:08:24,587 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2018-04-12 10:08:24,841 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zk1-1.analytics.eqiad.wmflabs/10.68.18.31:2181. Will not attempt to authenticate using SASL (unknown error)
2018-04-12 10:08:24,843 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.68.22.32:55520, server: zk1-1.analytics.eqiad.wmflabs/10.68.18.31:2181
2018-04-12 10:08:24,844 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x66629631f0860005, likely server has closed socket, closing socket connection and attempting reconnect
2018-04-12 10:08:25,533 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zk1-3.analytics.eqiad.wmflabs/10.68.20.68:2181. Will not attempt to authenticate using SASL (unknown error)
2018-04-12 10:08:25,534 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.68.22.32:37352, server: zk1-3.analytics.eqiad.wmflabs/10.68.20.68:2181
2018-04-12 10:08:25,543 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server zk1-3.analytics.eqiad.wmflabs/10.68.20.68:2181, sessionid = 0x66629631f0860005, negotiated timeout = 5000
2018-04-12 10:08:25,544 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2018-04-12 10:08:25,546 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2018-04-12 10:08:25,558 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a15616e616c79746963732d6861646f6f702d6c61627312276861646f6f702d6d61737465722d312d616e616c79746963732d65716961642d776d666c6162731a276861646f6f702d6d61737465722d312e616e616c79746963732e65716961642e776d666c61627320d43e28d33e
2018-04-12 10:08:25,558 INFO org.apache.hadoop.ha.ActiveStandbyElector: But old node has our own data, so don't need to fence it.
2018-04-12 10:08:25,558 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/analytics-hadoop-labs/ActiveBreadCrumb to indicate that the local node is the most recent active...
2018-04-12 10:08:25,561 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at hadoop-master-1.analytics.eqiad.wmflabs/10.68.22.32:8020 active...
2018-04-12 10:08:25,565 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at hadoop-master-1.analytics.eqiad.wmflabs/10.68.22.32:8020 to active state

So as far as I can see, even if we might need to expect a brief unavailability, everything resolves very quickly.

In T182924#4126533, @elukey wrote:
But then zk1-1's state is not, since it keeps repeating the following:
2018-04-12 10:07:55,347 - INFO  [WorkerReceiver[myid=101]:FastLeaderElection@542] - Notification: 102 (n.leader), 0x700010d13 (n.zxid), 0x1 (n.round), LEADING (n.state), 102 (n.sid), 0x7 (n.peerEPoch), LOOKING (my state)

This error seems really similar to what described in https://issues.apache.org/jira/browse/ZOOKEEPER-1732 (fixed in 3.4.6). If true this would mean that our 3.4.5 cluster (conf100[123]) might be affected by this, and the rolling restart of the nodes might end up in more errors after the conf1001/4 swap. I am wondering if an upgrade to the stretch's version of zookepeer (3.4.9) could be a good intermediary step.

Interesting discovery today while testing zookeeper on stretch. I tried to clean up /etc/zookeeper/conf and ran puppet to check if everything was going to be restored or not, and with my great surprise, the zookeeper systemd unit wasn't able to "start". After a big of digging, the culprit seems to be the following:

elukey@conf1001:~$ ls -l /etc/zookeeper/conf
lrwxrwxrwx 1 root root 32 Jun 17  2015 /etc/zookeeper/conf -> /etc/alternatives/zookeeper-conf

elukey@conf1001:~$ ls -l  /etc/alternatives/zookeeper-conf
lrwxrwxrwx 1 root root 27 Jun 17  2015 /etc/alternatives/zookeeper-conf -> /etc/zookeeper/conf_example

elukey@conf1001:~$ ls -l /etc/zookeeper/conf_example
total 20
-rw-r--r-- 1 root root  535 Mar 15  2014 configuration.xsl
-rw-r--r-- 1 root root  754 Mar 15  2014 environment
-rw-r--r-- 1 root root 1063 Jun 17  2015 log4j.properties
-rw-r--r-- 1 root root    4 Jun 17  2015 myid
-rw-r--r-- 1 root root 2217 Jun  8  2016 zoo.cfg

So we make sure that myid, zoo.cfg, log4j are populated via puppet, but we don't do anything about the environment file (that is created the first time by the package). So if for some reason it is deleted and/or overwritten by the package, than we don't have a good way to restore it.

There seems to be a bug opened about it:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=838385

Change 426918 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/zookeeper@master] Ensure existence of environment conf file

https://gerrit.wikimedia.org/r/426918

Tested in labs a migration with two stretch hosts running zk 3.4.9 and one jessie host running zk 3.4.9 (Moritz's backport) and the host swap happened without any issue (no host in LOOKING state).

Change 427078 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::zookeeper::server: introduce the zookeeper349 component

https://gerrit.wikimedia.org/r/427078

Change 427078 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: introduce the zookeeper349 component

https://gerrit.wikimedia.org/r/427078

Change 427090 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster: upgrade zookeeper to 3.4.9-3~jessie in codfw

https://gerrit.wikimedia.org/r/427090

Mentioned in SAL (#wikimedia-operations) [2018-04-17T13:00:20Z] <elukey> upgrade zookeeper on conf200[123] to 3.4.9~jessie - T182924

Change 427090 merged by Elukey:
[operations/puppet@production] role::configcluster: upgrade zookeeper to 3.4.9-3~jessie in codfw

https://gerrit.wikimedia.org/r/427090

Change 427343 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster: upgrade zookeeper main-eqiad to 3.4.9

https://gerrit.wikimedia.org/r/427343

Mentioned in SAL (#wikimedia-operations) [2018-04-19T09:33:32Z] <elukey> upgrade zookeper on conf100[123] from 3.4.5 to 3.4.9 - T182924

Change 427343 merged by Elukey:
[operations/puppet@production] role::configcluster: upgrade zookeeper main-eqiad to 3.4.9

https://gerrit.wikimedia.org/r/427343

elukey mentioned this in T192642: Upgrade Analytics infrastructure to Debian Stretch.Apr 20 2018, 11:23 AM

elukey added a parent task: T192642: Upgrade Analytics infrastructure to Debian Stretch.

elukey moved this task from In Progress to Paused on the Analytics-Kanban board.May 2 2018, 2:28 PM

mforns set the point value for this task to 8.May 7 2018, 4:03 PM

Highlight of the plan to swap conf1001 with conf1004:

merge https://gerrit.wikimedia.org/r/#/c/432615/ and run puppet on kafka[12]00[123] hosts (not related but since we do rolling restarts..)

disable puppet on all the conf1* nodes and kafka-jumbo nodes.

merge https://gerrit.wikimedia.org/r/#/c/425238 and run puppet on einstenium and conf1004. Test zookeeper on conf1004, it should show the status "follower" and zkCli.sh should be able to connect and see the znodes. Check also the logs.

run puppet on conf1001, stop zookeeper and mask the zookeeper unit.
run puppet on conf1003, verify that it starts up correctly and then do the same on conf1002 (current leader).

After the above the zookeeper cluster should be healthy and not showing any error.

Then, rolling restart of:

Kafka main eqiad and codfw
Kafka Analytics
Kafka Jumbo
Burrow on kafkamon1001 should restart by itself
Hadoop masters and journal nodes

Change 425238 merged by Elukey:
[operations/puppet@production] Swap conf1001 with conf1004 in Zookeeper main-eqiad

https://gerrit.wikimedia.org/r/425238

Change 433157 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::ops: add new zookeeper hosts' monitoring

https://gerrit.wikimedia.org/r/433157

Change 433157 merged by Elukey:
[operations/puppet@production] role::prometheus::ops: add new zookeeper hosts' monitoring

https://gerrit.wikimedia.org/r/433157

Change 433183 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster: decom zookeeper on conf1001

https://gerrit.wikimedia.org/r/433183

Added conf100[4-6]'s ips to the analytics-in4 firewall rules, forgot to do it yesterday.

Change 433183 merged by Elukey:
[operations/puppet@production] role::configcluster: decom zookeeper on conf1001

https://gerrit.wikimedia.org/r/433183

Change 433322 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap zookeeper on conf1002 with conf1005

https://gerrit.wikimedia.org/r/433322

elukey moved this task from Paused to Ready to Deploy on the Analytics-Kanban board.May 17 2018, 3:57 PM

Change 433322 merged by Elukey:
[operations/puppet@production] Swap zookeeper on conf1002 with conf1005

https://gerrit.wikimedia.org/r/433322

Change 435810 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] network::constants: update kafka[12]00[1-3] ipv6 addresses

https://gerrit.wikimedia.org/r/435810

Change 435810 merged by Elukey:
[operations/puppet@production] network::constants: update kafka[12]00[1-3] ipv6 addresses

https://gerrit.wikimedia.org/r/435810

Change 435821 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] network::constants: update kafka-jumbo ipv6 addresses

https://gerrit.wikimedia.org/r/435821

Change 435821 merged by Elukey:
[operations/puppet@production] network::constants: update kafka-jumbo ipv6 addresses

https://gerrit.wikimedia.org/r/435821

Change 435963 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap zookeeper from conf1003 to conf1006

https://gerrit.wikimedia.org/r/435963

Change 435963 merged by Elukey:
[operations/puppet@production] Swap zookeeper from conf1003 to conf1006

https://gerrit.wikimedia.org/r/435963

Change 436064 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add notes about conf100[1-3] -> conf100[4-6] migration status

https://gerrit.wikimedia.org/r/436064

Change 436064 merged by Elukey:
[operations/puppet@production] Add notes about conf100[1-3] -> conf100[4-6] migration status

https://gerrit.wikimedia.org/r/436064

elukey moved this task from Ready to Deploy to Done on the Analytics-Kanban board.May 30 2018, 6:22 AM

elukey changed the point value for this task from 8 to 21.

Change 436222 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::configcluster_stretch: page if a zookeeper node goes down

https://gerrit.wikimedia.org/r/436222

Change 436222 merged by Elukey:
[operations/puppet@production] role::configcluster_stretch: page if a zookeeper node goes down

https://gerrit.wikimedia.org/r/436222

Change 436226 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] network::constants: remove decommed zookeeper nodes

https://gerrit.wikimedia.org/r/436226

Change 436226 merged by Elukey:
[operations/puppet@production] network::constants: remove decommed zookeeper nodes

https://gerrit.wikimedia.org/r/436226

elukey moved this task from In Progress to Done on the User-Elukey board.Jun 4 2018, 8:13 AM

• Nuria closed this task as Resolved.Jun 11 2018, 11:02 PM

Change 426918 abandoned by Elukey:
Ensure existence of environment conf file

https://gerrit.wikimedia.org/r/426918

Joe closed subtask T166081: rack/setup/install conf1004-conf1006 as Resolved.Aug 2 2019, 7:03 AM

Maintenance_bot removed a project: Patch-For-Review.Aug 2 2019, 7:10 AM

Refresh zookeeper nodes in eqiadClosed, ResolvedPublic21 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Refresh zookeeper nodes in eqiad
Closed, ResolvedPublic21 Estimated Story Points
Actions

Related Objects
Search...