Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020]
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	• Nuria
	Feb 25 2019, 4:57 PM

Description

At this time both analytics tools like druid and hadoop (tier-2) and kafka (tier-1) share the same zookeeper cluster. This becomes a problem when analytics needs for example, to test changes or updates that have zookeeper ramifications as we those could affect out tier-1 services. Let's split clusters so there is a clear boundary on availability and ops support for either.

Details

Subject	Repo	Branch	Lines +/-
Remove zookeeper terms from the Analytics filters	operations/homer/public	master	+0 -36
ferm: remove hadoop_masters from puppet config	operations/puppet	production	+2 -16
Move the Analytics Hadoop cluster to the new Analytics ZK cluster	operations/puppet	production	+1 -1
role::analytics_cluster::zookeeper: fix prometheus monitors	operations/puppet	production	+1 -0
role::analytics_cluster::zookeeper: enable monitoring	operations/puppet	production	+17 -27
role::prometheus::analytics: add Analytics Zookeeper cluster's metrics	operations/puppet	production	+13 -0
profile::zookeeper::server: enable prometheus metrics by default	operations/puppet	production	+2 -6
Move the Hadoop test cluster to the Analytics Zookeeper cluster	operations/puppet	production	+1 -1
profile::zookeeper::server: use openjkd-8 on Buster	operations/puppet	production	+29 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		elukey	T217057 Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020]
		Resolved		• Cmjohnson	T227025 (Need By: August 31) rack/setup/install (3) new zookeeper nodes

Event Timeline

• Nuria created this task.Feb 25 2019, 4:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 25 2019, 4:57 PM

Milimetric renamed this task from decouple analytics zookeeper cluster from kafka zookeeper cluster to decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020].Feb 28 2019, 5:46 PM

Milimetric triaged this task as Medium priority.

Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

• Nuria assigned this task to elukey.May 16 2019, 6:49 PM

New hosts should be racked this month: https://phabricator.wikimedia.org/T220687

• Nuria updated the task description. (Show Details)May 28 2019, 9:40 AM

elukey moved this task from Backlog to Waiting for others on the User-Elukey board.Jul 5 2019, 9:47 AM

elukey renamed this task from decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020] to Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020].Aug 22 2019, 2:58 PM

elukey moved this task from Waiting for others to In Progress on the User-Elukey board.

elukey added a subtask: T227025: (Need By: August 31) rack/setup/install (3) new zookeeper nodes.

The zookeeper analytics-eqiad cluster has been created in T227025. The remaining steps are:

test the new cluster properly (it runs java11 and buster)
test the switch to the new cluster for the Hadoop testing cluster (will likely require both Yarn RM and Hadoop HDFS daemons to stop and start with the new config)
do the same in production
clean up znodes on the current main-eqiad zookeeper cluster (the ones related to the Hadoop clusters).

elukey added a project: Analytics-Kanban.Aug 22 2019, 3:01 PM

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

The testing is currently blocked by two things:

in T231067 we are wondering if Java 8 should be used instead (since running Java 11 on the ZK servers and 8 on the clients might be problematic, due to the nature of the zookeeper java clients).
in T227025 there seems to be a serial console redirection problem with the hosts, need to follow up with Chris/Rob to figure out how to fix it.

elukey mentioned this in T231067: Install Debian Buster on Hadoop.Sep 16 2019, 9:56 AM

Change 539069 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::zookeeper::server: use openjkd-8 on Buster

https://gerrit.wikimedia.org/r/539069

gerritbot added a project: Patch-For-Review.Sep 25 2019, 9:15 AM

Change 539069 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: use openjkd-8 on Buster

https://gerrit.wikimedia.org/r/539069

Maintenance_bot removed a project: Patch-For-Review.Sep 25 2019, 1:11 PM

Change 539120 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move the Hadoop test cluster to the Analytics Zookeeper cluster

https://gerrit.wikimedia.org/r/539120

gerritbot added a project: Patch-For-Review.Sep 25 2019, 1:21 PM

Change 539120 merged by Elukey:
[operations/puppet@production] Move the Hadoop test cluster to the Analytics Zookeeper cluster

https://gerrit.wikimedia.org/r/539120

Change 539122 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::zookeeper: enable prometheus metrics by default

https://gerrit.wikimedia.org/r/539122

Change 539122 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: enable prometheus metrics by default

https://gerrit.wikimedia.org/r/539122

Maintenance_bot removed a project: Patch-For-Review.Sep 25 2019, 2:10 PM

Change 539129 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::analytics: add Analytics Zookeeper cluster's metrics

https://gerrit.wikimedia.org/r/539129

gerritbot added a project: Patch-For-Review.Sep 25 2019, 2:11 PM

Change 539129 merged by Elukey:
[operations/puppet@production] role::prometheus::analytics: add Analytics Zookeeper cluster's metrics

https://gerrit.wikimedia.org/r/539129

Maintenance_bot removed a project: Patch-For-Review.Sep 25 2019, 3:11 PM

I tried to deploy openjdk-8 on one node and ended up in an error similar to https://github.com/plasma-umass/doppio/issues/497 (logged by zookeeper). This is probably due to jars compiled for Java 11 that cannot run on 8. I have restored Java 11 on the node and migrated the Hadoop test cluster to the new Zookeeper cluster, everything seems running fine.

Side note: the current partman recipe does not create a big root partition, but instead it reserves a big unused lvm volume. I created a 100G zookeeper volume on all nodes and mounted manually /var/lib/zookepeer to it.

Mentioned in SAL (#wikimedia-operations) [2019-09-26T08:07:13Z] <elukey> executed 'rmr /yarn-rmstore/analytics-test-hadoop/ZKRMStateRoot' on conf1004's zkCli.sh to clean up znodes - T217057

Next steps:

Fix with DCops the serial console issue - T227025
Test if the Hadoop Test cluster is working well with ZK on Java 11, and plan the upgrade for the Analytics Cluster
Clean up the Zookeper main-eqiad cluster from Hadoop Znodes

elukey moved this task from In Progress to Paused on the Analytics-Kanban board.Oct 4 2019, 1:42 PM

elukey moved this task from In Progress to Waiting for others on the User-Elukey board.Oct 10 2019, 6:26 AM

elukey changed the task status from Stalled to Open.Oct 14 2019, 6:28 AM

elukey closed subtask T227025: (Need By: August 31) rack/setup/install (3) new zookeeper nodes as Resolved.

Hosts are ready, and we have been testing the Hadoop Test cluster with the new Zk cluster for a while without any big issues. Next step is deploying to prod! https://etherpad.wikimedia.org/p/analytics-zk-migration

elukey moved this task from Paused to Ready to Deploy on the Analytics-Kanban board.Oct 14 2019, 6:38 AM

Change 542789 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::zookeeper: enable monitoring

https://gerrit.wikimedia.org/r/542789

gerritbot added a project: Patch-For-Review.Oct 14 2019, 6:45 AM

Change 542789 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::zookeeper: enable monitoring

https://gerrit.wikimedia.org/r/542789

Change 542866 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move the Analytics Hadoop cluster to the new Analytics ZK cluster

https://gerrit.wikimedia.org/r/542866

Change 542867 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] ferm: remove hadoop_masters from puppet config

https://gerrit.wikimedia.org/r/542867

elukey moved this task from Waiting for others to In Progress on the User-Elukey board.Oct 14 2019, 1:25 PM

Change 543027 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::zookeeper: fix prometheus monitors

https://gerrit.wikimedia.org/r/543027

Change 543027 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::zookeeper: fix prometheus monitors

https://gerrit.wikimedia.org/r/543027

Change 542866 merged by Elukey:
[operations/puppet@production] Move the Analytics Hadoop cluster to the new Analytics ZK cluster

https://gerrit.wikimedia.org/r/542866

Change 542867 merged by Elukey:
[operations/puppet@production] ferm: remove hadoop_masters from puppet config

https://gerrit.wikimedia.org/r/542867

elukey set the point value for this task to 13.Oct 15 2019, 1:45 PM

Last step is to clean up the zookeeper main eqiad cluster from old hadoop zones (~30k, a lot) to complete the migration.

Mentioned in SAL (#wikimedia-operations) [2019-10-15T14:42:58Z] <elukey> start a root tmux containing a bash script on conf1004 to clean up znodes under /yarn-rmstore/analytics-hadoop/ZKRMStateRoot/RMAppRoot slowly - T217057

elukey moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Oct 15 2019, 3:21 PM

Change 543183 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/homer/public@master] Remove zookeeper terms from the Analytics filters

https://gerrit.wikimedia.org/r/543183

elukey moved this task from In Progress to Done on the User-Elukey board.Oct 16 2019, 8:13 AM

Change 543183 merged by Ayounsi:
[operations/homer/public@master] Remove zookeeper terms from the Analytics filters

https://gerrit.wikimedia.org/r/543183

• Nuria closed this task as Resolved.Oct 24 2019, 7:06 PM

elukey mentioned this in T220387: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task).Nov 5 2019, 4:12 PM

elukey mentioned this in T244211: Analytics Hardware for Fiscal Year 2019/2020.Feb 4 2020, 10:07 AM

Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020]Closed, ResolvedPublic13 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020]
Closed, ResolvedPublic13 Estimated Story Points
Actions

Related Objects
Search...