Page MenuHomePhabricator

Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020]
Closed, ResolvedPublic13 Estimated Story Points

Description

At this time both analytics tools like druid and hadoop (tier-2) and kafka (tier-1) share the same zookeeper cluster. This becomes a problem when analytics needs for example, to test changes or updates that have zookeeper ramifications as we those could affect out tier-1 services. Let's split clusters so there is a clear boundary on availability and ops support for either.

Event Timeline

Milimetric renamed this task from decouple analytics zookeeper cluster from kafka zookeeper cluster to decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020].Feb 28 2019, 5:46 PM
Milimetric triaged this task as Medium priority.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
elukey renamed this task from decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020] to Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020].Aug 22 2019, 2:58 PM
elukey moved this task from Waiting for others to In Progress on the User-Elukey board.

The zookeeper analytics-eqiad cluster has been created in T227025. The remaining steps are:

  1. test the new cluster properly (it runs java11 and buster)
  2. test the switch to the new cluster for the Hadoop testing cluster (will likely require both Yarn RM and Hadoop HDFS daemons to stop and start with the new config)
  3. do the same in production
  4. clean up znodes on the current main-eqiad zookeeper cluster (the ones related to the Hadoop clusters).
elukey changed the task status from Open to Stalled.EditedSep 11 2019, 9:46 AM

The testing is currently blocked by two things:

  • in T231067 we are wondering if Java 8 should be used instead (since running Java 11 on the ZK servers and 8 on the clients might be problematic, due to the nature of the zookeeper java clients).
  • in T227025 there seems to be a serial console redirection problem with the hosts, need to follow up with Chris/Rob to figure out how to fix it.

Change 539069 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::zookeeper::server: use openjkd-8 on Buster

https://gerrit.wikimedia.org/r/539069

Change 539069 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: use openjkd-8 on Buster

https://gerrit.wikimedia.org/r/539069

Change 539120 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move the Hadoop test cluster to the Analytics Zookeeper cluster

https://gerrit.wikimedia.org/r/539120

Change 539120 merged by Elukey:
[operations/puppet@production] Move the Hadoop test cluster to the Analytics Zookeeper cluster

https://gerrit.wikimedia.org/r/539120

Change 539122 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::zookeeper: enable prometheus metrics by default

https://gerrit.wikimedia.org/r/539122

Change 539122 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: enable prometheus metrics by default

https://gerrit.wikimedia.org/r/539122

Change 539129 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::analytics: add Analytics Zookeeper cluster's metrics

https://gerrit.wikimedia.org/r/539129

Change 539129 merged by Elukey:
[operations/puppet@production] role::prometheus::analytics: add Analytics Zookeeper cluster's metrics

https://gerrit.wikimedia.org/r/539129

I tried to deploy openjdk-8 on one node and ended up in an error similar to https://github.com/plasma-umass/doppio/issues/497 (logged by zookeeper). This is probably due to jars compiled for Java 11 that cannot run on 8. I have restored Java 11 on the node and migrated the Hadoop test cluster to the new Zookeeper cluster, everything seems running fine.

Side note: the current partman recipe does not create a big root partition, but instead it reserves a big unused lvm volume. I created a 100G zookeeper volume on all nodes and mounted manually /var/lib/zookepeer to it.

Mentioned in SAL (#wikimedia-operations) [2019-09-26T08:07:13Z] <elukey> executed 'rmr /yarn-rmstore/analytics-test-hadoop/ZKRMStateRoot' on conf1004's zkCli.sh to clean up znodes - T217057

Next steps:

  • Fix with DCops the serial console issue - T227025
  • Test if the Hadoop Test cluster is working well with ZK on Java 11, and plan the upgrade for the Analytics Cluster
  • Clean up the Zookeper main-eqiad cluster from Hadoop Znodes

Hosts are ready, and we have been testing the Hadoop Test cluster with the new Zk cluster for a while without any big issues. Next step is deploying to prod! https://etherpad.wikimedia.org/p/analytics-zk-migration

Change 542789 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::zookeeper: enable monitoring

https://gerrit.wikimedia.org/r/542789

Change 542789 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::zookeeper: enable monitoring

https://gerrit.wikimedia.org/r/542789

Change 542866 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move the Analytics Hadoop cluster to the new Analytics ZK cluster

https://gerrit.wikimedia.org/r/542866

Change 542867 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] ferm: remove hadoop_masters from puppet config

https://gerrit.wikimedia.org/r/542867

Change 543027 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::zookeeper: fix prometheus monitors

https://gerrit.wikimedia.org/r/543027

Change 543027 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::zookeeper: fix prometheus monitors

https://gerrit.wikimedia.org/r/543027

Change 542866 merged by Elukey:
[operations/puppet@production] Move the Analytics Hadoop cluster to the new Analytics ZK cluster

https://gerrit.wikimedia.org/r/542866

Change 542867 merged by Elukey:
[operations/puppet@production] ferm: remove hadoop_masters from puppet config

https://gerrit.wikimedia.org/r/542867

elukey set the point value for this task to 13.Oct 15 2019, 1:45 PM

Last step is to clean up the zookeeper main eqiad cluster from old hadoop zones (~30k, a lot) to complete the migration.

Mentioned in SAL (#wikimedia-operations) [2019-10-15T14:42:58Z] <elukey> start a root tmux containing a bash script on conf1004 to clean up znodes under /yarn-rmstore/analytics-hadoop/ZKRMStateRoot/RMAppRoot slowly - T217057

Change 543183 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/homer/public@master] Remove zookeeper terms from the Analytics filters

https://gerrit.wikimedia.org/r/543183

Change 543183 merged by Ayounsi:
[operations/homer/public@master] Remove zookeeper terms from the Analytics filters

https://gerrit.wikimedia.org/r/543183