At this time both analytics tools like druid and hadoop (tier-2) and kafka (tier-1) share the same zookeeper cluster. This becomes a problem when analytics needs for example, to test changes or updates that have zookeeper ramifications as we those could affect out tier-1 services. Let's split clusters so there is a clear boundary on availability and ops support for either.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | elukey | T217057 Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020] | |||
Resolved | • Cmjohnson | T227025 (Need By: August 31) rack/setup/install (3) new zookeeper nodes |
Event Timeline
The zookeeper analytics-eqiad cluster has been created in T227025. The remaining steps are:
- test the new cluster properly (it runs java11 and buster)
- test the switch to the new cluster for the Hadoop testing cluster (will likely require both Yarn RM and Hadoop HDFS daemons to stop and start with the new config)
- do the same in production
- clean up znodes on the current main-eqiad zookeeper cluster (the ones related to the Hadoop clusters).
The testing is currently blocked by two things:
- in T231067 we are wondering if Java 8 should be used instead (since running Java 11 on the ZK servers and 8 on the clients might be problematic, due to the nature of the zookeeper java clients).
- in T227025 there seems to be a serial console redirection problem with the hosts, need to follow up with Chris/Rob to figure out how to fix it.
Change 539069 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::zookeeper::server: use openjkd-8 on Buster
Change 539069 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: use openjkd-8 on Buster
Change 539120 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move the Hadoop test cluster to the Analytics Zookeeper cluster
Change 539120 merged by Elukey:
[operations/puppet@production] Move the Hadoop test cluster to the Analytics Zookeeper cluster
Change 539122 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::zookeeper: enable prometheus metrics by default
Change 539122 merged by Elukey:
[operations/puppet@production] profile::zookeeper::server: enable prometheus metrics by default
Change 539129 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::analytics: add Analytics Zookeeper cluster's metrics
Change 539129 merged by Elukey:
[operations/puppet@production] role::prometheus::analytics: add Analytics Zookeeper cluster's metrics
I tried to deploy openjdk-8 on one node and ended up in an error similar to https://github.com/plasma-umass/doppio/issues/497 (logged by zookeeper). This is probably due to jars compiled for Java 11 that cannot run on 8. I have restored Java 11 on the node and migrated the Hadoop test cluster to the new Zookeeper cluster, everything seems running fine.
Side note: the current partman recipe does not create a big root partition, but instead it reserves a big unused lvm volume. I created a 100G zookeeper volume on all nodes and mounted manually /var/lib/zookepeer to it.
Mentioned in SAL (#wikimedia-operations) [2019-09-26T08:07:13Z] <elukey> executed 'rmr /yarn-rmstore/analytics-test-hadoop/ZKRMStateRoot' on conf1004's zkCli.sh to clean up znodes - T217057
Next steps:
- Fix with DCops the serial console issue - T227025
- Test if the Hadoop Test cluster is working well with ZK on Java 11, and plan the upgrade for the Analytics Cluster
- Clean up the Zookeper main-eqiad cluster from Hadoop Znodes
Hosts are ready, and we have been testing the Hadoop Test cluster with the new Zk cluster for a while without any big issues. Next step is deploying to prod! https://etherpad.wikimedia.org/p/analytics-zk-migration
Change 542789 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::zookeeper: enable monitoring
Change 542789 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::zookeeper: enable monitoring
Change 542866 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move the Analytics Hadoop cluster to the new Analytics ZK cluster
Change 542867 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] ferm: remove hadoop_masters from puppet config
Change 543027 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::zookeeper: fix prometheus monitors
Change 543027 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::zookeeper: fix prometheus monitors
Change 542866 merged by Elukey:
[operations/puppet@production] Move the Analytics Hadoop cluster to the new Analytics ZK cluster
Change 542867 merged by Elukey:
[operations/puppet@production] ferm: remove hadoop_masters from puppet config
Last step is to clean up the zookeeper main eqiad cluster from old hadoop zones (~30k, a lot) to complete the migration.
Mentioned in SAL (#wikimedia-operations) [2019-10-15T14:42:58Z] <elukey> start a root tmux containing a bash script on conf1004 to clean up znodes under /yarn-rmstore/analytics-hadoop/ZKRMStateRoot/RMAppRoot slowly - T217057
Change 543183 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/homer/public@master] Remove zookeeper terms from the Analytics filters
Change 543183 merged by Ayounsi:
[operations/homer/public@master] Remove zookeeper terms from the Analytics filters