At this time both analytics tools like druid and hadoop (tier-2) and kafka (tier-1) share the same zookeeper cluster. This becomes a problem when analytics needs for example, to test changes or updates that have zookeeper ramifications as we those could affect out tier-1 services. Let's split clusters so there is a clear boundary on availability and ops support for either.
|Resolved||elukey||T217057 Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020]|
|Resolved||Cmjohnson||T227025 (Need By: August 31) rack/setup/install (3) new zookeeper nodes|
- Mentioned In
- T244211: Analytics Hardware for Fiscal Year 2019/2020
T220387: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task)
T231067: Install Debian Buster on Hadoop
- Mentioned Here
- T231067: Install Debian Buster on Hadoop
T227025: (Need By: August 31) rack/setup/install (3) new zookeeper nodes
The zookeeper analytics-eqiad cluster has been created in T227025. The remaining steps are:
- test the new cluster properly (it runs java11 and buster)
- test the switch to the new cluster for the Hadoop testing cluster (will likely require both Yarn RM and Hadoop HDFS daemons to stop and start with the new config)
- do the same in production
- clean up znodes on the current main-eqiad zookeeper cluster (the ones related to the Hadoop clusters).
The testing is currently blocked by two things:
- in T231067 we are wondering if Java 8 should be used instead (since running Java 11 on the ZK servers and 8 on the clients might be problematic, due to the nature of the zookeeper java clients).
- in T227025 there seems to be a serial console redirection problem with the hosts, need to follow up with Chris/Rob to figure out how to fix it.
I tried to deploy openjdk-8 on one node and ended up in an error similar to https://github.com/plasma-umass/doppio/issues/497 (logged by zookeeper). This is probably due to jars compiled for Java 11 that cannot run on 8. I have restored Java 11 on the node and migrated the Hadoop test cluster to the new Zookeeper cluster, everything seems running fine.
Side note: the current partman recipe does not create a big root partition, but instead it reserves a big unused lvm volume. I created a 100G zookeeper volume on all nodes and mounted manually /var/lib/zookepeer to it.
Hosts are ready, and we have been testing the Hadoop Test cluster with the new Zk cluster for a while without any big issues. Next step is deploying to prod! https://etherpad.wikimedia.org/p/analytics-zk-migration