At this time both analytics tools like druid and hadoop (tier-2) and kafka (tier-1) share the same zookeeper cluster. This becomes a problem when analytics needs for example, to test changes or updates that have zookeeper ramifications as we those could affect out tier-1 services. Let's split clusters so there is a clear boundary on availability and ops support for either.
|operations/homer/public : master||Remove zookeeper terms from the Analytics filters|
|operations/puppet : production||ferm: remove hadoop_masters from puppet config|
|operations/puppet : production||Move the Analytics Hadoop cluster to the new Analytics ZK cluster|
|operations/puppet : production||role::analytics_cluster::zookeeper: fix prometheus monitors|
|operations/puppet : production||role::analytics_cluster::zookeeper: enable monitoring|
|operations/puppet : production||role::prometheus::analytics: add Analytics Zookeeper cluster's metrics|
|operations/puppet : production||profile::zookeeper::server: enable prometheus metrics by default|
|operations/puppet : production||Move the Hadoop test cluster to the Analytics Zookeeper cluster|
|operations/puppet : production||profile::zookeeper::server: use openjkd-8 on Buster|
|Resolved||elukey||T217057 Decouple analytics zookeeper cluster from kafka zookeeper cluster [2019-2020]|
|Resolved||Cmjohnson||T227025 (Need By: August 31) rack/setup/install (3) new zookeeper nodes|
The zookeeper analytics-eqiad cluster has been created in T227025. The remaining steps are:
- test the new cluster properly (it runs java11 and buster)
- test the switch to the new cluster for the Hadoop testing cluster (will likely require both Yarn RM and Hadoop HDFS daemons to stop and start with the new config)
- do the same in production
- clean up znodes on the current main-eqiad zookeeper cluster (the ones related to the Hadoop clusters).
The testing is currently blocked by two things:
- in T231067 we are wondering if Java 8 should be used instead (since running Java 11 on the ZK servers and 8 on the clients might be problematic, due to the nature of the zookeeper java clients).
- in T227025 there seems to be a serial console redirection problem with the hosts, need to follow up with Chris/Rob to figure out how to fix it.
I tried to deploy openjdk-8 on one node and ended up in an error similar to https://github.com/plasma-umass/doppio/issues/497 (logged by zookeeper). This is probably due to jars compiled for Java 11 that cannot run on 8. I have restored Java 11 on the node and migrated the Hadoop test cluster to the new Zookeeper cluster, everything seems running fine.
Side note: the current partman recipe does not create a big root partition, but instead it reserves a big unused lvm volume. I created a 100G zookeeper volume on all nodes and mounted manually /var/lib/zookepeer to it.
Hosts are ready, and we have been testing the Hadoop Test cluster with the new Zk cluster for a while without any big issues. Next step is deploying to prod! https://etherpad.wikimedia.org/p/analytics-zk-migration