Page MenuHomePhabricator

Review znodes on Zookeeper cluster to possibly remove not-used data
Closed, ResolvedPublic3 Estimated Story Points

Description

The two Zookeeper clusters, main-eqiad (conf100[4-6]) and main-codfw (conf200[1-3]) are holding data for the following clusters:

  • Kafka main eqiad/codfw
  • Kafka Analytics eqiad
  • Kafka Jumbo eqiad
  • Kafka Logging
  • Kafka Burrow codfw/eqiad
  • Hadoop Test Analytics
  • Hadoop Analytics

The current znode allocation is the following:

  • main-eqiad -> ~50k
  • main-codfw -> ~3.7k

In order to preserve mental sanity when dealing with maintenance or (hopefully few) critical events, I think that it would be great to:

  • check what data is surely garbage that can be trashed (reducing the amount of znodes)
  • check if zookeeper is misused somehow, like an application storing data rather than a state. A huge amount of znodes under the same parent is, for example, fine but I am wondering if we could hit some limits sooner or later (or maybe subtle failures).

Event Timeline

elukey triaged this task as Medium priority.Feb 25 2019, 7:15 AM
elukey created this task.

List of parent znodes in main-eqiad:

[zk: localhost:2181(CONNECTED) 0] ls /
[registry, brokers, zookeeper, yarn-leader-election, hadoop-ha, rmstore-analytics-test-hadoop, services, druid, etc, hive_zookeeper_namespace, kafka, rmstore, burrow, consumers]
  • registry (Testing znodes for Apache Slider, can be trashed)
[zk: localhost:2181(CONNECTED) 4] ls /registry/users/joal/services
[org-apache-slider]
  • brokers (probably old Kafka Analytics stuff)
[zk: localhost:2181(CONNECTED) 9] ls /brokers/topics
[]
  • zookeeper (seems internal usage for quotas)
[zk: localhost:2181(CONNECTED) 11] ls /zookeeper/quota
[]
  • yarn-leader-election (HA data for Hadoop Yarn)
[zk: localhost:2181(CONNECTED) 12] ls /yarn-leader-election
[analytics-hadoop, analytics-test-hadoop]
  • hadoop-ha (HA znodes for Hadoop HDFS)
[zk: localhost:2181(CONNECTED) 13] ls /hadoop-ha
[analytics-hadoop, analytics-test-hadoop]
  • services (Seems again Apache slider testing data)
[zk: localhost:2181(CONNECTED) 18] ls /services/slider/users/joal
[]
  • druid (old Druid zookeeper data, we currently have separate clusters for druid, can surely be cleaned up)
  • etc - This is old Burrow data, probably my mistake when I have set it up (see the burrow znode later on)
[zk: localhost:2181(CONNECTED) 19] ls /etc/burrow
[notifier, notifier-eqiad, notifier-main-eqiad, notifier-analytics, notifier-jumbo-eqiad]
  • rmstore-analytics-test-hadoop and rmstore, see T216952
  • hive_zookeeper_namespace - stil not sure what this is, seems old data though from a first look.
[zk: localhost:2181(CONNECTED) 20] ls /hive_zookeeper_namespace
[wmf, qchris, ironholds, Wmf, otto, wmf_Raw, yurik, ellery, wmf_raw]
  • kafka - Parent znode for all the Kafka clusters. It contains a lot of znodes but it seems all legit data
[zk: localhost:2181(CONNECTED) 38] ls /kafka
[main-codfw, jumbo-eqiad, logging-eqiad, eqiad, main-eqiad]
  • burrow - Parent znode for all the Burrow data.
[zk: localhost:2181(CONNECTED) 47] ls /burrow/notifier
[analytics, jumbo-eqiad, logging-eqiad, main-eqiad]
  • consumers - probably old Kafka znodes when consumers where handled via Zookeeper
[zk: localhost:2181(CONNECTED) 48] ls /consumers
[ebernhardson_test1, otto0, otto1, test_joal_flink, KafkaWordCount-otto-0, eventlogging-8c5a95e0-a8ef-11e5-b1da-782bcb0a0efc, eventlogging-group]
  • Main codfw is less crowded and probably doesn't need a clean up:
[zk: localhost:2181(CONNECTED) 0] ls /
[burrow, kafka, zookeeper]

[zk: localhost:2181(CONNECTED) 1] ls /burrow
[notifier-main-codfw, notifier]

[zk: localhost:2181(CONNECTED) 2] ls /kafka
[logging-codfw, main-codfw]

Proposal for removal:

registry brokers services etc consumers

@Ottomata what do you think?

don't know about registry, services or etc, but /brokers and /consumers should be leftover from when we might have had an un-namespaced kafka cluster in zookeeper and should be safe to delete.

/etc is my fault when I've set up burrow the first time, and registry/services seems to be @joal's slider test (so safe to delete IIRC).

Got down to:

[zk: localhost:2181(CONNECTED) 37] ls /
[zookeeper, yarn-leader-election, hadoop-ha, hive_zookeeper_namespace, kafka, burrow]

That looks much nicer :)

elukey set the point value for this task to 3.Feb 28 2019, 4:39 PM
elukey moved this task from In Code Review to Done on the Analytics-Kanban board.

Mentioned in SAL (#wikimedia-operations) [2019-02-28T16:39:57Z] <elukey> clean up old/stale zookeeper znodes from conf100[4-6] - T216979