Page MenuHomePhabricator

Create kafka test cluster
Closed, ResolvedPublic

Description

In order to have confidence in our migration plan for https://phabricator.wikimedia.org/T255973, we will create a cluster of a few Kafka nodes and use Kafka MirrorMaker to replicate the kafka-jumbo cluster data to the new cluster. We can then add nodes to the mirror cluster and test rebalancing the partitions to include the new nodes, and test that this process is smooth.

For simplicity, we'll want to mirror a subset of topics, making sure to include one of the highest-traffic topics, like webrequest_text.

One candidate set of nodes is analytics1051-analytics1056, which are former hadoop workers and are not currently in use. These apparently only have 1GB/s network, whereas the kafka-jumbo nodes have 10GB/s, so if the migration works on these less-networked nodes, it should be just fine on the production cluster.

Event Timeline

Questions:

  • new zookeeper cluster or reuse a zookeeper cluster?

Questions:

  • new zookeeper cluster or reuse a zookeeper cluster?

A possibility that we discussed on IRC could be to co-locate zookeeper on the nodes, like we do for the Druid nodes. In theory there shouldn't be any problem in using our an-conf100x nodes, but I wouldn't use the production conf[12]00x cluster (to to be super sure that we don't cause any harm inadvertently to other kafka clusters).

We should also do the partition move test while producer/consumers are sending data, to observe their impact.

@elukey, what ZK does HA failover use in the analytics-test-hadoop cluster?

Also, what nodes should we build this on? If we do this work, I'd prefer to do something permanent, a kafka jumbo-test-eqiad cluster. If we use analytics1051-1056, will we be able to keep those nodes for a while? Perhaps we can set up Kafka on analytics1051-53, and then test the procedure for T255973 by adding analyttics1054-55 and removing analytics1051-52, freeing up at 2 slots in the rack.

If we can't keep those nodes for a while (because they are out of warrantee), what should we do? We could probably get away with running Kafka on Ganeti if we can get several nodes with 4-6GB RAM for it.

@elukey, what ZK does HA failover use in the analytics-test-hadoop cluster?

an-conf100[1-3], our cluster!

Also, what nodes should we build this on? If we do this work, I'd prefer to do something permanent, a kafka jumbo-test-eqiad cluster. If we use analytics1051-1056, will we be able to keep those nodes for a while? Perhaps we can set up Kafka on analytics1051-53, and then test the procedure for T255973 by adding analyttics1054-55 and removing analytics1051-52, freeing up at 2 slots in the rack.

If we can't keep those nodes for a while (because they are out of warrantee), what should we do? We could probably get away with running Kafka on Ganeti if we can get several nodes with 4-6GB RAM for it.

I think that we can keep the analytics* nodes for a while, but in the order of some months probably, so not a permanent thing. Given the amount of time that it is required to bootstrap a kafka cluster (shouldn't be long with the current puppet profiles etc.. IIRC) I think it is worth to build a temp cluster for the remapping of partitions + kafka 2.x upgrade, and then tear it down afterwards.. If we like the experiment we can add nodes to next fiscal's budget (like we did basically for the hadoop test cluster). I don't think that on Ganeti there is a lot of space for something like 5 VMS with 4/6GB of ram, but we can ask to SRE and see :)

The value of this testing will be double since we'll have also to do the same tasks on the job queue kafkas (well not us directly but we'll help SRE) so a lot of karma points if we come up with solid procedures that can be applied anywhere :)

Ok, let's not call this kafka jumbo-test-eqiad then. I think just test-eqiad is best, and we can use the cluster at whim for various upgrades, etc.

I don't think that on Ganeti there is a lot of space for something like 5 VMS with 4/6GB of ram, but we can ask to SRE and see :)

Let's try this first, we can at least ask. @razzi can you follow up with @akosiaris (see also https://wikitech.wikimedia.org/wiki/SRE_Team_requests#Virtual_machine_requests_(Production) ) to find out if we can/should do this in Ganeti?

If not, let's use analytics1051-1055 (and try to keep them online for testing until SRE gets cranky about it :p )

what ZK does HA failover use in the analytics-test-hadoop cluster?

an-conf100[1-3], our cluster!

Ok. Hm. I guess we can just use this for the test cluster too, it shouldn't hurt.

Ottomata renamed this task from Create kafka-jumbo mirror cluster to Create kafka test cluster.Nov 30 2020, 4:39 PM
Ottomata assigned this task to razzi.
Ottomata moved this task from Backlog to Q1 2021/2022 on the Analytics-Clusters board.

Change 644344 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Configure zookeeper-test1002.eqiad.wmnet

https://gerrit.wikimedia.org/r/644344

Change 644344 merged by Razzi:
[operations/puppet@production] Configure zookeeper-test1002.eqiad.wmnet

https://gerrit.wikimedia.org/r/644344

Change 644962 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::ops: add monitoring for zookeeper test

https://gerrit.wikimedia.org/r/644962

Change 644962 merged by Elukey:
[operations/puppet@production] profile::prometheus::ops: add monitoring for zookeeper test

https://gerrit.wikimedia.org/r/644962

Change 655126 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka-test: Remove rack B from kafka-test cluster

https://gerrit.wikimedia.org/r/655126

Change 655126 merged by Razzi:
[operations/puppet@production] kafka-test: Remove rack B from kafka-test cluster

https://gerrit.wikimedia.org/r/655126

Change 655494 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka-test: Mirror eventlogging_SearchSatisfaction topic

https://gerrit.wikimedia.org/r/655494

Change 655494 merged by Razzi:
[operations/puppet@production] kafka-test: Mirror eventlogging_SearchSatisfaction topic

https://gerrit.wikimedia.org/r/655494

Cluster is up and running!

Mentioned in SAL (#wikimedia-operations) [2022-01-12T19:17:13Z] <mutante> zookeeper-test1002 - CRITICAL - degraded: The following units failed: ifup@ens5.service - for this issue see T273026 (T268074)