Create kafka test cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• razzi
	Nov 17 2020, 8:43 PM

Description

In order to have confidence in our migration plan for https://phabricator.wikimedia.org/T255973, we will create a cluster of a few Kafka nodes and use Kafka MirrorMaker to replicate the kafka-jumbo cluster data to the new cluster. We can then add nodes to the mirror cluster and test rebalancing the partitions to include the new nodes, and test that this process is smooth.

For simplicity, we'll want to mirror a subset of topics, making sure to include one of the highest-traffic topics, like webrequest_text.

One candidate set of nodes is analytics1051-analytics1056, which are former hadoop workers and are not currently in use. These apparently only have 1GB/s network, whereas the kafka-jumbo nodes have 10GB/s, so if the migration works on these less-networked nodes, it should be just fine on the production cluster.

Details

Subject	Repo	Branch	Lines +/-
kafka-test: Mirror eventlogging_SearchSatisfaction topic	operations/puppet	production	+1 -1
kafka-test: Remove rack B from kafka-test cluster	operations/puppet	production	+0 -5
profile::prometheus::ops: add monitoring for zookeeper test	operations/puppet	production	+9 -0
Configure zookeeper-test1002.eqiad.wmnet	operations/puppet	production	+4 -4

Customize query in gerrit

Related Objects

Mentioned In: T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs
T268202: Eq: 5 VM request for kafka-test-eqiad cluster
Mentioned Here: T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs
T255973: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers

Event Timeline

• razzi created this task.Nov 17 2020, 8:43 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 17 2020, 8:43 PM

• razzi updated the task description. (Show Details)Nov 17 2020, 8:52 PM

Ottomata added a subscriber: elukey.Nov 17 2020, 9:03 PM

Ottomata subscribed.

Questions:

new zookeeper cluster or reuse a zookeeper cluster?

In T268074#6628851, @razzi wrote:

Questions:

new zookeeper cluster or reuse a zookeeper cluster?

A possibility that we discussed on IRC could be to co-locate zookeeper on the nodes, like we do for the Druid nodes. In theory there shouldn't be any problem in using our an-conf100x nodes, but I wouldn't use the production conf[12]00x cluster (to to be super sure that we don't cause any harm inadvertently to other kafka clusters).

We should also do the partition move test while producer/consumers are sending data, to observe their impact.

@elukey, what ZK does HA failover use in the analytics-test-hadoop cluster?

Also, what nodes should we build this on? If we do this work, I'd prefer to do something permanent, a kafka jumbo-test-eqiad cluster. If we use analytics1051-1056, will we be able to keep those nodes for a while? Perhaps we can set up Kafka on analytics1051-53, and then test the procedure for T255973 by adding analyttics1054-55 and removing analytics1051-52, freeing up at 2 slots in the rack.

If we can't keep those nodes for a while (because they are out of warrantee), what should we do? We could probably get away with running Kafka on Ganeti if we can get several nodes with 4-6GB RAM for it.

In T268074#6630532, @Ottomata wrote:

@elukey, what ZK does HA failover use in the analytics-test-hadoop cluster?

an-conf100[1-3], our cluster!

Also, what nodes should we build this on? If we do this work, I'd prefer to do something permanent, a kafka jumbo-test-eqiad cluster. If we use analytics1051-1056, will we be able to keep those nodes for a while? Perhaps we can set up Kafka on analytics1051-53, and then test the procedure for T255973 by adding analyttics1054-55 and removing analytics1051-52, freeing up at 2 slots in the rack.

If we can't keep those nodes for a while (because they are out of warrantee), what should we do? We could probably get away with running Kafka on Ganeti if we can get several nodes with 4-6GB RAM for it.

I think that we can keep the analytics* nodes for a while, but in the order of some months probably, so not a permanent thing. Given the amount of time that it is required to bootstrap a kafka cluster (shouldn't be long with the current puppet profiles etc.. IIRC) I think it is worth to build a temp cluster for the remapping of partitions + kafka 2.x upgrade, and then tear it down afterwards.. If we like the experiment we can add nodes to next fiscal's budget (like we did basically for the hadoop test cluster). I don't think that on Ganeti there is a lot of space for something like 5 VMS with 4/6GB of ram, but we can ask to SRE and see :)

The value of this testing will be double since we'll have also to do the same tasks on the job queue kafkas (well not us directly but we'll help SRE) so a lot of karma points if we come up with solid procedures that can be applied anywhere :)

Ok, let's not call this kafka jumbo-test-eqiad then. I think just test-eqiad is best, and we can use the cluster at whim for various upgrades, etc.

I don't think that on Ganeti there is a lot of space for something like 5 VMS with 4/6GB of ram, but we can ask to SRE and see :)

Let's try this first, we can at least ask. @razzi can you follow up with @akosiaris (see also https://wikitech.wikimedia.org/wiki/SRE_Team_requests#Virtual_machine_requests_(Production) ) to find out if we can/should do this in Ganeti?

If not, let's use analytics1051-1055 (and try to keep them online for testing until SRE gets cranky about it :p )

what ZK does HA failover use in the analytics-test-hadoop cluster?

an-conf100[1-3], our cluster!

Ok. Hm. I guess we can just use this for the test cluster too, it shouldn't hurt.

• razzi mentioned this in T268202: Eq: 5 VM request for kafka-test-eqiad cluster.Nov 19 2020, 3:56 AM

Ottomata renamed this task from Create kafka-jumbo mirror cluster to Create kafka test cluster.Nov 30 2020, 4:39 PM

Ottomata assigned this task to • razzi.

Ottomata moved this task from Backlog to Q1 2021/2022 on the Analytics-Clusters board.

Change 644344 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Configure zookeeper-test1002.eqiad.wmnet

https://gerrit.wikimedia.org/r/644344

gerritbot added a project: Patch-For-Review.Nov 30 2020, 10:12 PM

Change 644344 merged by Razzi:
[operations/puppet@production] Configure zookeeper-test1002.eqiad.wmnet

https://gerrit.wikimedia.org/r/644344

Maintenance_bot removed a project: Patch-For-Review.Dec 1 2020, 10:10 PM

Change 644962 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::ops: add monitoring for zookeeper test

https://gerrit.wikimedia.org/r/644962

gerritbot added a project: Patch-For-Review.Dec 3 2020, 7:15 AM

Change 644962 merged by Elukey:
[operations/puppet@production] profile::prometheus::ops: add monitoring for zookeeper test

https://gerrit.wikimedia.org/r/644962

Maintenance_bot removed a project: Patch-For-Review.Dec 3 2020, 10:10 AM

Ottomata moved this task from Q1 2021/2022 to Done on the Analytics-Clusters board.Jan 4 2021, 4:45 PM

Change 655126 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka-test: Remove rack B from kafka-test cluster

https://gerrit.wikimedia.org/r/655126

gerritbot added a project: Patch-For-Review.Jan 8 2021, 8:25 PM

Change 655126 merged by Razzi:
[operations/puppet@production] kafka-test: Remove rack B from kafka-test cluster

https://gerrit.wikimedia.org/r/655126

Maintenance_bot removed a project: Patch-For-Review.Jan 8 2021, 9:10 PM

Change 655494 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka-test: Mirror eventlogging_SearchSatisfaction topic

https://gerrit.wikimedia.org/r/655494

gerritbot added a project: Patch-For-Review.Jan 11 2021, 7:29 PM

Change 655494 merged by Razzi:
[operations/puppet@production] kafka-test: Mirror eventlogging_SearchSatisfaction topic

https://gerrit.wikimedia.org/r/655494

Cluster is up and running!

Mentioned in SAL (#wikimedia-operations) [2022-01-12T19:17:13Z] <mutante> zookeeper-test1002 - CRITICAL - degraded: The following units failed: ifup@ens5.service - for this issue see T273026 (T268074)

Create kafka test clusterClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Create kafka test cluster
Closed, ResolvedPublic
Actions