Page MenuHomePhabricator

Request permission to create 4 kafka topics in kafka-main (WDQS graph split)
Closed, ResolvedPublic3 Estimated Story Points

Description

As part of the work to split the WDQS graph we will need to populate 4 new topics:

  • eqiad.rdf-streaming-updater.mutation-main
  • codfw.rdf-streaming-updater.mutation-main
  • eqiad.rdf-streaming-updater.mutation-scholarly
  • codfw.rdf-streaming-updater.mutation-scholarly

The expected size of both added should not exceed the size of eqiad.rdf-streaming-updater.mutation which is around 17Gb (51Gb including replication).
Similarly for the rate of messages.

Because topic mirroring this means that an additional 100Gb per cluster is required (+100Gb on main-eqiad and +100Gb on main-codfw).

These topics must have the following characteristics:

  • a single partition
  • retention of 4 weeks

AC:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse renamed this task from Request permission to create 4 kafka topics in kafka-main to Request permission to create 4 kafka topics in kafka-main (WDQS graph split).Jun 14 2024, 1:20 PM
Gehel triaged this task as High priority.Jun 17 2024, 1:18 PM
Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.
Gehel set the point value for this task to 3.Jun 17 2024, 3:48 PM

Disk usage looks ok (all servers are around 30% usage of 3.3TB partitions), rate for eqiad.rdf-streaming-updater.mutation is under 20 msg/s. I think it's ok. @akosiaris can you confirm?

Yeah, I confirm. The older hosts in the clusters, kafka-main[12]00[0-5] have 2TB free space left, so 100GB isn't an issue. The newer hosts have smaller disks (budget reasons) but they aren't in service yet.

I do have one question though. Why 4 weeks retention? Is there some business reason or could it be dropped to a smaller duration?

I do have one question though. Why 4 weeks retention? Is there some business reason or could it be dropped to a smaller duration?

we need 4 weeks to be able to backfill after an import, from the time the wikidata dump process starts, the time required to shuffle the data around (compression, hdfs-rsync to hdfs) and til the end of the import into blazegraph, see the initial lag column in T241128 for past import times, perhaps 3weeks would be manageable but we went to 4 weeks to have extra room.

I do have one question though. Why 4 weeks retention? Is there some business reason or could it be dropped to a smaller duration?

we need 4 weeks to be able to backfill after an import, from the time the wikidata dump process starts, the time required to shuffle the data around (compression, hdfs-rsync to hdfs) and til the end of the import into blazegraph, see the initial lag column in T241128 for past import times, perhaps 3weeks would be manageable but we went to 4 weeks to have extra room.

Perfect, that's the justification I was looking for. Numbers make sense to me now. Thanks!

Gehel claimed this task.
This comment was removed by Gehel.

We are getting ready to deploy the new updater that will populate these new topics, @bking could we have the topics created with proper retention and partitioning? (We could also let the topics autocreate and adapt the retention after the fact using https://wikitech.wikimedia.org/wiki/Kafka/Administration#Alter_topic_retention_settings). Thanks!

Mentioned in SAL (#wikimedia-operations) [2024-07-17T16:08:03Z] <inflatador> bking@kafka-main1005 kafka topics --create --topic ${TOPIC} --partitions 1 --replication-factor 3; kafka configs --entity-type topics --entity-name ${TOPIC} --alter --add-config retention.ms=2592000000 T367510

Mentioned in SAL (#wikimedia-operations) [2024-07-17T17:13:32Z] <inflatador> bking@kafka-main2005 kafka topics --create --topic ${TOPIC} --partitions 1 --replication-factor 3; kafka configs --entity-type topics --entity-name ${TOPIC} --alter --add-config retention.ms=2592000000 T367510

The new code has been deployed this morning and data is flowing properly into these new topics, we improved batching a bit to save some space via compression, I believe that we have some room to increase some buffer size if we want to optimize for space even further.

@dcausse re: your comment

I believe that we have some room to increase some buffer size if we want to optimize for space even further.

Is that something you are planning on doing in the graph split code, or would that be a Kafka configuration that the SREs can apply? Kafka apparently does have config options for various buffers .

@dcausse re: your comment

I believe that we have some room to increase some buffer size if we want to optimize for space even further.

Is that something you are planning on doing in the graph split code, or would that be a Kafka configuration that the SREs can apply? Kafka apparently does have config options for various buffers .

I have no plans on doing unless required, to do it a change in deployment-prep is required:

# default 250000
kafka_producer_config.batch.size: 500000
# default 2000
kafka_producer_config.linger.ms: 3000

More might require tuning other things like max.request.size