Page MenuHomePhabricator

Enable kafka log compaction for page_rerender on jumbo
Closed, ResolvedPublic2 Estimated Story Points

Description

Since page_rerender is a particular chatty topic (~300 records/s expected), we would like to work with log compaction to keep only the latest record per key. Enabling this might increase CPU usage of brokers, therefore, we'd like to test this on kafka-jumbo first.

AC:

  • log compaction in combination with retention-based delete (cleanup.policy=[compact,delete]) is enabled for both topics on kafka-jumbo:
    • codfw.mediawiki.cirrussearch.page_rerender.v1
    • eqiad.mediawiki.cirrussearch.page_rerender.v1
  • monitored impact on CPU utilisation

Details

Other Assignee
pfischer

Event Timeline

pfischer changed the task status from Open to In Progress.Jan 2 2024, 9:49 AM
pfischer claimed this task.
pfischer triaged this task as High priority.
pfischer moved this task from needs triage to elastic / cirrus on the Discovery-Search board.
pfischer set the point value for this task to 2.
pfischer changed the task status from In Progress to Open.Jan 2 2024, 10:35 AM
pfischer removed pfischer as the assignee of this task.
pfischer updated Other Assignee, added: pfischer.
pfischer updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-analytics) [2024-01-02T10:56:33Z] <brouberol> configuring [eqiad,codfw].mediawiki.cirrussearch.page_rerender.v1 as compacted topics on jumbo-eqiad - T353715

brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'eqiad.mediawiki.cirrussearch.page_rerender.v1' --alter --add-config 'cleanup.policy=[compact,delete]'
kafka-configs --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --entity-type topics --entity-name eqiad.mediawiki.cirrussearch.page_rerender.v1 --alter --add-config cleanup.policy=[compact,delete]
Completed Updating config for entity: topic 'eqiad.mediawiki.cirrussearch.page_rerender.v1'.

We can see the impact on the overall topic size

Screenshot 2024-01-02 at 11.57.42.png (852×2 px, 132 KB)

brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'codfw.mediawiki.cirrussearch.page_rerender.v1' --alter --add-config 'cleanup.policy=[compact,delete]'
kafka-configs --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --entity-type topics --entity-name codfw.mediawiki.cirrussearch.page_rerender.v1 --alter --add-config cleanup.policy=[compact,delete]
Completed Updating config for entity: topic 'codfw.mediawiki.cirrussearch.page_rerender.v1'.

22% of the topic segments were compacted and deleted:

Screenshot 2024-01-02 at 12.16.41.png (852×2 px, 157 KB)

The change has been applied an hour ago (at the line). We don't observe any impact on broker CPU usage.

Screenshot 2024-01-02 at 13.16.35.png (1×2 px, 992 KB)

Interesting! Curious, so the reason for using compaction here is just to save space, not necessarily to keep the latest record per key forever?

300 / second is not nothing, but the total disk space used for this stream (~32G across 5 brokers?) isn't that much, and depending on how many distinct keys there, the total number of messages removed by compaction probably won't be that much?

I'm very interested in using compaction for keeping a full current state in Kafka, but I hadn't considered using it for saving disk space.

@Ottomata, yes, this was intended to a) save disk space and b) reduce the number of records that have to be processed in case of a back fill. Events representing "page X has been re-rendered" seemed a perfect use case where only the latest event is meaningful and worth to be kept.

Are you sure you want delete in the policy then? Perhaps you want to keep all the latest event per page forever, so you can backfill fully from the topic?

@Ottomata, we considered this but but decided against it since

a) page_rerender is only of currently five source topics we aggregate so this would only be complete if all those topics would retain records just as long as page_rerender
b) building a full index from scratch based on events is not a scenario we want to cover, mainly because it's a lot more time consuming than starting from an index snapshot and backfill only the delta (since the snapshot was taken)

For reference, here's a screenshot of more kafka metrics around enabling compaction:

screencapture-grafana-rw-wikimedia-org-d-000000027-kafka-2024-01-02-13_20_42.png (7×3 px, 2 MB)

{F41663744}