Enable kafka log compaction for page_rerender on jumbo
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	pfischer
	Dec 19 2023, 2:47 PM

Description

Since page_rerender is a particular chatty topic (~300 records/s expected), we would like to work with log compaction to keep only the latest record per key. Enabling this might increase CPU usage of brokers, therefore, we'd like to test this on kafka-jumbo first.

AC:

log compaction in combination with retention-based delete (cleanup.policy=[compact,delete]) is enabled for both topics on kafka-jumbo:
- codfw.mediawiki.cirrussearch.page_rerender.v1
- eqiad.mediawiki.cirrussearch.page_rerender.v1
monitored impact on CPU utilisation

Details

Other Assignee: pfischer

Related Objects

Mentioned In: T354794: Requesting permission to enable kafka log compaction for page_rerender on kafka-main
T354595: SUP: Production

Event Timeline

pfischer created this task.Dec 19 2023, 2:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 19 2023, 2:47 PM

pfischer updated the task description. (Show Details)Dec 19 2023, 2:50 PM

pfischer changed the task status from Open to In Progress.Jan 2 2024, 9:49 AM

pfischer claimed this task.

pfischer triaged this task as High priority.

pfischer moved this task from needs triage to elastic / cirrus on the Discovery-Search board.

pfischer set the point value for this task to 2.

pfischer added projects: SRE, serviceops.Jan 2 2024, 10:17 AM

pfischer updated the task description. (Show Details)

pfischer changed the task status from In Progress to Open.Jan 2 2024, 10:35 AM

pfischer removed pfischer as the assignee of this task.

pfischer updated Other Assignee, added: pfischer.

pfischer updated the task description. (Show Details)

brouberol claimed this task.Jan 2 2024, 10:39 AM

brouberol added a project: Data-Platform-SRE (2024.01.01 - 2024.01.21).

pfischer updated the task description. (Show Details)Jan 2 2024, 10:46 AM

Mentioned in SAL (#wikimedia-analytics) [2024-01-02T10:56:33Z] <brouberol> configuring [eqiad,codfw].mediawiki.cirrussearch.page_rerender.v1 as compacted topics on jumbo-eqiad - T353715

brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'eqiad.mediawiki.cirrussearch.page_rerender.v1' --alter --add-config 'cleanup.policy=[compact,delete]'
kafka-configs --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --entity-type topics --entity-name eqiad.mediawiki.cirrussearch.page_rerender.v1 --alter --add-config cleanup.policy=[compact,delete]
Completed Updating config for entity: topic 'eqiad.mediawiki.cirrussearch.page_rerender.v1'.

We can see the impact on the overall topic size

Screenshot 2024-01-02 at 11.57.42.png (852×2 px, 132 KB)

pfischer updated the task description. (Show Details)Jan 2 2024, 10:58 AM

brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'codfw.mediawiki.cirrussearch.page_rerender.v1' --alter --add-config 'cleanup.policy=[compact,delete]'
kafka-configs --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --entity-type topics --entity-name codfw.mediawiki.cirrussearch.page_rerender.v1 --alter --add-config cleanup.policy=[compact,delete]
Completed Updating config for entity: topic 'codfw.mediawiki.cirrussearch.page_rerender.v1'.

22% of the topic segments were compacted and deleted:

Screenshot 2024-01-02 at 12.16.41.png (852×2 px, 157 KB)

brouberol updated the task description. (Show Details)Jan 2 2024, 11:18 AM

The change has been applied an hour ago (at the line). We don't observe any impact on broker CPU usage.

Screenshot 2024-01-02 at 13.16.35.png (1×2 px, 992 KB)

brouberol updated the task description. (Show Details)Jan 2 2024, 12:17 PM

Interesting! Curious, so the reason for using compaction here is just to save space, not necessarily to keep the latest record per key forever?

300 / second is not nothing, but the total disk space used for this stream (~32G across 5 brokers?) isn't that much, and depending on how many distinct keys there, the total number of messages removed by compaction probably won't be that much?

I'm very interested in using compaction for keeping a full current state in Kafka, but I hadn't considered using it for saving disk space.

@Ottomata, yes, this was intended to a) save disk space and b) reduce the number of records that have to be processed in case of a back fill. Events representing "page X has been re-rendered" seemed a perfect use case where only the latest event is meaningful and worth to be kept.

Are you sure you want delete in the policy then? Perhaps you want to keep all the latest event per page forever, so you can backfill fully from the topic?

@Ottomata, we considered this but but decided against it since

a) page_rerender is only of currently five source topics we aggregate so this would only be complete if all those topics would retain records just as long as page_rerender
b) building a full index from scratch based on events is not a scenario we want to cover, mainly because it's a lot more time consuming than starting from an index snapshot and backfill only the delta (since the snapshot was taken)

+1 k!

Gehel moved this task from Backlog to Done on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.Jan 5 2024, 4:08 PM

pfischer mentioned this in T354595: SUP: Production .Jan 9 2024, 8:30 AM

bking subscribed.Jan 10 2024, 3:53 PM

For reference, here's a screenshot of more kafka metrics around enabling compaction:

screencapture-grafana-rw-wikimedia-org-d-000000027-kafka-2024-01-02-13_20_42.png (7×3 px, 2 MB)

{F41663744}

bking mentioned this in T354794: Requesting permission to enable kafka log compaction for page_rerender on kafka-main.Jan 10 2024, 6:39 PM

	F41663739: screencapture-grafana-rw-wikimedia-org-d-000000027-kafka-2024-01-02-13_20_42.png
	Jan 10 2024, 4:06 PM

	F41648698: Screenshot 2024-01-02 at 13.16.35.png
	Jan 2 2024, 12:17 PM

	F41648664: Screenshot 2024-01-02 at 12.16.41.png
	Jan 2 2024, 11:17 AM

	F41648651: Screenshot 2024-01-02 at 11.57.42.png
	Jan 2 2024, 10:58 AM

Enable kafka log compaction for page_rerender on jumboClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Enable kafka log compaction for page_rerender on jumbo
Closed, ResolvedPublic2 Estimated Story Points
Actions