Page MenuHomePhabricator

Requesting permission to enable kafka log compaction for page_rerender on kafka-main
Closed, ResolvedPublic

Description

Hello ServiceOps,

This is a follow-up to T353715 where we enabled kafka log compaction on kafka-jumbo. We are requesting to enable log compaction on the page_rerender topic on kafka-main as the next step on moving the Search Update Pipeline into production. Enabling log compaction will save disk space, but our main motivation is to optimize the backfilling process, as explained here .

@elukey mentioned his concerns about enabling log compaction here . I believe we have addressed the resource usage concern around log compaction/deletion by enabling it on kafka-jumbo topics (thanks @brouberol ! ) . The screenshots attached to T353715 give us a good idea of the resource impact we can expect when we enable it on the page_rerender topic on kafka-main. You can also see kafka-jumbo's resource usage during the change in grafana until it's aged out.

Thanks for looking and please let us know if you need additional info around this request.

Event Timeline

Per IRC conversation with @Joe , he is the right person to approve this change in the absence of @elukey . He'll review and make a decision on this as time permits.

Gehel triaged this task as High priority.Jan 22 2024, 2:27 PM

It generally seems ok, but a few considerations:

  • kafka-main is much smaller than kafka-jumbo, and critical to site operations
  • The codfw.mediawiki.currussearch.page_rerender.v1 topic is pretty large at the moment, 292 GB in codfw and 149 GB in eqiad, while the corresponding eqiad topic is as expected tiny/irrelevant.

There is still the possibility that the initial compaction has a negative effect on the producers, which could degrade site performance and functionality.

I think that to play it fully safe we have two options:

  • Limit impact by following traffic
    1. We perform the compaction now on the eqiad cluster, which is mostly unused at the moment for producing events as the main datacenter is codfw
    2. After the switchover on the week of the next equinox, so in march, we also do codfw
  • Limit impact by reducing temporarily retention
    1. reduce retention to say 1 day, thus reducing the amount of data to compress to 1/7th of what it is now
    2. apply the change
    3. re-raise retention after compaction has happened

I'm pretty agnostic about which way we're going to go between the two.

I also want to note that this doesn't solve the long-standing issue of persisting topic configurations somehow.

I also want to note that this doesn't solve the long-standing issue of persisting topic configurations somehow.

As an aside, this is *very much* on my radar and something I'd like to fix in the near future if possible.

Discussion with @Joe : no objection to enabling compaction as long as we follow one of the option to reduce impact.

I'm going to go ahead, and go with solution 1. As there's no strong favor and support for solution 2, I'm going to implement the one in which there's no data loss.

Mentioned in SAL (#wikimedia-operations) [2024-01-31T14:53:39Z] <brouberol> I'm going to apply kafka log compaction for {eqiad,codfw}.mediawiki.currussearch.page_rerender.v1 on kafka-main-eqiad only (current replica) - T354794

Looking at the topic sizes, I'm going to change the config of the smallest topic (eqiad.mediawiki.cirrussearch.page_rerender.v1), wait a bit, then apply the same change on the bigger topic (codfw.mediawiki.cirrussearch.page_rerender.v1).

brouberol@kafka-main1003:~$ kafka configs --entity-type topics --entity-name 'eqiad.mediawiki.cirrussearch.page_rerender.v1' --alter --add-config 'cleanup.policy=[compact,delete]'
kafka-configs --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/main-eqiad --entity-type topics --entity-name eqiad.mediawiki.cirrussearch.page_rerender.v1 --alter --add-config cleanup.policy=[compact,delete]
Completed Updating config for entity: topic 'eqiad.mediawiki.cirrussearch.page_rerender.v1'.

The topic is so small, the effect of compaction went completely unregistered.

Screenshot 2024-01-31 at 16.02.16.png (850×2 px, 160 KB)

brouberol@kafka-main1003:~$ kafka configs --entity-type topics --entity-name 'codfw.mediawiki.cirrussearch.page_rerender.v1' --alter --add-config 'cleanup.policy=[compact,delete]'
kafka-configs --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/main-eqiad --entity-type topics --entity-name codfw.mediawiki.cirrussearch.page_rerender.v1 --alter --add-config cleanup.policy=[compact,delete]
Completed Updating config for entity: topic 'codfw.mediawiki.cirrussearch.page_rerender.v1'.

The compaction of the 160GB topic took about 30 minutes, and had a noticeable impact on R/W IOPS and disk time. The topic is now 25% smaller.

Screenshot 2024-01-31 at 16.26.53.png (2×2 px, 948 KB)

Screenshot 2024-01-31 at 16.27.01.png (1×2 px, 250 KB)

However, no producer impact was seen (there's no active consumer on that cluster)
{F41735142}

brouberol changed the task status from Open to Stalled.EditedJan 31 2024, 4:21 PM

This is blocked until the next codfw -> eqiad failover, that will happen on march 20th 2024.

Moving to our backlog board, to be picked up again after March 20th 2024

Gehel moved this task from Incoming to Quarterly Goals on the Data-Platform-SRE board.

Mentioned in SAL (#wikimedia-operations) [2024-03-26T08:31:55Z] <brouberol> I'm going to apply kafka log compaction for {eqiad,codfw}.mediawiki.currussearch.page_rerender.v1 on kafka-main-codfw only (current replica) - T354794

brouberol@kafka-main2001:~$ kafka configs --entity-type topics --entity-name 'codfw.mediawiki.cirrussearch.page_rerender.v1' --alter --add-config 'cleanup.policy=[compact,delete]'
kafka-configs --zookeeper conf2004.codfw.wmnet,conf2005.codfw.wmnet,conf2006.codfw.wmnet/kafka/main-codfw --entity-type topics --entity-name codfw.mediawiki.cirrussearch.page_rerender.v1 --alter --add-config cleanup.policy=[compact,delete]
Completed Updating config for entity: topic 'codfw.mediawiki.cirrussearch.page_rerender.v1'.
brouberol changed the task status from Stalled to In Progress.Mar 26 2024, 8:32 AM
brouberol@kafka-main2001:~$ kafka configs --entity-type topics --entity-name 'eqiad.mediawiki.cirrussearch.page_rerender.v1' --alter --add-config 'cleanup.policy=[compact,delete]'
kafka-configs --zookeeper conf2004.codfw.wmnet,conf2005.codfw.wmnet,conf2006.codfw.wmnet/kafka/main-codfw --entity-type topics --entity-name eqiad.mediawiki.cirrussearch.page_rerender.v1 --alter --add-config cleanup.policy=[compact,delete]
Completed Updating config for entity: topic 'eqiad.mediawiki.cirrussearch.page_rerender.v1'.

All done! It had quite the impact on eqiad.mediawiki.cirrussearch.page_rerender.v1 (36% size reduction), and a bit less on codfw.mediawiki.cirrussearch.page_rerender.v1 (15% size reduction).

Screenshot 2024-03-26 at 09.56.12.png (848×5 px, 258 KB)