Page MenuHomePhabricator

Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis
Closed, ResolvedPublic3 Estimated Story Points

Description

The stream mediawiki.cirrussearch.page_rerender.v1 is currently enabled only for testwiki.
As we'd like to test how the search update pipeline works in a backfill scenario (T350826) having such stream populated with more wikis is interesting for us.

This task is track what needs to be done to have such stream populated with most of our wikis (public ones).

  • Double-check if kafka-main can be used
    • rate ~300 evt/s
    • expected topic size for 7days: expected topic size ~110Gb, replicated 330Gb, 5 partitions 22Gb each, additional 66Gb per node on a 5nodes cluster
    • if size is a concern we could reduce retention to 4days or possibly explore if log compaction is usable/useful in this context (cc @pfischer)
  • Should we enable this gradually, in 2, 3, 4 or more steps?

Prerequisites:

  • RESOLVED Get the green light from serviceops (cc @elukey)
  • Increase the number of partitions to 5 on existing topics on main-eqiad, main-codfw and kafka-jumbo
  • codfw.mediawiki.cirrussearch.page_rerender.v1
  • eqiad.mediawiki.cirrussearch.page_rerender.v1

AC:

  • the mediawiki.cirrussearch.page_rerender.v1 steam is populated for all the the public wikis

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Gehel set the point value for this task to 3.
Gehel triaged this task as High priority.Nov 22 2023, 9:24 AM
Gehel moved this task from Incoming to Quarterly Goals on the Data-Platform-SRE board.

@elukey, we would like to start populating this kafka topic on kafka-main. Enabling page_rerender is the last missing source we need, to ensure our search stream processing works with production volumes. The calculations can be found in the following spreadsheet, sheet "Kafka", line 12.

  • Option A) 7 days retention: storage need of 66Gb per node
  • Option B) 4 days retention: storage need of 37Gb per node

Both options assume retention of all records. Since - in particular for this topic - we are interested only in the latest record per key, I would propose a combination of max retention time and log compaction (cleanup.policy=delete,compact;, see docs). This would bring down the storage requirements even further, depending on the number number of records sharing the same key.

As of now this would add to the existing topics since we are not ready to get rid of legacy topics, such as eqiad.mediawiki.job.cirrusSearchLinksUpdate.

Would you say we can start publishing to this topic, @elukey?
Who would take care of configuring the topics, including partitioning? (cc @brouberol)

@pfischer Once you agree on the config, I can create and configure the topic for you. As this is to be a compacted topic, please don't auto-create it by publishing to it directly. I'd rather everything is configured correctly before we start writing to it. Thanks!

@pfischer option A) is fine, if there is a way to add the new traffic incrementally (to double check space used by the topic etc.. after every step) it would be great :)

@elukey, sure. We would start onboarding smaller wikis first (test, it, fr) before moving on to the bigger ones.

@brouberol, thanks, sadly the topic already exists. :-( is this a major issue? We should be able to delete it, after we (search update pipeline) should be the only consumers at the moment.

For deleting the topic, if we need to pause all writers and consumers that can relatively easily be done. testwiki should be the only one sending events, those can be disabled with a mediawiki-config patch. The only consumer should be the staging deployment of cirrus-streaming-updater in k8s-eqiad-staging. We can easily stop that consumer as well, it's strictly for testing at the moment.

Change 979133 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Disable event bus bridge

https://gerrit.wikimedia.org/r/979133

The partition count can be changed on the fly (only increased, never decreased), that's no issue, so can the topic config. No need to delete it. It's a matter of personal preference really, for greenfield topics, I prefer having them configured as required and _then_ have them be written to. As this topic already exists, this does not apply here.

Change 979133 abandoned by Ebernhardson:

[operations/mediawiki-config@master] cirrus: Disable event bus bridge

Reason:

turns out to be unnecessary, the topic can be mutated in place.

https://gerrit.wikimedia.org/r/979133

@brouberol, if you have time, could you configure codfw.mediawiki.cirrussearch.page_rerender.v1 and eqiad.mediawiki.cirrussearch.page_rerender.v1 with cleanup.policy=delete,compact, please? If you have any best practices regarding the compaction configuration, I'd be happy to learn about those, too. The retention period would remain 7 days for now.

@brouberol, if you have time, could you configure codfw.mediawiki.cirrussearch.page_rerender.v1 and eqiad.mediawiki.cirrussearch.page_rerender.v1 with cleanup.policy=delete,compact, please? If you have any best practices regarding the compaction configuration, I'd be happy to learn about those, too. The retention period would remain 7 days for now.

@pfischer is there any reason why we need a specific cleanup policy for those topics? I thought we were going to change the partition count, not other settings.

@elukey, for page re-render, we're definitely interested only in the latest event, since we only care for the fact that a page was re-rendered, not how many times (especially for the backfill scenario). Further, that would reduce the storage requirements. Is that of any concern?

@elukey, for page re-render, we're definitely interested only in the latest event, since we only care for the fact that a page was re-rendered, not how many times (especially for the backfill scenario). Further, that would reduce the storage requirements. Is that of any concern?

We currently don't have any git-ops-like way to apply specific settings to topics, so it would be nice not to use special settings unless we really have to. We definitely delete after 7 days (Kafka broker setting), not sure what compact does to be honest but as starting step I'd avoid it if possible. There is also a risk of adding more computation on the Kafka brokers side, that may or may not be an issue down the road. Shall we start with only a partition count change (if needed) and monitor the size of the Kafka topic after the first traffic increments?

We currently don't have any git-ops-like way to apply specific settings to topics

Okay, I get your concern. Is this planned, though? Presumably kafka use will increase, so having a way to tailor topic config, appears to be a valuable feature.

not sure what compact does to be honest

Kafka keeps a log of records (key + value). Those logs are made of segments. There's a hot segment, that is currently written to, and there are cold segments. If a client subscribes to a topic, it will be served from the logs. Without compaction, the broker keeps all events. If compaction is enabled, brokers start to reduce cold log segments by keeping only the latest record of every key. In context of page_rerender that means: If a page (key: testwiki:page_42) was re-rendered 10 times over two days, and we have to backfill our pipeline a day later, the broker would have kept only one event related to testwiki:page_42 instead of 10.

We currently don't have any git-ops-like way to apply specific settings to topics

Okay, I get your concern. Is this planned, though? Presumably kafka use will increase, so having a way to tailor topic config, appears to be a valuable feature.

Not right now, but if needed we'll surely be able to create something. So far (years of Kafka work) we didn't receive many requests and we relied on the Kafka broker's settings for most of the work (except ACLs).

not sure what compact does to be honest

Kafka keeps a log of records (key + value). Those logs are made of segments. There's a hot segment, that is currently written to, and there are cold segments. If a client subscribes to a topic, it will be served from the logs. Without compaction, the broker keeps all events. If compaction is enabled, brokers start to reduce cold log segments by keeping only the latest record of every key. In context of page_rerender that means: If a page (key: testwiki:page_42) was re-rendered 10 times over two days, and we have to backfill our pipeline a day later, the broker would have kept only one event related for testwiki:page_42 instead of 10.

Thanks for the explanation. As described above, it is more work for the broker and we have never used it before. I'd prefer to proceed with the standard settings if you agree, and then refine if needed.

Not right now, but if needed we'll surely be able to create something.

It's not blocking us right now, so we could enable it in a mirrored cluster (jumbo), to see, if there's any noticeable effect on CPU consumption.

Thanks for the explanation. As described above, it is more work for the broker and we have never used it before. I'd prefer to proceed with the standard settings if you agree, and then refine if needed.

Sure. So we would start publishing to this topic with a subset of wikis an observe what happens.

Current plan for gradual deploy is to start with a selection of wikis that add up to ~25% of the total rate. If that's too high we can remove commonswiki from the set, which should bring it down around ~13%. Before we can turn those events on I believe we need to have the topic partitioning changes applied, the topic currently has a single partition.

Mentioned in SAL (#wikimedia-operations) [2023-12-04T21:06:17Z] <ryankemper> T351503 Setting partition count to 5: ryankemper@kafka-main1001:~$ kafka topics --alter --topic eqiad.mediawiki.cirrussearch.page_rerender.v1 --partitions 5

Mentioned in SAL (#wikimedia-operations) [2023-12-04T21:09:01Z] <ryankemper> T351503 Setting partition count to 5: ryankemper@kafka-main1001:~$ kafka topics --alter --topic codfw.mediawiki.cirrussearch.page_rerender.v1 --partitions 5

Mentioned in SAL (#wikimedia-operations) [2023-12-04T21:47:28Z] <ryankemper> T351503 Setting partition count to 5: ryankemper@kafka-main2001:~$ kafka topics --alter --topic eqiad.mediawiki.cirrussearch.page_rerender.v1 --partitions 5

Mentioned in SAL (#wikimedia-operations) [2023-12-04T21:47:37Z] <ryankemper> T351503 Setting partition count to 5: ryankemper@kafka-main2001:~$ kafka topics --alter --topic codfw.mediawiki.cirrussearch.page_rerender.v1 --partitions 5

I think (please ignore if already done) we're still missing the partition count change on kafka-jumbo for both topics.

Mentioned in SAL (#wikimedia-operations) [2023-12-18T17:12:28Z] <inflatador> bking@kafka-jumbo1007 kafka topics --alter --topic eqiad.mediawiki.cirrussearch.page_rerender.v1 --partitions 5 T351503

Mentioned in SAL (#wikimedia-operations) [2023-12-18T17:14:10Z] <inflatador> bking@kafka-jumbo1007 kafka topics --alter --topic codfw.mediawiki.cirrussearch.page_rerender.v1 --partitions 5 T351503

@elukey, we have an updated estimate of the expected topic size increment per wiki we publish page_rerender records for: https://docs.google.com/spreadsheets/d/1Fp44MdLxUVlxi03MBD_64m0zQErny-9jUD5C6RGf_bU/edit#gid=670687915

The color coding of column accumulated topic size roughly groups batches of ~10GB (per broker). Unless you object, we would start to enable those wikis, one batch per day.

+1, looks good (IIUC the new estimation are similar from the original ballpark figures, if not please lemme know)

One suggestion - maybe let's not add too much traffic before the holidays to avoid any headaches for SRE during the next 2/3 weeks :)

Change 987783 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: 2nd batch page_rerender

https://gerrit.wikimedia.org/r/987783

Change 987783 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: 2nd batch page_rerender

https://gerrit.wikimedia.org/r/987783

Change 988449 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/mediawiki-config@master] enable page_rerender for 3rd batch of wikis

https://gerrit.wikimedia.org/r/988449

Change 988449 merged by jenkins-bot:

[operations/mediawiki-config@master] enable page_rerender for 3rd batch of wikis

https://gerrit.wikimedia.org/r/988449

Mentioned in SAL (#wikimedia-operations) [2024-01-08T14:02:45Z] <urbanecm@deploy2002> Started scap: Backport for [[gerrit:988449|enable page_rerender for 3rd batch of wikis (T351503)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-08T14:04:11Z] <urbanecm@deploy2002> pfischer and urbanecm: Backport for [[gerrit:988449|enable page_rerender for 3rd batch of wikis (T351503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-01-08T14:12:21Z] <urbanecm@deploy2002> Finished scap: Backport for [[gerrit:988449|enable page_rerender for 3rd batch of wikis (T351503)]] (duration: 09m 35s)

Change 988500 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: 3rd batch page_rerender

https://gerrit.wikimedia.org/r/988500

Change 988500 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: 3rd batch page_rerender

https://gerrit.wikimedia.org/r/988500

Change 989442 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/mediawiki-config@master] enable page_rerender for 4th batch of wikis

https://gerrit.wikimedia.org/r/989442

Change 989443 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: 4th batch page_rerender

https://gerrit.wikimedia.org/r/989443

Change 989442 merged by jenkins-bot:

[operations/mediawiki-config@master] enable page_rerender for 4th batch of wikis

https://gerrit.wikimedia.org/r/989442

Mentioned in SAL (#wikimedia-operations) [2024-01-10T08:35:50Z] <dcausse@deploy2002> Started scap: Backport for [[gerrit:989442|enable page_rerender for 4th batch of wikis (T351503)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-10T08:37:37Z] <dcausse@deploy2002> pfischer and dcausse: Backport for [[gerrit:989442|enable page_rerender for 4th batch of wikis (T351503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-01-10T08:47:41Z] <dcausse@deploy2002> Finished scap: Backport for [[gerrit:989442|enable page_rerender for 4th batch of wikis (T351503)]] (duration: 11m 50s)

Change 989443 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: 4th batch page_rerender

https://gerrit.wikimedia.org/r/989443

Change 990029 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/mediawiki-config@master] enable page_rerender for 5th batch of wikis

https://gerrit.wikimedia.org/r/990029

Change 990586 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: 5th batch page_rerender

https://gerrit.wikimedia.org/r/990586

Change 990029 merged by jenkins-bot:

[operations/mediawiki-config@master] enable page_rerender for 5th batch of wikis

https://gerrit.wikimedia.org/r/990029

Mentioned in SAL (#wikimedia-operations) [2024-01-15T08:12:02Z] <dcausse@deploy2002> Started scap: Backport for [[gerrit:990029|enable page_rerender for 5th batch of wikis (T351503)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-15T08:13:36Z] <dcausse@deploy2002> pfischer and dcausse: Backport for [[gerrit:990029|enable page_rerender for 5th batch of wikis (T351503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-01-15T08:23:42Z] <dcausse@deploy2002> Finished scap: Backport for [[gerrit:990029|enable page_rerender for 5th batch of wikis (T351503)]] (duration: 11m 40s)

Change 990586 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: 5th batch page_rerender

https://gerrit.wikimedia.org/r/990586

Change 990718 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/mediawiki-config@master] enable page_rerender for all wikis

https://gerrit.wikimedia.org/r/990718

Change 990718 merged by jenkins-bot:

[operations/mediawiki-config@master] enable page_rerender for all wikis

https://gerrit.wikimedia.org/r/990718

Mentioned in SAL (#wikimedia-operations) [2024-01-17T08:46:53Z] <dcausse@deploy2002> Started scap: Backport for [[gerrit:990718|enable page_rerender for all wikis (T351503)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-17T08:48:21Z] <dcausse@deploy2002> pfischer and dcausse: Backport for [[gerrit:990718|enable page_rerender for all wikis (T351503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Change 991282 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: enable page_rerender for all wikis

https://gerrit.wikimedia.org/r/991282

Mentioned in SAL (#wikimedia-operations) [2024-01-17T08:56:08Z] <dcausse@deploy2002> Finished scap: Backport for [[gerrit:990718|enable page_rerender for all wikis (T351503)]] (duration: 09m 15s)

Change 991282 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: enable page_rerender for all wikis

https://gerrit.wikimedia.org/r/991282

Change 991307 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: enable page_rerender for all wikis

https://gerrit.wikimedia.org/r/991307

Change 991307 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: enable page_rerender for all wikis

https://gerrit.wikimedia.org/r/991307

As of today, all non-private wikis featuring the cirrussearch extension publish page_rerender events by default.