Page MenuHomePhabricator

Add support for page re-renders
Closed, ResolvedPublic8 Estimated Story Points

Description

CirrusSearch should track page re-renders to update its index whenever a change external to the page itself is made (template, page properties, lua...) might affect its rendered version.
As of today CirrusSearch does track this using the LinksUpdateComplete mediawiki hook.

For the rewrite of the update-pipeline we should consider using similar events to trigger page updates that are not revision based.

When the hook LinksUpdateComplete is triggered CirrusSearch should emit a change-event and ideally should avoid emitting an event if this change relates to a change captured by the page-state stream.
The content of the event should contain everything required to enrich the event

  • domain
  • wiki_id
  • page_id
  • page_namespace
  • page_title (not strictly required but perhaps useful for debug purposes?)
  • timestamp (probably the current time at which the MW hook is executed?)

Ideally the index_name and cluster_group should be part of these events so that we save a call to the mw API.

Open question:
Should we enrich during the preparation job or the ingestion job?

Enriching during the preparation might require some non-negligible space on the target kafka cluster to store this:
kafka_log_size = re_renders_rate * (avg_doc_size / compression_ratio) * kafka_retention
If we take:

kafka_log_size = 400 * (20KiB/2) * 604800 = 2.25TiB

In addition to the kafka log size we also need to estimate the size of the flink state holding the window for doing event-reordering and optimizations. Assuming a 10minutes window it would be:
flink_state_size = 400 * 20KiB * 600 = 4.6GiB (at least)

Having page re-renders content in kafka might allow us to replay these updates during in-place re-index and save one API call for cloudelastic but it's not clear that the space cost is worth it.

Another approach is doing enrichment of page re-renders during the ingestion job:

  • will help to keep the kafka backlog and the flink state smaller
  • we probably won't want to replay such updates after an in-place reindex (we don't replay those today anyways)
  • this content is not adressable (not bound to specific revision) so there's no strong reason to capture and store the content
  • unsure we want to track an error side-output for this kind of updates
  • will be a natural throttling mechanism to ensure that revision based updates are prioritized
  • between 65% to 80% of these updates are discarded when hitting elasticsearch

AC:

  • write a schema that supports such update events
  • emit these events from CirrusSearch (using EventBus?)
  • consume these events from the producer job
  • enrich the events (from the preparation or the ingestion job see open question)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as High priority.Feb 6 2023, 4:46 PM
Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).

Change 935452 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Produce a stream for CirrusSearch page-rerenders

https://gerrit.wikimedia.org/r/935452

Change 935697 had a related patch set uploaded (by DCausse; author: DCausse):

[schemas/event/primary@master] Add mediawiki/cirrussearch/page-rerender

https://gerrit.wikimedia.org/r/935697

Change 947315 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Try to identify page changes the same way EventBus does

https://gerrit.wikimedia.org/r/947315

Change 935697 merged by jenkins-bot:

[schemas/event/primary@master] Add mediawiki/cirrussearch/page_rerender

https://gerrit.wikimedia.org/r/935697

Change 947315 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Try to identify page changes the same way EventBus does

https://gerrit.wikimedia.org/r/947315

Change 935452 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Produce a stream for CirrusSearch page-rerenders

https://gerrit.wikimedia.org/r/935452

Change 957726 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] cirrus: add the mediawiki.cirrussearch.page_rerender stream

https://gerrit.wikimedia.org/r/957726

Change 957727 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki

https://gerrit.wikimedia.org/r/957727

Change 961197 had a related patch set uploaded (by Joal; author: Joal):

[operations/deployment-charts@master] Update eventgate services docker image

https://gerrit.wikimedia.org/r/961197

Change 961197 merged by jenkins-bot:

[operations/deployment-charts@master] Update eventgate services docker image

https://gerrit.wikimedia.org/r/961197

Change 957726 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream

https://gerrit.wikimedia.org/r/957726

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:14:25Z] <samtar@deploy2002> Started scap: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]]

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:15:47Z] <samtar@deploy2002> samtar and dcausse: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:22:11Z] <samtar@deploy2002> Finished scap: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]] (duration: 07m 45s)

Change 957727 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki

https://gerrit.wikimedia.org/r/957727

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:27:14Z] <samtar@deploy2002> Started scap: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]]

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:28:39Z] <samtar@deploy2002> samtar and dcausse: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:34:10Z] <samtar@deploy2002> Finished scap: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]] (duration: 06m 55s)