Add support for page re-renders
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Dec 19 2022, 5:14 PM

Description

CirrusSearch should track page re-renders to update its index whenever a change external to the page itself is made (template, page properties, lua...) might affect its rendered version.
As of today CirrusSearch does track this using the LinksUpdateComplete mediawiki hook.

For the rewrite of the update-pipeline we should consider using similar events to trigger page updates that are not revision based.

When the hook LinksUpdateComplete is triggered CirrusSearch should emit a change-event and ideally should avoid emitting an event if this change relates to a change captured by the page-state stream.
The content of the event should contain everything required to enrich the event

domain
wiki_id
page_id
page_namespace
page_title (not strictly required but perhaps useful for debug purposes?)
timestamp (probably the current time at which the MW hook is executed?)

Ideally the index_name and cluster_group should be part of these events so that we save a call to the mw API.

Open question:
Should we enrich during the preparation job or the ingestion job?

Enriching during the preparation might require some non-negligible space on the target kafka cluster to store this:
kafka_log_size = re_renders_rate * (avg_doc_size / compression_ratio) * kafka_retention
If we take:

re_renders_rate: 400 re-renders/s (estimated from current cirrusSearchLinksUpdate insertion rate).
avg_doc_size: 20KiB
compression_ratio: 2:1
kafka_retention: 604800 secs (7days)

kafka_log_size = 400 * (20KiB/2) * 604800 = 2.25TiB

In addition to the kafka log size we also need to estimate the size of the flink state holding the window for doing event-reordering and optimizations. Assuming a 10minutes window it would be:
flink_state_size = 400 * 20KiB * 600 = 4.6GiB (at least)

Having page re-renders content in kafka might allow us to replay these updates during in-place re-index and save one API call for cloudelastic but it's not clear that the space cost is worth it.

Another approach is doing enrichment of page re-renders during the ingestion job:

will help to keep the kafka backlog and the flink state smaller
we probably won't want to replay such updates after an in-place reindex (we don't replay those today anyways)
this content is not adressable (not bound to specific revision) so there's no strong reason to capture and store the content
unsure we want to track an error side-output for this kind of updates
will be a natural throttling mechanism to ensure that revision based updates are prioritized
between 65% to 80% of these updates are discarded when hitting elasticsearch

AC:

write a schema that supports such update events
emit these events from CirrusSearch (using EventBus?)
consume these events from the producer job
enrich the events (from the preparation or the ingestion job see open question)

Details

Subject	Repo	Branch	Lines +/-
cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki	operations/mediawiki-config	master	+4 -1
cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream	operations/mediawiki-config	master	+8 -0
Update eventgate services docker image	operations/deployment-charts	master	+4 -4
Produce a stream for CirrusSearch page-rerenders	mediawiki/extensions/CirrusSearch	master	+458 -1
Try to identify page changes the same way EventBus does	mediawiki/extensions/CirrusSearch	master	+366 -84
Add mediawiki/cirrussearch/page_rerender	schemas/event/primary	master	+168 -0

Customize query in gerrit

	Title	Reference	Author	Source Branch	Dest Branch
	Late Fetching	repos/search-platform/cirrus-streaming-updater!11	pfischer	option2-late-fetching	main

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T317045 [Epic] Re-architect the Search Update Pipeline
		Resolved		pfischer	T325565 Add support for page re-renders

Event Timeline

dcausse created this task.Dec 19 2022, 5:14 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptDec 19 2022, 5:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dcausse updated the task description. (Show Details)Dec 19 2022, 5:17 PM

dcausse updated the task description. (Show Details)Jan 5 2023, 7:37 PM

Gehel moved this task from needs triage to Current work on the Discovery-Search board.Jan 9 2023, 4:27 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Gehel triaged this task as High priority.Feb 6 2023, 4:46 PM

Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).

Gehel moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.Feb 13 2023, 4:21 PM

dcausse edited projects, added Discovery-Search (Current work); removed Discovery-Search.Mar 22 2023, 9:39 AM

EBernhardson moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Apr 24 2023, 3:38 PM

EBernhardson set the point value for this task to 8.

dcausse mentioned this in T325672: Re-order and optimize change events.Jun 16 2023, 1:06 PM

dcausse claimed this task.Jun 22 2023, 2:00 PM

dcausse moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Change 935452 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Produce a stream for CirrusSearch page-rerenders

https://gerrit.wikimedia.org/r/935452

gerritbot added a project: Patch-For-Review.Jul 4 2023, 2:51 PM

Change 935697 had a related patch set uploaded (by DCausse; author: DCausse):

[schemas/event/primary@master] Add mediawiki/cirrussearch/page-rerender

https://gerrit.wikimedia.org/r/935697

Change 947315 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Try to identify page changes the same way EventBus does

https://gerrit.wikimedia.org/r/947315

Change 935697 merged by jenkins-bot:

[schemas/event/primary@master] Add mediawiki/cirrussearch/page_rerender

https://gerrit.wikimedia.org/r/935697

Change 947315 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Try to identify page changes the same way EventBus does

https://gerrit.wikimedia.org/r/947315

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.22; 2023-08-15).Aug 10 2023, 6:00 PM

Change 935452 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Produce a stream for CirrusSearch page-rerenders

https://gerrit.wikimedia.org/r/935452

Maintenance_bot removed a project: Patch-For-Review.Aug 10 2023, 8:10 PM

Gehel reassigned this task from dcausse to pfischer.Aug 14 2023, 3:09 PM

pfischer opened https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/11

Draft: Late Fetch

pfischer moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Aug 29 2023, 7:22 PM

pfischer moved this task from Needs review to In Progress on the Discovery-Search (Current work) board.Sep 11 2023, 3:02 PM

Change 957726 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] cirrus: add the mediawiki.cirrussearch.page_rerender stream

https://gerrit.wikimedia.org/r/957726

Change 957727 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki

https://gerrit.wikimedia.org/r/957727

pfischer updated https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/11

Late Fetching

pfischer moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Sep 18 2023, 2:21 PM

ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/11

Late Fetching

EBernhardson mentioned this in T347075: Deploy test instance of cirrus updater in k8s.Sep 22 2023, 7:09 PM

pfischer moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Sep 26 2023, 2:25 PM

Change 961197 had a related patch set uploaded (by Joal; author: Joal):

[operations/deployment-charts@master] Update eventgate services docker image

https://gerrit.wikimedia.org/r/961197

Change 961197 merged by jenkins-bot:

[operations/deployment-charts@master] Update eventgate services docker image

https://gerrit.wikimedia.org/r/961197

pfischer moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Oct 9 2023, 3:07 PM

Gehel closed this task as Resolved.Oct 13 2023, 1:07 PM

Change 957726 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream

https://gerrit.wikimedia.org/r/957726

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:14:25Z] <samtar@deploy2002> Started scap: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]]

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:15:47Z] <samtar@deploy2002> samtar and dcausse: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:22:11Z] <samtar@deploy2002> Finished scap: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]] (duration: 07m 45s)

Change 957727 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki

https://gerrit.wikimedia.org/r/957727

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:27:14Z] <samtar@deploy2002> Started scap: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]]

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:28:39Z] <samtar@deploy2002> samtar and dcausse: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-10-24T13:34:10Z] <samtar@deploy2002> Finished scap: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]] (duration: 06m 55s)

bking subscribed.Oct 31 2023, 5:57 PM

Add support for page re-rendersClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add support for page re-renders
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...