CirrusSearch should track page re-renders to update its index whenever a change external to the page itself is made (template, page properties, lua...) might affect its rendered version.
As of today CirrusSearch does track this using the LinksUpdateComplete mediawiki hook.
For the rewrite of the update-pipeline we should consider using similar events to trigger page updates that are not revision based.
When the hook LinksUpdateComplete is triggered CirrusSearch should emit a change-event and ideally should avoid emitting an event if this change relates to a change captured by the page-state stream.
The content of the event should contain everything required to enrich the event
- domain
- wiki_id
- page_id
- page_namespace
- page_title (not strictly required but perhaps useful for debug purposes?)
- timestamp (probably the current time at which the MW hook is executed?)
Ideally the index_name and cluster_group should be part of these events so that we save a call to the mw API.
Open question:
Should we enrich during the preparation job or the ingestion job?
Enriching during the preparation might require some non-negligible space on the target kafka cluster to store this:
kafka_log_size = re_renders_rate * (avg_doc_size / compression_ratio) * kafka_retention
If we take:
- re_renders_rate: 400 re-renders/s (estimated from current cirrusSearchLinksUpdate insertion rate).
- avg_doc_size: 20KiB
- compression_ratio: 2:1
- kafka_retention: 604800 secs (7days)
kafka_log_size = 400 * (20KiB/2) * 604800 = 2.25TiB
In addition to the kafka log size we also need to estimate the size of the flink state holding the window for doing event-reordering and optimizations. Assuming a 10minutes window it would be:
flink_state_size = 400 * 20KiB * 600 = 4.6GiB (at least)
Having page re-renders content in kafka might allow us to replay these updates during in-place re-index and save one API call for cloudelastic but it's not clear that the space cost is worth it.
Another approach is doing enrichment of page re-renders during the ingestion job:
- will help to keep the kafka backlog and the flink state smaller
- we probably won't want to replay such updates after an in-place reindex (we don't replay those today anyways)
- this content is not adressable (not bound to specific revision) so there's no strong reason to capture and store the content
- unsure we want to track an error side-output for this kind of updates
- will be a natural throttling mechanism to ensure that revision based updates are prioritized
- between 65% to 80% of these updates are discarded when hitting elasticsearch
AC:
- write a schema that supports such update events
- emit these events from CirrusSearch (using EventBus?)
- consume these events from the producer job
- enrich the events (from the preparation or the ingestion job see open question)