**The original design is done, but we are keeping this ticket open to continue the discussion**
==== **User Story**
> ==== As a platform engineer, I need a common MediaWiki page state change schema that can be used as a 'changelog' of page state. I can then use this to maintain a materialized view of the current state of pages outside of mediawiki.
> ==== As a search engineer, I need to be able to easily subscribe to ordered changes to pages to keep search indexes up to date.
==== Timebox:
- 2 weeks
==== Done is:
[x] Schema reviewed and agreed with group, including Data Engineering, Research, and Wikimedia Enterprise
[x] Schema is merged and deployed
For collaboration on this schema design, please use this [[ https://docs.google.com/document/d/1Pt5mFeRYJ1c6joiKeHCAyWhSAa1yC34G5W8yF6PPGXA/edit#heading=h.vyt1m0p2t0j6 | MediaWiki Page State Change Event Schema Design ]] google doc.
=== Details
This event stream is an implementation of the “comprehensiveness” problem described in {T291120}
==== How is this different from what we already have?
We do not currently have a way to get real time updates of comprehensive MediaWiki state outside of MediaWiki.
- [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history | MediaWiki History ]] is a monthly snapshot
- [[ https://meta.wikimedia.org/wiki/Data_dumps | Wikimedia Dumps ]] are (monthlyish) snapshots
- [[ https://stream.wikimedia.org/?doc#/streams | MediaWiki Event Streams ]] (e.g. mediawiki.revision-create) are notification streams, and do not have full state changes (e.g. no content in streams).
We want to design MediaWiki event streams that can be used to fully externalize MediaWiki state, without involving MediaWiki on the consumer side. That is, we want MediaWiki state to be carried by events to any downstream consumer.
See also: [[ https://medium.com/swlh/event-notification-vs-event-carried-state-transfer-2e4fdf8f6662 | Event Notification vs. Event-Carried State Transfer ]]
We had hoped that MediaWiki entity based changelog streams would be enough to externalize all MediaWiki state. The MediaWiki revision table itself is really just an event log of changes to pages. However, this is not technically true, as past revisions can be updated. On page deletes, revision records are 'archived'. They can be merged back into existing pages, updating the revision record's page id. Modeling this as a page changelog event will be very difficult.
Instead, this page state change data model will support use cases that only care about externalized current state. That is, we will not try to capture modifications to MediaWiki's past in this stream.
This stream will be useful for Wikimedia Enterprise, Dumps, Search updates, cache invalidation, etc, but not for keeping a comprehensive history of all state changes to pages.
We aim to create a new page ‘entity’ based stream that can be used to ‘materialize’ the current state of any MediaWiki page. An entity based stream will have all kinds of changes (creates, updates, deletes. etc.) in a single stream. That is, mediawiki.page_change stream will have page creates, page edits, page deletes, and possibly other types of changes (page properties changes?).
=== Decisions made
==== What is MediaWiki page state? What are the relevant entities?
- wiki/database
- page table data: e.g. page_id, page_title, etc.
- actor: the user making a change to a page
- revision
-- comment
-- content slots (MCR) (& content body)
-- rendered content slots (for derived/enriched streams)
-- editor (same as actor on edit events).
==== What is not MediaWiki page state (for now)
- page properties: these are usually parsing hints, and are not persisted through edits.
- editing restrictions: these are about edit restrictions on page, not how the page looks. We could add these state change later if we change our minds.
- page links changes: We have this in a different stream already, can join if this is needed.
=== page state changes and changelog kinds
What kind of page changes are we going to capture in this stream, and what
'changelog kind' do they map to? 'changelog kind' is the type of change to apply to a state store, so either an 'insert'/'create' 'update' or 'delete. Each page change kind maps to exactly one changelog kind. (In Flink, these will be mapped to a [[ https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/types/RowKind.html) | RowKind ]]
|**MediaWiki page change kind**|**changelog kind**|
|create|insert|
|edit|update|
|current revision visibility change*|update|
|move|update|
|delete|delete|
|suppress|delete|
|undelete|insert|
*//This can happen if the comment or editor's user_text are hidden on the current revision.//
==== Modeling decisions
- We will make this schema organized, in that we are not going to force ourselves to stick with previous event data model decisions. E.g. we will have a `revision` object with revision related data, rather than top level `rev_id`, `rev_timestamp` fields. **NOTE: This decision is being revisited, see [[ https://phabricator.wikimedia.org/T308017#8402493 | this comment ]].
**
- Every page change event will have ALL of the data needed to represent the current page state. (page content will be a in different stream). That is, a page move event will still have all the data about the page's current revision in it, even if only the title has changed.
- We will use our existent [[ https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Modeling_state_changes | prior_state modeling convention ]]
==== Outstanding TODOs and unknowns
- Nested vs flat/top level fields
-- It is difficult to work with nested fields in SQL. Perhaps flat is best. See [[ https://phabricator.wikimedia.org/T308017#8402493 | this comment ]].
- Deprecate `meta.domain` and `meta.uri`, and put that info top level
-- See [[ https://phabricator.wikimedia.org/T308017#8402493 | this comment ]].
- Message Keys
-- We'll need a message key data model too. Perhaps something like `{"database": "enwiki", "page_id": 123}` is enough.
-- We haven't yet had to think about message keys in Event Platform.
--- wikimedia-event-utilties (Java client), EventGate (HTTP produce API), EventStreams (HTTP consume API) need to support keyed messages, and likely validation of key schemas too.
- Compacted Kafka topics
-- Can we maintain just one compacted Kafka topic for each of these streams, or do we need to maintain a non-compacted one (e.g. with suppressions in it), and a separate compacted one (where suppressions deletions are null/tombstoned out)?