Building services that need realtime MediaWiki data outside of MediaWiki currently requires a strong runtime coupling between the service and MediaWiki. Using event streams as a source of MediaWiki data would remove this runtime coupling, but our existent MediaWiki state change events suffer from two main flaws:
- Consistency: there is no guarantee that a MediaWiki state change will result in an event being produced
- Comprehensiveness: much data needed by a service is not in any existent MediaWiki state change event stream, and there is no easy / built in way for MediaWiki to automatically generate state change events.
What does the future look like if this is achieved?
A reliable streaming source of truth for MediaWiki data will allow us to build new products that serve MediaWiki data in different shapes and forms in a near-realtime and flexible manner without involving MediaWiki.
What happens if we do nothing?
- Product and Engineering teams will build new services serving different read models using unreliable and incomplete data.
- These services will be tightly coupled to MediaWiki or will be built into MediaWiki itself. There will be no standardized way to copy and use MediaWiki data in realtime without directly involving MediaWiki.
- We will continue to expend engineering resources solving the same data integration problems over and over again.
- We won't be able to implement OLTP use cases that are not scalable using MariaDB.
Consistent and comprehensive MediaWiki state changes events are relevant for any service that wants to use MediaWiki data without being directly coupled to the MediaWiki internals. This ties directly into MTP-Y2: Platform Evolution - Evolutionary Architecture.
Why are you bringing this decision to the Technical Forum?
- This problem involves MediaWiki core and primary MediaWiki MariaDB datastores, and is relevant to every product and engineering team that uses MediaWiki data outside of MediaWiki.
Examples of affected projects:
|Project||Use Case/Need||Primary Contacts|
|Image Recommendations||Platform engineering needs to collect events about image changes in MediaWiki and when users accept image recommendations and correlate them with a list of possible recommendations in order to decide what recommendations are available to be shown to users. See also Lambda Architecture for Generated Datasets.||Gabriele Modena, Clara Andrew-Wani|
|Structured Data Across Wikimedia||This will be supported by data storage infrastructure that can structure section data in wikitext as its own entity and associate topical metadata with each section entity|
|Structured Data Topics||Developers need a way to trigger/run a topic algorithm based on page updates in order to generate relevant section topics for users based on page content changes.||Desiree Abad|
|Similar Users||AHT in cooperation with Research wish to provide a feature for the CheckUsers community group to compare users to determine if they might be the same user to help with evaluating negative behaviours. See also Lambda Architecture for Generated Datasets.||Gabriele Modena, Hugh Nowlan|
|Add a link||The Link Recommendation Service recommends phrases of text in an article to link to other articles on a wiki. Users can then accept or reject these recommendations.||Marshall Miller, Kosta Harlan|
|MediaWiki History Incremental Updates||The Data Engineering team bulk loads monthly snapshots of MediaWiki data from dedicated MariaDB replicas, transforms this data using Spark into a MediaWiki History, and stores it in Hive and Cassandra and serves it via AQS. Data Engineering would like to keep this dataset up to date within a few hours using MediaWiki state change events.||Joseph Allemandou, Dan Andreescu|
|WDQS Streaming Updater||The Search team consumes WikiData change MediaWiki events with Flink and queries MediaWiki APIs, builds a stream of RDF diffs and updates their Blazegraph database for the Wikidata Query Service.||David Causse, Zbyszko Papierski|
|Knowledge Store PoV||The Architecture Team’s Knowledge Store PoV consumes events, looks up content from the MediaWiki API, transforms it, and stores structured versions of that content in an object store and serves it via GraphQL.||Diana Montalion, Kate Chapman|
|MediaWiki REST API Historical Data Endpoint||Platform Engineering wants to consume MediaWiki events to compute edit statistics that can be served from an API endpoint to build iOS features. (See also this data platform design document.)||Will Doran, Joseph Allemandou|
|Cloud Data Services||The Cloud Services team consumes MediaWiki MariaDB data and transforms it for tool maintainers, sanitizing it for public consumption. (Many tool maintainers have to implement OLAP-type use cases on data shapes that don’t support that.)||Andrew Bogott|
|Wikimedia Enterprise||Wikimedia Enterprise (Okapi) consumes events externally, looks up content from the MediaWiki API, transforms it and stores structured versions of that content in AWS, and serves APIs on top of that data there.||Ryan Brounley|
|Change Propagation / RESTBase||The Platform Engineering team uses Change Propagation to consume MediaWiki change events and causes RESTBase to store re-rendered HTML in Cassandra and serve it.||Petr Pchelko|
|Frontend cache purging||SRE consumes MediaWiki resource-purge events and transforms them into HTTP PURGE requests to clear frontend HTTP caches.||Petr Pchelko, Giuseppe Lavagetto|
|MW DerivedPageDataUpdater and friends||A ‘collection’ of various derived data generators running in-process within MW or deferring to the job queue||Core team|
|Some jobs||Many jobs are pure RPC calls, but many jobs basically fit this topic, driving derived data generation. Cirrus jobs for example.||Core team, Search team, etc|
|ML Feature store||Machine Learning models need features to be trained and served. These features are often derived from existing datasets, and may have different requirements for latency and throughput (training vs serving mostly).||Machine Learning Team, Chris Albon|
|MediaWiki XML dumps||XML dumps of MediaWiki data are generated semi monthly. Reworking this process to work on a more up to date data source would be very valuable.|
|Wikidata RDF / JSON dumps||Wikidata state changes could be used to generate more frequent and useful Wikidata dumps.||Search, WMDE|
|Revision scoring||For wikis where these machine learning models are supported, edits and revisions are automatically scored using article content and metadata. The service currently makes API calls back to MediaWiki, leading to a fragile dependency cycle and high latency.||Machine learning team|
Let’s use the Wikidata Query Service Updater as an example. WDQS Updater starts from a snapshot of Wikidata content. It then subscribes to the event stream of revision create events to get notifications of when new revisions are created, and queries the MediaWiki API to get the content for those revisions. That content is transformed into updates for WDQS’s datastore.
This is an example of Event Notification architecture, and is used or will be used by Wikimedia Enterprise, the Analytics Data Lake, various machine learning pipelines, the proposed Knowledge Store, Change Propagation (for RESTBase), various async MediaWiki PHP jobs, etc. Event Notification is a step in the right direction, but it does not remove the runtime coupling between the source data service (MediaWiki) and external services that need data in a different shape.
Making it possible to rely on MediaWiki event streams as a ‘source of truth’ for MediaWiki will incrementally allow us to build services using Event Carried State Transfer and/or Event Sourcing architectures, which enable us to use CQRS to serve different read models (see the Why? Section above). Note that none of these architectures are specifically prescribed by this decision record. We just want to make it possible to build reliable data systems using event driven architectures.
Ultimately we’d like to treat all Wikimedia data in this way, allowing us to build a platform supporting cataloged and sourceable streaming shared business data from which any service or product can draw. Making MediaWiki event production consistent and more comprehensive is an important step in that direction.
The Consistency problem
The existent streams of MediaWiki events are not consistent. E.g. There is no guarantee that a revision saved in a MediaWiki database will result in a revision create event. This makes it difficult for consumers of these events to rely on them for event carried state transfer.
In a distributed system, data will never be 100% consistent. This is true even now for the MediaWiki MariaDBs. (Some MediaWiki data relies on writing to different MariaDB instances, for which there is no way to update data transactionally.) However, we currently rely on MariaDB database replication for distributing MediaWiki state to scale database reads. The level of acceptable consistency of MariaDB replica data should be explicitly defined using SLOs. We generally accept that replication of MariaDB of state may be late, but we do not accept if it is incorrect.
Hopefully, SLOs we define for MariaDB replication consistency will be the same SLOs we define for event stream consistency.
For example, we know that currently, ~1% of revision create events are missing from the event streams, and we are not sure why. Missing 1% of rows in MariaDB replicas would be an unacceptable SLO, and it shouldn't be for event streams either. Ideally, if there were missing data in MariaDB replicas, the same data would be missing in event streams.
2022 EDIT: ^ is no longer true. We do miss some data, but the amount is now small, thanks to fixes by Petr Pchelko in T215001: Revisions missing from mediawiki_revision_create. We will need to solve the consistency problem, but the urgency is less.
The Comprehensiveness problem
Data needed by many of the use case examples listed above is not in the existent MediaWiki state change event streams, e.g. article content (wikitext or otherwise).
When bootstrapping, a service may need to make many requests to the MediaWiki API to get its data, either overloading the API and/or causing the bootstrapping process to take a very long time. During normal operation, contacting the MediaWiki API in a realtime pipeline causes extra external latency that could be avoided.
Specifically, getting content and other data out of the MediaWiki API suffers from the following problems:
- Dealing with API retries
- Stale reads due to MariaDB replica lag
- Quick deletes (cannot differentiate between a stale read and a deleted revision)
If relevant MediaWiki state changes were captured in event streams, services could be runtime decoupled from MediaWiki.
- Messaging as the Single Source of Truth
- Turning the database inside-out with Apache Samza
- What do you mean by “Event-Driven”?
- T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth
- T290203: Discussion of Event Driven Systems
- Lambda Architecture for Generated Datasets
- Central Data Storage & Pipelines Workstream
- Shared-Data Platform for Knowledge as a Service [draft]
- Data Infrastructure Integration WG notes
- MediaWiki Event completeness meeting discussion notes
- Problem Statement
- Problem Statement feedback
- Decision Records