MediaWiki Event Carried State Transfer - Problem Statement
Open, HighPublic
Actions

Assigned To

None

Authored By

	Ottomata
	Sep 15 2021, 6:58 PM

Description

What?

Building services that need MediaWiki data outside of MediaWiki currently requires a strong runtime coupling between the service and MediaWiki. Using event streams as a source of MediaWiki data would remove this runtime coupling, but our existent MediaWiki state change events suffer from two main flaws:

Consistency: there is no guarantee that a MediaWiki state change will result in an event being produced
Comprehensiveness: much data needed by a service is not in any existent MediaWiki state change event stream, and there is no easy / built in way for MediaWiki to automatically generate state change events.

What does the future look like if this is achieved?

A reliable streaming source of truth for MediaWiki data will allow us to build new products that serve MediaWiki data in different shapes and forms in a near-realtime and flexible manner without involving MediaWiki.

What happens if we do nothing?

Product and Engineering teams will build new services serving different read models using unreliable and incomplete data.
These services will be tightly coupled to MediaWiki or will be built into MediaWiki itself. There will be no standardized way to copy and use MediaWiki data in realtime without directly involving MediaWiki.
We will continue to expend engineering resources solving the same data integration problems over and over again.
We won't be able to implement OLTP use cases that are not scalable using MariaDB.

Why?

Consistent and comprehensive MediaWiki state changes events are relevant for any service that wants to use MediaWiki data without being directly coupled to the MediaWiki internals. This ties directly into MTP-Y2: Platform Evolution - Evolutionary Architecture.

Why are you bringing this decision to the Technical Forum?

This problem involves MediaWiki core and primary MediaWiki MariaDB datastores, and is relevant to every product and engineering team that uses MediaWiki data outside of MediaWiki.

Examples of affected projects:

Project	Use Case/Need	Primary Contacts
Image Recommendations	Platform engineering needs to collect events about image changes in MediaWiki and when users accept image recommendations and correlate them with a list of possible recommendations in order to decide what recommendations are available to be shown to users. See also Lambda Architecture for Generated Datasets.	Gabriele Modena, Clara Andrew-Wani
Structured Data Across Wikimedia	This will be supported by data storage infrastructure that can structure section data in wikitext as its own entity and associate topical metadata with each section entity
Structured Data Topics	Developers need a way to trigger/run a topic algorithm based on page updates in order to generate relevant section topics for users based on page content changes.	Desiree Abad
Similar Users	AHT in cooperation with Research wish to provide a feature for the CheckUsers community group to compare users to determine if they might be the same user to help with evaluating negative behaviours. See also Lambda Architecture for Generated Datasets.	Gabriele Modena, Hugh Nowlan
Add a link	The Link Recommendation Service recommends phrases of text in an article to link to other articles on a wiki. Users can then accept or reject these recommendations.	Marshall Miller, Kosta Harlan
MediaWiki History Incremental Updates	The Data Engineering team bulk loads monthly snapshots of MediaWiki data from dedicated MariaDB replicas, transforms this data using Spark into a MediaWiki History, and stores it in Hive and Cassandra and serves it via AQS. Data Engineering would like to keep this dataset up to date within a few hours using MediaWiki state change events.	Joseph Allemandou, Dan Andreescu
WDQS Streaming Updater	The Search team consumes WikiData change MediaWiki events with Flink and queries MediaWiki APIs, builds a stream of RDF diffs and updates their Blazegraph database for the Wikidata Query Service.	David Causse, Zbyszko Papierski
Knowledge Store PoV	The Architecture Team’s Knowledge Store PoV consumes events, looks up content from the MediaWiki API, transforms it, and stores structured versions of that content in an object store and serves it via GraphQL.	Diana Montalion, Kate Chapman
MediaWiki REST API Historical Data Endpoint	Platform Engineering wants to consume MediaWiki events to compute edit statistics that can be served from an API endpoint to build iOS features. (See also this data platform design document.)	Will Doran, Joseph Allemandou
Cloud Data Services	The Cloud Services team consumes MediaWiki MariaDB data and transforms it for tool maintainers, sanitizing it for public consumption. (Many tool maintainers have to implement OLAP-type use cases on data shapes that don’t support that.)	Andrew Bogott
Wikimedia Enterprise	Wikimedia Enterprise (Okapi) consumes events externally, looks up content from the MediaWiki API, transforms it and stores structured versions of that content in AWS, and serves APIs on top of that data there.	Ryan Brounley
Change Propagation / RESTBase	The Platform Engineering team uses Change Propagation to consume MediaWiki change events and causes RESTBase to store re-rendered HTML in Cassandra and serve it.	Petr Pchelko
Frontend cache purging	SRE consumes MediaWiki resource-purge events and transforms them into HTTP PURGE requests to clear frontend HTTP caches.	Petr Pchelko, Giuseppe Lavagetto
MW DerivedPageDataUpdater and friends	A ‘collection’ of various derived data generators running in-process within MW or deferring to the job queue	Core team
Some jobs	Many jobs are pure RPC calls, but many jobs basically fit this topic, driving derived data generation. Cirrus jobs for example.	Core team, Search team, etc
ML Feature store	Machine Learning models need features to be trained and served. These features are often derived from existing datasets, and may have different requirements for latency and throughput (training vs serving mostly).	Machine Learning Team, Chris Albon
MediaWiki XML dumps	XML dumps of MediaWiki data are generated semi monthly. Reworking this process to work on a more up to date data source would be very valuable.
Wikidata RDF / JSON dumps	Wikidata state changes could be used to generate more frequent and useful Wikidata dumps.	Search, WMDE
Revision scoring	For wikis where these machine learning models are supported, edits and revisions are automatically scored using article content and metadata. The service currently makes API calls back to MediaWiki, leading to a fragile dependency cycle and high latency.	Machine learning team

More context

Let’s use the Wikidata Query Service Updater as an example. WDQS Updater starts from a snapshot of Wikidata content. It then subscribes to the event stream of revision create events to get notifications of when new revisions are created, and queries the MediaWiki API to get the content for those revisions. That content is transformed into updates for WDQS’s datastore.

This is an example of Event Notification architecture, and is used or will be used by Wikimedia Enterprise, the Analytics Data Lake, various machine learning pipelines, the proposed Knowledge Store, Change Propagation (for RESTBase), various async MediaWiki PHP jobs, etc. Event Notification is a step in the right direction, but it does not remove the runtime coupling between the source data service (MediaWiki) and external services that need data in a different shape.

Making it possible to rely on MediaWiki event streams as a ‘source of truth’ for MediaWiki will incrementally allow us to build services using Event Carried State Transfer and/or Event Sourcing architectures, which enable us to use CQRS to serve different read models (see the Why? Section above). Note that none of these architectures are specifically prescribed by this decision record. We just want to make it possible to build reliable data systems using event driven architectures.

Ultimately we’d like to treat all Wikimedia data in this way, allowing us to build a platform supporting cataloged and sourceable streaming shared business data from which any service or product can draw. Making MediaWiki event production consistent and more comprehensive is an important step in that direction.

The Consistency problem

The existent streams of MediaWiki events are not consistent. E.g. There is no guarantee that a revision saved in a MediaWiki database will result in a revision create event. This makes it difficult for consumers of these events to rely on them for event carried state transfer.

In a distributed system, data will never be 100% consistent. This is true even now for the MediaWiki MariaDBs. (Some MediaWiki data relies on writing to different MariaDB instances, for which there is no way to update data transactionally.) However, we currently rely on MariaDB database replication for distributing MediaWiki state to scale database reads. The level of acceptable consistency of MariaDB replica data should be explicitly defined using SLOs. We generally accept that replication of MariaDB of state may be late, but we do not accept if it is incorrect.

Hopefully, SLOs we define for MariaDB replication consistency will be the same SLOs we define for event stream consistency.

For example, we know that currently, ~1% of revision create events are missing from the event streams, and we are not sure why. Missing 1% of rows in MariaDB replicas would be an unacceptable SLO, and it shouldn't be for event streams either. Ideally, if there were missing data in MariaDB replicas, the same data would be missing in event streams.

2022 EDIT: ^ is no longer true. We do miss some data, but the amount is now small, thanks to fixes by Petr Pchelko in T215001: Revisions missing from mediawiki_revision_create. We will need to solve the consistency problem, but the urgency is less.

The Comprehensiveness problem

Data needed by many of the use case examples listed above is not in the existent MediaWiki state change event streams, e.g. article content (wikitext or otherwise).

When bootstrapping, a service may need to make many requests to the MediaWiki API to get its data, either overloading the API and/or causing the bootstrapping process to take a very long time. During normal operation, contacting the MediaWiki API in a realtime pipeline causes extra external latency that could be avoided.

Specifically, getting content and other data out of the MediaWiki API suffers from the following problems:

Dealing with API retries
Stale reads due to MariaDB replica lag
Quick deletes (cannot differentiate between a stale read and a deleted revision)

If relevant MediaWiki state changes were captured in event streams, services could be runtime decoupled from MediaWiki.

NOTE: Unless we replicate low level database state changes true ‘completeness’ of state change data will be hard to accomplish. Instead, we may consider providing a standard and default way of producing MediaWiki state changes (perhaps using some core abstraction on the MediaWiki hook mechanism?) This would encourage (or require?) developers to produce state change events for updates to MediaWiki data.

Tech Forum docs

Problem Statement
Problem Statement feedback
RACI
Decision Records
- https://www.mediawiki.org/wiki/Technical_decision_making/Decision_records/T291120
- Consistency [TODO]

Related Objects
Search...

Status	Assigned	Task
Open	None	T291120 MediaWiki Event Carried State Transfer - Problem Statement
Open	achou	T331399 Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page
Resolved	pfischer	T325315 Add support for redirects in CirrusSearch
Resolved	bking	T344366 Rollout Elasticsearch extra plugins package and restart cluster to apply

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T291120#7464525, @awight wrote:

The theory is that event sourcing is specifically designed to be replayed, so this greatly simplifies the requirement to build quasi-two-phase-commit architecture.

That would require us to REPLACE the database as the primary source of truth. That basically means reachitecting and rewriting a good chunk of MW. Even if we knew that this is the direction we want to take for the future, we are far far away from it. I think it makes more sense to focus on getting a reasonably reliable stream of events from mediawiki, in a standardized way.

• DAbad subscribed.Oct 28 2021, 9:28 PM

The format of "problem statement first, solutions later" is very hard to follow, since I feel immediate urge jump to solutions, I'll try to stick to the problem.

After our most recent set of improvements under T215001 and thanks to @Milimetric and @JAllemandou analysis, we're now down to 0.0007% of revision-create events missing. We do not have exact numbers on the rest of event types since we didn't do such good analysis, but it's safe to assume that the numbers are similar. IMHO, this already looks quite good. I believe it's still possible to drop this number even more without drastically reinventing the architecture. We could look more into eventgate ( interactions between node.js thread poll and librdkafka own thread pool has always been suspicious), or we can experiment with replacing eventgate with librdkafka-based PHP kafka driver (there's a new one on the market that we've never evaluated).

I wonder, for which use-cases would 0.0001% be not enough? 0.00001%? Also, revisions are 'self-healing' for use-cases that rely on the lates view of the data, so when the page is edited next the view of the latest state becomes consistent again. I believe that if there are use-cases where 100% reliability is required we can find solutions based on some form of eventual consistency or reconciliation.

Non-guaranteed delivery of things like page deletes or revision suppressions are more important since these are not going to be naturally eventually consistent like revisions. But these operations are much much more rare then page edits, so perhaps we shouldn't put them together in one bag with revisions - solutions that are unacceptable for revisions might be totally ok for much more rare page deletes or renames.

The problem statement seems to suggest there will be a golden bullet that would allow us to make all events 100% reliable. That's mathematically impossible without sacrificing reliability. I think the solution would be unique for each type of event or use-case with a sliding scale between availability and consistency. The problem now imho is that we don't have any other option of event delivery and we don't have any easy option for reconciliation, so our sliding scale is missing a knob.

Thanks Petr! For consitency's sake (;p), let's keep discussions about consistency in T120242: Eventually Consistent MediaWiki State Change Events.

The problem statement seems to suggest there will be a golden bullet that would allow us to make all events 100% reliable.

I hope not! But, the problem state does suggest that we should try to make (important/relevant) MW state change events as (or as close to) reliable as MariaDB replication. That seems possible to me. Perhaps reconciliation will be the final solution, but I think there are other options we should consider too.

In T291120#7467254, @Ottomata wrote:

But, the problem state does suggest that we should try to make (important/relevant) MW state change events as (or as close to) reliable as MariaDB replication.

I know this is jumping to "solutioning" but the 'cheap' thing that is as reliable as MariaDB replication are MariaDB binlogs. Can the types of events desired be reconstructed from binlogs?

I know this is jumping to "solutioning" but the 'cheap' thing that is as reliable as MariaDB replication are MariaDB binlogs. Can the types of events desired be reconstructed from binlogs?

@bd808, see T120242: Eventually Consistent MediaWiki State Change Events

• nskaggs updated the task description. (Show Details)Oct 29 2021, 7:23 PM

• nskaggs added subscribers: • nskaggs, Andrew.

Ottomata mentioned this in T221397: Store page-links-change data in a database table and make available through a Special page.Nov 1 2021, 4:04 PM

odimitrijevic moved this task from Incoming (new tickets) to Apache Iceberg Migration on the Data-Engineering board.Nov 1 2021, 4:55 PM

fkaelin subscribed.Nov 1 2021, 6:02 PM

A lot of thoughtful comments have been made, and like others I find it difficult to separate the technical options/trade-offs from the problem statement itself. There seems to be a consensus that a reliable events architecture is desirable, but not if we are willing to pay a price in reliability for it. Some points we could include to make the problem statement more "decidable":

a commitment to create a core event infrastructure that is a source of truth for which we are willing to pay a price in reliability (and engineering resources). That price is determined by many factors yet to be determined/discussed, and should be "configurable" by SRE. I.e. there will be some sort of coupling between the transactions to the MW DB and the event system; if one transaction fails the other gets a chance to react to it. This is difficult to discuss without getting technical, but without coupling there is no events with guarantees, and we can't have coupling without a price in reliability.
explicitly exclude as goal to make events as a source of truth for MW itself. This seems too risky/ambitious, it could still be done more incrementally in the future once events are a source of truth.

The changes required to make events a core feature with guarantees, and the technical discussions/decisions preceding it, should also offer the chance to decrease the overall complexity of the WMF systems. We currently pay a hard-to-quantify but non-trivial price to build new systems/services that don't need to be a part of MW, in terms of human effort but also in terms of the options that are even considered feasible/reasonable to implement. Have we done a back-of-the-envelope evaluation of projects listed in the description in terms of estimated work given the status quo vs estimated work given a core event infrastructure with some correctness guarantees/streaming pipelines/etc? In my opinion, this aggregate cost needs to be weighted against the price in reliability it would cost us to build a core event infrastructure.

herron subscribed.Nov 3 2021, 2:49 PM

ArielGlenn subscribed.Nov 4 2021, 11:26 AM

TK-999 subscribed.Nov 4 2021, 4:09 PM

Hello, apologies for a late question (I have marked the "feedback" deadline for a wrong Friday and only now realized it was last week's Friday):
I might simply know too little about the current mediawiki's "event system" but reading through the problem statement, and thinking of its title ("source of truth") I do not fully understand why does it focus only on "services that need realtime MediaWiki data outside of MediaWiki"? If the event system was reliable enough to be considered a "source of truth" I'd imagine that (at least some parts of) Mediawiki (and its extensions) could actually move away from using the current SQL tables (revision, text, etc) to get the data they need. Is the different "performance" requirement level something that leads to ruling this application out and focus on "services outside of Mediawiki"? Or am I misinterpreting what "Mediawiki" refers to here? (e.g. would a hypothetical event-based service that updates Wikipedia infoboxes when Wikidata data changes -- currently implemented as Wikidata pushing updates to Wikipedias via mediawiki job queues -- be an instance of "service outside of Mediawiki")

I apologize if it occurs to be a loaded/non-constructive question. It is not my intention.

My understanding is that some of us got swept up into discussing the choice of a single source of truth (SSOT) but the bug author meant a source of truth, in a relative sense. In other words, the RFC is concerned with making the event streams trustworthy, but not trying to rearchitect MediaWiki to model all data flow as events. I propose that the RFC be renamed to clear this up, maybe to "Reliable event streams"?

As long as streams are reliable, we get most of the benefits of CQRS/ES even without our entire application using events internally. Waving hands in the general direction of what we might be missing without going for the "full" rewrite is that, tight RDBMS coupling will keep our application mostly monolithic, the API surface will continue to be the primary write interface, it will continu to be awkward for mw-core to ingest from externally-produced event sources.

oh dear, your favorite facepalm meme could go here. Thanks @awight for pointing out my stupidity. I am apparently so deep into the cool kids talks about the single source of truth all over the internet, that I've semiconsciously added it to the title of this document, while it indeed "only" talks about a source of truth.
Apologies for the noise.

After some discussion with Tim, it seems to me like the most realistic way to do this is Change Data Capture (CDC): we look at the databases replication log and generate events from that. However, direct CDC means binding directly to the raw database schema. That should be avoided. We'd want a layer that turns database row changes into more abstract events for general consumption. The code for the CDC event emitter should live in the same repo as the code that writes to the database - that is, it would live in MW core. It would be executed as a permanently running maintenance script. The log position could be maintained in kafka. This would provide parity with database replication in terms of latency and consistency.

I understand that we should not be "solutioning" too much while discussing the problem statement. I'm outlining the solution mainly to betetr understand the requirements - would this solution solve the problem outlined? Does it match the requirements?

My problem is that we don't know why we want a solution in this direction. The underlying problem is mentioned as:

For example, we know that currently, ~1% of revision create events are missing from the event streams

Let's assume the issue is in eventgate. Then we will be building a lot of code that would couple with mediawiki's database schema (increasing the cost of its much-needed improvements) for it to lose the 1% again and gain nothing. I would really like to see an investigation on this 1%. Where it vanishes into the void, how it can be improved, etc. If we know it's something in the architecture of the system, then yes, let's change the architecture.

Re CDC idea, pros and cons, see T120242: Eventually Consistent MediaWiki State Change Events, let's keep that discussion there. @daniel see the Hybrid CDC + Transactional Outbox idea for how we could have control over the event model, but still use the binlog directly (without a complicated raw row change translation layer).

For example, we know that currently, ~1% of revision create events are missing from the event streams

Recent fixes (some envoy timeout race conditions) may have brought this number way way down, which is great. We are about to have the full month of October imported into Hadoop, so we can do further analysis to really get this number.

However,

If we know it's something in the architecture of the system, then yes, let's change the architecture.

We do know this. No matter how reliable we make the deferred update async POSTing of a core state change event, we know that there will be discrepancies that over time will really matter. A solution like CDC might make the stream so good that any discrepancies (between what is in the binlog and what is in the stream) would totally disappear...but also maybe they won't? How confident are we currently that a MariaDB replica's database, or even just the revision table is, is 100% consistent with the master? Should a reliable event stream do better? I'd argue that it should be AS good, or better.

In either case, others have often argued that even if we get our event production consistency really really good, we will always need a way to reconcile discrepancies in the stream with the 'real' source of truth...the MediaWiki MariaDB. Perhaps all we will need is better EventGate SLOs (which maybe we can make now?) and a reconciliation mechanism built into MediaWiki. I was skeptical that this could work, but I'm starting to be convinced it might be possible. Hm, I should update the description of T120242 to include that as a possible solution.

@WMDE-leszek, if we can rely on event streams to source MW data for non-MW services...there's no reason we couldn't do it for MW too (at least in cases where MW is ok with eventual consistency). This problem statement def focuses on external services though, since MW already has a 'source of truth' that it alone has access to.

The replies above suggest that this ticket is indeed about events as the source of truth rather than a source of truth. @Ottomata can you clarify?

The replies above suggest that this ticket is indeed about events as the source of truth rather than a source of truth. @Ottomata can you clarify?

Hm, I'm not sure, but I think the confusion is if events are the (only?) source of truth then we'd have to event source MediaWiki in for order MediaWiki itself to be consistent. While that would be a nice solution if we were starting from scratch, doing that is intractable and not what this problem statement is concerned with. Instead, this is about being able to use MediaWiki state change events as a reliable source of truth for MediaWiki state changes. This would make it possible to externalize MediaWiki state in realtime without a runtime coupling to MediaWiki (and its MariaDB db).

In T291120#7492406, @Ottomata wrote:

In either case, others have often argued that even if we get our event production consistency really really good, we will always need a way to reconcile discrepancies in the stream with the 'real' source of truth...the MediaWiki MariaDB. Perhaps all we will need is better EventGate SLOs (which maybe we can make now?) and a reconciliation mechanism built into MediaWiki.

This is a really exciting suggestion—if a reconciliation mechanism is built-in and can be made flexible enough, it could also solve for a broad range of use cases: * edit conflict resolution * real-time collaborative editing * offline editing * and federation, just to name a handful...

In T291120#7493138, @Ottomata wrote:

Instead, this is about being able to use MediaWiki state change events as a reliable source of truth for MediaWiki state changes. This would make it possible to externalize MediaWiki state in realtime without a runtime coupling to MediaWiki (and its MariaDB db).

If this is the case, I suggest removing the "source of truth" wording from the proposal entirely. As is evenident from the discussion on this ticket, it's bound to create confusion. Personally, I have never seen "source of truth" used to refer to anything other than the single source of truth. Hoe about just saying that we need MediaWiki to act as a reliable even source, and focus the discussion on what "reliable" should mean in terms of consistency and latency.

I suggest removing the "source of truth" wording from the proposal entirely
I have never seen "source of truth" used to refer to anything other than the single source of truth

Hm, perhaps you are right.

However, this ticket is about making it possible to use MediaWiki produced events as the only source of truth a non-MediaWiki service needs to get MediaWiki data. I'm not suggesting that MediaWiki also only use events as SSOT (single source of truth) because of the intractability of that. It is much easier to design Event Sourced services from the start, rather than try to convert one to Event Sourced later. MediaWiki is huge and complicated and too important; I wouldn't suggest any kind of concerted effort to change its data architecture outright. If we get events as a source of truth, then parts of MediaWiki could incrementally be moved to an Event Sourced architecture. In an (my?) ideal world, this could be a theoretical goal, so in that sense I'd love to theoretically make MediaWiki events THE single source of truth for everything, including MediaWiki. It's just that I don't think doing so is practical, so I excluded that from consideration in this problem statement.

I suppose, since this is just a very abstract problem statement, we could frame the problem to include MediaWiki too? Although that would probably confuse more people and cause more friction.

How about just saying that we need MediaWiki to act as a reliable event source

Reliable and comprehensive. Meaning that any time someone wants to set up a new service with transformed MediaWiki data, there is a way to get all the state they need from events. If the specific state changes they need aren't yet in the streams, there should be something that allows them to code up MediaWiki to begin emitting that state as events, for themselves and others to use.

Just following up because I may have been wrong to introduce the idea of "a" source of truth above. I did this to try to show where the confusion in this task is coming from, but it's important to note that "single source of truth" has a very strict definition, if we had a SSOT it would be the MediaWiki primary database and there are no other sources of truth. For example, if you want to make an update it must be to this database, not in an event stream, etc.

Also following the definition of SSOT, the event source records are not another source of truth. They are in fact breaking ideal SSOT because they are copying data literally rather than as a reference. If we sent events that simply said "revision: 123" and you had to query the database to get the revision content, this would be closer to pure SSOT. Instead, we inline the content of the revision, which becomes problematic if a revision is deleted. In this case the event is out-of-sync with the actual source of truth. This is exactly the reason that SSOT exists, and shows how breaking it causes issues for us.

+1 that it seems most productive to focus on just "reliable and complete" event sourcing for this RFC, and give it a concrete target threshold of percentage messages dropped.

which becomes problematic if a revision is deleted

If it is deleted without a corresponding revision-delete state change event, that is problematic. Otherwise, fine, no?

They are in fact breaking ideal SSOT because they are copying data literally rather than as a reference

IIUC it is totally fine to copy data literally, as long as you are copying it from the SSOT, and the SSOT is reliable and complete.

I'd like to make it possible for everything except MediaWiki to consider events as their source of truth. Technically this isn't SINGLE source of truth, I guess, since the fundamental MW truth is MariaDB. I'd like to abstract that away from everything else, so from the perspective of a service that doesn't have access to MW MariaDB, the events are the only 'source of truth' they need.

In T291120#7503969, @Ottomata wrote:

which becomes problematic if a revision is deleted

If it is deleted without a corresponding revision-delete state change event, that is problematic. Otherwise, fine, no?

I only call this problematic because it's important that our downstream consumers respect the social convention that revision-delete is projected to delete previously received data. This makes event sourcing feel slightly uncomfortable, similar to if we were broadcasting protected data such as PII but asking our consumers to only retain for at most the legally-allowed 30-day period, for example. In both cases consumers have an obligation to periodically transform the entire history of each stream in order to comply with rules, which feels at odds with the ideal of a write-only, permanent event store.

But I agree that it's totally fine to do this, and to copy denormalized data literally, sending as events. What events have already done brilliantly is to give CQRS access to MediaWiki records, to external peers. From the perspective of an event consumer, it doesn't matter where the boundaries are between our event producer and the authoritative DB rows. Whether this is a tightly-coupled replication log follower or a set of MediaWiki hooks, the event stream will look the same, the only differences are in the expected reliability.

I think you said it well, that events are the only truth many consumers will need. Let's abandon this "truth" terminology if possible, though—if data is followed through our systems e.g. as a data-flow diagram, then each data-transforming process has a source data store and a sink data store. The source of any one thread of data flow is not necessarily true, it could already be out-of-date relative to the furthest upstream producer. Truth is a very strong claim to make about a piece of data (I would even hesitate to claim that the primary db contains truth), and also doesn't have a convenient pairing (i.e., the sink of a data flow is not "untrue") so not a good fit here. One could say "authoritative", but that would be tiresome to repeat across each data flow: "events are less authoritative than the primary database, but more authoritative than a locally-cached projection of these events". Maybe we can avoid this sort of language entirely, instead pointing to a DFD showing that e.g. the EventStreams ES endpoint mediawiki.revision-create mirrors the internal Kafka topic of the same name, which is produced by the EventBus extension hooking into onRevisionRecordInserted. Describing as a series of data flows lets the reader see the relationship between each node, choose which is the most appropriate for their use case, and also hints at potential losses through each linkage.

Of course, the point of this task is that we should project event streams from all important MediaWiki activities, and do it reliably. I'm completely in favor, apologies for all the words thrown at such a small issue of language. I would love to see what additional event streams are being considered (doc is not readable to the public yet). I'm sure that OKAPI will provide plenty of business motivation to guarantee a high level of reliability, deciding on the exact threshold is more of a cost/benefit thing than a RFC question. So maybe we should be talking about what's in the newly proposed event streams, governance over data retention, and for fun we could look at the potential for *incoming* event streams?

I only call this problematic because it's important that our downstream consumers respect the social convention that revision-delete is projected to delete previously received data

This is true of the replication binlog now, right? (If MariaDB didn't respect deletes). But also kafka compacted topics may help.

(doc is not readable to the public yet).

fixed.

what's in the newly proposed event streams, governance over data retention, and for fun we could look at the potential for *incoming* event streams?

Heckay ya. There aren't any new proposed event streams at this time, only talk about putting things like wikitext and/or html in a stream. Its going to be hard defining what is 'complete', so we changed our hopes to just making it easy enough for MW state change events to made more complete over time, as more use cases emerge and are implemented.

Let's abandon this "truth" terminology if possible

I don't mind.

One could say "authoritative",

Canonical?

source of any one thread of data flow is not necessarily true

But each data source should have a one official producer that emits the original state change event. The stream of these events is the series of true/canonical state changes for the relevant entity. I think the 'truth' term is a philosophical one, not a practical one. If you are able to traverse and apply the full history of all events, to a particular point in time, then you should have a full snapshot slice of what the full current state of all entities looked like at that time. In practice/realtime, the externalized state is not going to be consistent everywhere at once, only eventually so.

I think we agree @awight, and I am not attached to the term 'truth' in any way. I mostly want MediaWiki to produce state change event streams that other services can fully rely on to get what they need. "Events as a reliable and comprehensive source of MediaWiki state changes?"

In T291120#7506676, @Ottomata wrote:

I think we agree @awight, and I am not attached to the term 'truth' in any way. I mostly want MediaWiki to produce state change event streams that other services can fully rely on to get what they need. "Events as a reliable and comprehensive source of MediaWiki state changes?"

I feel that I am missing some subtlety in the distinction that you are drawing between 'truth' and 'can fully rely on'. What concretely is traded away in the sense of CAP theorem's framing of the binary choice of decreasing availability or decreasing consistency in the face of message loss in your fully reliable designation?

What concretely is traded away in the sense of CAP theorem's framing of the binary choice of decreasing availability or decreasing consistency in the face of message loss in your fully reliable designation?

In CAP terms, we lose Consistency, but only in that CAP defines it as "Every read receives the most recent write or an error.". State changes via event streams only provide eventual consistency (as do MariaDB replicas).

• Jenlenfantwright moved this task from Problem Statement Review to Research and Prototyping on the tech-decision-forum board.Nov 23 2021, 4:31 PM

I've "boldly" edited the title to reflect how I understand the consensus above. We aren't trying to shift the source of truth, the point is to make events more useful for external system and service integrations.

freephile renamed this task from Constent and comprehensive event streams - Decision Statement Overview to Constant and comprehensive event streams - Decision Statement Overview.Nov 26 2021, 5:39 PM

awight renamed this task from Constant and comprehensive event streams - Decision Statement Overview to Consistent and comprehensive event streams - Decision Statement Overview.Nov 26 2021, 11:21 PM

Hmmm, bold! :) This task is owned by the Architecture team and the Technical Decision Forum chairs. They use this to create 'artifacts', so the description here needs to be synced to their artifact documents. We probably shouldn't change it without some coordination.

I'd like to discuss a little more before changing the title. I think 'a source of truth' is accurate for the intent: event streams should capture all MediaWiki state changes, so that it may be used as a 'source of truth' for MediaWiki. If that confuses people, I'm not opposed to changing the title, but I want to make sure that the intent is still captured somehow.

Also, this task is explicitly about MediaWiki event streams; so we should keep 'MediaWiki' in the title.

Of course, no problem at all to "revert" the title change :-), it was just a thought in order to avoid further confusion about the intent. Just to illustrate the problem with "source of truth", here are the Wikipedia search results.

Aye, k I'll revert for now. I'm mostly getting my 'source of truth' terminology from https://www.confluent.io/blog/messaging-single-source-truth/

Ottomata renamed this task from Consistent and comprehensive event streams - Decision Statement Overview to MediaWiki Events as a Source of Truth - Problem Statement.Nov 29 2021, 2:35 PM

Just to add my 2 cents, I too find the "A source of truth" terminology confusing. I am so accustomed to equate Source Of Truth to SSOT that it would take me some mental effort to accommodate for that. And while I now am able to do so, thanks to this discussion, newcomers will not have that historical benefit. I 'd argue that it's not worth it to confuse newcomers and ask them to do with that nuance.

Aye, k I'll revert for now. I'm mostly getting my 'source of truth' terminology from https://www.confluent.io/blog/messaging-single-source-truth/

Please forgive me for saying so, but that page has "single" in the URL and the title and "One" (capitalized) in the 3rd subheading. A cursory reading makes me suppose that it supports SSOT and not ASOT via messages, so it's confusing me somewhat more.

supports SSOT and not ASOT via messages, so it's confusing me somewhat more.

I think the philosophical intention is to be able to use streams as a SSOT. Its just that we will likely never rearchitect MediaWiki itself to use this streaming source of truth for all its state changes, even if we might wish that we could.

In the 'Getting Events into the Log' section of that article, it specifically talks about how database only based legacy systems can use other techniques (like CDC) to build the event streams other services can use as source of truth, and then incrementally also use that stream to rearchitect the legacy system into an event sourced one. Sure that would be great! I just don't want to propose that we should work on that.

So, in principal, I'm suggesting we should have a SSOT, but in practice we'll likely never get there...so ASOT is great too. :)

But, heard. Am willing to change title, lets discuss with TechForum folks.

In T291120#7534341, @Ottomata wrote:

But, heard. Am willing to change title, lets discuss with TechForum folks.

❤

In the 'Getting Events into the Log' section of that article, it specifically talks about how database only based legacy systems can use other techniques (like CDC) to build the event streams other services can use as source of truth, and then incrementally also use that stream to rearchitect the legacy system into an event sourced one. Sure that would be great! I just don't want to propose that we should work on that.

Thanks for clarifying that.

FYI, if anyone is interested, there is a free talk from Confluent on Dec 16 2021: Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka

Perhaps a better titled would be "Event Carried State Transfer of MediaWiki State"?

nshahquinn-wmf subscribed.Dec 20 2021, 4:55 PM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:33 AM

Restricted Application added a project: Analytics. · View Herald TranscriptJan 12 2022, 12:33 AM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:48 AM

Restricted Application added a project: Analytics. · View Herald TranscriptJan 12 2022, 12:48 AM

Ok, going with 'MediaWiki Event Carried State Transfer' as title.

nshahquinn-wmf awarded a token.Jan 20 2022, 5:45 PM

Ottomata updated the task description. (Show Details)Jan 25 2022, 4:18 PM

Ottomata updated the task description. (Show Details)Jan 25 2022, 4:38 PM

Ottomata updated the task description. (Show Details)Jan 25 2022, 4:42 PM

• Jenlenfantwright reassigned this task from • Jenlenfantwright to Ottomata.Feb 3 2022, 5:54 PM

• Jenlenfantwright moved this task from Research and Prototyping to Decision Record on the tech-decision-forum board.Feb 3 2022, 6:00 PM

Andrew unsubscribed.Feb 3 2022, 8:59 PM

• LNguyen moved this task from Decision Record to Executive Review on the tech-decision-forum board.Mar 2 2022, 9:37 PM

@Jenlenfantwright @LNguyen This task changed state twice (in February and in March) despite a lack of substantial updates since November last year. Is there a record of any discussions or decisions that is available elsewhere?

Discussions that caused changes to the task here are all in the comments.

Some notes were taken during the meetings about the Decision Record, and are included at the bottom of that document.

The current status is that we have submitted the Decision Record to the Tech Forum, and will be meeting to discuss with Tech Forum board(?) on March 14 for their sign off. I probably should have posted a comment here last week noting that, sorry about that.

@LSobanski if you or anyone wants to discuss more, feel free to make a meeting or to just ask here. :)

Oh, BTW in case you weren't aware, the Decision Record we are submitting now is explicitly about the 'Comprehensiveness' problem, not the 'Consistency' problem. So, we are trying to solve for getting more MW state into streams, but not solving for improving the consistency of state streams emitted from MW. The consistency problem is much less pressing now that Petr has fixed some proxy timeout settings, and it's solutions (described in T120242) are also much more controversial, so we are punting on those for now.

• LNguyen moved this task from Executive Review to Decision Made on the tech-decision-forum board.Mar 14 2022, 4:45 PM

• DAbad mentioned this in T303968: <Shared Platform Initiative> Shared Event Platform Experiment.Apr 13 2022, 2:54 PM

Jdforrester-WMF subscribed.Apr 29 2022, 6:30 PM

• LNguyen updated the task description. (Show Details)Jun 7 2022, 5:01 PM

• LNguyen moved this task from Decision Made to Published on the tech-decision-forum board.Jun 7 2022, 5:05 PM

xcollazo subscribed.Jun 14 2022, 8:25 PM

JArguello-WMF removed a project: Analytics.Jul 6 2022, 2:52 PM

Ottomata moved this task from Backlog to Parent Tasks/Epics on the Event-Platform board.Aug 11 2022, 6:00 PM

Ottomata mentioned this in T308017: Design Schema for page state and page state with content (enriched) streams.Aug 17 2022, 7:49 PM

gmodena mentioned this in T311084: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream..Sep 12 2022, 12:39 PM

Ottomata added a subtask: T331399: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page.Mar 7 2023, 11:00 AM

Ottomata mentioned this in T330507: New Service Request mediawiki-page-content-change-enrichment.Mar 27 2023, 3:07 PM

Lectrician1 subscribed.Apr 11 2023, 6:31 PM

tchin subscribed.May 15 2023, 3:02 PM

JArguello-WMF moved this task from Apache Iceberg Migration to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 11:31 PM

JArguello-WMF edited projects, added Data Engineering and Event Platform Team; removed Data-Engineering.Jun 30 2023, 4:16 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJun 30 2023, 4:16 PM

JArguello-WMF moved this task from Data Eng Backlog to Parent Tasks/Epics on the Data Engineering and Event Platform Team board.Jun 30 2023, 4:17 PM

JArguello-WMF removed a project: Data-Engineering.Jun 30 2023, 4:19 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJun 30 2023, 4:19 PM

Milimetric mentioned this in T341649: Provide an easy way for MediaWiki to fetch aggregate statistics from the data lake.Jul 19 2023, 10:50 AM

lbowmaker moved this task from Parent Tasks/Epics to Event Platform Backlog on the Data Engineering and Event Platform Team board.Oct 20 2023, 2:57 PM

lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 2:29 PM

Ottomata updated the task description. (Show Details)Dec 11 2023, 4:26 PM

Ottomata mentioned this in T212482: RFC: Evolve hook system to support "filters" and "actions" only.Jan 5 2024, 2:28 PM

Ottomata added a subscriber: Mooeypoo.Mar 15 2024, 11:53 PM

dr0ptp4kt subscribed.Apr 9 2024, 5:42 PM

@Ottomata: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

MediaWiki Event Carried State Transfer - Problem StatementOpen, HighPublicActions