Page MenuHomePhabricator

MediaWiki Event Carried State Transfer - Problem Statement
Open, HighPublic

Description

What?

Building services that need realtime MediaWiki data outside of MediaWiki currently requires a strong runtime coupling between the service and MediaWiki. Using event streams as a source of MediaWiki data would remove this runtime coupling, but our existent MediaWiki state change events suffer from two main flaws:

  • Consistency: there is no guarantee that a MediaWiki state change will result in an event being produced
  • Comprehensiveness: much data needed by a service is not in any existent MediaWiki state change event stream, and there is no easy / built in way for MediaWiki to automatically generate state change events.

What does the future look like if this is achieved?

A reliable streaming source of truth for MediaWiki data will allow us to build new products that serve MediaWiki data in different shapes and forms in a near-realtime and flexible manner without involving MediaWiki.

What happens if we do nothing?

  • Product and Engineering teams will build new services serving different read models using unreliable and incomplete data.
  • These services will be tightly coupled to MediaWiki or will be built into MediaWiki itself. There will be no standardized way to copy and use MediaWiki data in realtime without directly involving MediaWiki.
  • We will continue to expend engineering resources solving the same data integration problems over and over again.
  • We won't be able to implement OLTP use cases that are not scalable using MariaDB.

Why?

Consistent and comprehensive MediaWiki state changes events are relevant for any service that wants to use MediaWiki data without being directly coupled to the MediaWiki internals. This ties directly into MTP-Y2: Platform Evolution - Evolutionary Architecture.

Why are you bringing this decision to the Technical Forum?

  • This problem involves MediaWiki core and primary MediaWiki MariaDB datastores, and is relevant to every product and engineering team that uses MediaWiki data outside of MediaWiki.

Examples of affected projects:

ProjectUse Case/NeedPrimary Contacts
Image RecommendationsPlatform engineering needs to collect events about image changes in MediaWiki and when users accept image recommendations and correlate them with a list of possible recommendations in order to decide what recommendations are available to be shown to users. See also Lambda Architecture for Generated Datasets.Gabriele Modena, Clara Andrew-Wani
Structured Data Across WikimediaThis will be supported by data storage infrastructure that can structure section data in wikitext as its own entity and associate topical metadata with each section entity
Structured Data TopicsDevelopers need a way to trigger/run a topic algorithm based on page updates in order to generate relevant section topics for users based on page content changes.Desiree Abad
Similar UsersAHT in cooperation with Research wish to provide a feature for the CheckUsers community group to compare users to determine if they might be the same user to help with evaluating negative behaviours. See also Lambda Architecture for Generated Datasets.Gabriele Modena, Hugh Nowlan
Add a linkThe Link Recommendation Service recommends phrases of text in an article to link to other articles on a wiki. Users can then accept or reject these recommendations.Marshall Miller, Kosta Harlan
MediaWiki History Incremental UpdatesThe Data Engineering team bulk loads monthly snapshots of MediaWiki data from dedicated MariaDB replicas, transforms this data using Spark into a MediaWiki History, and stores it in Hive and Cassandra and serves it via AQS. Data Engineering would like to keep this dataset up to date within a few hours using MediaWiki state change events.Joseph Allemandou, Dan Andreescu
WDQS Streaming UpdaterThe Search team consumes WikiData change MediaWiki events with Flink and queries MediaWiki APIs, builds a stream of RDF diffs and updates their Blazegraph database for the Wikidata Query Service.David Causse, Zbyszko Papierski
Knowledge Store PoVThe Architecture Team’s Knowledge Store PoV consumes events, looks up content from the MediaWiki API, transforms it, and stores structured versions of that content in an object store and serves it via GraphQL.Diana Montalion, Kate Chapman
MediaWiki REST API Historical Data EndpointPlatform Engineering wants to consume MediaWiki events to compute edit statistics that can be served from an API endpoint to build iOS features. (See also this data platform design document.)Will Doran, Joseph Allemandou
Cloud Data ServicesThe Cloud Services team consumes MediaWiki MariaDB data and transforms it for tool maintainers, sanitizing it for public consumption. (Many tool maintainers have to implement OLAP-type use cases on data shapes that don’t support that.)Andrew Bogott
Wikimedia EnterpriseWikimedia Enterprise (Okapi) consumes events externally, looks up content from the MediaWiki API, transforms it and stores structured versions of that content in AWS, and serves APIs on top of that data there.Ryan Brounley
Change Propagation / RESTBaseThe Platform Engineering team uses Change Propagation to consume MediaWiki change events and causes RESTBase to store re-rendered HTML in Cassandra and serve it.Petr Pchelko
Frontend cache purgingSRE consumes MediaWiki resource-purge events and transforms them into HTTP PURGE requests to clear frontend HTTP caches.Petr Pchelko, Giuseppe Lavagetto
MW DerivedPageDataUpdater and friendsA ‘collection’ of various derived data generators running in-process within MW or deferring to the job queueCore team
Some jobsMany jobs are pure RPC calls, but many jobs basically fit this topic, driving derived data generation. Cirrus jobs for example.Core team, Search team, etc
ML Feature storeMachine Learning models need features to be trained and served. These features are often derived from existing datasets, and may have different requirements for latency and throughput (training vs serving mostly).Machine Learning Team, Chris Albon
MediaWiki XML dumpsXML dumps of MediaWiki data are generated semi monthly. Reworking this process to work on a more up to date data source would be very valuable.
Wikidata RDF / JSON dumpsWikidata state changes could be used to generate more frequent and useful Wikidata dumps.Search, WMDE
Revision scoringFor wikis where these machine learning models are supported, edits and revisions are automatically scored using article content and metadata. The service currently makes API calls back to MediaWiki, leading to a fragile dependency cycle and high latency.Machine learning team

More context

Let’s use the Wikidata Query Service Updater as an example. WDQS Updater starts from a snapshot of Wikidata content. It then subscribes to the event stream of revision create events to get notifications of when new revisions are created, and queries the MediaWiki API to get the content for those revisions. That content is transformed into updates for WDQS’s datastore.

This is an example of Event Notification architecture, and is used or will be used by Wikimedia Enterprise, the Analytics Data Lake, various machine learning pipelines, the proposed Knowledge Store, Change Propagation (for RESTBase), various async MediaWiki PHP jobs, etc. Event Notification is a step in the right direction, but it does not remove the runtime coupling between the source data service (MediaWiki) and external services that need data in a different shape.

Making it possible to rely on MediaWiki event streams as a ‘source of truth’ for MediaWiki will incrementally allow us to build services using Event Carried State Transfer and/or Event Sourcing architectures, which enable us to use CQRS to serve different read models (see the Why? Section above). Note that none of these architectures are specifically prescribed by this decision record. We just want to make it possible to build reliable data systems using event driven architectures.

Ultimately we’d like to treat all Wikimedia data in this way, allowing us to build a platform supporting cataloged and sourceable streaming shared business data from which any service or product can draw. Making MediaWiki event production consistent and more comprehensive is an important step in that direction.

The Consistency problem

The existent streams of MediaWiki events are not consistent. E.g. There is no guarantee that a revision saved in a MediaWiki database will result in a revision create event. This makes it difficult for consumers of these events to rely on them for event carried state transfer.

In a distributed system, data will never be 100% consistent. This is true even now for the MediaWiki MariaDBs. (Some MediaWiki data relies on writing to different MariaDB instances, for which there is no way to update data transactionally.) However, we currently rely on MariaDB database replication for distributing MediaWiki state to scale database reads. The level of acceptable consistency of MariaDB replica data should be explicitly defined using SLOs. We generally accept that replication of MariaDB of state may be late, but we do not accept if it is incorrect.

Hopefully, SLOs we define for MariaDB replication consistency will be the same SLOs we define for event stream consistency.

For example, we know that currently, ~1% of revision create events are missing from the event streams, and we are not sure why. Missing 1% of rows in MariaDB replicas would be an unacceptable SLO, and it shouldn't be for event streams either. Ideally, if there were missing data in MariaDB replicas, the same data would be missing in event streams.

2022 EDIT: ^ is no longer true. We do miss some data, but the amount is now small, thanks to fixes by Petr Pchelko in T215001: Revisions missing from mediawiki_revision_create. We will need to solve the consistency problem, but the urgency is less.

The Comprehensiveness problem

Data needed by many of the use case examples listed above is not in the existent MediaWiki state change event streams, e.g. article content (wikitext or otherwise).

When bootstrapping, a service may need to make many requests to the MediaWiki API to get its data, either overloading the API and/or causing the bootstrapping process to take a very long time. During normal operation, contacting the MediaWiki API in a realtime pipeline causes extra external latency that could be avoided.

Specifically, getting content and other data out of the MediaWiki API suffers from the following problems:

  • Dealing with API retries
  • Stale reads due to MariaDB replica lag
  • Quick deletes (cannot differentiate between a stale read and a deleted revision)

If relevant MediaWiki state changes were captured in event streams, services could be runtime decoupled from MediaWiki.

NOTE: Unless we replicate low level database state changes true ‘completeness’ of state change data will be hard to accomplish. Instead, we may consider providing a standard and default way of producing MediaWiki state changes (perhaps using some core abstraction on the MediaWiki hook mechanism?) This would encourage (or require?) developers to produce state change events for updates to MediaWiki data.

Further reading

Tech Forum docs

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Which part of this is controversial / of wide impact / needs a shared decision? If you expect to need architectural changes, those should be clearly outlined.

I was told to not suggest solutions at this phase, as this is a problem statement.

and until we don't know why, maybe we shouldn't propose solutions

We know that no matter what we do with the current implementation, we will miss important state events that MediaWiki does not.

Which part of this is controversial / of wide impact / needs a shared decision?

T120242 talks about potential solutions, is plenty controversial (as you know) and you suggested that I submit this to the tech forum! :p

I have a hard time understanding what the exact definition of "MediaWiki data" is.

Me too, which is probably why the "comprehensiveness" part of the problem isn't that well defined. Perhaps, we can make up a test? If someone wants to build a new service in production, but in order for that service to function, it either has to bootstrap data from MediaWiki / MariaDB and/or it has to talk to the MW API at runtime, the data it needs could be considered "MediaWiki data"?

... [snip]
Also, I would like to see a reference for this statement:

For example, we know that currently, ~1% of revision create events are missing from the event streams, and we are not sure why.

and until we don't know why, maybe we shouldn't propose solutions.

Since this was written, we got that number down, but maybe there's a clearer way to state this, I'll try it here:

"For example, no matter how reliably we publish events to Kafka from EventBus or similar, unless it happens inside a MW transaction with the edit, this event will potentially be lost. One technical reason for this is that a global kill timer starts with the transaction and after 120 seconds it kills everything, including deferred updates. (Timo pointed out some nuances here that I can link to, but the general statement stands). Fairly self-evidently, if we don't publish events as part of the transaction, there's a small chance they don't make it to their destination."

In many cases, that's ok. MW itself doesn't need many of its operations to have perfect reliability. But MW and external services do need it sometimes, and revisions created/updated/deleted is an example. Without the same kind of reliability that MW database replicas enjoy, based on transactional guarantees and subject only to replication problems, some external services can't exist.

There's a different way to state the higher level requirement, and I'm also curious if that would help:

"As an external listener to events published by MW, I should be able to ask MW the question 'What are all the events I should've seen since time X?',"

MW should be able to answer quickly. Currently, for some particular instances of this question, MW databases would crash and burn trying to answer. The way I see this problem space is us iterating together to allow MW to answer different instances of this question.

Which part of this is controversial / of wide impact / needs a shared decision?

T120242 talks about potential solutions, is plenty controversial (as you know) and you suggested that I submit this to the tech forum! :p

Yeah, well, I suggested presenting ideas for changing our status quo, not the statement "delivery of events should be reliable", which I see as quite uncontroversial. The issues were with the proposed architectural choices, which are hardly the "how" in this case.

... [snip]
Also, I would like to see a reference for this statement:

For example, we know that currently, ~1% of revision create events are missing from the event streams, and we are not sure why.

and until we don't know why, maybe we shouldn't propose solutions.

Since this was written, we got that number down, but maybe there's a clearer way to state this, I'll try it here:

"For example, no matter how reliably we publish events to Kafka from EventBus or similar, unless it happens inside a MW transaction with the edit, this event will potentially be lost. One technical reason for this is that a global kill timer starts with the transaction and after 120 seconds it kills everything, including deferred updates. (Timo pointed out some nuances here that I can link to, but the general statement stands). Fairly self-evidently, if we don't publish events as part of the transaction, there's a small chance they don't make it to their destination."

Ok perfect, this is a clearer statement. "Event generation should be a first-class part of the transaction inserting data in the mediawiki datastore" is what I would've liked to see in the problem statement. To which I would've objected that you're proposing to sacrifice website performance for ensuring (well, not really, we can get into that later) delivery of events.

This means, in practice, introducing a strong coupling between Mediawiki, eventgate, kafka. This kind of strong coupling always comes at the cost of reliability, because the total uptime of editing will now become:

Edit uptime = 1 - (downtime of mediawiki) -  (downtime of mysql) -  (downtime of eventgate)  -  (downtime of kafka)

So doing this will only happen at the expense of the edit uptime. It will also need support as right now we treat kafka-main as a second-layer system that can be down for short intervals.

Also: imagine we produce the event, and then the transaction fails to commit to the database for some reason. This is not an XA transaction, and if we want XA transactions, we will have to implement the logic ourselves in mediawiki; and problems and failures will still happen. This is one of the hardest problems in practical distributed architectures, IMHO.

I consider nesting transactions to different datastores an antipattern and I've seen the consequences of tying an external action in the middle of a database transaction many times, both here and at previous jobs.

Again, we can give an opinion on statements that are somewhat precise (like the one you enunciated above) and that is what I naively expected the technical decision forum to be the good place for getting feedback about.

In many cases, that's ok. MW itself doesn't need many of its operations to have perfect reliability. But MW and external services do need it sometimes, and revisions created/updated/deleted is an example. Without the same kind of reliability that MW database replicas enjoy, based on transactional guarantees and subject only to replication problems, some external services can't exist.

Let me briefly analyze your last statement.

Mysql replicas don't offer any kind of guarantee on freshness of data, nor on their immediate availability nor that data are consistent between databases. Are you sure that's what you aim for?

I do think, though, that the need for reliable event production is a valid issue; but the solution is to 1) acknowledge issues can happen, 2) decide what model of eventual consistency we would like to have 3) have a way to detect drifts and missing events and do reconciliation 4) have a mechanism to "dump and load" the data derived from mediawiki that doesn't depend on an event stream.

Again, strongly consistent replication between remote datastores can only happen at the cost of availability; or you can have eventual consistency and get better availability (which is what we should aim for IMHO), but you always will need to have mechanisms to reconciliate errors and dump/load data from the source of the "replica". Again this is a hard problem to solve, and will need a big commitment of engineering resources, and probably rewriting eventgate to be something different than it is today.

Concluding: there is much to discuss about this idea - but I don't see much of it in the problem statement. I don't think that in its present form this problem statement is particularly less vague than "we serve errors to our users, and we shouldn't. Improve the reliability of the website",

Also, I would like to see a reference for this statement:

For example, we know that currently, ~1% of revision create events are missing from the event streams, and we are not sure why.

A search leads me to: T215001: Revisions missing from mediawiki_revision_create and an earlier proposal T120242. It would be great to link these tasks here, and update the statistics about how much data is missing.

The problem statement should be tweaked a bit to reflect the actual decision that we need to make. Due to the impossibility of a perfect two-phase-commit protocol across systems, we need to pick a *single* source of truth. To quickly explain what I mean, let's say the current flow is something like this:

  1. Commit state change to MediaWiki DB
  2. Send change event to stream producer

Obviously, if anything breaks after step (1) we lose the event.

A quasi-two-phase protocol would be more like:

  1. Begin MediaWiki DB state change transaction
  2. Begin stream producer transaction
  3. Commit DB state change.
  4. If db commit fails, roll back stream transaction and retry from step 1.
  5. Commit stream transaction
  6. If stream transaction fails, retry--or alternatively, roll back db transaction.

As you can see, this is also imperfect and if the process is aborted at any step we're left with inconsistent data. It gives us slightly more assurance that the db and event stream have the same content, but there's still the problem of what to do if the db commit never succeeds, and if we abort before step 6 then we still have inconsistent data. This isn't a solvable problem.

I believe the way to look at the proposal we're talking about here is actually to make event streams the *authoritative* source of truth, and updating the MediaWiki DB is a downstream consumer of the stream. Let's clarify whether this is the case. The flow would look something like this:

  1. Submit change event to producer.
  2. If sending event fails, retry. Repeated failures return an error to atomic callers.
  3. Event consumer reads in "exactly once" or "at least once" mode, and commits these changes to the MW DB in an idempotent way.
  4. If DB update fails, retry, or mark event as failed.

@awight I fully agree with you: we need a single source of truth. But I don't think that is our actual issue - our issue is how do we sync secondary datastores with the source of truth.

If we were to - say, I actually have strong reservations of practical nature - switch to events as the single source of truth, we would still have to solve the problem of how to reliably reproduce them in the mediawiki database.

As a side note: when considering reliability scenarios, assume that in a nested or chained transaction, PHP can crash for N reasons at, or between, any of the steps, with a non-negligible probability. Any way we flip the replication direction or you nest the transactions, we will have situations where one is committed and the other rolled back in a way the application can't control and manage.

In many cases, that's ok. MW itself doesn't need many of its operations to have perfect reliability. But MW and external services do need it sometimes, and revisions created/updated/deleted is an example. Without the same kind of reliability that MW database replicas enjoy, based on transactional guarantees and subject only to replication problems, some external services can't exist.

Sending the data from within the transaction will not guarantee consistency, unless we can ensure to send a "cleanup" event when the transaction fails and is rolled back. We could emit the event right after successful commit, but then again consistency isn't guaranteed, because emitting the event may fail. We could retry, but that would mean delaying the response to the user.

If we need consistency, we are looking at implementing a distributed transaction system. That is hard, and resource intensive. Basically, we are looking to find our optimal solution for the CAP theorem (or more precisely the PACELC theorem): We have to trade off latency vs consistency. We can't have both.

EDIT: oh I see that @Joe explained all this much better in T291120#7464184.

@awight I fully agree with you: we need a single source of truth. But I don't think that is our actual issue - our issue is how do we sync secondary datastores with the source of truth.

The theory is that event sourcing is specifically designed to be replayed, so this greatly simplifies the requirement to build quasi-two-phase-commit architecture. Once a producer successfully inserts into the event stream, it can report success to an atomic caller. In theory this is a much safer operation than updating several MW DB tables, so producer reliability should be improved. The events can be replayed by secondary consumers (such as the DB in the scenario I described), and if the transaction fails we have an offset for easy retrying. I do think this solves the issues you describe, of being unable to predict rollback etc. At least that's what the literature that came with my box of K**l-Aid told me.

I recently attended a workshop called "Software Architecture: The Hard Parts" by Mark Richards and Neal Ford, promoting their book of the same name. One interesting idea we discussed int he workshop is a set up "transactional sagas" for distributed architectures, categorized by three characteristics: synchronous vs asynchronous communication, atomic vs eventual consistency, and orchestrated vs choreographed coordination.

In the context of the discussion above, it seems two of these "sagas" are relevant:

  • async + atomic + orchestrated: the "Fantasy Fiction" saga. High coupling, high complexity, low responsiveness, bad scalability.
  • async + eventual + orchestrated: the "Parallel" saga. Low coupling, low complexity, high responsiveness, easy scalability.

So what they are telling us is that we should be ready to live with eventual consistency of the state of services that receive data from MediaWiki. Attempting to get atomic consistency will not end well.

It would be great to link these tasks here,

@awight, those tasks are linked at the bottom of the description.

I suggested presenting ideas for changing our status quo,
"delivery of events should be reliable", which I see as quite uncontroversial.
that is what I naively expected the technical decision forum to be the good place for getting feedback about.

I was told pretty explicitly to write a problem statement only states the problem for a non technical audience (I found it really hard to write about a technical problem in a non-technical way, so this is probably why you don't like how the statement is written). IIUC, this phase of the tech forum is about making sure everyone agrees that the problem is a problem and is important, and then to build a RACI that will be used to find the right people to solve the problem later.

This means, in practice, introducing a strong coupling between Mediawiki, eventgate, kafka

If we need consistency, we are looking at implementing a distributed transaction system

This should probably be discussed in T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth. Note that the proposed solutions there (CDC / transactional outbox) do not suggest introducing distributed transactions, and avoid this strong coupling. But, to keep the discussion followable, let's discuss possible consistency solutions on that ticket, instead of this one.

I don't think that in its present form this problem statement is particularly less vague than "we serve errors to our users, and we shouldn't. Improve the reliability of the website",

I disagree but am very happy to change the problem statement in any way that would make this different. The problem is that it is currently not possible to reliably use MW events for state transfer to build new services. Or, perhaps I should say: it is currently not possible to reliably use MW data outside of MW in a realtime way, and consistent and comprehensive events as a source of truth is a solution to that problem.

The hard part will be figuring out the right architecture (and SLOs) to make MW events usable for state transfer. (Or at least a first class way to reconcile what has been missed).

The theory is that event sourcing is specifically designed to be replayed, so this greatly simplifies the requirement to build quasi-two-phase-commit architecture.

That would require us to REPLACE the database as the primary source of truth. That basically means reachitecting and rewriting a good chunk of MW. Even if we knew that this is the direction we want to take for the future, we are far far away from it. I think it makes more sense to focus on getting a reasonably reliable stream of events from mediawiki, in a standardized way.

The format of "problem statement first, solutions later" is very hard to follow, since I feel immediate urge jump to solutions, I'll try to stick to the problem.

After our most recent set of improvements under T215001 and thanks to @Milimetric and @JAllemandou analysis, we're now down to 0.0007% of revision-create events missing. We do not have exact numbers on the rest of event types since we didn't do such good analysis, but it's safe to assume that the numbers are similar. IMHO, this already looks quite good. I believe it's still possible to drop this number even more without drastically reinventing the architecture. We could look more into eventgate ( interactions between node.js thread poll and librdkafka own thread pool has always been suspicious), or we can experiment with replacing eventgate with librdkafka-based PHP kafka driver (there's a new one on the market that we've never evaluated).

I wonder, for which use-cases would 0.0001% be not enough? 0.00001%? Also, revisions are 'self-healing' for use-cases that rely on the lates view of the data, so when the page is edited next the view of the latest state becomes consistent again. I believe that if there are use-cases where 100% reliability is required we can find solutions based on some form of eventual consistency or reconciliation.

Non-guaranteed delivery of things like page deletes or revision suppressions are more important since these are not going to be naturally eventually consistent like revisions. But these operations are much much more rare then page edits, so perhaps we shouldn't put them together in one bag with revisions - solutions that are unacceptable for revisions might be totally ok for much more rare page deletes or renames.

The problem statement seems to suggest there will be a golden bullet that would allow us to make all events 100% reliable. That's mathematically impossible without sacrificing reliability. I think the solution would be unique for each type of event or use-case with a sliding scale between availability and consistency. The problem now imho is that we don't have any other option of event delivery and we don't have any easy option for reconciliation, so our sliding scale is missing a knob.

Thanks Petr! For consitency's sake (;p), let's keep discussions about consistency in T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth.

The problem statement seems to suggest there will be a golden bullet that would allow us to make all events 100% reliable.

I hope not! But, the problem state does suggest that we should try to make (important/relevant) MW state change events as (or as close to) reliable as MariaDB replication. That seems possible to me. Perhaps reconciliation will be the final solution, but I think there are other options we should consider too.

But, the problem state does suggest that we should try to make (important/relevant) MW state change events as (or as close to) reliable as MariaDB replication.

I know this is jumping to "solutioning" but the 'cheap' thing that is as reliable as MariaDB replication are MariaDB binlogs. Can the types of events desired be reconstructed from binlogs?

I know this is jumping to "solutioning" but the 'cheap' thing that is as reliable as MariaDB replication are MariaDB binlogs. Can the types of events desired be reconstructed from binlogs?

@bd808, see T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth

A lot of thoughtful comments have been made, and like others I find it difficult to separate the technical options/trade-offs from the problem statement itself. There seems to be a consensus that a reliable events architecture is desirable, but not if we are willing to pay a price in reliability for it. Some points we could include to make the problem statement more "decidable":

  • a commitment to create a core event infrastructure that is a source of truth for which we are willing to pay a price in reliability (and engineering resources). That price is determined by many factors yet to be determined/discussed, and should be "configurable" by SRE. I.e. there will be some sort of coupling between the transactions to the MW DB and the event system; if one transaction fails the other gets a chance to react to it. This is difficult to discuss without getting technical, but without coupling there is no events with guarantees, and we can't have coupling without a price in reliability.
  • explicitly exclude as goal to make events as a source of truth for MW itself. This seems too risky/ambitious, it could still be done more incrementally in the future once events are a source of truth.

The changes required to make events a core feature with guarantees, and the technical discussions/decisions preceding it, should also offer the chance to decrease the overall complexity of the WMF systems. We currently pay a hard-to-quantify but non-trivial price to build new systems/services that don't need to be a part of MW, in terms of human effort but also in terms of the options that are even considered feasible/reasonable to implement. Have we done a back-of-the-envelope evaluation of projects listed in the description in terms of estimated work given the status quo vs estimated work given a core event infrastructure with some correctness guarantees/streaming pipelines/etc? In my opinion, this aggregate cost needs to be weighted against the price in reliability it would cost us to build a core event infrastructure.

Hello, apologies for a late question (I have marked the "feedback" deadline for a wrong Friday and only now realized it was last week's Friday):
I might simply know too little about the current mediawiki's "event system" but reading through the problem statement, and thinking of its title ("source of truth") I do not fully understand why does it focus only on "services that need realtime MediaWiki data outside of MediaWiki"? If the event system was reliable enough to be considered a "source of truth" I'd imagine that (at least some parts of) Mediawiki (and its extensions) could actually move away from using the current SQL tables (revision, text, etc) to get the data they need. Is the different "performance" requirement level something that leads to ruling this application out and focus on "services outside of Mediawiki"? Or am I misinterpreting what "Mediawiki" refers to here? (e.g. would a hypothetical event-based service that updates Wikipedia infoboxes when Wikidata data changes -- currently implemented as Wikidata pushing updates to Wikipedias via mediawiki job queues -- be an instance of "service outside of Mediawiki")

I apologize if it occurs to be a loaded/non-constructive question. It is not my intention.

My understanding is that some of us got swept up into discussing the choice of a single source of truth (SSOT) but the bug author meant a source of truth, in a relative sense. In other words, the RFC is concerned with making the event streams trustworthy, but not trying to rearchitect MediaWiki to model all data flow as events. I propose that the RFC be renamed to clear this up, maybe to "Reliable event streams"?

As long as streams are reliable, we get most of the benefits of CQRS/ES even without our entire application using events internally. Waving hands in the general direction of what we might be missing without going for the "full" rewrite is that, tight RDBMS coupling will keep our application mostly monolithic, the API surface will continue to be the primary write interface, it will continu to be awkward for mw-core to ingest from externally-produced event sources.

oh dear, your favorite facepalm meme could go here. Thanks @awight for pointing out my stupidity. I am apparently so deep into the cool kids talks about the single source of truth all over the internet, that I've semiconsciously added it to the title of this document, while it indeed "only" talks about a source of truth.
Apologies for the noise.

After some discussion with Tim, it seems to me like the most realistic way to do this is Change Data Capture (CDC): we look at the databases replication log and generate events from that. However, direct CDC means binding directly to the raw database schema. That should be avoided. We'd want a layer that turns database row changes into more abstract events for general consumption. The code for the CDC event emitter should live in the same repo as the code that writes to the database - that is, it would live in MW core. It would be executed as a permanently running maintenance script. The log position could be maintained in kafka. This would provide parity with database replication in terms of latency and consistency.

I understand that we should not be "solutioning" too much while discussing the problem statement. I'm outlining the solution mainly to betetr understand the requirements - would this solution solve the problem outlined? Does it match the requirements?

My problem is that we don't know why we want a solution in this direction. The underlying problem is mentioned as:

For example, we know that currently, ~1% of revision create events are missing from the event streams

Let's assume the issue is in eventgate. Then we will be building a lot of code that would couple with mediawiki's database schema (increasing the cost of its much-needed improvements) for it to lose the 1% again and gain nothing. I would really like to see an investigation on this 1%. Where it vanishes into the void, how it can be improved, etc. If we know it's something in the architecture of the system, then yes, let's change the architecture.

Re CDC idea, pros and cons, see T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth, let's keep that discussion there. @daniel see the Hybrid CDC + Transactional Outbox idea for how we could have control over the event model, but still use the binlog directly (without a complicated raw row change translation layer).

For example, we know that currently, ~1% of revision create events are missing from the event streams

Recent fixes (some envoy timeout race conditions) may have brought this number way way down, which is great. We are about to have the full month of October imported into Hadoop, so we can do further analysis to really get this number.

However,

If we know it's something in the architecture of the system, then yes, let's change the architecture.

We do know this. No matter how reliable we make the deferred update async POSTing of a core state change event, we know that there will be discrepancies that over time will really matter. A solution like CDC might make the stream so good that any discrepancies (between what is in the binlog and what is in the stream) would totally disappear...but also maybe they won't? How confident are we currently that a MariaDB replica's database, or even just the revision table is, is 100% consistent with the master? Should a reliable event stream do better? I'd argue that it should be AS good, or better.

In either case, others have often argued that even if we get our event production consistency really really good, we will always need a way to reconcile discrepancies in the stream with the 'real' source of truth...the MediaWiki MariaDB. Perhaps all we will need is better EventGate SLOs (which maybe we can make now?) and a reconciliation mechanism built into MediaWiki. I was skeptical that this could work, but I'm starting to be convinced it might be possible. Hm, I should update the description of T120242 to include that as a possible solution.

@WMDE-leszek, if we can rely on event streams to source MW data for non-MW services...there's no reason we couldn't do it for MW too (at least in cases where MW is ok with eventual consistency). This problem statement def focuses on external services though, since MW already has a 'source of truth' that it alone has access to.

The replies above suggest that this ticket is indeed about events as the source of truth rather than a source of truth. @Ottomata can you clarify?

The replies above suggest that this ticket is indeed about events as the source of truth rather than a source of truth. @Ottomata can you clarify?

Hm, I'm not sure, but I think the confusion is if events are the (only?) source of truth then we'd have to event source MediaWiki in for order MediaWiki itself to be consistent. While that would be a nice solution if we were starting from scratch, doing that is intractable and not what this problem statement is concerned with. Instead, this is about being able to use MediaWiki state change events as a reliable source of truth for MediaWiki state changes. This would make it possible to externalize MediaWiki state in realtime without a runtime coupling to MediaWiki (and its MariaDB db).

In either case, others have often argued that even if we get our event production consistency really really good, we will always need a way to reconcile discrepancies in the stream with the 'real' source of truth...the MediaWiki MariaDB. Perhaps all we will need is better EventGate SLOs (which maybe we can make now?) and a reconciliation mechanism built into MediaWiki.

This is a really exciting suggestion—if a reconciliation mechanism is built-in and can be made flexible enough, it could also solve for a broad range of use cases: * edit conflict resolution * real-time collaborative editing * offline editing * and federation, just to name a handful...

Instead, this is about being able to use MediaWiki state change events as a reliable source of truth for MediaWiki state changes. This would make it possible to externalize MediaWiki state in realtime without a runtime coupling to MediaWiki (and its MariaDB db).

If this is the case, I suggest removing the "source of truth" wording from the proposal entirely. As is evenident from the discussion on this ticket, it's bound to create confusion. Personally, I have never seen "source of truth" used to refer to anything other than the single source of truth. Hoe about just saying that we need MediaWiki to act as a reliable even source, and focus the discussion on what "reliable" should mean in terms of consistency and latency.

I suggest removing the "source of truth" wording from the proposal entirely
I have never seen "source of truth" used to refer to anything other than the single source of truth

Hm, perhaps you are right.

However, this ticket is about making it possible to use MediaWiki produced events as the only source of truth a non-MediaWiki service needs to get MediaWiki data. I'm not suggesting that MediaWiki also only use events as SSOT (single source of truth) because of the intractability of that. It is much easier to design Event Sourced services from the start, rather than try to convert one to Event Sourced later. MediaWiki is huge and complicated and too important; I wouldn't suggest any kind of concerted effort to change its data architecture outright. If we get events as a source of truth, then parts of MediaWiki could incrementally be moved to an Event Sourced architecture. In an (my?) ideal world, this could be a theoretical goal, so in that sense I'd love to theoretically make MediaWiki events THE single source of truth for everything, including MediaWiki. It's just that I don't think doing so is practical, so I excluded that from consideration in this problem statement.

I suppose, since this is just a very abstract problem statement, we could frame the problem to include MediaWiki too? Although that would probably confuse more people and cause more friction.

How about just saying that we need MediaWiki to act as a reliable event source

Reliable and comprehensive. Meaning that any time someone wants to set up a new service with transformed MediaWiki data, there is a way to get all the state they need from events. If the specific state changes they need aren't yet in the streams, there should be something that allows them to code up MediaWiki to begin emitting that state as events, for themselves and others to use.

Just following up because I may have been wrong to introduce the idea of "a" source of truth above. I did this to try to show where the confusion in this task is coming from, but it's important to note that "single source of truth" has a very strict definition, if we had a SSOT it would be the MediaWiki primary database and there are no other sources of truth. For example, if you want to make an update it must be to this database, not in an event stream, etc.

Also following the definition of SSOT, the event source records are not another source of truth. They are in fact breaking ideal SSOT because they are copying data literally rather than as a reference. If we sent events that simply said "revision: 123" and you had to query the database to get the revision content, this would be closer to pure SSOT. Instead, we inline the content of the revision, which becomes problematic if a revision is deleted. In this case the event is out-of-sync with the actual source of truth. This is exactly the reason that SSOT exists, and shows how breaking it causes issues for us.

+1 that it seems most productive to focus on just "reliable and complete" event sourcing for this RFC, and give it a concrete target threshold of percentage messages dropped.

which becomes problematic if a revision is deleted

If it is deleted without a corresponding revision-delete state change event, that is problematic. Otherwise, fine, no?

They are in fact breaking ideal SSOT because they are copying data literally rather than as a reference

IIUC it is totally fine to copy data literally, as long as you are copying it from the SSOT, and the SSOT is reliable and complete.

I'd like to make it possible for everything except MediaWiki to consider events as their source of truth. Technically this isn't SINGLE source of truth, I guess, since the fundamental MW truth is MariaDB. I'd like to abstract that away from everything else, so from the perspective of a service that doesn't have access to MW MariaDB, the events are the only 'source of truth' they need.

which becomes problematic if a revision is deleted

If it is deleted without a corresponding revision-delete state change event, that is problematic. Otherwise, fine, no?

I only call this problematic because it's important that our downstream consumers respect the social convention that revision-delete is projected to delete previously received data. This makes event sourcing feel slightly uncomfortable, similar to if we were broadcasting protected data such as PII but asking our consumers to only retain for at most the legally-allowed 30-day period, for example. In both cases consumers have an obligation to periodically transform the entire history of each stream in order to comply with rules, which feels at odds with the ideal of a write-only, permanent event store.

But I agree that it's totally fine to do this, and to copy denormalized data literally, sending as events. What events have already done brilliantly is to give CQRS access to MediaWiki records, to external peers. From the perspective of an event consumer, it doesn't matter where the boundaries are between our event producer and the authoritative DB rows. Whether this is a tightly-coupled replication log follower or a set of MediaWiki hooks, the event stream will look the same, the only differences are in the expected reliability.

I think you said it well, that events are the only truth many consumers will need. Let's abandon this "truth" terminology if possible, though—if data is followed through our systems e.g. as a data-flow diagram, then each data-transforming process has a source data store and a sink data store. The source of any one thread of data flow is not necessarily true, it could already be out-of-date relative to the furthest upstream producer. Truth is a very strong claim to make about a piece of data (I would even hesitate to claim that the primary db contains truth), and also doesn't have a convenient pairing (i.e., the sink of a data flow is not "untrue") so not a good fit here. One could say "authoritative", but that would be tiresome to repeat across each data flow: "events are less authoritative than the primary database, but more authoritative than a locally-cached projection of these events". Maybe we can avoid this sort of language entirely, instead pointing to a DFD showing that e.g. the EventStreams ES endpoint mediawiki.revision-create mirrors the internal Kafka topic of the same name, which is produced by the EventBus extension hooking into onRevisionRecordInserted. Describing as a series of data flows lets the reader see the relationship between each node, choose which is the most appropriate for their use case, and also hints at potential losses through each linkage.

Of course, the point of this task is that we should project event streams from all important MediaWiki activities, and do it reliably. I'm completely in favor, apologies for all the words thrown at such a small issue of language. I would love to see what additional event streams are being considered (doc is not readable to the public yet). I'm sure that OKAPI will provide plenty of business motivation to guarantee a high level of reliability, deciding on the exact threshold is more of a cost/benefit thing than a RFC question. So maybe we should be talking about what's in the newly proposed event streams, governance over data retention, and for fun we could look at the potential for *incoming* event streams?

I only call this problematic because it's important that our downstream consumers respect the social convention that revision-delete is projected to delete previously received data

This is true of the replication binlog now, right? (If MariaDB didn't respect deletes). But also kafka compacted topics may help.

(doc is not readable to the public yet).

fixed.

what's in the newly proposed event streams, governance over data retention, and for fun we could look at the potential for *incoming* event streams?

Heckay ya. There aren't any new proposed event streams at this time, only talk about putting things like wikitext and/or html in a stream. Its going to be hard defining what is 'complete', so we changed our hopes to just making it easy enough for MW state change events to made more complete over time, as more use cases emerge and are implemented.

Let's abandon this "truth" terminology if possible

I don't mind.

One could say "authoritative",

Canonical?

source of any one thread of data flow is not necessarily true

But each data source should have a one official producer that emits the original state change event. The stream of these events is the series of true/canonical state changes for the relevant entity. I think the 'truth' term is a philosophical one, not a practical one. If you are able to traverse and apply the full history of all events, to a particular point in time, then you should have a full snapshot slice of what the full current state of all entities looked like at that time. In practice/realtime, the externalized state is not going to be consistent everywhere at once, only eventually so.

I think we agree @awight, and I am not attached to the term 'truth' in any way. I mostly want MediaWiki to produce state change event streams that other services can fully rely on to get what they need. "Events as a reliable and comprehensive source of MediaWiki state changes?"

I think we agree @awight, and I am not attached to the term 'truth' in any way. I mostly want MediaWiki to produce state change event streams that other services can fully rely on to get what they need. "Events as a reliable and comprehensive source of MediaWiki state changes?"

I feel that I am missing some subtlety in the distinction that you are drawing between 'truth' and 'can fully rely on'. What concretely is traded away in the sense of CAP theorem's framing of the binary choice of decreasing availability or decreasing consistency in the face of message loss in your fully reliable designation?

What concretely is traded away in the sense of CAP theorem's framing of the binary choice of decreasing availability or decreasing consistency in the face of message loss in your fully reliable designation?

In CAP terms, we lose Consistency, but only in that CAP defines it as "Every read receives the most recent write or an error.". State changes via event streams only provide eventual consistency (as do MariaDB replicas).

awight renamed this task from MediaWiki Events as Source of Truth - Decision Statement Overview to Constent and comprehensive event streams - Decision Statement Overview.Nov 25 2021, 11:45 AM

I've "boldly" edited the title to reflect how I understand the consensus above. We aren't trying to shift the source of truth, the point is to make events more useful for external system and service integrations.

freephile renamed this task from Constent and comprehensive event streams - Decision Statement Overview to Constant and comprehensive event streams - Decision Statement Overview.Nov 26 2021, 5:39 PM
awight renamed this task from Constant and comprehensive event streams - Decision Statement Overview to Consistent and comprehensive event streams - Decision Statement Overview.Nov 26 2021, 11:21 PM

Hmmm, bold! :) This task is owned by the Architecture team and the Technical Decision Forum chairs. They use this to create 'artifacts', so the description here needs to be synced to their artifact documents. We probably shouldn't change it without some coordination.

I'd like to discuss a little more before changing the title. I think 'a source of truth' is accurate for the intent: event streams should capture all MediaWiki state changes, so that it may be used as a 'source of truth' for MediaWiki. If that confuses people, I'm not opposed to changing the title, but I want to make sure that the intent is still captured somehow.

Also, this task is explicitly about MediaWiki event streams; so we should keep 'MediaWiki' in the title.

Of course, no problem at all to "revert" the title change :-), it was just a thought in order to avoid further confusion about the intent. Just to illustrate the problem with "source of truth", here are the Wikipedia search results.

Aye, k I'll revert for now. I'm mostly getting my 'source of truth' terminology from https://www.confluent.io/blog/messaging-single-source-truth/

Ottomata renamed this task from Consistent and comprehensive event streams - Decision Statement Overview to MediaWiki Events as a Source of Truth - Problem Statement.Nov 29 2021, 2:35 PM

Just to add my 2 cents, I too find the "A source of truth" terminology confusing. I am so accustomed to equate Source Of Truth to SSOT that it would take me some mental effort to accommodate for that. And while I now am able to do so, thanks to this discussion, newcomers will not have that historical benefit. I 'd argue that it's not worth it to confuse newcomers and ask them to do with that nuance.

Aye, k I'll revert for now. I'm mostly getting my 'source of truth' terminology from https://www.confluent.io/blog/messaging-single-source-truth/

Please forgive me for saying so, but that page has "single" in the URL and the title and "One" (capitalized) in the 3rd subheading. A cursory reading makes me suppose that it supports SSOT and not ASOT via messages, so it's confusing me somewhat more.

supports SSOT and not ASOT via messages, so it's confusing me somewhat more.

I think the philosophical intention is to be able to use streams as a SSOT. Its just that we will likely never rearchitect MediaWiki itself to use this streaming source of truth for all its state changes, even if we might wish that we could.

In the 'Getting Events into the Log' section of that article, it specifically talks about how database only based legacy systems can use other techniques (like CDC) to build the event streams other services can use as source of truth, and then incrementally also use that stream to rearchitect the legacy system into an event sourced one. Sure that would be great! I just don't want to propose that we should work on that.

So, in principal, I'm suggesting we should have a SSOT, but in practice we'll likely never get there...so ASOT is great too. :)

But, heard. Am willing to change title, lets discuss with TechForum folks.

But, heard. Am willing to change title, lets discuss with TechForum folks.

In the 'Getting Events into the Log' section of that article, it specifically talks about how database only based legacy systems can use other techniques (like CDC) to build the event streams other services can use as source of truth, and then incrementally also use that stream to rearchitect the legacy system into an event sourced one. Sure that would be great! I just don't want to propose that we should work on that.

Thanks for clarifying that.

FYI, if anyone is interested, there is a free talk from Confluent on Dec 16 2021: Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka

Perhaps a better titled would be "Event Carried State Transfer of MediaWiki State"?

Ottomata renamed this task from MediaWiki Events as a Source of Truth - Problem Statement to MediaWiki Event Carried State Transfer - Problem Statement.Jan 18 2022, 3:37 PM

Ok, going with 'MediaWiki Event Carried State Transfer' as title.

@Jenlenfantwright @LNguyen This task changed state twice (in February and in March) despite a lack of substantial updates since November last year. Is there a record of any discussions or decisions that is available elsewhere?

Discussions that caused changes to the task here are all in the comments.

Some notes were taken during the meetings about the Decision Record, and are included at the bottom of that document.

The current status is that we have submitted the Decision Record to the Tech Forum, and will be meeting to discuss with Tech Forum board(?) on March 14 for their sign off. I probably should have posted a comment here last week noting that, sorry about that.

@LSobanski if you or anyone wants to discuss more, feel free to make a meeting or to just ask here. :)

Oh, BTW in case you weren't aware, the Decision Record we are submitting now is explicitly about the 'Comprehensiveness' problem, not the 'Consistency' problem. So, we are trying to solve for getting more MW state into streams, but not solving for improving the consistency of state streams emitted from MW. The consistency problem is much less pressing now that Petr has fixed some proxy timeout settings, and it's solutions (described in T120242) are also much more controversial, so we are punting on those for now.