T116786 introduced MediaWiki Event-Platform production via an extension utilizing hooks. While adequate for the EventBus MVP, this is only an interim solution. Ultimately, we need a mechanism that guarantees event delivery (eventual consistency is OK).
The Event Platform program extended the work started in T116786 to provide a standardized event producing APIs unified for both production and analytics purposes.
However, in order to build truly reliable new production services with events based on MediaWiki data, we need a single source of truth for MediaWiki data. That source of truth is the MediaWiki MySQL database. This is only consistently accessible by MediaWiki itself. There is currently no way to consistently expose (real time) MediaWiki state changes to non MediaWiki applications.
We do have events produced by MediaWiki, but these events are decoupled from the MySQL writes, and there is no guarantee that e.g. every revision table save results in a mediawiki.revision-create event. This means that as of today, MediaWiki events cannot be relied on as a 'source of truth' for MediaWiki data. They are not much more than a best (really good!) effort notification.
Background reading: Turning the database inside out
Why do we need this?
I asked a few stake holders to explain why this is important to them, and they gave me permission to quote them here. These are a few examples of why consistent events are important.
WikiData Query Service Updater - T244590
... missed events are probably the biggest issue in the system. We have visibility into late and out of order events (and probably mostly buggy events, but there's no way of knowing for sure). Not only that, there are sensible ways of dealing with them, both in general and in our specific situation.
Missed events are, by their nature, invisible to us via standard means and hard to observe in general. Since we also don't really understand the situation when those are dropped, it's hard to assess the impact on WDQS updater. We decided we're ok with it for now, because it's simply still better than the previous solution.
To reiterate - we can deal with lateness and out-of-orderliness - dealing with missed events is order of magnitude a harder challenge.
Image Recommendations project - T254768
Throughout the month, the state of an article can change. We'll need to track a "revisions events topic" to establish a feedback loop with the
model re the following state changes (among others):
- Previously unillustrated articles that are now illustrated
- Articles illustrated algorithmically, that have been reverted
- Orthogonal (technically not a MW state change): track which recommendations have been rejected by a client.
Being late in capturing state changes, would result in a degraded UX that will fix itself with time.
Missing events would be an order of magnitude harder problem to solve.
HTML wiki content dumps and other public datasets - T182351
Another category of tools that depend on the correctness of the events are derived datasets that the foundation could publish. This includes the equivalent of the wikidumps on which the analytics wiki history datasets are based, which could be replaced with a snapshot-less and continuous log of revisions. Another example is the html dumps discussed in T182351: Make HTML dumps available, which the OKAPI team can also relate to, and any number of other datasets that one can think of.
Wikimedia Enterprise AKA Okapi
if you don't have consistent events, how else would you get the data you need for your use case? - We heavily rely on events to maintain our dataset. Basically we do CDC from event streams to maintain our dataset. Not having consistent events means that our dataset gets out of sync and we need to engineer something on top of events to make sure that it is consistent. Just FYI we are just acknowledging that events may be not consistent and putting that problem into a box for now, but that's probably going to be our next bridge to cross.
Potential solutions
Event Sourcing is an approach that event driven architectures use to ensure they have a single consistent source of truth that can be used to build many downstream applications. If we were building an application from scratch, this might be a great way to start. However, MediaWiki + MySQL already exist as our source of truth, and migrating it to an Event Sourced architecture all at once is intractable.
In lieu of completely re-architecting MediaWiki's data source, there are a few possible approaches to solving this problem in a more incremental way.
Change Data Capture (CDC)
CDC uses the MySQL replication binlog to produce state change events. This is the same source of data used to keep the read MySQL replicas up to date.
Description
A binlog reader such as debezium would produce database change events to Kafka. This reader may be able to transform the database change events into a more useful data model (e.g. mediawiki/revision/create), or transformation maybe done later by a Stream Processing framework such as Flink or Kafka Streams.
Pros
- No MediaWiki code changes needed
- Events are guaranteed to be produced for every database state change
- May be possible to guarantee each event is produced exactly once
- Would allow us to incrementally Event Source MediaWiki (if we wanted to)
Cons
- Events are emitted (by default?) in a low level database change model, instead of a higher level domain model, and need to be joined together and transformed by something, most likely a stateful stream processing application.
- WMF's MariaDB replication configuration may not support this (we may need GTIDs).
- Data Persistence is not excited about maintaining more 'unicorn' replication setups.
Transactional Outbox
This makes use of database transactions and a separate poller process to produce events.
See also: https://microservices.io/patterns/data/transactional-outbox.html
Description
Here's how this might work with the revision table:
When a revision is to be inserted into the MySQL revision table, a MySQL transaction is started. A record is inserted into both the revision table and the revision_event_log table. The MySQL transaction is committed. Since this is done in a transaction, we can be sure that both of the table writes happen atomically. The revision event is produced to Kafka. When the Kafka produce request succeeds, the revision_event_log's produced_at timestamp (or boolean) field is set.
A separate process polls the revision_event_log table for records where produced_at is NULL, produces them to Kafka, and sets produced_at when the produce request succeeds.
If needed, revision_event_log records may be removed after they are successfully produced.
Pros
- Events can be emitted modeled as we choose
- Since MW generally wraps all DB writes in a transaction, no MW core change needed. This could be done in an extension.
Cons
- At least once guarantee for events, but this should be fine. There may be ways to easily detect a the duplicate event.
- Separate polling process to run and manage.
Hybrid: Change Data Capture via Transactional Outbox
This is a hybrid of the above two approaches. The main difference is instead of using CDC to emit change events on all MySQL tables, we only emit change events for event outbox tables.
This idea is from Debezium: https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
Description
MediaWiki would be configured to write all changes in a transaction with the outbox tables. When a revision is to be inserted into the revision table, a MySQL transaction is started. A record is inserted into the revision table as well as the revision_event_outbox table. The revision_event_outbox has a field including a JSON string representing the payload of the change event. The transaction is then committed.
A binlog reader such as Debezium would then filter for changes to the revsion_event_outbox table (likely extracting only the JSON event payload) and emit only those to Kafka.
Pros
- Events can be emitted modeled as we choose
- Events are guaranteed to be produced for every database state change
- May be possible to guarantee each event is produced exactly once
- No need to transform from low level database changes to high level domain models.
- Since MW generally wraps all DB writes in a transaction, no MW core change needed. This could be done in an extension.
- Would allow us to incrementally Event Source MediaWiki (if we wanted to)
Cons
- WMF's MariaDB replication configuration may not support this (we may need GTIDs).
- Data Persistence is not excited about maintaining more 'unicorn' replication setups.
2 Phase Commit with Kafka Transactions
This may or may not be possible and requires more research if we want to consider it. Implementing it would likely be difficult and error prone, and could have an adverse affect on MediaWiki performance. If we do need Kafka Transactions, this might be impossible anyway, unless a good PHP Kafka Client is written.