Page MenuHomePhabricator

Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth
Open, MediumPublic

Description

T116786 introduced MediaWiki Event-Platform production via an extension utilizing hooks. While adequate for the EventBus MVP, this is only an interim solution. Ultimately, we need a mechanism that guarantees event delivery (eventual consistency is OK).

The Event Platform program extended the work started in T116786 to provide a standardized event producing APIs unified for both production and analytics purposes.

However, in order to build truly reliable new production services with events based on MediaWiki data, we need a single source of truth for MediaWiki data. That source of truth is the MediaWiki MySQL database. This is only consistently accessible by MediaWiki itself. There is currently no way to consistently expose (real time) MediaWiki data to non MediaWiki applications.

We do have events produced by MediaWiki, but these events are decoupled from the MySQL writes, and there is no guarantee that e.g. every revision table save results in a mediawiki.revision-create event. This means that as of today, MediaWiki events cannot be relied on as a 'source of truth' for MediaWiki data. They are not much more than a best (really good!) effort notification.

Background reading: Turning the database inside out

Potential approaches

Event Sourcing is an approach that event driven architectures use to ensure they have a single consistent source of truth that can be used to build many downstream applications. If we were building an application from scratch, this might be a great way to start. However, MediaWiki + MySQL already exist as our source of truth, and migrating it to an Event Sourced architecture all at once is intractable.

In lieu of completely re-architecting MediaWiki's data source, there are a few possible approaches to solving this problem in a more incremental way.


Change Data Capture (CDC)

CDC uses the MySQL replication binlog to produce state change events. This is the same source of data used to keep the read MySQL replicas up to date.

Description
A binlog reader such as debezium would produce database change events to Kafka. This reader may be able to transform the database change events into a more useful data model (e.g. mediawiki/revision/create), or transformation maybe done later by a Stream Processing framework such as Flink or Kafka Streams.

Pros

  • No MediaWiki code changes needed
  • Events are guaranteed to be produced for every database state change
  • May be possible to guarantee each event is produced exactly once
  • Would allow us to incrementally Event Source MediaWiki (if we wanted to)

Cons

  • Events are emitted (by default?) in a low level database change model, instead of a higher level domain model, and need to be joined together and transformed by something, most likely a stateful stream processing application.

Transactional Outbox

This makes use of database transactions and a separate poller process to produce events.

See also: https://microservices.io/patterns/data/transactional-outbox.html

Description
Here's how this might work with the revision table:

When a revision is to be inserted into the MySQL revision table, a MySQL transaction is started. A record is inserted into both the revision table and the revision_event_log table. The MySQL transaction is committed. Since this is done in a transaction, we can be sure that both of the table writes happen atomically. The revision event is produced to Kafka. When the Kafka produce request succeeds, the revision_event_log's produced_at timestamp (or boolean) field is set.

A separate process polls the revision_event_log table for records where produced_at is NULL, produces them to Kafka, and sets produced_at when the produce request succeeds.

If needed, revision_event_log records may be removed after they are successfully produced.

NOTE: This example is just one of various ways a Transactional Outbox might be implemented. The core idea is the use of MySQL transactions and a separate poller to ensure that all events are produced.

Pros

  • Events can be emitted modeled as we choose
  • Since MW generally wraps all DB writes in a transaction, no MW core change needed. This could be done in an extension.

Cons

  • At least once guarantee for events, but this should be fine. There may be ways to easily detect a the duplicate event.
  • Separate polling process to run and manage.

Hybrid: Change Data Capture via Transactional Outbox

This is a hybrid of the above two approaches. The main difference is instead of using CDC to emit change events on all MySQL tables, we only emit change events for event outbox tables.

This idea is from Debezium: https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/

Description
MediaWiki would be configured to write all changes in a transaction with the outbox tables. When a revision is to be inserted into the revision table, a MySQL transaction is started. A record is inserted into the revision table as well as the revision_event_outbox table. The revision_event_outbox has a field including a JSON string representing the payload of the change event. The transaction is then committed.

A binlog reader such as Debezium would then filter for changes to the revsion_event_outbox table (likely extracting only the JSON event payload) and emit only those to Kafka.

Pros

  • Events can be emitted modeled as we choose
  • Events are guaranteed to be produced for every database state change
  • May be possible to guarantee each event is produced exactly once
  • No need to transform from low level database changes to high level domain models.
  • Since MW generally wraps all DB writes in a transaction, no MW core change needed. This could be done in an extension.
  • Would allow us to incrementally Event Source MediaWiki (if we wanted to)

Cons

  • ?

2 Phase Commit with Kafka Transactions

This may or may not be possible and requires more research if we want to consider it. Implementing it would likely be difficult and error prone, and could have an adverse affect on MediaWiki performance. If we do need Kafka Transactions, this might be impossible anyway, unless a good PHP Kafka Client is written.

Event Timeline

Eevans created this task.Dec 3 2015, 5:21 PM
Eevans raised the priority of this task from to Medium.
Eevans updated the task description. (Show Details)
Eevans added projects: Services, MediaWiki-API.
Eevans added subscribers: Eevans, Ottomata, mobrovac and 4 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 3 2015, 5:21 PM
Anomie set Security to None.
Anomie added a subscriber: Anomie.

If there's anything here that has to do with the action API, I'm not seeing it. Removing MediaWiki-API.

Anomie removed a subscriber: Anomie.Dec 3 2015, 5:41 PM
Pchelolo moved this task from Backlog to later on the Services board.Oct 12 2016, 7:57 PM
Pchelolo edited projects, added Services (later); removed Services.

Facebook actually heavily relies on SQL comments to pass event information to binlog tailer daemons (see the TAO paper). We currently use those SQL comments only to mark the source of a SQL query (PHP function), but could potentially add some annotations that would make it easy to generically extract & export such events into individual Kafka topics.

Also, I get the impression that Kafka SQL connectors are getting better,
e.g. http://debezium.io/

@Ottomata, from a cursory look at those connectors, it looks like they all aim to capture all SQL updates (update, insert, delete). They don't seem to be targeted at emitting specific semantic events, such as the ones we are interested in for EventBus. This is where the SQL comment idea could help, by letting us essentially embed the events we want to have emitted in the statement, rather than trying to reverse-engineer an event from raw SQL statement(s).

BTW, theres' been some recent talk about using Debezium (or something) for incremental updates of mediawiki history in hadoop, which would help replace analytics-store slaves. CC @Milimetric. @mobrovac perhaps events from MySQL binlog, (if coupled with a stream processing framework?) would also be helpful for dependency tracking?

@mobrovac perhaps events from MySQL binlog, (if coupled with a stream processing framework?) would also be helpful for dependency tracking?

Hm, I would rather rely on atomic structures inside our code base for that. Using binlog is a bit tricky from the semantic perspective, because it implies intimate internal knowledge of the SQL structures used in MW (read: custom transaction-to-event mapping), which make it hard to keep up to date. Also, it raises the bar from the portability perspective (other stores, environments, etc).

it implies intimate internal knowledge of the SQL structures used in MW

Aye

it raises the bar from the portability perspective

Hm, not necessarily. If we have a solid stream processing system, it might not be too hard to map to a more agnostic stream of events from the db based ones.

Anyway, just an idea :)

We still have to check Debezium with the DBAs and hear their thoughts on it, but it's possible we could go forward with both ways of generating events and figure out which is easier in practice:

  • Continue to improve how mediawiki sends events to our general event infrastructure, if there are questions people are asking of the data, add more instrumentation
  • Meanwhile, get everything through Debezium and try to answer questions by converting transactions to events as Marko points out. More techy analysts could dig through the raw transactions?

The same event infrastructure would support both of these approaches, and until we get really good at the first bullet point we would probably need to do the second anyway. So maybe we don't need to choose before we start?

Restricted Application added a project: Analytics. · View Herald TranscriptJun 29 2018, 7:21 PM
Milimetric moved this task from Incoming to Event Platform on the Analytics board.Jul 5 2018, 4:36 PM
Ottomata moved this task from Backlog to Radar on the Event-Platform board.Apr 14 2020, 1:26 PM
Krinkle removed a subscriber: Krinkle.Apr 14 2020, 2:50 PM
Ottomata added a comment.EditedSep 17 2020, 3:40 PM

We might be able to achieve this with a Kafka client in MW that uses Kafka transactions. But then, we'd need a good PHP Kafka client, and would likely have to fire up a new Kafka producer connection with every MW request, which is not very efficient. We'd also have to some how tie the Kafka produce call with the MySQL DB write call into a transaction.

Ottomata updated the task description. (Show Details)Dec 4 2020, 4:06 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)

We'd also have to some how tie the Kafka produce call with the MySQL DB write call into a transaction.

To do this I think we'd need some kind of two phase commit service for MediaWiki, which sounds really hard to me!

@dianamontalion
I updated this task with what I hope is more descriptive of the problem and some possible solutions. I really think solving this is key to the Architecture teams mission of supporting 'infinite use cases insert rest of arch mission phraseology here' :)

@claroski o/ Curious, is Platform Engineering not a relevant tag? Work on this would very likely affect things like ChangeProp and JobQueue.

Ottomata updated the task description. (Show Details)Dec 17 2020, 9:13 PM
Ottomata updated the task description. (Show Details)Dec 23 2020, 3:42 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)Dec 23 2020, 6:16 PM
Ottomata updated the task description. (Show Details)Jan 5 2021, 4:12 PM

@Clarakosi: I think @Ottomata meant to ping you above, adding here.

@Clarakosi: I think @Ottomata meant to ping you above, adding here.

Thanks!

@claroski o/ Curious, is Platform Engineering not a relevant tag? Work on this would very likely affect things like ChangeProp and JobQueue.

I think I understood it as needing feedback from the Architecture team first but retagging with Platform Engineering Roadmap Decision Making for review.

At the moment I don't think there are any actionables, but there might be in a couple of quarters. Not sure if there is a decision to be made yet, but there will be eventually!

FYI, I had a chat with @Krinkle yesterday, and he informed be that for all MediaWIki browser client generated writes to MediaWiki MySQL are wrapped in a transaction anyway! So, a Transactional Outbox solution would most likely be possible without any changes to MediaWiki core. We should follow up on how this is different for API requests, maintenance, or JobQueue based writes, as we'd likely have to have those wrapped in a transaction too.

Ottomata updated the task description. (Show Details)Jan 21 2021, 4:07 PM
Ottomata renamed this task from Reliable (atomic) MediaWiki event production to Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth.Jan 21 2021, 4:11 PM
Joe added a subscriber: Joe.Fri, Feb 12, 8:16 AM