Change Details

T116786 introduced MediaWiki #EventBus production [[https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/EventBus|via an extension]] utilizing hooks. While adequate for the [[https://phabricator.wikimedia.org/T114443|EventBus MVP]], this is only an interim solution. Ultimately, we need a mechanism that guarantees event delivery (eventual consistency is OK). The [[ https://wikitech.wikimedia.org/wiki/Event_Platform | Event Platform ]] program extended the work started in T116786 to provide a standardized event producing APIs unified for both production and analytics purposes. However, in order to build truly reliable new production services with events based on MediaWiki data, we need a single source of truth for MediaWiki data. That source of truth is the MediaWiki MySQL database. This is only consistently accessible by MediaWiki itself. There is currently no way to consistently expose (real time) MediaWiki data to non MediaWiki applications. We do have events produced by MediaWiki, but these events are decoupled from the MySQL writes, and there is no guarantee that e.g. every revision table save results in a mediawiki.revision-create event. This means that as of today, MediaWiki events cannot be relied on as a 'source of truth' for MediaWiki data. They are not much more than a best (really good!) effort notification. ## Potential approaches [[ https://martinfowler.com/eaaDev/EventSourcing.html | Event Sourcing ]] is an approach that event driven architectures use to ensure they have a single consistent source of truth that can be used to build many downstream applications. If we were building an application from scratch, this might be a great way to start. However, MediaWiki + MySQL already exist as our source of truth, and migrating it to an Event Sourced architecture all at once is intractable. In lieu of completely re-architecting MediaWiki's data source, there are a few possible approaches to solving this problem in a more incremental way. --- ### Change Data Capture (CDC) CDC uses the MySQL replication binlog to produce state change events. This is the same source of data used to keep the read MySQL replicas up to date. **Description** A binlog reader such as [[ https://debezium.io/ | debezium ]] would produce database change events to Kafka. This reader may be able to transform the database change events into a more useful data model (e.g. [[ https://schema.wikimedia.org/repositories/primary/jsonschema/mediawiki/revision/create/latest | mediawiki/revision/create ]]), or transformation maybe done later by a Stream Processing framework such as [[ https://flink.apache.org/ | Flink ]] or [[ https://kafka.apache.org/documentation/streams/ | Kafka Streams ]]. **Pros** * No (or minimal?) MediaWiki code changes needed * Events are guaranteed to be produced for every database state change * May be possible to guarantee each event is produced exactly once * Would allow us to incrementally Event Source MediaWiki (if we wanted to) **Cons** * Events are emitted (by default?) in a low level database change model, instead of a higher level domain model, and need to be transformed by something --- ### Transactional Outbox This is a hybrid method that makes use of database transactions and a separate poller process to produce events. **Description** Here's how this might work with the revision table: When a revision is to be inserted into the MySQL `revision` table, a MySQL transaction is started. A record is inserted into both the `revision` table and the `revision_event_log` table. The MySQL transaction is committed. Since this is done in a transaction, we can be sure that both of the table writes happen atomically. The revision event is produced to Kafka. When the Kafka produce request succeeds, the `revision_event_log`'s `produced_at` timestamp (or boolean) field is set. A separate process polls the `revision_event_log` table for records where `produced_at` is NULL, produces them to Kafka, and sets `produced_at` when the produce request succeeds. If needed, `revision_event_log` records may be removed after they are successfully produced. NOTE: This example is just one of various ways a Transactional Outbox might be implemented. The core idea is the use of MySQL transactions and a separate poller to ensure that all events are produced. **Pros** * Events can be emitted modeled as we choose **Cons** * Substantial MediaWiki code changes needed * At least once guarantee for events, but this should be fine. There may be ways to easily detect a the duplicate event. --- ### 2 Phase Commit with Kafka Transactions This may or may not be possible and requires more research if we want to consider it. Implementing it would likely be difficult and error prone, and could have an adverse affect on MediaWiki performance. If we do need Kafka Transactions, this might be impossible anyway, unless a good PHP Kafka Client is written.

T116786 introduced MediaWiki #EventBus production [[https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/EventBus|via an extension]] utilizing hooks. While adequate for the [[https://phabricator.wikimedia.org/T114443|EventBus MVP]], this is only an interim solution. Ultimately, we need a mechanism that guarantees event delivery (eventual consistency is OK). The [[ https://wikitech.wikimedia.org/wiki/Event_Platform | Event Platform ]] program extended the work started in T116786 to provide a standardized event producing APIs unified for both production and analytics purposes. However, in order to build truly reliable new production services with events based on MediaWiki data, we need a single source of truth for MediaWiki data. That source of truth is the MediaWiki MySQL database. This is only consistently accessible by MediaWiki itself. There is currently no way to consistently expose (real time) MediaWiki data to non MediaWiki applications. We do have events produced by MediaWiki, but these events are decoupled from the MySQL writes, and there is no guarantee that e.g. every revision table save results in a mediawiki.revision-create event. This means that as of today, MediaWiki events cannot be relied on as a 'source of truth' for MediaWiki data. They are not much more than a best (really good!) effort notification. ## Potential approaches [[ https://martinfowler.com/eaaDev/EventSourcing.html | Event Sourcing ]] is an approach that event driven architectures use to ensure they have a single consistent source of truth that can be used to build many downstream applications. If we were building an application from scratch, this might be a great way to start. However, MediaWiki + MySQL already exist as our source of truth, and migrating it to an Event Sourced architecture all at once is intractable. In lieu of completely re-architecting MediaWiki's data source, there are a few possible approaches to solving this problem in a more incremental way. --- ### Change Data Capture (CDC) CDC uses the MySQL replication binlog to produce state change events. This is the same source of data used to keep the read MySQL replicas up to date. **Description** A binlog reader such as [[ https://debezium.io/ | debezium ]] would produce database change events to Kafka. This reader may be able to transform the database change events into a more useful data model (e.g. [[ https://schema.wikimedia.org/repositories/primary/jsonschema/mediawiki/revision/create/latest | mediawiki/revision/create ]]), or transformation maybe done later by a Stream Processing framework such as [[ https://flink.apache.org/ | Flink ]] or [[ https://kafka.apache.org/documentation/streams/ | Kafka Streams ]]. **Pros** * No (or minimal?) MediaWiki code changes needed * Events are guaranteed to be produced for every database state change * May be possible to guarantee each event is produced exactly once * Would allow us to incrementally Event Source MediaWiki (if we wanted to) **Cons** * Events are emitted (by default?) in a low level database change model, instead of a higher level domain model, and need to be transformed by something --- ### Transactional Outbox This is a hybrid method that makes use of database transactions and a separate poller process to produce events. See also: https://microservices.io/patterns/data/transactional-outbox.html **Description** Here's how this might work with the revision table: When a revision is to be inserted into the MySQL `revision` table, a MySQL transaction is started. A record is inserted into both the `revision` table and the `revision_event_log` table. The MySQL transaction is committed. Since this is done in a transaction, we can be sure that both of the table writes happen atomically. The revision event is produced to Kafka. When the Kafka produce request succeeds, the `revision_event_log`'s `produced_at` timestamp (or boolean) field is set. A separate process polls the `revision_event_log` table for records where `produced_at` is NULL, produces them to Kafka, and sets `produced_at` when the produce request succeeds. If needed, `revision_event_log` records may be removed after they are successfully produced. NOTE: This example is just one of various ways a Transactional Outbox might be implemented. The core idea is the use of MySQL transactions and a separate poller to ensure that all events are produced. **Pros** * Events can be emitted modeled as we choose **Cons** * Substantial MediaWiki code changes needed * At least once guarantee for events, but this should be fine. There may be ways to easily detect a the duplicate event. ### Change Data Capture via Transactional Outbox This is a hybrid of the above two approaches. The main difference is instead of using CDC to emit change events on all MySQL tables, we only emit change events for a single outbox table. This idea is from Debezium: https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/ **Description** MediaWiki would be configured to write all changes in a transaction with the outbox table. When a revision is to be inserted into the `revision` table, a MySQL transaction is started. A record is inserted into the `revision` table as well as the `event_outbox` table. The `event_outbox` has a field including a JSON string representing the payload of the change event. The transaction is then committed. A binlog reader such as Debezium would then filter for changes to the `event_outbox` table and emit only those to Kafka. **Pros** * Fewer initial event models to deal with, all change data goes into a single Kafka topic * Events are guaranteed to be produced for every database state change * May be possible to guarantee each event is produced exactly once **Cons** * Substantial MediaWiki code changes needed (but not as much as the Transactional Outbox without CDC) * Events are emitted (by default?) in a low level database change model, instead of a higher level domain model, and need to be transformed by something * Single topic would need to be split into multiple per-entity (table) change topics after domain model transformation --- ### 2 Phase Commit with Kafka Transactions This may or may not be possible and requires more research if we want to consider it. Implementing it would likely be difficult and error prone, and could have an adverse affect on MediaWiki performance. If we do need Kafka Transactions, this might be impossible anyway, unless a good PHP Kafka Client is written.