Page MenuHomePhabricator

Reliable (atomic) MediaWiki event production
Open, NormalPublic

Description

T116786 introduced MediaWiki EventBus production via an extension utilizing hooks. While adequate for the EventBus MVP, this is only an interim solution. Ultimately, we need a mechanism that guarantees event delivery (eventual consistency is OK).

Potential approaches

See also:

Event Timeline

Eevans created this task.Dec 3 2015, 5:21 PM
Eevans raised the priority of this task from to Normal.
Eevans updated the task description. (Show Details)
Eevans added projects: Services, MediaWiki-API.
Eevans added subscribers: Eevans, Ottomata, mobrovac and 4 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 3 2015, 5:21 PM
Anomie set Security to None.
Anomie added a subscriber: Anomie.

If there's anything here that has to do with the action API, I'm not seeing it. Removing MediaWiki-API.

Anomie removed a subscriber: Anomie.Dec 3 2015, 5:41 PM
Pchelolo moved this task from Backlog to later on the Services board.Oct 12 2016, 7:57 PM
Pchelolo edited projects, added Services (later); removed Services.

Facebook actually heavily relies on SQL comments to pass event information to binlog tailer daemons (see the TAO paper). We currently use those SQL comments only to mark the source of a SQL query (PHP function), but could potentially add some annotations that would make it easy to generically extract & export such events into individual Kafka topics.

Also, I get the impression that Kafka SQL connectors are getting better,
e.g. http://debezium.io/

@Ottomata, from a cursory look at those connectors, it looks like they all aim to capture all SQL updates (update, insert, delete). They don't seem to be targeted at emitting specific semantic events, such as the ones we are interested in for EventBus. This is where the SQL comment idea could help, by letting us essentially embed the events we want to have emitted in the statement, rather than trying to reverse-engineer an event from raw SQL statement(s).

BTW, theres' been some recent talk about using Debezium (or something) for incremental updates of mediawiki history in hadoop, which would help replace analytics-store slaves. CC @Milimetric. @mobrovac perhaps events from MySQL binlog, (if coupled with a stream processing framework?) would also be helpful for dependency tracking?

@mobrovac perhaps events from MySQL binlog, (if coupled with a stream processing framework?) would also be helpful for dependency tracking?

Hm, I would rather rely on atomic structures inside our code base for that. Using binlog is a bit tricky from the semantic perspective, because it implies intimate internal knowledge of the SQL structures used in MW (read: custom transaction-to-event mapping), which make it hard to keep up to date. Also, it raises the bar from the portability perspective (other stores, environments, etc).

it implies intimate internal knowledge of the SQL structures used in MW

Aye

it raises the bar from the portability perspective

Hm, not necessarily. If we have a solid stream processing system, it might not be too hard to map to a more agnostic stream of events from the db based ones.

Anyway, just an idea :)

We still have to check Debezium with the DBAs and hear their thoughts on it, but it's possible we could go forward with both ways of generating events and figure out which is easier in practice:

  • Continue to improve how mediawiki sends events to our general event infrastructure, if there are questions people are asking of the data, add more instrumentation
  • Meanwhile, get everything through Debezium and try to answer questions by converting transactions to events as Marko points out. More techy analysts could dig through the raw transactions?

The same event infrastructure would support both of these approaches, and until we get really good at the first bullet point we would probably need to do the second anyway. So maybe we don't need to choose before we start?

Restricted Application added a project: Analytics. · View Herald TranscriptJun 29 2018, 7:21 PM