Reliable publish / subscribe event bus
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Dec 18 2014, 9:24 PM

Description

We need a reliable way to distribute a variety of update events emitted from MediaWiki core (and other services) to various consumers. Currently we use the job queue for this (ex: Parsoid extension), but this is fairly complex, not very reliable and does not support multiple consumers without setting up separate job types.

We are looking for a solution that decouples producers from consumers, and gives us better reliability than the current job queue.

Benefits

Simplification: Avoid the need to write and maintain a separate MediaWiki extension for each event consumer. Reduce maintenance by focusing on one standard queuing solution.
Single points of failure: A failure of the job queue Redis instance or EventLogging database will cause instant failures of update jobs and the loss of events / jobs. A robust event queue eliminates these single points of failure.
Robust updates at scale: Updates like cache purges are currently propagated on a best-effort basis. If a node is down when the event is sent, there is no way to catch up. With more services and more aggressive caching we'll need more reliability at scale. Currently the only way to achieve this would be creating one job per consumer. However, this does not scale to many consumers.
Performance and scalability: Job queue overload has in the past slowed down edit latency significantly. Both the job queue and EventLogging are hitting scalability limits.
SOA integration: The job queue is a MediaWiki-specific solution that cannot be used by other services. The event queue should provide a clearly defined service interface, so that both MediaWiki and other services can produce and consume events using it.

Event type candidates

Moved to T116247: Define edit related events for change propagation.

Requirements for an implementation

persistent: state does not disappear on power failure & can support large delays (order of days) for individual consumers
no single point of failure
supports pub/sub consumers with varying speed
ideally, lets various producers enqueue new events (not just MW core)
- example use case: restbase scheduling dependent updates for content variants after HTML was updated
can run publicly: consumer may be anyone on the public Internet (think random Mediawiki installation with instant Commons or instant Wikidata) instead of only selected ones with special permissions

Option 1: Kafka

Kafka is a persistent and replicated queue with support for both pub/sub and job queue use cases. We already use it at high volume for request log queueing, so have operational experience and a working puppetization. This makes it a promising candidate.

Rough tasks for an implementation:

Set up a kafka instance
define events & relative order requirements
hook up a synchronous producer to the relevant MediaWiki hooks
Figure out good producer & consumer interfaces
- Raw Kafka creates a fairly high bar to entry
- API inspiration: Amazon, Google TaskQueue, Google PubSub, Azure
- Kafka REST proxy by confluent.io
- custom HTTP / websockets? (ex: T88459: Implementing the reliable event bus using Kafka)

Open questions

Where / how should we expand link table jobs? A consumer of the primary event that enqueues individual updates to another queue? Also see: T102476: RFC: Requirements for change propagation
How can we scale this down for third-party users?
Can we build on the existing job queue fall-back?
T110927: Considerations for supporting job queue use cases with the unified event bus

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	aaron	T97562 WANObjectCache relay daemon or mcrouter support
Resolved	Ottomata	T123954 Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters.
		Restricted Task
Duplicate	None	T109331 Deleted files sometimes remain visible to non-privileged users if permanently linked
Duplicate	None	T133819 upload-lb.ulsfo.wikimedia.org still allow access to some deleted files
Duplicate	BBlack	T119038 Image cache issue when 'over-writing' an image on commons
Resolved	• ema	T133821 Make CDN purges reliable
Resolved	daniel	T102476 RFC: Requirements for change propagation
Resolved	• GWicke	T84923 Reliable publish / subscribe event bus
Resolved	Ottomata	T88459 Implementing the reliable event bus using Kafka
Invalid	Ottomata	T110748 Event Bus
Resolved	Ottomata	T110750 Investigate improving Confluent REST Proxy and Schema Registry for Event Bus
Resolved	Ottomata	T114443 EventBus MVP
Resolved	RobH	T114191 Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
Resolved	Ottomata	T121553 setup kafka1001 & kafka1002
Resolved	• Cmjohnson	T121578 Rack 8 new misc servers
Resolved	elukey	T121558 setup kafka2001 & kafka2002
Resolved	RobH	T120885 codfw: rack 8 new misc systems
Resolved	• mobrovac	T116247 Define edit related events for change propagation
Resolved	Eevans	T116786 Integrate eventbus-based event production into MediaWiki
Declined	• csteipp	T120133 security review of ramsey/uuid
Resolved	• csteipp	T120212 Security review of EventBus extension
Resolved	• GWicke	T120409 RESTBase should honor wiki-wide deletion/suppression of users
Resolved	• ssastry	T125266 Remove user name and edit comment from html <head>
Resolved	• Pchelolo	T122079 Update EventBus extension to produce User-block events
Resolved	Ottomata	T122077 Define schema for a User-block event
Resolved	Ottomata	T118578 Package EventLogging and dependencies for Jessie
Resolved	Ottomata	T118761 Move EventLogging/server to its own repo and set up CI
Resolved	Ottomata	T118780 Puppetize eventlogging-service
Resolved	• madhuvishy	T118903 Make eventlogging logs configurable via python config file [5 pts] {oryx}
Resolved	Ottomata	T118863 Deploy eventlogging from new repository [5 pts]
Resolved	• madhuvishy	T118869 Send HTTP stats about eventlogging-service to statsd [3 pts]
Resolved	Ottomata	T121112 Build tornado-sprocket python packages
Resolved	• mobrovac	T128463 New Service Request - Change Propagation
Resolved	• mobrovac	T130948 Scap3 promote stage not working

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• GWicke raised the priority of this task from to Needs Triage.Dec 18 2014, 9:24 PM

• GWicke added a project: Wikidata-Query-Service.

• GWicke changed Security from none to None.

• GWicke added projects: Service-Architecture, Services, MediaWiki-General, acl*sre-team.

• GWicke edited subscribers, added: aaron; removed: Aklapper.

• GWicke subscribed.

• GWicke triaged this task as Medium priority.Dec 18 2014, 9:30 PM

• GWicke updated the task description. (Show Details)

The nature of these event type candidates is such that they are changes with a log existing at the provider. The only persistent state that each consuming service needs to know is the revision/time up to which it has applied the changes. The historic (recent) changes are already kept on the providing service, so a solution would not necessarily have to retain those. It is probably with a bit of care possible to apply changes out of order and in a way that it is idempotent so the consuming service could apply restartable and in parallel.

The actual data can be pulled by the consumer from the provider; duplicating it in the queue probably has no benefit. I think the only thing here where pub/sub brings anything worth is the event that new changes are available to decrease latency of updates. Which means only ever one event in queue for all consumers, as new events contain all information of the previous ones.

How is the job queue currently not reliable? ( https://www.mediawiki.org/wiki/Job_queue_redesign and https://www.mediawiki.org/wiki/Manual:Job_queue mention other problems but nothing regarding reliability problems. )

Lydia_Pintscher added a project: Wikidata.Dec 23 2014, 3:50 PM

In T84923#940155, @JanZerebecki wrote:

The nature of these event type candidates is such that they are changes with a log existing at the provider.

Wikidata might be the exeception here. Most other events are not available at the provider in a reliable manner.

The actual data can be pulled by the consumer from the provider;

Is there an API for this already?

I think the only thing here where pub/sub brings anything worth is the event that new changes are available to decrease latency of updates. Which means only ever one event in queue for all consumers, as new events contain all information of the previous ones.

A major benefit of an event stream is reliability, performance and simplicity. The same generic client (using websockets for example) can be used to consume a variety of events.

Queues like Kafka are optimized for the use case and perform really well at high request volumes. As an example, we are currently processing about 50k messages/second (request logs) with two Kafka nodes. I know that most other events are lower volume than this, but it's good to have headroom in the system.

How is the job queue currently not reliable?

For one, it uses redis for storage, which only supports async master/slave replication and has limited options for durability. Async replication means that you are likely to lose messages in a fail-over. Limited durability means that you lose messages on power loss. It also does not support multiple consumers for each event (no pub/sub), which results in fairly static coupling of producer and consumer. You need to create a new job per consumer. Finally, the job runners are doing all kinds of processing instead of pure event delivery. As mis-behaving jobs compete with others for resources, this makes them less reliable than a pure event delivery system would be.

• mobrovac subscribed.Dec 23 2014, 8:31 PM

In T84923#942248, @GWicke wrote:

In T84923#940155, @JanZerebecki wrote:

The actual data can be pulled by the consumer from the provider;

Is there an API for this already?

https://www.wikidata.org/w/api.php?action=help&modules=wbgetentities and recent changes, see also T85103 and T85100.

In T84923#943464, @JanZerebecki wrote:

https://www.wikidata.org/w/api.php?action=help&modules=wbgetentities and recent changes, see also T85103 and T85100.

@JanZerebecki, none of these seem to provide the change information apart from 'something in this entity has changed'. Is there a way to efficiently get data for the actual changes?

yuvipanda added subscribers: Halfak, yuvipanda.Jan 6 2015, 7:46 AM

As in a diff? Not that I am aware of. The granularity of actual changes in the DB is one entitiy, smaller granularity might only be present before the change is done then a full new revision of the entity is written. Anything smaller would have to be computed from two revisions of an entity. Though there is no API that combines both which entities changed and their current data. Getting a list of changed entities and then in a second request getting their current data might be efficient enough CPU wise.

Hardikj subscribed.Jan 6 2015, 4:46 PM

While talking to @daniel he noticed that I missed that the actual serialized diff is saved in https://www.mediawiki.org/wiki/Wikibase/Schema/wb_changes there is just no API.

@Halfak has written up very similar ideas at https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_public_event_datasource

• GWicke updated the task description. (Show Details)Jan 7 2015, 9:46 PM

• GWicke updated the task description. (Show Details)Jan 7 2015, 10:07 PM

Krenair subscribed.Jan 7 2015, 10:18 PM

can support large delays (order of days) for individual consumers

Do you have a strong use case to support this need? Kafka may very well be able to support this but I'm wondering it there is a specific and strong reason for such a long duration for an event bus. This seems like a specification for an event based storage system rather than a comm bus.

In T84923#961622, @bd808 wrote:

can support large delays (order of days) for individual consumers

Do you have a strong use case to support this need?

Yes. Hosts can go down for multiple days, and if the event stream is used to do something like reliable purges then it'll be necessary to replay those or throw away the entire cache. Really reliable purges will become more important once we cache for logged-in users as well.

There can also be bugs in consumers, which need to be fixed by re-starting the processing from a clean snapshot.

From what I hear, Analytics would love to get even longer event traces. @Halfak mentioned a back-of-the-envelope calculation that basically all the primary events he lists on his proposal since the beginning of Wikipedia might fit into 200G.

For comparison, I think we currently have several day's worth of buffer for our traffic logs in kafka, which helps to avoid loss if the consumer has issues. That's much higher volume at up to 150k messages/s, while we are looking at low hundreds for edit-related events.

• MZMcBride subscribed.Jan 9 2015, 1:43 AM

http://www.fedmsg.com might fit this need. It is used/developed by Fedora and Debian people and is a federated, reliable message bus with history of cryptographically authenticated json messages building on 0mq with python.

JanZerebecki mentioned this in T85951: Implement incremental updates.Jan 12 2015, 11:55 AM

• brooke subscribed.Jan 23 2015, 8:40 PM

• JohnLewis moved this task from incoming to monitoring on the Wikidata board.Jan 24 2015, 1:22 PM

• GWicke mentioned this in T87520: Set up update jobs for RESTBase: code done and deployed, to be configured.Jan 24 2015, 10:22 PM

In T84923#968636, @JanZerebecki wrote:

http://www.fedmsg.com might fit this need. It is used/developed by Fedora and Debian people and is a federated, reliable message bus with history of cryptographically authenticated json messages building on 0mq with python.

Since 0mq is not actually durable or replicated this does not cover the 'reliable' bit.

Re signatures: We can always send signed JWTs on top of whatever solution we end up using *if* there is a need for per-message authentication. I don't see a strong need for this though.

• mobrovac mentioned this in T76165: Handle revision deletion and suppression in RESTBase.Jan 25 2015, 10:06 AM

• chasemp subscribed.Jan 25 2015, 9:29 PM

In T84923#993443, @GWicke wrote:

Since 0mq is not actually durable or replicated this does not cover the 'reliable' bit.

That is done on top of 0mq. Every message is stored and numbered, no number is ever skipped. Thus if you get 4 but never received 3 you can request 3 knowing you missed it. So it emulates reliability via storage and sequential numbering. (This is done transparently if the endpoint is configured that way, see http://www.fedmsg.com/en/latest/config/#term-replay-endpoints and http://www.fedmsg.com/en/latest/replay/ .) Also quoting en.wp on 0mq: [...] message transports include TCP [...], which means in addition usually a reliable transports is used.

Re: reliability, RELP might be of help on the application level.

• GWicke updated the task description. (Show Details)Feb 3 2015, 7:08 PM

bd808 added a project: Epic.Feb 20 2015, 4:56 PM

• GWicke added a subscriber: Eevans.Mar 11 2015, 12:25 AM

• GWicke mentioned this in T92468: Services Roadmap April - June 2015 (Q4 2014/2015).Mar 12 2015, 1:14 AM

• mobrovac mentioned this in T92490: Revision updates with Jobrunner for Parsoid and RESTBase.Mar 12 2015, 12:40 PM

• GWicke updated the task description. (Show Details)Mar 13 2015, 9:25 PM

• GWicke moved this task from Backlog to Unnamed Column on the Services board.Mar 17 2015, 8:08 PM

See also: @aaron is working on a cache update service at https://github.com/AaronSchulz/python-m
emcached-relay

• mmodell subscribed.Apr 14 2015, 9:05 PM

• GWicke updated the task description. (Show Details)May 15 2015, 3:19 PM

• GWicke updated the task description. (Show Details)

• mobrovac updated the task description. (Show Details)May 15 2015, 3:21 PM

Ottomata subscribed.May 15 2015, 3:29 PM

• mobrovac mentioned this in T100082: Provide useful diffs to high-volume consumers of recent changes.Jun 12 2015, 9:11 AM

• GWicke mentioned this in T102306: Services team roadmap July - September 2015 (Q1 2015/16).Jun 12 2015, 11:27 PM

• GWicke mentioned this in T102476: RFC: Requirements for change propagation.Jun 15 2015, 2:47 PM

This could also be seen as a replacement for some use cases of MW hooks and hook listeners. Right now, if the hook listener throws an exception, that breaks the code that triggered the hook (e.g. T102874: Using Special:EnableFlow on a French Wikiproject page has broken the page completely).

If the hook doesn't have to alter or return any values (e.g. to tell the hook-running code to do something), it might be able to use an event bus instead. If the event bus listener has a bug, my understanding is that wouldn't cause the code that emits the event to break.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptJun 22 2015, 4:31 AM

BTW, T102082 is mainly about analytics eventlogging, but the confluent stuff would be good for an event bus used for application stuff too.

daniel awarded a token.Jun 22 2015, 2:35 PM

• GWicke updated the task description. (Show Details)Jun 22 2015, 11:06 PM

• GWicke updated the task description. (Show Details)Jul 16 2015, 7:56 PM

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 16 2015, 7:56 PM

• GWicke updated the task description. (Show Details)Jul 31 2015, 6:01 PM

Ottomata mentioned this in T110748: Event Bus.Aug 28 2015, 10:39 PM

Ottomata added a subtask: T110748: Event Bus.

Etherpad notes: https://etherpad.wikimedia.org/p/scalable_events_system

• GWicke updated the task description. (Show Details)Aug 31 2015, 6:58 PM

Ottomata mentioned this in Event-Platform.Aug 31 2015, 8:24 PM

Ottomata closed subtask T110748: Event Bus as Invalid.

Ottomata added a project: Event-Platform.

• GWicke mentioned this in T111819: Services team goals October - December 2015 (Q2 2015/16).Sep 9 2015, 5:56 PM

• GWicke mentioned this in T114443: EventBus MVP.Oct 1 2015, 11:28 PM

Smalyshev moved this task from Incoming to Watching / Waiting on the Wikidata-Query-Service board.Oct 19 2015, 10:03 PM

• GWicke added a parent task: T102476: RFC: Requirements for change propagation.Oct 21 2015, 11:52 PM

• GWicke updated the task description. (Show Details)

Eevans mentioned this in T116786: Integrate eventbus-based event production into MediaWiki.Oct 27 2015, 5:51 PM

• Mattflaschen-WMF unsubscribed.Nov 6 2015, 1:25 AM

Milimetric moved this task from Incoming to Radar on the Analytics board.Dec 9 2015, 6:02 PM

JanZerebecki updated the task description. (Show Details)Dec 23 2015, 6:35 AM

• mobrovac closed subtask T116247: Define edit related events for change propagation as Resolved.Jan 12 2016, 7:21 PM

A basic event bus is now available in production, and is being populated with edit events from MediaWiki. Consumption is directly from Kafka at this point.

This means that the core proposal of this task is implemented. I'm closing this task to reflect this.

Ottomata closed subtask T88459: Implementing the reliable event bus using Kafka as Resolved.Feb 1 2016, 4:57 PM

• mobrovac closed subtask T114443: EventBus MVP as Resolved.Apr 27 2016, 7:12 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:56 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM