Type of activity: Pre-scheduled session
Main topic: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/How_to_manage_our_technical_debt
Wikimedia Foundation has at least three systems processing in-wiki events:
- The multicast HTCP purge messages to keep caches in sync. This has several problems, including race conditions in PURGE propagation
- The current jobqueue implementation. It is based on a technology that cannot scale at our size (redis used as a queue with transactions) and can't work easily cross-datacenter.
- Change-propagation: a service that consumes events from Kafka and is able to make HTTP calls to other services.
to which we could probably add a bunch of cron scripts like the wikidata ones.
It would make sense in general to unify everything into one system, unless there is a strong reason to keep things like they are. In particular the jobqueue:
- Has constant scalability issues (given the way we use redis)
- Uses a ton of resources (we have 14 servers dedicated to redis alone!)
- Is very, very hard to debug whenever something goes wrong
- Is limited to work on mediawiki, can't propagate events to the other services
So we need a plan to transition from the old jobqueue to change-propagation, which uses a more solid transport mechanism.
Consensus among participants about the need (or not!) to transition to change propagation. A general plan for the migration and owners of the various parts of it.
Current status of the discussion
There is not a lot of discussion going on about this, although the inner workings of the current jobqueue are known just to a small number of people (maybe one?). When the change-propagation service was introduced, it was discussed replacing progressively functionalities of the jobqueue, but no timeline nor a plan has ever being laid out. Also, this will need multiple teams to buy into the project.
In various comments to this task, we refer to "small wikis" as a shorthand for "small-scale MediaWiki installations".