Page MenuHomePhabricator

Can we use a stream of events for facilitating incrementals of the project content dumps?
Closed, ResolvedPublic

Description

How can we get and use a stream of mediawiki events for facilitating incrementals of the project content dumps, and what would this stream contain?

Event Timeline

Could this plug into the existing RCStream thingy? Or does that provide enough events to trigger small update/sync jobs?

ArielGlenn raised the priority of this task from Medium to High.Mar 7 2016, 6:10 PM
In T128754#2090048, @brion wrote:

Could this plug into the existing RCStream thingy? Or does that provide enough events to trigger small update/sync jobs?

The idea would be to use the shiny new Event-Platform system and listen to events there. Eventually we are hoping to use it for RCStream as well, but that's out of scope of this ticket :)

We definitely had a lot of discussion about eventbus use; the rc stream is not 100% reliable and the format is pretty clunky too.

I always forget, is content stored in MySQL? If so, maybe T120242 would help? Maybe not, since the content is crazy binary format?

@Ottomatta I might not understand your question properly. Page content (wikitext) is available in the external stores (mysql dbs), and we get it from there for the current adds/changes dumps. HTML with a lot of extra markup is stored in RestBASE and would be retrieved there for 'incremental' html dumps if those were produced.

BTullis claimed this task.
BTullis subscribed.

I would say that we can resolve this ticket now, since we are using an event based approach in T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2) for Dumps 2.0
Details on that approach are here: https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Content/Mediawiki_content_history_v1

All new work on the next generation dumps is being coordinated through the DPE-Mediawiki-Content tag.