Page MenuHomePhabricator

Proposal: Generate Wikidata JSON & RDF dumps from Hadoop
Open, Needs TriagePublic

Description

Wikidata dumps currently come directly from the SQL servers.
The general process here is iterate through all pages, and slowly write all content to files (possibly in multiple threads).

An alternative solution could be for Wikidata to produce 2 event streams of RDF and JSON output to hadoop, if T120242: Eventually-Consistent MediaWiki state change events | MediaWiki events as source of truth & T215001: Revisions missing from mediawiki_revision_create are complete.
In order to not need to wait for T120242 or T215001 this could be implemented differently, with a service taking a reliable and consistent input (such as MediaWiki recent changes) and populating a reliable stream in kafka of content by making requests to Wikidata for the content.

Dumps could then be created directly from hadoop, which I imagine would take far less time allowing users to get fresher data, and also benefiting services such as Wikidata-Query-Service which sometimes have to reload from dumps.
If we could quickly push this data to kafka too, we would likely see some reduction in load on s8 db servers, as dump generation would no longer need to run. I'm sure DBA would appreciate this.
And the new query service flink updater could also make use of the RDF stream, instead of using mediawiki revision create events and then requesting Special:EntityData.

This would likely also open up the doors a little more to subset dumps and other fun like that.
The currently weekly imported data set of wikidata_entities into hadoop would also be fresh for analysis, rather than 1-2 weeks behind the times.

Event Timeline

Cross linking to T290839: Evaluate a double backend strategy for WDQS which I was reading when I decided to write this down, particularly when reading T290839#7354690

a reliable and consistent input (such as MediaWiki recent changes)

I guess by this you mean polling the MW RecentChanges API?

a reliable and consistent input (such as MediaWiki recent changes)

I guess by this you mean polling the MW RecentChanges API?

Yes I imagine that is the only truly "reliable" way to get all events? as I imagine other sources like https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams would all have the same problems?

And the new query service flink updater could also make use of the RDF stream

Perhaps the existing logic in the WDQS updater to generate its RDF stream could be factored out into its own service? Or, at least, it could emit its RDF stream as a side output into a Kafka topic?

To be reliable for dumps, we'd have to fix T215001: Revisions missing from mediawiki_revision_create as you say, but I'm hoping we can do that within this FY.

I imagine other sources like https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams would all have the same problems?

Yes, EventStreams uses the same data.

Perhaps the existing logic in the WDQS updater to generate its RDF stream could be factored out into its own service? Or, at least, it could emit its RDF stream as a side output into a Kafka topic?

This relies on the kafka event streams right now, thus also suffers the same problem

The very old updater (not used in years) though did poll recent changes, but it would probably be easier to write a small service for this from scratch.

From IRC

7:32 PM <+dcausse> addshore: I'm not convinced that RecentChanges is more reliable than the revision-create stream, using this stream did improve consistency of wdqs IIRC

This is indeed true, RC is also not a totally reliable thing.
Really this just wants the revision table...

So https://www.wikidata.org/w/api.php?action=query&list=allrevisions&arvdir=older&arvlimit=50

Though technically this would be reliable as all revisions would be recorded here for sure, but also could be removed still too.