Proposal: Generate Wikidata JSON & RDF dumps from Hadoop
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Addshore
	Sep 15 2021, 3:33 PM

Description

Wikidata dumps currently come directly from the SQL servers.
The general process here is iterate through all pages, and slowly write all content to files (possibly in multiple threads).

An alternative solution could be for Wikidata to produce 2 event streams of RDF and JSON output to hadoop, if T120242: Eventually-Consistent MediaWiki state change events | MediaWiki events as source of truth & T215001: Revisions missing from mediawiki_revision_create are complete.
In order to not need to wait for T120242 or T215001 this could be implemented differently, with a service taking a reliable and consistent input (such as MediaWiki recent changes) and populating a reliable stream in kafka of content by making requests to Wikidata for the content.

Dumps could then be created directly from hadoop, which I imagine would take far less time allowing users to get fresher data, and also benefiting services such as Wikidata-Query-Service which sometimes have to reload from dumps.
If we could quickly push this data to kafka too, we would likely see some reduction in load on s8 db servers, as dump generation would no longer need to run. I'm sure DBA would appreciate this.
And the new query service flink updater could also make use of the RDF stream, instead of using mediawiki revision create events and then requesting Special:EntityData.

This would likely also open up the doors a little more to subset dumps and other fun like that.
The currently weekly imported data set of wikidata_entities into hadoop would also be fresh for analysis, rather than 1-2 weeks behind the times.

Related Objects
Search...

Status	Assigned	Task
Open	None	T88728 Improve Wikimedia dumping infrastructure
Open	None	T88991 improve Wikidata dumps [tracking]
Open	None	T291089 Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

Event Timeline

Addshore created this task.Sep 15 2021, 3:33 PM

Addshore added subscribers: Ottomata, ArielGlenn.

Addshore added subscribers: dcausse, Ladsgroup.

Addshore mentioned this in T287231: Consider moving WDQS "munging" of RDF into Wikibase RDF output code.Sep 15 2021, 3:35 PM

Cross linking to T290839: Evaluate a double backend strategy for WDQS which I was reading when I decided to write this down, particularly when reading T290839#7354690

Addshore updated the task description. (Show Details)Sep 15 2021, 3:45 PM

Ottomata added subscribers: JAllemandou, • Zbyszko.Sep 15 2021, 4:15 PM

a reliable and consistent input (such as MediaWiki recent changes)

I guess by this you mean polling the MW RecentChanges API?

In T291089#7356181, @Ottomata wrote:

a reliable and consistent input (such as MediaWiki recent changes)

I guess by this you mean polling the MW RecentChanges API?

Yes I imagine that is the only truly "reliable" way to get all events? as I imagine other sources like https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams would all have the same problems?

And the new query service flink updater could also make use of the RDF stream

Perhaps the existing logic in the WDQS updater to generate its RDF stream could be factored out into its own service? Or, at least, it could emit its RDF stream as a side output into a Kafka topic?

To be reliable for dumps, we'd have to fix T215001: Revisions missing from mediawiki_revision_create as you say, but I'm hoping we can do that within this FY.

I imagine other sources like https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams would all have the same problems?

Yes, EventStreams uses the same data.

Perhaps the existing logic in the WDQS updater to generate its RDF stream could be factored out into its own service? Or, at least, it could emit its RDF stream as a side output into a Kafka topic?

This relies on the kafka event streams right now, thus also suffers the same problem

The very old updater (not used in years) though did poll recent changes, but it would probably be easier to write a small service for this from scratch.

dcausse awarded a token.Sep 15 2021, 7:46 PM

From IRC

7:32 PM <+dcausse> addshore: I'm not convinced that RecentChanges is more reliable than the revision-create stream, using this stream did improve consistency of wdqs IIRC

This is indeed true, RC is also not a totally reliable thing.
Really this just wants the revision table...

So https://www.wikidata.org/w/api.php?action=query&list=allrevisions&arvdir=older&arvlimit=50

Though technically this would be reliable as all revisions would be recorded here for sure, but also could be removed still too.

ArielGlenn moved this task from Backlog to Other teams on the Dumps-Generation board.Sep 16 2021, 7:44 AM

Addshore moved this task from Inbox to Investigate & Discuss on the [DEPRECATED] wdwb-tech board.Sep 16 2021, 3:47 PM

odimitrijevic edited projects, added Analytics-Radar; removed Analytics.Sep 16 2021, 4:53 PM

odimitrijevic moved this task from Incoming to Datasets on the Analytics board.

WMDE-leszek subscribed.Sep 17 2021, 3:36 PM

So9q awarded a token.Sep 27 2021, 9:52 PM

So9q subscribed.

So9q mentioned this in T289561: Evaluate Apache Rya as alternative to Blazegraph.Sep 27 2021, 9:56 PM

Michael subscribed.Oct 7 2021, 10:01 AM