Page MenuHomePhabricator

WD JSON dump processing w. WDTK for the WMDE Analytical Systems
Closed, ResolvedPublic

Description

Produce a weekly updated set of .tsv files from the Wikidata JSON dumps to support the WMDE maintained Wikidata analytical systems (WDCM, External Identifiers, Languages Landscape, etc) w. WDTK.

The description of the tables and fields follows:

Table 1. wd_json_dump_gender.tsv
Fields:

  • Item ID: we are interested only in items with P21;
  • P21
  • P19
  • P106

Table 2. wd_json_dump_geo.tsv
Fields:

  • Item ID: we are interested only in items with P625;
  • lat
  • lon

Table 3. wd_json_dump_externalIdentifiers.tsv
Fields:

  • Item ID: we are interested only in items with properties that are of a datatype: external-id
  • External_Identifier_Property - we are not interested in the values, just what item uses what external identifiers;
  • we need these collected from claims, qualifiers, and references.

Table 4. wd_json_dump_languages.tsv
Fields:

  • Item ID
  • Language - the Wikimedia language code for the language in which the item has a label;
  • i.e. for all items, all languages that have a label for them.

It would be the best to have these tables stored somewhere on the stat1007 machine so that we can easily feed them to hdfs/Spark from there.