Page MenuHomePhabricator

Pyspark/R procedures to process the copy of the WD Dump in the Data Lake
Closed, ResolvedPublic

Description

  • Develop a set of standardized, efficient Pyspark/R procedures for processing from the WD Dump copy (hdfs) in the WMD Data Lake.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2019, 11:31 PM
GoranSMilovanovic added a comment.EditedMay 2 2019, 11:03 AM
  • These operations are meant to replace all R orchestrated, massive, and time consuming Wikidata API/WDQS SPARQL calls from WDCM and related dashboard back-ends;
  • The following datasets were produced until now:
    • WD labels for the top 15 languages per number of speakers (essential for the WDCM system);
    • Q5 (Human): all items + the essential properties for our WD statistical systems:
      • Prop 21
      • Prop 106
      • Prop 170
      • Prop 50
      • Prop 101
      • Prop 27
      • Prop 39
      • Prop 103
      • Prop 1412
      • Prop 172
      • Prop 463
      • Prop 1344

Next steps:

  • essential properties/classes for languages (for the Wikidata Languages Landscape T221965)
  • essential properties/classes for taxa (WDCM)
  • essential properties/classes for geographical objects (WDCM)
  • essential properties/classes for organizations (WDCM)
  • essential properties/classes for organizations (WDCM) - DONE.
  • essential properties/classes for geographical objects (WDCM) - DONE.
  • essential properties/classes for taxa (WDCM) - DONE.
  • essential properties/classes for languages - for the Wikidata Languages Landscape #T221965 --> this will be transferred as a sub-task to #T221965.

Resolved.

GoranSMilovanovic closed this task as Resolved.May 13 2019, 2:31 PM