Page MenuHomePhabricator

Provide a quantitative description of the Wikidata-triples dataset
Closed, ResolvedPublic

Description

As a way to get familiar with the data, please provide quantitative information over the dataset using spark in a notebook (probably using python as it facilitates making charts).
The data can be found in:

hdfs://analytics-hadoop/wmf/data/discovery/wikidata/rdf/date=20210419/wiki=wikidata

There are multiple snapshot date available, as well as multiple wikis (wikidata and commons). Just pick one date with wikidata data :)
In hive or spark-sql:

use discovery;
show partitions wikibase_rdf;

Event Timeline

Some of the suggested information to analyse or extract through this analysis are:

  • Top items
  • Top properties
  • Top subject, object types
  • Top property types
  • Top wikidata vs other predicates
  • Number of S, P, O that don't involve wikidata
    • The aim is to find the size of the subgraph not concerning wikidata, i.e size of leaves. They are leaves because once they point to something outside of wikidata, they are not expanded within wikidata. Some things are not even exapandable like literals. If we have too many leaves, we may consider using property graphs (where leaves will be listed as properties of a node).