Page MenuHomePhabricator

WD Languages Landscape: fundamental statistics
Open, Needs TriagePublic

Description

  • Collect fundamental statistics from the external (i.e. non-Wikidata) sources for the Wikidata Languages Landscape.

Event Timeline

@Lydia_Pintscher @RazShuty

Something to begin with:

  • each node is a language (Wikimedia language codes are used);
  • each language points towards the three most similar languages to it,
  • in terms of the overlap in the respective language labels across >57M Wikidata items:
  • (explanation: for each language we search what WD items have a label in it,
  • then: similarity between two languages == Jaccard distance between two binary vectors of length approx. 57M each).

Mapping WDCM item re-use statistics onto languages now.