- Collect fundamental statistics from the external (i.e. non-Wikidata) sources for the Wikidata Languages Landscape.
Something to begin with:
- each node is a language (Wikimedia language codes are used);
- each language points towards the three most similar languages to it,
- in terms of the overlap in the respective language labels across >57M Wikidata items:
- (explanation: for each language we search what WD items have a label in it,
- then: similarity between two languages == Jaccard distance between two binary vectors of length approx. 57M each).
Mapping WDCM item re-use statistics onto languages now.