Let's add some more languages (and projects) to the clickstream datasets! My goal is to upfront add some easy ones and more long-term have a more established process for adding them and understand any barriers to further expanding the dataset.
Current status
There are 11 Wikipedia projects for which the clickstream is currently produced monthly (de, en, es, fa, fr, it, ja, pl, pt, ru, zh). I was not around when these languages were chosen but they seem to be the largest languages by # of pageviews. When these were added, the smallest (fa) was receiving just over 100 million pageviews per month.
Why add more?
The clickstream dataset is our sole source of public information about reader navigation -- i.e. how people move between pages. This is valuable for a variety of reasons:
- Editors: seeing what links are clicked on so that they may improve the quality of those pages
- Researchers: as a measure of "link importance" for algorithms like PageRank, generating page embeddings based on where readers come/go, designing strategies for addressing gaps in Wikipedia (e.g., ensuring that not only is there more content about women but that it is sufficiently discoverable), etc.
One of the main challenges for this data has been accessibility -- i.e. it was only available as dumps so no API for quick access or nice interfaces to play with the data sans data science skills. This barrier has been recently lowered thanks to an Outreachy project, so it feels like a good time to discuss adding more languages: https://wikinav.toolforge.org/
As far as I can tell, there is no clear reason to limit the clickstream to just these 11 languages beyond concerns of complicating the job that produces them. There are some fast-growing wiki communities that I believe could make use of this data (and that we should encourage researchers to include in their work).
Desired state
In an ideal world, the dataset is produced for all Wikipedia languages and we discuss adding relevant sister projects too -- e.g., Wikisource. Practically speaking, I assume the job will need some refactoring to scale that dramatically (?) and the privacy requirements (at least 10 clicks between two pages to be retained in the data) mean that for some projects, the dataset would be missing far too much data to be of much use.
Thus, I would like to at least propose some priority to these projects, sorted by pageviews per month (turnilo):
- >100M: ar (Arabic), nl (Dutch), tr (Turkish), id (Indonesian)
- >50M: sv (Swedish), ko (Korean), vi (Vietnamese), hi (Hindi), th (Thai), he (Hebrew), cs (Czech), fi (Finnish)
- >10M: hu (Hungarian), uk (Ukrainian), el (Greek), ro (Romanian), no (Norwegian (Bokmål)), da (Danish), sr (Serbian), ms (Malay), bn (Bengali), bg (Bulgarian), hr (Croatian), ta (Tamil), sk (Slovak), az (Azerbaijani), simple (Simple), ca (Catalan), mr (Marathi), ml (Malayalam)
Of note, if you expand beyond Wikipedia, then these cut-offs also include Commons, Wikidata, and a few high-traffic Wiktionary sites. There have been previous requests for Wikidata clickstream (see below) but not sure what it would mean for Commons or Wiktionary -- i.e. if the data is useful.
Concerns
- I don't know what the original privacy review was but might be worth asking for a privacy review for the >50M and >10M languages given that the likely reader pool that generates the data dwindles for them.
- In theory, adding a Wikipedia site to the job is just the matter of adding it to the config. In practice, I don't know if including all languages in a single job will scale forever. Though English dwarfs all of these in terms of amount of data, languages like Swedish have almost half the number of articles so still might break the job. Other projects like Commons or Wikidata of course are also wildcards for this reason.
- In practice, adding a bunch more languages at least shouldn't increase the size of data being stored drastically because we already generate the dataset for the largest wikis.