Neil, Pau, and I discussed the following idea recently: to find a way to understand how much do the readers of Wikimedia projects in every country of the world read wiki pages in the languages actually spoken by the people of that country.
The purpose of this index is understanding the level of knowledge equity between different countries by understanding in which countries do the people get knowledge as a service in a language they know, and to track the progress of this equity over time.
- In United Kingdom, France, Russia, and Japan this index will be very high. People in these countries mostly read Wikipedia, Wikisource, etc., in English, French, Russian, and Japanese, respectively.
- In Israel it will be lower, because even though more than half of the pageviews from Israel are to projects in Hebrew, Arabic, and Russian, which are the most common spoken languages in Israel, roughly 40% of pageviews are to English projects. English is the first language for less than 3% of the people in Israel, but more than half of the people know it as a second language (numbers vary: 85% in CLDR, but according to the Israeli central bureau of statistics 30% know it "well", and 40% know it "partially").
- In Zambia it will be very low, because almost all the pageviews are to the English Wikipedia, even though only 16% of the people know English (according to CLDR). The main languages actually spoken by the people of Zambia are Bemba, Nyanja, Lozi, and Tonga.
So, this index will be a number that is derived from the:
- The % of pageviews to languages that are actually viewed by people during a given period.
- The % of people who know the popular languages as their first language.
- The % of people who know the popular languages as their second language.
- The % of people who don't know the popular languages.
It's a complex calculation, and at the moment I don't have an immediate good idea on how to come up with it.
However, I think I have at least a way to find what are the most popular projects in a country:
SELECT project, sum(view_count) as count FROM wmf.pageview_hourly WHERE country = 'Zambia' AND month = 3 AND year = 2019 GROUP BY project ORDER BY count desc LIMIT 50;
This can run in hive on stat1007. (If you have good familiarity with the wmf.pageview_hourly table, please review the correctness of this query.)