Page MenuHomePhabricator

Formulate a way to calculate the index of how many people in each country read Wikimedia projects in the local languages
Open, Needs TriagePublic

Description

Neil, Pau, and I discussed the following idea recently: to find a way to understand how much do the readers of Wikimedia projects in every country of the world read wiki pages in the languages actually spoken by the people of that country.

The purpose of this index is understanding the level of knowledge equity between different countries by understanding in which countries do the people get knowledge as a service in a language they know, and to track the progress of this equity over time.

For example:

  • In United Kingdom, France, Russia, and Japan this index will be very high. People in these countries mostly read Wikipedia, Wikisource, etc., in English, French, Russian, and Japanese, respectively.
  • In Israel it will be lower, because even though more than half of the pageviews from Israel are to projects in Hebrew, Arabic, and Russian, which are the most common spoken languages in Israel, roughly 40% of pageviews are to English projects. English is the first language for less than 3% of the people in Israel, but more than half of the people know it as a second language (numbers vary: 85% in CLDR, but according to the Israeli central bureau of statistics 30% know it "well", and 40% know it "partially").
  • In Zambia it will be very low, because almost all the pageviews are to the English Wikipedia, even though only 16% of the people know English (according to CLDR). The main languages actually spoken by the people of Zambia are Bemba, Nyanja, Lozi, and Tonga.

So, this index will be a number that is derived from the:

  • The % of pageviews to languages that are actually viewed by people during a given period.
  • The % of people who know the popular languages as their first language.
  • The % of people who know the popular languages as their second language.
  • The % of people who don't know the popular languages.

It's a complex calculation, and at the moment I don't have an immediate good idea on how to come up with it.

However, I think I have at least a way to find what are the most popular projects in a country:

SELECT
  project,
  sum(view_count) as count
FROM
  wmf.pageview_hourly
WHERE
  country = 'Zambia' AND
  month = 3 AND
  year = 2019 
GROUP BY
  project
ORDER BY
  count desc
LIMIT
  50;

This can run in hive on stat1007. (If you have good familiarity with the wmf.pageview_hourly table, please review the correctness of this query.)

Event Timeline

Amire80 created this task.Apr 10 2019, 5:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2019, 5:00 PM
Amire80 moved this task from Backlog to Metrics on the Language-strategy board.Jul 30 2020, 1:54 PM