Page MenuHomePhabricator

Have a way to show the most popular pages per country
Open, HighPublic

Description

The Pageviews tool provides valuable information about the most popular articles in every project.

However, it would also be useful to have information about the most popular articles by country. Countries and languages are very different. This is especially important for languages that are spoken in many countries, such as English, French, and Spanish: the most popular articles in U.K., U.S., Australia, Nigeria, South Africa, and India are probably quite different.

In such a tool it would be useful to see the most popular pages in all the projects; in such a case, the 100 most popular pages in Moldova will probably include articles in Romanian, Russian, and English Wikipedias, and possibly also some Commons, Wiktionary, and Wikisource pages. It would also be useful to filter for both country and language and, for example, see only the most popular English Wikipedia articles in Moldova.

It probably makes the most sense to integrate this into the existing Pageviews tool, and add a new tab to the current Langviews, Topviews, Siteviews, etc. However, it may also make sense to set it up elsewhere, for example in Turnilo, Superset, or some other platform.

I once raised this at the Analytics mailing list: https://lists.wikimedia.org/pipermail/analytics/2018-July/006385.html . The query suggested in that thread by @fdans works, but it's slowish, and there's no fast API for this, as there is for pageviews per project.

There are probably some blocking privacy issues. They should not be a total blocker, however. It's OK to filter out some problematic entries in small countries or languages where personally identifiable information can show up, but it's probably fine to show the top 500 viewed pages in Nigeria (just as an example).

Beyond the general "zeitgeist" curiosity, such a tool will be particularly strategically useful to people who want to develop projects in languages that are spoken by many people, but don't yet have a lot of articles. This is true for many languages of India, for example, where English is the most popular language by far, even though most people there speak other languages.

Event Timeline

Amire80 created this task.Oct 16 2018, 1:15 PM
Restricted Application added subscribers: MusikAnimal, Aklapper. · View Herald TranscriptOct 16 2018, 1:15 PM
fdans awarded a token.Oct 16 2018, 3:18 PM
fdans triaged this task as Low priority.Oct 18 2018, 4:50 PM
fdans moved this task from Incoming to Analytics Query Service on the Analytics board.
Superyetkin removed a subscriber: Superyetkin.
Superyetkin added a subscriber: Superyetkin.

Here's a rather simple query that shows the top 100 most popular articles by country.

SELECT
  project,
  page_title,
  namespace_id,
  sum(view_count) as count
FROM
  wmf.pageview_hourly
WHERE
  -- Exclude special pages and files.
  namespace_id not in (-1, 6) AND
  -- This should be better understood. It shouldn't be needed.
  -- The existence of this '-' probably hides real pageviews.
  page_title not in ('-') AND
  -- Put the country name here
  country = 'France' AND
  -- Put the right month here
  month = 11 AND
  year = 2018 
GROUP BY
  project,
  page_title,
  namespace_id
ORDER BY
  count desc
LIMIT
  100;

For November 2018 and France this takes 227 seconds on stat1007.

The most viewed page has the count of 794,995, and #100 has 91,263.

I ran this for several countries and the results make sense to me, but please let me know if something there is wrong.

Nuria added a subscriber: Nuria.Mar 25 2019, 11:11 PM

This work needs the bot filtering be active first, otherwise you would just get "fake" top10 lists per country as much of the data will be distorted by bot traffic.

Amire80 removed a subscriber: Nuria.Mar 25 2019, 11:22 PM

This work needs the bot filtering be active first, otherwise you would just get "fake" top10 lists per country as much of the data will be distorted by bot traffic.

Thanks for the comment.

Is this also an issue for the topviews that are shown per language?

Is this also an issue for the topviews that are shown per language?

Yes, it is an issue with any top list. Now, topviews has a "spam" list so titles that are known to be spammy traffic are removed. Those are reported by users and while list is great to have it just removes the major offenders.

Amire80 added a subscriber: Nuria.Mar 25 2019, 11:28 PM

Is this also an issue for the topviews that are shown per language?

Yes, it is an issue with any top list. Now, topviews has a "spam" list so titles that are known to be spammy traffic are removed. Those are reported by users and while list is great to have it just removes the major offenders.

OK, so can the same list be reused for the top views per language and per country? If they have spam, it's probably OK that they have the same spam :)

OK, so can the same list be reused for the top views per language and per country? If they have spam, it's probably OK that they have the same spam :)

no, the list is incomplete, it mitigates the problem a bit, it does not make it disappear and it is not very effective on non english wikis. See: https://tools.wmflabs.org/topviews/faq/#false_positive The data for this tool is the same pageview data you are looking at when you query pageview_hourly.

What I'm trying to say is that in the pageviews tool there is now a topviews tool that shows data per language, and there should also be a topviews tool that shows data per countries, and if it's not perfect for data per languages, it will be just as imperfect for data per countries. So not having a very good spam filter doesn't sound like a blocker. It's better to have something imperfect that not to having anything at all.

I ran this query for several countries in Africa, and the top results looked quite sensible: local politicians and celebrities, important events from the histories of these countries, etc. So maybe spam exists, but it doesn't look like a huge blocking problem.

So maybe spam exists, but it doesn't look like a huge blocking problem.

It is, at times bot traffic amounts to 10% of our total traffic. This is not an issue that is equally spread across projects and countries and affects enwiki most of all across all countries. We need to release data with a notion of quality and top pageviews per country does not have enough quality to be released as is. Neither (to be totally fair) does the per project data but that one is already been public for a while. It might be unknown to you but that dataset has its share of bug reports that we need to fix before releasing data of similar type.

Is this also an issue for the topviews that are shown per language?

Yes, it is an issue with any top list. Now, topviews has a "spam" list so titles that are known to be spammy traffic are removed. Those are reported by users and while list is great to have it just removes the major offenders.

To clarify: I understand that this "spam" list used by topviews is about traffic that is not already classified as a known spider in our pageviews data. (@MusikAnimal should be able to confirm - it's not 100% clear from the tool's documentation.)
But the query at T207171#4807857 doesn't yet exclude those known spider views either. You can add the condition agent_type = 'user' for that.

Here's a rather simple query that shows the top 100 most popular articles by country.

[...]

-- This should be better understood. It shouldn't be needed.
-- The existence of this '-' probably hides real pageviews.
page_title not in ('-') AND
...

See the explanation at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16 :
"The special value used when the title is not extracted is -."

Is this also an issue for the topviews that are shown per language?

Yes, it is an issue with any top list. Now, topviews has a "spam" list so titles that are known to be spammy traffic are removed. Those are reported by users and while list is great to have it just removes the major offenders.

To clarify: I understand that this "spam" list used by topviews is about traffic that is not already classified as a known spider in our pageviews data. (@MusikAnimal should be able to confirm - it's not 100% clear from the tool's documentation.)

Topviews lists pages from the /metrics/pageviews/top/ API endpoint, which I believe is restricted to pageviews with the user agent type. In addition, crowd-sourced data filters out obvious fake traffic. Currently I personally have to go in and manually verify these user-submitted reports one by one, so it is not comprehensive or always up-to-date.

SBisson added subscribers: AMuigai, SBisson.

This data available in an API would be very useful to the Inuka-Team for the KaiOS-Wikipedia-app to show locally-relevant trending articles.

I understand the concerns around data quality but for our specific use case, there's no way the page views per country per language can be worst than the page views per language in terms of local relevance.

I have no idea how AQS works and if adding such a filter is a big effort but if it could be turned on for a whitelist of countries (maybe even just India) that would be really useful.

Restricted Application added a subscriber: Strainu. · View Herald TranscriptNov 14 2019, 6:41 PM
SBisson moved this task from Backlog to Watching on the Inuka-Team board.Nov 14 2019, 6:41 PM
Nuria added a comment.EditedNov 14 2019, 7:12 PM

I understand the concerns around data quality but for our specific use case, there's no way the page views per country per language can be worst than the page views per language in terms of local relevance.

See issues with hungarian wikipedia now to understand why this type of data is of no use until bot filtering is in place, top lists - as compiled now-, unless they are curated by hand have significant issues: T237282: Topviews Analysis of the Hungarian Wikipedia is flooded with spam

I understand the concerns around data quality but for our specific use case, there's no way the page views per country per language can be worst than the page views per language in terms of local relevance.

See issues with hungarian wikipedia now to understand why these type of data is of no use until bot filtering is in place, top lists as compiled now, unless they are curated by hand have significant issues: T237282: Topviews Analysis of the Hungarian Wikipedia is flooded with spam

I see what you mean. This form of spam can truly deface what we are showing in the trending section of the apps. I hope the effort around bots filtering makes this data usable in the future.

Peter added a subscriber: Peter.Mar 24 2020, 7:52 AM

@Nuria checking where this is on your list of priorities? This is still a feature we would like to use for new readers, more-so in the Wikipedia app for KaiOS.

Nuria added a comment.Apr 14 2020, 3:15 PM

The bot detection (that is a prerequisite to this work) will be rolled out in the next couple of weeks. Now, we do not have plans to work on this work this quarter so at least two quarters away.

fdans added a comment.May 21 2020, 6:22 AM

+1 to what Nuria said, but I'm moving this to incoming, since the completion of bot detection means we should probably reprioritize this from its current "low".

Milimetric raised the priority of this task from Low to High.Jun 8 2020, 4:16 PM
Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.
Milimetric added a subscriber: Milimetric.

From Nuria on priority: we need to finish the API work (Lex's work in T238365) before we do this, but we can grab it as a goal after that.

Amire80 moved this task from Backlog to Metrics on the Language-strategy board.Jul 30 2020, 9:07 PM

@Milimetric @Nuria Checking in again about the plans for this?

Nuria added a comment.Sep 7 2020, 7:59 PM

We are wrapping up the work listed on the ticket above so it is likely we can get to this quarter, to be clear this data is available internally. The work we will be doing will be to make it such it is also available externally.

Nuria assigned this task to lexnasser.Sep 7 2020, 7:59 PM
Nuria added a comment.Wed, Oct 7, 5:54 PM

@AMuigai @Amire80 @lexnasser is going to start working on this project this quarter. I propose we use a top 100 of articles per country where these articles can be in *any* language. Thus you could request "the most popular articles in french in spain" in the month of 2020-01 and that list might have say 5 articles because the top 100 includes 95 articles in Spanish and 5 in french, makes sense?

If you request the top list of articles (regardless of language) for a country the list returned will have 100 items as long as numbers of pageviews are sufficiently large.

Privacy wise: articles with very small number of views or countries with a very small number of views will not be reported

... I propose we use a top 100 of articles per country where these articles can be in *any* language

Maybe someone should do some analysis around how that would look in practice. Because it's possible English Wikipedia, with the majority of the traffic, would dominate a lot of countries. So maybe if we run some numbers we'd get a better sense of what we should output.

... I propose we use a top 100 of articles per country where these articles can be in *any* language

Maybe someone should do some analysis around how that would look in practice. Because it's possible English Wikipedia, with the majority of the traffic, would dominate a lot of countries. So maybe if we run some numbers we'd get a better sense of what we should output.

Indeed, this is what will happen. In some other countries it will be French (Guinea, Mali), Portuguese (Mozambique, Angola), or Russian (the difference between Romania and Moldova is quite striking). And this is totally fine, and we want to know this!

So both things are useful:

  • The top 100 (or maybe 500?..) pages in all languages.
  • Top 100 in each language.

And of course it's OK if you stop after fewer than 100 if there's a privacy issue or if too few people read in that language.

Nuria added a comment.EditedWed, Oct 7, 8:19 PM

Top 100 in each language.

To be clear, this is mostly not possible due to privacy reasons as we can only release buckets of pageviews that are large enough and things like "malasyan top pageviews in san marino" will not fit that criteria. It is very likely also that "top 100 pageviews in san marino" will also not fit that criteria. So an strategy that tries all languages for all countries is not very likely to be successful as the clients will be left with an "undetermined" number of projects for which pageviews are available per country and no way to know which are those projects..

In any case to milimetric's point this can be further quantified with data analysis.

Top 100 in each language.

To be clear, this is mostly not possible due to privacy reasons as we can only release buckets of pageviews that are large enough and things like "malasyan top pageviews in san marino" will not fit that criteria.

San Marino is a very extreme example :)

Is it possible for Nigeria, Mali, Kenya, India, Philippines, Romania, Kyrgyzstan?

Nuria added a comment.Thu, Oct 8, 3:13 PM

Is it possible for Nigeria, Mali, Kenya, India, Philippines, Romania, Kyrgyzstan?

It depends but (other than Romania) the answer is "mostly yes". Have in mind that San Marino has about ~10,000 pageviews per day with a population of 33,000. Kyrgyzstan has about ~300K pageviews with a population of 6 million so the number of pageviews per person is a lot smaller. If you do language splits that are very detailed, basically you can infer what wikipedia pages habitants of Kyrgyzstan that "speak french and have access to internet" are viewing. This number might be much smaller than the population of San Marino.

In general, you are right that San Marino is an "extreme example". In this API we will probably not publish stats for such small countries cause there is no way to do that w/o disclosing a lot about the users on the country but the "idea" applies to other larger countries where issues are not so obvious.

In terms of the privacy considerations for countries with low pageview counts, I found that the most-viewed articles by project endpoint reports articles with very low pageview counts. See: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/kl.wikipedia/all-access/2019/10/01. It reports 374 different articles with a single pageview, and I'd assume that the vast majority of those originate from Greenland, given that it is Greenlandic-language Wikipedia. With the relative similarity between these two projects, I was wondering what led to the decision to not perform those privacy transformations on the data, and if those same reasons would be relevant to this case.

Nuria added a comment.Thu, Oct 8, 3:41 PM

@lexnasser Nice, yes, same considerations apply to your example. That such a low count is available speaks of a bug here: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/monthly/pageview_top_articles.hql

speaks of a bug

I disagree :)
The raw data is available for per-language project broadly on the API or on dumps. Therefore the top endpoint is a just computation facilitated over data available.
For country data it is different as it is a dimension we currently provide for pageviews at project level only, and with bucketing.

Nuria added a comment.EditedThu, Oct 8, 3:55 PM

The raw data is available for per-language project broadly on the API or on dumps.

You are right. i forgot these files are available per project. My reasoning above breaks down where project (more or less) == country (per @lexnasser) example as that data is already public. The example of users in Kyrgyzstan that "speak french and have access to internet" still stands.

Just finished a first draft of the design doc for this project! You can find it here: https://docs.google.com/document/d/19HbdPvSHPUF9n4thFOlck0dIgvZvg50K3mcL-guoViY/edit?usp=sharing

I made a bunch of comments that I'd love everyone's thoughts on. Feel free to either respond to my comments or make new ones of your own.

I know there's still a bunch of ambiguity on what the API should look like, so I'll get started on some data analysis once my access to the relevant data is re-enabled.

Just wanted to follow up to say that I'd love for everyone to take a look at the design doc and make suggestions as you see fit.

I'm also starting my data analysis for this project, which may affect the API design. I'll be sure to report any relevant findings.

Thanks so much!

Nuria added a comment.Thu, Oct 15, 3:13 PM

Wei talked about doing some data analysis to quantify the issues with privacy and country splits. As we spoke we need to quantify the identification risk, an article with 1 pageview in "Greenlandic-language Wikipedia" might carry an identification risk of 1/55,000 (55,000 being the population of Greenland) and article in Malasyan in San marino might have an identification risk of 1/5 (5 citizens with malsyan names in San Marino) so it is not the "number of pageviews" that defines the identification risk but rather "possible population from which this pageviews are drawn"

Nuria added a subscriber: Isaac.Thu, Oct 15, 5:36 PM

Adding @Isaac cause I think he can probably be a good person to help to explore more than a simple bucketization solution might be needed.

Isaac added a comment.Thu, Oct 15, 5:59 PM

Thanks @Nuria -- indeed, I'm highly motivated to find a good solution for this and we had good conversations about similar aspects for the Covid session data project. Just quickly, our solution there had a few parts to address different aspects:

  • a blocklist of countries we'd never retain data for -- i.e. highly sensitive countries where an error in logic is deemed more problematic
  • a mininum # of unique users (where user is IP+user-agent) -- i.e. try to assert that at least X unique individuals are generating the pageviews to be included in the data
  • a maximum % of pageviews we'd release data on -- e.g., don't release data if it covers more than 10% of pageviews from the whole country to again reduce the risk of a user appearing in the data and being identifiable
  • random sampling by day so that heavy users or e.g., users who always view popular pages don't continue to show up in the data every day (in general, introducing some small amount of randomness is good if you don't need exactness, which doesn't seem necessary for this dataset in my opinion though I'm happy to be proved wrong)
  • exclude people with traces of editing in their sessions -- these accounts can be tied to a given page on a given day via the edit history so it's best to exclude them as they are a higher risk of deindentification
  • exclude mobile app editors -- this is largely a function of the prior piece because the clause we use for detecting edits doesn't cover edits made in the apps
  • exclude power users -- i.e. userhashes with greater than X pageviews in a day. This doubles as another form of likely bot removal, protects very heavy users of the project, and also in theory would help reduce the chance of a single user heavily skewing the data.

I'm not sure how many of these will be relevant / necessary but I'll take a look at Lex's document and give it some thought in the next week or so.

@Isaac thanks so much, those are the kinds of considerations we'd love to apply.

This seems also like a great opportunity to bring in population data into our data pipelines. Some of us have wanted to normalize our data by population for a while (T242621). Partly to get up to parity with Wikistats 1, but now I see that it could also be invaluable in privacy considerations.

Basically:

  1. Maintain a regular (yearly?) import of country populations
  2. Use the data to normalize our geographic metrics, so we're displaying something more meaningful than a population map
  3. Use the data to compute what Nuria describes in T207171#6547043 and make dynamic decisions about what to publish and not publish

Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task).

In the case of the small "malasyan" bucket of "san marino" the country population is not helping much, for example. What quantifies that the pool is too small in that case (more or less) is the the <# pageviews on malasyan in san marino>/<# total pageviews in san marino>

One thing that seems to be missed here is that it's not really that important how many articles in the Slovak language people read in Zambia, so if their count is too low and can't be displayed, no one is going to notice it or ask for it. It is interesting what articles people read in Slovak from Hungary, Poland, or Czechia, but if that number is also too low, then this fact is interesting, and the further details—not so much. So, if you just say "this number is too low to be displayed", I don't think that anyone will complain, because that is all the information that the product managers and the editors' community need. I trust the Analytics professionals to define what exactly does "too low" mean.

Nuria added a comment.Fri, Oct 16, 3:05 PM

So, if you just say "this number is too low to be displayed" , I don't think that anyone will complain

This is actually very useful info, thank you.

@Isaac thanks for sharing these!

I think the following points are most useful:

a blocklist of countries we'd never retain data for -- i.e. highly sensitive countries where an error in logic is deemed more problematic

a mininum # of unique users (where user is IP+user-agent) -- i.e. try to assert that at least X unique individuals are generating the pageviews to be included in the data

Some metrics to report:

  • For San Marino, a country with <34,000 population
    • There were ~150,000 pageviews in October
      • The 10th most viewed article in all of October had ~110 pageviews
      • The 100th most viewed article in all of October had ~30 pageviews
      • The 1000th most viewed article in all of October had ~10 pageviews
    • There were ~7,000 pageviews on October 10
      • The 10th most viewed article on October 10 had ~10 pageviews

Given these numbers, I think that granularity should be monthly, and, per the solution Isaac brought up, we could only report articles with above Y(=50?) unique user views rather than limiting reporting to articles that have above X total views. I also think that the top 1000 articles should be reported as long as they all have above Y unique user views.

I'd love to hear everyone's thoughts on this approach.

I'll have to repeat that San Marino is a very extreme case :)

However, monthly is an OK default, at least as a start. Perhaps, once this is rolling, you could switch some larger countries to weekly.

@Amire80 I definitely agree that San Marino is an edge case. Do you think there are any other metrics that could help gauge what would be the best way to exclude articles (total views vs unique views vs something else)? Or do you just in general prefer one way over the others?

@Amire80 I definitely agree that San Marino is an edge case. Do you think there are any other metrics that could help gauge what would be the best way to exclude articles (total views vs unique views vs something else)? Or do you just in general prefer one way over the others?

Can't think of anything special. I'm not a true web analytics expert.

As far as I can see, I am mostly interested in the most popular articles in different languages in every country, for example for these purposes:

  • Helping people who live in these countries understand which articles should be prioritized for translation. For example, if a Russian Wikipedia article is popular in Moldova, and is not available in the Romanian language, it can be somehow suggested in Content Translation (such a feature doesn't exist yet, but maybe it will appear some day).
  • If an English Wikipedia article is popular in Nigeria, and it is also available in the local languages of Nigeria (Yoruba, Hausa, Igbo, Fula), the interlanguage links to it can be emphasized for readers in Nigeria (such a feature doesn't exist yet, but maybe it will appear some day).
  • A Wikimedia chapter in Colombia can organize a workshop for writing articles about famous people from the history of Colombia, and later check how popular did they become in each Spanish-speaking country.

(The countries and the languages above are just examples, and lots of other countries and languages could be there.)

Other product managers, strategists, designers, and community members can also have different purposes and usage scenarios.

... Another kind of related scenario that someone has just brought up in the Wikimedia Telegram chat: "Is there anyway to know where's the readers come from for an article?"

That is, to see in which countries is the article popular.

If something like this can be done together with this task, it would be nice, but it's find to do it separately.

Isaac added a comment.Wed, Oct 21, 9:22 PM

Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task). In the case of the small "malasyan" bucket of "san marino" the country population is not helping much, for example. What quantifies that the pool is too small in that case (more or less) is the the <# pageviews on malasyan in san marino>/<# total pageviews in san marino>

Nuria makes a very good point and I would also add that tourists would also greatly complicate interpretation of these numbers (see this list of countries where tourists greatly outnumber citizens).

Given these numbers, I think that granularity should be monthly, and, per the solution Isaac brought up, we could only report articles with above Y(=50?) unique user views rather than limiting reporting to articles that have above X total views. I also think that the top 1000 articles should be reported as long as they all have above Y unique user views.

However, monthly is an OK default, at least as a start. Perhaps, once this is rolling, you could switch some larger countries to weekly.

I'm onboard in general with your approach @lexnasser . I'd love to have a release cadence though that is more fine-grained than monthly because so much happens in a month and it's nice to be able to get a sense of whether articles are spiking in interest and where. Editors who like to be addressing breaking-news-type events too I'm sure would appreciate a daily report. In particular, one option that I could see working well and hopefully still easy to get off the ground:

  • A daily release to provide quick information for editors interested in very targeted editing. I suspect that this could even be just a ranking of most popular articles that meet the privacy thresholds without including any raw count data (though being able to include pageview counts / bins obviously provides more nuance and value).
  • The monthly (or weekly) to make sure that as many country-project pairs as possible are included (per your San Marino analysis, it seems a monthly release is necessary if they'll ever be included) and give actual pageview counts while reducing privacy risks by having the data cover so many days. This then could be used for quantitative analyses by people interested in campaign impact, trends in reader interest, etc.

A daily release to provide quick information for editors interested in very targeted editing. I suspect that this could even be just a ranking of most popular articles that meet the privacy thresholds without including any raw count data

Nice, +1 to this idea

Nuria added a comment.EditedWed, Oct 21, 11:37 PM

I would implement the daily "top" 1st and once that is in place I would add the monthly job, given the very different amounts of data needed for both a different strategy might be needed for the second one.