Page MenuHomePhabricator

Have a way to show the most popular pages per country
Open, LowPublic

Description

The Pageviews tool provides valuable information about the most popular articles in every project.

However, it would also be useful to have information about the most popular articles by country. Countries and languages are very different. This is especially important for languages that are spoken in many countries, such as English, French, and Spanish: the most popular articles in U.K., U.S., Australia, Nigeria, South Africa, and India are probably quite different.

In such a tool it would be useful to see the most popular pages in all the projects; in such a case, the 100 most popular pages in Moldova will probably include articles in Romanian, Russian, and English Wikipedias, and possibly also some Commons, Wiktionary, and Wikisource pages. It would also be useful to filter for both country and language and, for example, see only the most popular English Wikipedia articles in Moldova.

It probably makes the most sense to integrate this into the existing Pageviews tool, and add a new tab to the current Langviews, Topviews, Siteviews, etc. However, it may also make sense to set it up elsewhere, for example in Turnilo, Superset, or some other platform.

I once raised this at the Analytics mailing list: https://lists.wikimedia.org/pipermail/analytics/2018-July/006385.html . The query suggested in that thread by @fdans works, but it's slowish, and there's no fast API for this, as there is for pageviews per project.

There are probably some blocking privacy issues. They should not be a total blocker, however. It's OK to filter out some problematic entries in small countries or languages where personally identifiable information can show up, but it's probably fine to show the top 500 viewed pages in Nigeria (just as an example).

Beyond the general "zeitgeist" curiosity, such a tool will be particularly strategically useful to people who want to develop projects in languages that are spoken by many people, but don't yet have a lot of articles. This is true for many languages of India, for example, where English is the most popular language by far, even though most people there speak other languages.

Event Timeline

Amire80 created this task.Oct 16 2018, 1:15 PM
Restricted Application added subscribers: MusikAnimal, Aklapper. · View Herald TranscriptOct 16 2018, 1:15 PM
fdans awarded a token.Oct 16 2018, 3:18 PM
fdans triaged this task as Low priority.Oct 18 2018, 4:50 PM
fdans moved this task from Incoming to Analytics Query Service on the Analytics board.
Superyetkin removed a subscriber: Superyetkin.
Superyetkin added a subscriber: Superyetkin.

Here's a rather simple query that shows the top 100 most popular articles by country.

SELECT
  project,
  page_title,
  namespace_id,
  sum(view_count) as count
FROM
  wmf.pageview_hourly
WHERE
  -- Exclude special pages and files.
  namespace_id not in (-1, 6) AND
  -- This should be better understood. It shouldn't be needed.
  -- The existence of this '-' probably hides real pageviews.
  page_title not in ('-') AND
  -- Put the country name here
  country = 'France' AND
  -- Put the right month here
  month = 11 AND
  year = 2018 
GROUP BY
  project,
  page_title,
  namespace_id
ORDER BY
  count desc
LIMIT
  100;

For November 2018 and France this takes 227 seconds on stat1007.

The most viewed page has the count of 794,995, and #100 has 91,263.

I ran this for several countries and the results make sense to me, but please let me know if something there is wrong.

Nuria added a subscriber: Nuria.Mar 25 2019, 11:11 PM

This work needs the bot filtering be active first, otherwise you would just get "fake" top10 lists per country as much of the data will be distorted by bot traffic.

Amire80 removed a subscriber: Nuria.Mar 25 2019, 11:22 PM

This work needs the bot filtering be active first, otherwise you would just get "fake" top10 lists per country as much of the data will be distorted by bot traffic.

Thanks for the comment.

Is this also an issue for the topviews that are shown per language?

Is this also an issue for the topviews that are shown per language?

Yes, it is an issue with any top list. Now, topviews has a "spam" list so titles that are known to be spammy traffic are removed. Those are reported by users and while list is great to have it just removes the major offenders.

Amire80 added a subscriber: Nuria.Mar 25 2019, 11:28 PM

Is this also an issue for the topviews that are shown per language?

Yes, it is an issue with any top list. Now, topviews has a "spam" list so titles that are known to be spammy traffic are removed. Those are reported by users and while list is great to have it just removes the major offenders.

OK, so can the same list be reused for the top views per language and per country? If they have spam, it's probably OK that they have the same spam :)

OK, so can the same list be reused for the top views per language and per country? If they have spam, it's probably OK that they have the same spam :)

no, the list is incomplete, it mitigates the problem a bit, it does not make it disappear and it is not very effective on non english wikis. See: https://tools.wmflabs.org/topviews/faq/#false_positive The data for this tool is the same pageview data you are looking at when you query pageview_hourly.

What I'm trying to say is that in the pageviews tool there is now a topviews tool that shows data per language, and there should also be a topviews tool that shows data per countries, and if it's not perfect for data per languages, it will be just as imperfect for data per countries. So not having a very good spam filter doesn't sound like a blocker. It's better to have something imperfect that not to having anything at all.

I ran this query for several countries in Africa, and the top results looked quite sensible: local politicians and celebrities, important events from the histories of these countries, etc. So maybe spam exists, but it doesn't look like a huge blocking problem.

So maybe spam exists, but it doesn't look like a huge blocking problem.

It is, at times bot traffic amounts to 10% of our total traffic. This is not an issue that is equally spread across projects and countries and affects enwiki most of all across all countries. We need to release data with a notion of quality and top pageviews per country does not have enough quality to be released as is. Neither (to be totally fair) does the per project data but that one is already been public for a while. It might be unknown to you but that dataset has its share of bug reports that we need to fix before releasing data of similar type.

Is this also an issue for the topviews that are shown per language?

Yes, it is an issue with any top list. Now, topviews has a "spam" list so titles that are known to be spammy traffic are removed. Those are reported by users and while list is great to have it just removes the major offenders.

To clarify: I understand that this "spam" list used by topviews is about traffic that is not already classified as a known spider in our pageviews data. (@MusikAnimal should be able to confirm - it's not 100% clear from the tool's documentation.)
But the query at T207171#4807857 doesn't yet exclude those known spider views either. You can add the condition agent_type = 'user' for that.

Here's a rather simple query that shows the top 100 most popular articles by country.

[...]

-- This should be better understood. It shouldn't be needed.
-- The existence of this '-' probably hides real pageviews.
page_title not in ('-') AND
...

See the explanation at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16 :
"The special value used when the title is not extracted is -."

Is this also an issue for the topviews that are shown per language?

Yes, it is an issue with any top list. Now, topviews has a "spam" list so titles that are known to be spammy traffic are removed. Those are reported by users and while list is great to have it just removes the major offenders.

To clarify: I understand that this "spam" list used by topviews is about traffic that is not already classified as a known spider in our pageviews data. (@MusikAnimal should be able to confirm - it's not 100% clear from the tool's documentation.)

Topviews lists pages from the /metrics/pageviews/top/ API endpoint, which I believe is restricted to pageviews with the user agent type. In addition, crowd-sourced data filters out obvious fake traffic. Currently I personally have to go in and manually verify these user-submitted reports one by one, so it is not comprehensive or always up-to-date.

SBisson added subscribers: AMuigai, SBisson.

This data available in an API would be very useful to the Inuka-Team for the KaiOS-Wikipedia-app to show locally-relevant trending articles.

I understand the concerns around data quality but for our specific use case, there's no way the page views per country per language can be worst than the page views per language in terms of local relevance.

I have no idea how AQS works and if adding such a filter is a big effort but if it could be turned on for a whitelist of countries (maybe even just India) that would be really useful.

Restricted Application added a subscriber: Strainu. · View Herald TranscriptThu, Nov 14, 6:41 PM
SBisson moved this task from Backlog to Watching on the Inuka-Team board.Thu, Nov 14, 6:41 PM
Nuria added a comment.EditedThu, Nov 14, 7:12 PM

I understand the concerns around data quality but for our specific use case, there's no way the page views per country per language can be worst than the page views per language in terms of local relevance.

See issues with hungarian wikipedia now to understand why this type of data is of no use until bot filtering is in place, top lists - as compiled now-, unless they are curated by hand have significant issues: T237282: Topviews Analysis of the Hungarian Wikipedia is flooded with spam

I understand the concerns around data quality but for our specific use case, there's no way the page views per country per language can be worst than the page views per language in terms of local relevance.

See issues with hungarian wikipedia now to understand why these type of data is of no use until bot filtering is in place, top lists as compiled now, unless they are curated by hand have significant issues: T237282: Topviews Analysis of the Hungarian Wikipedia is flooded with spam

I see what you mean. This form of spam can truly deface what we are showing in the trending section of the apps. I hope the effort around bots filtering makes this data usable in the future.