Maniphest T207171

Have a way to show the most popular pages per country
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Amire80
	Oct 16 2018, 1:15 PM

Description

The Pageviews tool provides valuable information about the most popular articles in every project.

However, it would also be useful to have information about the most popular articles by country. Countries and languages are very different. This is especially important for languages that are spoken in many countries, such as English, French, and Spanish: the most popular articles in U.K., U.S., Australia, Nigeria, South Africa, and India are probably quite different.

In such a tool it would be useful to see the most popular pages in all the projects; in such a case, the 100 most popular pages in Moldova will probably include articles in Romanian, Russian, and English Wikipedias, and possibly also some Commons, Wiktionary, and Wikisource pages. It would also be useful to filter for both country and language and, for example, see only the most popular English Wikipedia articles in Moldova.

It probably makes the most sense to integrate this into the existing Pageviews tool, and add a new tab to the current Langviews, Topviews, Siteviews, etc. However, it may also make sense to set it up elsewhere, for example in Turnilo, Superset, or some other platform.

I once raised this at the Analytics mailing list: https://lists.wikimedia.org/pipermail/analytics/2018-July/006385.html . The query suggested in that thread by @fdans works, but it's slowish, and there's no fast API for this, as there is for pageviews per project.

There are probably some blocking privacy issues. They should not be a total blocker, however. It's OK to filter out some problematic entries in small countries or languages where personally identifiable information can show up, but it's probably fine to show the top 500 viewed pages in Nigeria (just as an example).

Beyond the general "zeitgeist" curiosity, such a tool will be particularly strategically useful to people who want to develop projects in languages that are spoken by many people, but don't yet have a lot of articles. This is true for many languages of India, for example, where English is the most popular language by far, even though most people there speak other languages.

Details

Subject	Repo	Branch	Lines +/-
Create pageviews 'top-per-country' endpoint with tests	analytics/aqs	master	+208 -2
Fix and optimize Hive query and change field names in properties file for top-per-country job	analytics/refinery	master	+38 -47
Create and configure Oozie job to load data into Cassandra for pageviews 'top-per-country' AQS endpoint	analytics/refinery	master	+325 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Htriedman	T207171 Have a way to show the most popular pages per country
		Resolved		lexnasser	T263697 Add more popular articles per country data to AQS

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Just wanted to follow up to say that I'd love for everyone to take a look at the design doc and make suggestions as you see fit.

I'm also starting my data analysis for this project, which may affect the API design. I'll be sure to report any relevant findings.

Thanks so much!

We talked about doing some data analysis to quantify the issues with privacy and country splits. As we spoke we need to quantify the identification risk, an article with 1 pageview in "Greenlandic-language Wikipedia" might carry an identification risk of 1/55,000 (55,000 being the population of Greenland) and article in Malasyan in San marino might have an identification risk of 1/5 (5 citizens with malsyan names in San Marino) so it is not the "number of pageviews" that defines the identification risk but rather "possible population from which this pageviews are drawn"

Adding @Isaac cause I think he can probably be a good person to help to explore more than a simple bucketization solution might be needed.

Thanks @Nuria -- indeed, I'm highly motivated to find a good solution for this and we had good conversations about similar aspects for the Covid session data project. Just quickly, our solution there had a few parts to address different aspects:

a blocklist of countries we'd never retain data for -- i.e. highly sensitive countries where an error in logic is deemed more problematic
a mininum # of unique users (where user is IP+user-agent) -- i.e. try to assert that at least X unique individuals are generating the pageviews to be included in the data
a maximum % of pageviews we'd release data on -- e.g., don't release data if it covers more than 10% of pageviews from the whole country to again reduce the risk of a user appearing in the data and being identifiable
random sampling by day so that heavy users or e.g., users who always view popular pages don't continue to show up in the data every day (in general, introducing some small amount of randomness is good if you don't need exactness, which doesn't seem necessary for this dataset in my opinion though I'm happy to be proved wrong)
exclude people with traces of editing in their sessions -- these accounts can be tied to a given page on a given day via the edit history so it's best to exclude them as they are a higher risk of deindentification
exclude mobile app editors -- this is largely a function of the prior piece because the clause we use for detecting edits doesn't cover edits made in the apps
exclude power users -- i.e. userhashes with greater than X pageviews in a day. This doubles as another form of likely bot removal, protects very heavy users of the project, and also in theory would help reduce the chance of a single user heavily skewing the data.

I'm not sure how many of these will be relevant / necessary but I'll take a look at Lex's document and give it some thought in the next week or so.

@Isaac thanks so much, those are the kinds of considerations we'd love to apply.

This seems also like a great opportunity to bring in population data into our data pipelines. Some of us have wanted to normalize our data by population for a while (T242621). Partly to get up to parity with Wikistats 1, but now I see that it could also be invaluable in privacy considerations.

Basically:

Maintain a regular (yearly?) import of country populations
Use the data to normalize our geographic metrics, so we're displaying something more meaningful than a population map
Use the data to compute what Nuria describes in T207171#6547043 and make dynamic decisions about what to publish and not publish

Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task).

In the case of the small "malasyan" bucket of "san marino" the country population is not helping much, for example. What quantifies that the pool is too small in that case (more or less) is the the <# pageviews on malasyan in san marino>/<# total pageviews in san marino>

One thing that seems to be missed here is that it's not really that important how many articles in the Slovak language people read in Zambia, so if their count is too low and can't be displayed, no one is going to notice it or ask for it. It is interesting what articles people read in Slovak from Hungary, Poland, or Czechia, but if that number is also too low, then this fact is interesting, and the further details—not so much. So, if you just say "this number is too low to be displayed", I don't think that anyone will complain, because that is all the information that the product managers and the editors' community need. I trust the Analytics professionals to define what exactly does "too low" mean.

So, if you just say "this number is too low to be displayed" , I don't think that anyone will complain

This is actually very useful info, thank you.

@Isaac thanks for sharing these!

I think the following points are most useful:

a blocklist of countries we'd never retain data for -- i.e. highly sensitive countries where an error in logic is deemed more problematic

a mininum # of unique users (where user is IP+user-agent) -- i.e. try to assert that at least X unique individuals are generating the pageviews to be included in the data

Some metrics to report:

For San Marino, a country with <34,000 population
- There were ~150,000 pageviews in October
  - The 10th most viewed article in all of October had ~110 pageviews
  - The 100th most viewed article in all of October had ~30 pageviews
  - The 1000th most viewed article in all of October had ~10 pageviews
- There were ~7,000 pageviews on October 10
  - The 10th most viewed article on October 10 had ~10 pageviews

Given these numbers, I think that granularity should be monthly, and, per the solution Isaac brought up, we could only report articles with above Y(=50?) unique user views rather than limiting reporting to articles that have above X total views. I also think that the top 1000 articles should be reported as long as they all have above Y unique user views.

I'd love to hear everyone's thoughts on this approach.

I'll have to repeat that San Marino is a very extreme case :)

However, monthly is an OK default, at least as a start. Perhaps, once this is rolling, you could switch some larger countries to weekly.

@Amire80 I definitely agree that San Marino is an edge case. Do you think there are any other metrics that could help gauge what would be the best way to exclude articles (total views vs unique views vs something else)? Or do you just in general prefer one way over the others?

In T207171#6568590, @lexnasser wrote:

@Amire80 I definitely agree that San Marino is an edge case. Do you think there are any other metrics that could help gauge what would be the best way to exclude articles (total views vs unique views vs something else)? Or do you just in general prefer one way over the others?

Can't think of anything special. I'm not a true web analytics expert.

As far as I can see, I am mostly interested in the most popular articles in different languages in every country, for example for these purposes:

Helping people who live in these countries understand which articles should be prioritized for translation. For example, if a Russian Wikipedia article is popular in Moldova, and is not available in the Romanian language, it can be somehow suggested in Content Translation (such a feature doesn't exist yet, but maybe it will appear some day).
If an English Wikipedia article is popular in Nigeria, and it is also available in the local languages of Nigeria (Yoruba, Hausa, Igbo, Fula), the interlanguage links to it can be emphasized for readers in Nigeria (such a feature doesn't exist yet, but maybe it will appear some day).
A Wikimedia chapter in Colombia can organize a workshop for writing articles about famous people from the history of Colombia, and later check how popular did they become in each Spanish-speaking country.

(The countries and the languages above are just examples, and lots of other countries and languages could be there.)

Other product managers, strategists, designers, and community members can also have different purposes and usage scenarios.

... Another kind of related scenario that someone has just brought up in the Wikimedia Telegram chat: "Is there anyway to know where's the readers come from for an article?"

That is, to see in which countries is the article popular.

If something like this can be done together with this task, it would be nice, but it's find to do it separately.

Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task). In the case of the small "malasyan" bucket of "san marino" the country population is not helping much, for example. What quantifies that the pool is too small in that case (more or less) is the the <# pageviews on malasyan in san marino>/<# total pageviews in san marino>

Nuria makes a very good point and I would also add that tourists would also greatly complicate interpretation of these numbers (see this list of countries where tourists greatly outnumber citizens).

Given these numbers, I think that granularity should be monthly, and, per the solution Isaac brought up, we could only report articles with above Y(=50?) unique user views rather than limiting reporting to articles that have above X total views. I also think that the top 1000 articles should be reported as long as they all have above Y unique user views.

However, monthly is an OK default, at least as a start. Perhaps, once this is rolling, you could switch some larger countries to weekly.

I'm onboard in general with your approach @lexnasser . I'd love to have a release cadence though that is more fine-grained than monthly because so much happens in a month and it's nice to be able to get a sense of whether articles are spiking in interest and where. Editors who like to be addressing breaking-news-type events too I'm sure would appreciate a daily report. In particular, one option that I could see working well and hopefully still easy to get off the ground:

A daily release to provide quick information for editors interested in very targeted editing. I suspect that this could even be just a ranking of most popular articles that meet the privacy thresholds without including any raw count data (though being able to include pageview counts / bins obviously provides more nuance and value).
The monthly (or weekly) to make sure that as many country-project pairs as possible are included (per your San Marino analysis, it seems a monthly release is necessary if they'll ever be included) and give actual pageview counts while reducing privacy risks by having the data cover so many days. This then could be used for quantitative analyses by people interested in campaign impact, trends in reader interest, etc.

A daily release to provide quick information for editors interested in very targeted editing. I suspect that this could even be just a ranking of most popular articles that meet the privacy thresholds without including any raw count data

Nice, +1 to this idea

I would implement the daily "top" 1st and once that is in place I would add the monthly job, given the very different amounts of data needed for both a different strategy might be needed for the second one.

mforns subscribed.Oct 22 2020, 5:56 PM

Hey everyone, I spent the last couple of days compiling data for less edge-casey countries that are relatively multilingual (India and Belgium). The metrics are below and my takeaways are at the bottom.

For one day in 2020:

For India, a country with ~1.3B population:

		For all languages:
			The most viewed article had ~370000 pageviews
			The 10th most viewed article had ~28000 pageviews
			The 100th most viewed article had ~6700 pageviews

		For Hindi, spoken natively by ~44% of the Indian population:
			The most viewed Hindi article had ~28000 pageviews
			The 10th most viewed Hindi article had ~4600 pageviews
			The 100th most viewed Hindi article had ~1100 pageviews

		For Bengali (~8%):
			1)   ~3800 pageviews
			10)  ~800 pageviews
			100) ~200 pageviews

		For Marathi (~7%):
			1)   ~17000 pageviews
			10)  ~900 pageviews
			100) ~200 pageviews

		For Telugu (~7%):
			1)   ~5200 pageviews
			10)  ~450 pageviews
			100) ~100 pageviews

		For Kannada (~4%):
			1)   ~23000 pageviews
			10)  ~800 pageviews
			100) ~150 pageviews

		For English (~0%):
			1)   ~37000 pageviews
			10)  ~22000 pageviews
			100) ~6300 pageviews

For Belgium, a country with ~12M population:

		For all languages:
			1)   ~32000 pageviews
			10)  ~7000 pageviews
			100) ~900 pageviews

		For Dutch (~60%):
			1)   ~32000 pageviews
			10)  ~5500 pageviews
			100) ~500 pageviews

		For French (~40%):
			1)   ~25000 pageviews
			10)  ~2000 pageviews
			100) ~400 pageviews

		For German, (~0.5%):
			1)   ~2000 pageviews
			10)  ~50 pageviews
			100) Too low to report

		For English (~0%):
			1)   ~17000 pageviews
			10)  ~1000 pageviews
			100) ~200 pageviews

My Takeaways:

If we want useful data by country by language with day granularity, then the pageview threshold should be on the order of 100(s), which seems relatively low in terms of privacy
Data by country by language with month granularity with a pageview threshold on the order of 1000(s) seems reasonable in terms of both utility and privacy, especially with bucketing - this could potentially be spun off into a separate API if we want daily granularity for all languages aggregated
Data by country, not by language, with daily granularity with a pageview threshold on the order of 1000(s) seems reasonable for a wide variety of countries in terms of both utility and privacy, and is the best option for this API implementation
Buckets of size 100 would likely not compromise the utility of the data

I'd love to hear where everyone agrees/disagrees, or any other suggestions or metric requests.

In T207171#6590596, @lexnasser wrote:

My Takeaways:

If we want useful data by country by language with day granularity, then the pageview threshold should be on the order of 100(s), which seems relatively low in terms of privacy

I agree that threshold of 100s pageviews seems small for privacy.

Data by country by language with month granularity with a pageview threshold on the order of 1000(s) seems reasonable in terms of both utility and privacy, especially with bucketing - this could potentially be spun off into a separate API if we want daily granularity for all languages aggregated

Data by country, not by language, with daily granularity with a pageview threshold on the order of 1000(s) seems reasonable for a wide variety of countries in terms of both utility and privacy, and is the best option for this API implementation

Buckets of size 100 would likely not compromise the utility of the data

If we decide to go for per-project (=per-language) + all-projects, using the same API endpoint with special project value all-projects has been our way to go so far.

I agree that threshold of 100s pageviews seems small for privacy.

I agree if we're delivering raw data and using pageviews as our sole threshold. I'm more open to e..g, 100 pageviews as a minimum threshold if...

We're bucketing -- e.g., buckets in 100s -- i.e. 100-200 pageviews, 200-300 pageviews, ... 1000-1100 pageviews, ... I'd also be open to thousands buckets (100-1000, 1000-2000, ...) if that's deemed safer. 100 pageviews from a given country to a given article in a day is a pretty high level of traffic for the smaller language editions and I'd like to see us try to include it if possible.
We're reporting pageviews but using unique # of users as the threshold. This is something that @lexnasser had indicated a willingness to consider. How does this change things? Is this still feasible? I'm much more comfortable with 100 pageviews if I know that's coming from 100 different UA+IPs in a day.

Data by country by language with month granularity with a pageview threshold on the order of 1000(s)

Yeah, I think I would push for 1000 even with a bucket size of e.g., 100. Seems sufficiently high to preserve privacy given that this number will be aggregated from at least 28 days and that any daily data will hopefully be sufficiently bucketed as to not reveal too much information (where you could e.g. infer the # of pageviews on a day that doesn't show up in the daily reports). I don't have a great way of quantifying this though. In the past, we've used 500 pageviews from over the course of a week numerous times in Research for when to release data.

Data by country, not by language

@lexnasser what's your thinking about how this would look? Is it essentially a list of Wikidata items for a country? Do we have a use-case that this covers or is the hope that e.g., for countries with many languages, topics would bubble up at the country level that don't meet any language-specific thresholds?

The other scenario I have in the back of my mind that I want to see us explicitly consider is e.g., what about Norwegian Wikipedia articles. The raw pageview data is available and it's well known that well >90% of pageviews to Norwegian Wikipedia articles is coming from Norway, so whatever the choices are around daily / monthly thresholds and bucketing, we want to make sure there's enough ambiguity that nobody could reasonably pinpoint pageviews from Norwegian readers outside of Norway. So if pageview data shows 1001 pageviews to an article in a month and our monthly threshold is 1000 and the article shows up in the monthly report, is this problematic? I think it's fair to decide that the fact that that remaining pageview could come from any country means that it's not a breach of privacy, but it's something that should be decided at some point.

Awesome analysis @lexnasser!

I was thinking about using the threshold on pageviews vs. on unique readers, as @Isaac suggests.

I see 2 privacy threats:

A person, or a small group of people, are the only ones that speak a given language in a very small country. If they visit Wikipedia (a lot), they could generate enough pageviews to appear in the ranking, and then anyone who knows them could deduce what they read in Wikipedia.

An editor person lives in a foreign country with very low Wikipedia traffic. They write an article over the course of one month. They do lots of small edits. And they visit the article they're editing lots of times (maybe their local friends and relatives do as well). The article makes it to the ranking of that country. By looking at the ranking combined with the public wiki databases, one could associate the edition of the article with the ranked article and deduce the country of the editor person.

I believe applying the threshold on unique readers would be safer privacy-wise. However, I had suggested @lexnasser to study first the rankings, to determine if that would be needed, or we were safe by just using pageviews for the threshold (which would be easier). Now, seeing @lexnasser's analyses, I'm on the fence.

@lexnasser, would it be possible to repeat an analysis for a smaller country (like i.e. Andorra)? Maybe we can find any instances of the 2 privacy threats I mentioned? Like ranked articles that look unexpected, in out-of-context languages? In any case, it will give us a better idea of how small country rankings will look like, and help us decide what to threshold on, no?

@JAllemandou Thanks for your thoughts and for the all-projects suggestion! I'm often unaware of those types of existing naming conventions.

@Isaac

I definitely agree that additional privacy features like bucketing and unique view thresholds may allow for that 100-view threshold.

I'm more open to e..g, 100 pageviews as a minimum threshold if... We're bucketing -- e.g., buckets in 100s -- i.e. 100-200 pageviews, 200-300 pageviews, ... 1000-1100 pageviews, ... I'd also be open to thousands buckets (100-1000, 1000-2000, ...) if that's deemed safer. 100 pageviews from a given country to a given article in a day is a pretty high level of traffic for the smaller language editions and I'd like to see us try to include it if possible.

I'm leaning towards buckets in 100s if we set the minimum total/unique view threshold to 100. My concern with 1000-sized buckets is that the API would provide very little insight into granular trends, especially for languages (e.g., the India data, where the vast majority (>90%) of reported pages for Bengali, Marathi, Telugu, and Kannada would fall into that 100-1000 bucket).

reporting pageviews but using unique # of users as the threshold. This is something that lexnasser had indicated a willingness to consider. How does this change things? Is this still feasible? I'm much more comfortable with 100 pageviews if I know that's coming from 100 different UA+IPs in a day.

I think this feature could bridge the privacy gap of using small (~100) buckets. I don't see any reason why this feature is infeasible, is there a particular reason why you were questioning its feasibility? I fully agree with your sentiment "I'm much more comfortable with 100 pageviews if I know that's coming from 100 different UA+IPs in a day.".

Also, even though I can find the UA Map, I'm unsure how to retrieve the IP for a specific pageview in wmf.pageviews_hourly. Do you have any insight for this, or for a different unique ID? Is this what you were referring to for unfeasibility?

what's your thinking about how this would look? Is it essentially a list of Wikidata items for a country? Do we have a use-case that this covers or is the hope that e.g., for countries with many languages, topics would bubble up at the country level that don't meet any language-specific thresholds?

All that I mean by "by language" is adding a request parameter so that data is returned for a specific language (== project) like:

>>> GET /metrics/pageviews/top-by-article/in/hi.wikipedia/mobile-app/2019/10 

"country": "in", <- India
"project": "hi.wikipedia", <- Hindi language
"access": "mobile-app",
"year": "2019",
"month": "10",
"articles": [
    {
        "article": "Main_Page",                   
        "project": "hi.wikipedia",
        "views": 12324,
        "rank": 1,
    },
    {
        "article": "India",                   
        "project": "hi.wiktionary",
        "views": 2312,
        "rank": 2,
    },
    ...
]

So, not "by language" would return a list of items like:

>>> GET /metrics/pageviews/top-by-article/in/mobile-app/2019/10/01

"country": "in",
"access": "mobile-app",
"year": "2019",
"month": "10",
"day": "01",
"articles": [
    {
        "article": "Main_Page",                   
        "project": "hi.wikipedia",
        "views": 12324,
        "rank": 1,
    },
    {
        "article": "Joker_(2019_film)",                   
        "project": "en.wikipedia",
        "views": 3290,
        "rank": 2,
    },
    ...
]

I'm differentiating between them because there may be different privacy-utility tradeoffs for their respective use cases (e.g., language data may be just as useful with monthly granularity, general data may be more useful with daily granularity). I'm open to merging these into a single endpoint if it doesn't compromise utility much.

I think that clarification should address your remarks, but let me know if I'm missing something.

what about Norwegian Wikipedia articles... So if pageview data shows 1001 pageviews to an article in a month and our monthly threshold is 1000 and the article shows up in the monthly report, is this problematic? I think it's fair to decide that the fact that that remaining pageview could come from any country means that it's not a breach of privacy, but it's something that should be decided at some point.

For your example in particular, I don't think that should be a problem just because it could come from any country as you mentioned, but I'm open to reconsidering if you disagree.

Awesome analysis lexnasser!

Thanks!

I see 2 privacy threats:...

I agree with your assessment of the main privacy threats and that a unique pageview threshold could thwart those.

would it be possible to repeat an analysis for a smaller country (like i.e. Andorra)? Maybe we can find any instances of the 2 privacy threats I mentioned? Like ranked articles that look unexpected, in out-of-context languages? In any case, it will give us a better idea of how small country rankings will look like, and help us decide what to threshold on, no?

I'm actually not sure how to compute the unique view count (UA map + something else?). But here's the non-unique pageview data for Andorra:

For one day in Andorra, a country with ~80k population:

		For all languages:
			1)   ~300 pageviews
			10)  <50 pageviews
			100) too low to report

		For Catalan (~40%):
			1)   ~180 pageviews
			10)  <25 pageviews
			100) too low to report

		For Spanish (~40%):
			1)   ~300 pageviews
			10)  <25 pageviews
			100) too low to report

		For Portuguese, (~0.5%):
			1)   too low to report
			10)  too low to report
			100) too low to report

		For English (~0%):
			1)   <50 pageviews
			10)  too low to report
			100) too low to report

@lexnasser Thanks for the analysis of Andorra's traffic!
In this case, no article would make it to the ranking because of the threshold (1000 pageviews).
So, no possibility of an unexpected privacy-sensitive article being featured.
I'm curious about Belgium's ranking then, to see if we can find something odd there.
Maybe we can pair on this @lexnasser?

• Nuria mentioned this in T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.Nov 5 2020, 2:04 AM

Hey everyone, I just created a table lex.pageview_ranks_with_unique, available on Hive and Superset, that holds the exploratory data (for one day) that I've been analyzing. I created this to make it easier for everyone to examine the data, and see how different thresholds (including unique pageview thresholds) would affect the data returned for different countries.

If you care about a specific use case and want to see how that use case is affected by different thresholds, I'd love for you to try out a few queries and report back your thoughts.

Here's an example query:

SELECT
    row_number() OVER (ORDER BY total_count DESC) AS output_rank,
    country_rank,
    country_code,
    lang,
    page_title,
    country_lang_rank,
    total_count,
    unique_count
FROM lex.pageview_ranks_with_unique
WHERE
    country_code = 'BE'
    AND total_count > 1000
    AND unique_count > 500
ORDER BY total_count
DESC LIMIT 1000;

Each record (row) would represent a single country-language-page combination.
In this above query:

output_rank represents the rank of that record among the records that would be returned, excluding pages that would not be returned due to pageview thresholds
country_rank represents the rank of that record among all the pages for that country, including pages that would not be returned due to pageview thresholds
country_lang_rank represents the rank of that record among all the pages for that country in that language (e.g. If the entry had lang = en, country_code = BE, and country_lang_rank = 47, then that would mean that that page was the 47th most-viewed English page in Belgium for that day)
total_count represents the total number of pageviews for that record
unique_count represents the number of unique pageviews for that record

As such, this query would represent data that would be returned for the country Belgium, with a total pageview minimum threshold of 1000 pageviews and a unique pageview minimum threshold of 500 pageviews. This query in particular retrieves 87 records that match those criteria from the top 1000 most-viewed articles on that day in Belgium.

TedTed subscribed.Nov 5 2020, 10:08 PM

Looks great @lexnasser!
I feel the threshold on unique actors is working well. I was surprised to see that the number of unique actors on an article is typically close to the number of pageviews.
So, maybe we should consider only using the unique actors threshold, because it sort of supersedes the pageview threshold, no? An article with 500 unique actors will at least have 500 pageviews.
And even if 500 pageviews is lower that we initially thought the pageview threshold would be, knowing the article has 500 unique actors, I think we're fine.
Thoughts, anyone?

• Nuria mentioned this in T267454: Get list of most viewed articles by viewers from specific country .Nov 7 2020, 3:36 AM

Aklapper merged a task: T267454: Get list of most viewed articles by viewers from specific country .Nov 7 2020, 8:14 AM

Aklapper added subscribers: Mike.Khoroshun, Base.

Ata subscribed.Nov 8 2020, 10:52 AM

Using unique-actors as main threshold metrics seems a nice idea. As @Milimetric was pointing the other day, there is a vector of attack adding fake users. Malicious attacker would be able to find if a page got a single viewer by faking 499 calls to that page with different UA, if the page has 500 views for instance. I wonder how we could think of mitigating that - buckets seems to do it.

Thanks for making the table @lexnasser! A few thoughts below:

Total Pageviews vs. Unique Pageviews

I checked on unique vs. total pageviews just to verify that it makes a difference and it does. I did a query to check for pages where a few users were generating many pageviews. For example, there were 55 pages where >1000 pageviews were generated from <50 unique devices and another 55 where >1000 pageviews were generated from <100 unique devices. Personally, I don't see any value to having both a total pageview and unique pageview threshold. I can't think of a privacy use-case where we'd e.g., be worried if there were 1000 pageviews but they all came from unique devices as opposed to 1000 pageviews that came from only 500 devices. So I'd just drop it to a threshold based on unique pageviews.

Bucketing for daily data

Malicious attacker would be able to find if a page got a single viewer by faking 499 calls to that page with different UA, if the page has 500 views for instance. I wonder how we could think of mitigating that - buckets seems to do it.

Agreed but I'd probably go further -- I don't think there's any strong use-case that requires exact counts, especially in daily data. The long-term differential privacy approach will greatly help too with this but I don't think we're close yet to that. I frankly feel like until we have a better understanding of the privacy risks that we should just opt for the ranking of the pages and not actual pageview data for the daily reports. Maybe pilot it with just rankings and get feedback on whether it's useful enough that way? People can always cross-reference with the public pageview data to get a sense of orders of magnitude. That said, I would love to see some level of bucketing for the monthly reports though because it would seem to me that it'd be far more difficult to manipulate monthly data in a way that reveals information about a single user.

Miscellaneous

I assume you'll want to retain only pages in article namespace. I think I lean towards retaining the Main Page articles but you could also reasonably remove them (that filter could be automatically generated by doing an anti join on page_id + wiki_db where item_id = Q5296 in item_page_link)
It looks like you're currently grouping by page_title for the counts (I queried for Nigeria and saw both Special:Search and Ihü_kárírí:Search for Igbo Wikipedia). I'd recommend instead to instead group by page_id (which will take into account redirects). Unfortunately there's no great way to get the current canonical title in bulk because the page table in Hive will generally be at least several days behind. So maybe best fix is to re-add page_title by grabbing the most common page_title associated with each page_id. Others might have ideas too (or push back on using page_id to group instead of page_title).

And regarding unique pageview threshold, I threw together this table of # of pages (and unique languages/projects) that would be on the list for each country for 100, 500, and 1000 unique pageviews. Example query for k = 1000 and data for all below. My takeaway is that unless we have a strong reason for k=1000, I'd push for k=500 or k=100 given that many countries are included and go from maybe 1-2 pages to 10 or more as you push k lower, which feels like a sizable jump in value for these countries. In general, I'd argue for pushing the k value as low as we feel comfortable so that more countries can be included (and would drop bucketed pageview data if that makes us feel comfortable with pushing the k lower).

# Example query:
SELECT
  country_code,
  COUNT(1) AS num_pages,
  COUNT(DISTINCT(lang)) AS num_projects
FROM lex.pageview_ranks_with_unique
WHERE
  unique_count > 1000
GROUP BY
  country_code

Countries always excluded (if k>100): Antarctica; Cocos (Keeling) Islands; Cook Islands; Christmas Island; Western Sahara; Eritrea; British Indian Ocean Territory; Korea (Democratic People's Republic of); Marshall Islands; Montserrat; Norfolk Island; Nauru; Niue; Saint Helena, Ascension and Tristan da Cunha; Svalbard and Jan Mayen; Tokelau; Tuvalu; Wallis and Futuna; Antigua and Barbuda; Anguilla; American Samoa; Saint Barthélemy; Bonaire, Sint Eustatius and Saba; Central African Republic; Cabo Verde; Dominica; Falkland Islands (Malvinas); Micronesia (Federated States of); Grenada; Equatorial Guinea; Guinea-Bissau; Kiribati; Comoros; Saint Kitts and Nevis; Saint Martin (French part); Northern Mariana Islands; Saint Pierre and Miquelon; Palau; Solomon Islands; Seychelles; Sao Tome and Principe; Sint Maarten (Dutch part); Turks and Caicos Islands; Chad; Tonga; Holy See; Saint Vincent and the Grenadines; Virgin Islands (British); Vanuatu; Samoa; Mayotte; Saint Lucia

country	# pageviews	num_pages (k=100)	num_projects (k=100)	num_pages (k=500)	num_projects (k=500)	num_pages (k=1000)	num_projects (k=1000)
Greenland	1K - 10K	1	1	0	0	0	0
Gambia	1K - 10K	1	1	0	0	0	0
Lesotho	1K - 10K	1	1	0	0	0	0
San Marino	1K - 10K	1	1	0	0	0	0
Andorra	10K - 100K	2	2	0	0	0	0
Afghanistan	10K - 100K	11	2	2	2	1	1
Aruba	10K - 100K	1	1	0	0	0	0
Åland Islands	10K - 100K	1	1	0	0	0	0
Barbados	10K - 100K	6	1	0	0	0	0
Burkina Faso	10K - 100K	6	2	1	1	0	0
Burundi	10K - 100K	2	2	0	0	0	0
Benin	10K - 100K	10	2	1	1	0	0
Bermuda	10K - 100K	2	1	1	1	0	0
Brunei Darussalam	10K - 100K	4	1	0	0	0	0
Bahamas	10K - 100K	2	1	0	0	0	0
Bhutan	10K - 100K	1	1	0	0	0	0
Botswana	10K - 100K	2	1	1	1	0	0
Belize	10K - 100K	1	1	0	0	0	0
Congo	10K - 100K	1	1	0	0	0	0
Curaçao	10K - 100K	1	1	0	0	0	0
Djibouti	10K - 100K	2	2	0	0	0	0
Fiji	10K - 100K	3	1	0	0	0	0
Faroe Islands	10K - 100K	2	2	0	0	0	0
Gabon	10K - 100K	4	1	0	0	0	0
French Guiana	10K - 100K	1	1	0	0	0	0
Guernsey	10K - 100K	2	1	0	0	0	0
Gibraltar	10K - 100K	1	1	0	0	0	0
Guinea	10K - 100K	10	3	1	1	0	0
Guadeloupe	10K - 100K	3	1	1	1	0	0
Guam	10K - 100K	1	1	0	0	0	0
Guyana	10K - 100K	4	1	1	1	0	0
Isle of Man	10K - 100K	2	1	0	0	0	0
Jersey	10K - 100K	3	1	1	1	0	0
Cayman Islands	10K - 100K	1	1	0	0	0	0
Lao People's Democratic Republic	10K - 100K	5	2	1	1	0	0
Liechtenstein	10K - 100K	1	1	0	0	0	0
Liberia	10K - 100K	5	1	0	0	0	0
Monaco	10K - 100K	2	2	0	0	0	0
Madagascar	10K - 100K	3	2	0	0	0	0
Mali	10K - 100K	6	2	1	1	0	0
Martinique	10K - 100K	1	1	1	1	0	0
Mauritania	10K - 100K	5	2	0	0	0	0
Mauritius	10K - 100K	6	2	1	1	0	0
Maldives	10K - 100K	3	1	1	1	0	0
Malawi	10K - 100K	3	1	0	0	0	0
Mozambique	10K - 100K	11	2	0	0	0	0
Namibia	10K - 100K	2	1	1	1	0	0
New Caledonia	10K - 100K	4	1	0	0	0	0
Niger	10K - 100K	3	2	0	0	0	0
French Polynesia	10K - 100K	1	1	0	0	0	0
Papua New Guinea	10K - 100K	2	1	0	0	0	0
Rwanda	10K - 100K	2	1	0	0	0	0
Sierra Leone	10K - 100K	3	1	1	1	0	0
Somalia	10K - 100K	7	2	1	1	0	0
Suriname	10K - 100K	1	1	0	0	0	0
South Sudan	10K - 100K	2	1	0	0	0	0
Eswatini	10K - 100K	1	1	0	0	0	0
Togo	10K - 100K	6	2	0	0	0	0
Tajikistan	10K - 100K	14	3	1	1	0	0
Timor-Leste	10K - 100K	1	1	0	0	0	0
Turkmenistan	10K - 100K	4	3	0	0	0	0
Virgin Islands (U.S.)	10K - 100K	1	1	0	0	0	0
?	10K - 100K	4	2	0	0	0	0
Zambia	10K - 100K	8	2	2	1	1	1
Zimbabwe	10K - 100K	2	1	0	0	0	0
None	100K - 1M	59	15	4	3	2	1
Albania	100K - 1M	196	5	13	2	6	2
Armenia	100K - 1M	373	6	51	3	17	3
Angola	100K - 1M	7	2	3	2	2	2
Azerbaijan	100K - 1M	271	5	28	4	5	2
Bosnia and Herzegovina	100K - 1M	65	5	3	3	0	0
Bahrain	100K - 1M	56	4	7	2	2	1
Bolivia (Plurinational State of)	100K - 1M	119	2	10	1	4	1
Congo, Democratic Republic of the	100K - 1M	26	3	5	2	2	1
Côte d'Ivoire	100K - 1M	43	3	3	2	1	1
Cameroon	100K - 1M	16	3	3	2	1	1
China	100K - 1M	21	4	3	2	2	1
Costa Rica	100K - 1M	138	2	13	2	5	2
Cuba	100K - 1M	8	2	2	1	2	1
Cyprus	100K - 1M	32	3	4	2	0	0
Dominican Republic	100K - 1M	175	2	8	2	3	2
Algeria	100K - 1M	172	5	24	4	12	4
Estonia	100K - 1M	91	3	8	3	2	2
Ethiopia	100K - 1M	22	3	2	1	2	1
Georgia	100K - 1M	228	5	9	3	3	2
Ghana	100K - 1M	82	2	8	1	3	1
Guatemala	100K - 1M	233	2	14	2	4	1
Honduras	100K - 1M	93	2	11	2	4	1
Croatia	100K - 1M	262	7	14	2	5	2
Haiti	100K - 1M	17	2	1	1	0	0
Iraq	100K - 1M	214	5	25	3	11	3
Iceland	100K - 1M	12	2	1	1	1	1
Jamaica	100K - 1M	7	1	1	1	1	1
Jordan	100K - 1M	103	2	11	2	4	1
Kenya	100K - 1M	93	4	6	3	3	2
Kyrgyzstan	100K - 1M	144	3	6	2	4	2
Cambodia	100K - 1M	39	6	3	1	1	1
Kuwait	100K - 1M	340	4	114	2	61	2
Lebanon	100K - 1M	71	3	6	2	3	1
Sri Lanka	100K - 1M	81	5	7	2	4	2
Lithuania	100K - 1M	244	5	7	3	2	2
Luxembourg	100K - 1M	13	6	3	3	3	3
Latvia	100K - 1M	55	3	4	3	3	3
Libya	100K - 1M	19	2	1	1	1	1
Moldova, Republic of	100K - 1M	75	3	2	1	0	0
Montenegro	100K - 1M	10	4	0	0	0	0
North Macedonia	100K - 1M	31	3	2	1	0	0
Myanmar	100K - 1M	29	3	5	2	3	2
Mongolia	100K - 1M	22	2	1	1	0	0
Macao	100K - 1M	18	2	1	1	1	1
Malta	100K - 1M	10	2	2	1	0	0
Nicaragua	100K - 1M	71	2	3	1	2	1
Nepal	100K - 1M	88	2	4	1	2	1
Oman	100K - 1M	80	3	15	2	6	1
Panama	100K - 1M	169	2	10	2	3	1
Puerto Rico	100K - 1M	53	3	7	3	2	2
Palestine, State of	100K - 1M	22	2	2	1	1	1
Paraguay	100K - 1M	115	3	13	1	5	1
Qatar	100K - 1M	87	2	11	2	3	2
Réunion	100K - 1M	4	2	1	1	1	1
Sudan	100K - 1M	33	5	6	2	3	2
Slovenia	100K - 1M	50	6	3	3	2	2
Slovakia	100K - 1M	247	8	18	3	10	3
Senegal	100K - 1M	14	3	2	2	1	1
El Salvador	100K - 1M	90	2	6	1	2	1
Syrian Arab Republic	100K - 1M	22	2	2	1	1	1
Tunisia	100K - 1M	91	4	8	3	2	2
Trinidad and Tobago	100K - 1M	25	1	1	1	1	1
Tanzania, United Republic of	100K - 1M	28	4	2	1	2	1
Uganda	100K - 1M	9	2	2	1	2	1
Uruguay	100K - 1M	142	2	10	1	5	1
Uzbekistan	100K - 1M	54	3	5	2	2	1
Venezuela (Bolivarian Republic of)	100K - 1M	327	3	25	2	7	1
Yemen	100K - 1M	19	2	2	2	0	0
United Arab Emirates	1M - 10M	381	9	37	4	13	2
Argentina	1M - 10M	4833	6	341	3	94	2
Austria	1M - 10M	1088	11	84	2	27	2
Australia	1M - 10M	6146	21	461	6	159	5
Bangladesh	1M - 10M	470	16	31	3	16	2
Belgium	1M - 10M	1130	12	148	7	68	4
Bulgaria	1M - 10M	657	6	29	3	11	2
Belarus	1M - 10M	378	4	26	2	10	1
Switzerland	1M - 10M	759	14	50	7	22	6
Chile	1M - 10M	1646	5	94	2	29	2
Colombia	1M - 10M	4092	8	254	2	71	2
Czechia	1M - 10M	1736	11	101	5	25	4
Denmark	1M - 10M	533	12	36	3	14	3
Ecuador	1M - 10M	1157	3	78	2	17	1
Egypt	1M - 10M	742	9	50	3	21	3
Spain	1M - 10M	8503	21	700	11	209	9
Finland	1M - 10M	1407	9	94	3	36	2
Greece	1M - 10M	1129	9	74	3	20	2
Hong Kong	1M - 10M	1865	12	140	4	45	3
Hungary	1M - 10M	1314	6	80	3	26	2
Ireland	1M - 10M	936	26	89	12	42	8
Israel	1M - 10M	2706	11	214	4	95	4
Iran (Islamic Republic of)	1M - 10M	6188	11	576	3	146	2
Korea, Republic of	1M - 10M	1451	14	162	6	74	3
Kazakhstan	1M - 10M	3165	5	294	3	86	2
Morocco	1M - 10M	427	6	32	4	15	3
Malaysia	1M - 10M	1418	8	99	3	30	3
Nigeria	1M - 10M	755	9	67	5	24	3
Netherlands	1M - 10M	3548	25	271	9	92	7
Norway	1M - 10M	628	12	46	7	16	2
New Zealand	1M - 10M	328	3	31	1	10	1
Peru	1M - 10M	1542	4	79	3	24	3
Philippines	1M - 10M	5280	16	449	6	137	4
Pakistan	1M - 10M	825	9	62	1	20	1
Poland	1M - 10M	9019	17	676	6	179	3
Portugal	1M - 10M	653	7	38	2	11	2
Romania	1M - 10M	1634	58	112	24	23	4
Serbia	1M - 10M	487	9	35	2	13	2
Saudi Arabia	1M - 10M	2494	7	277	2	113	2
Sweden	1M - 10M	2847	18	172	5	50	4
Singapore	1M - 10M	698	14	44	3	17	3
Thailand	1M - 10M	2551	12	190	2	52	2
Turkey	1M - 10M	3470	16	307	5	109	3
Taiwan, Province of China	1M - 10M	5722	10	615	3	212	2
Ukraine	1M - 10M	5378	12	315	3	68	3
Viet Nam	1M - 10M	2756	13	180	5	53	2
South Africa	1M - 10M	988	16	51	6	16	3
Brazil	>10M	13803	13	1098	7	271	3
Canada	>10M	11539	21	893	12	294	6
Germany	>10M	31266	43	2897	17	881	11
France	>10M	20126	25	2006	11	663	8
United Kingdom of Great Britain and Northern Ireland	>10M	30751	35	3485	15	1102	11
Indonesia	>10M	11536	21	1843	6	592	3
India	>10M	34491	37	4301	27	1495	18
Italy	>10M	19577	18	1896	8	619	7
Japan	>10M	47674	21	5680	8	1900	5
Mexico	>10M	14460	14	2092	5	788	3
Russian Federation	>10M	22880	22	2085	11	570	7
United States of America	>100M	133217	101	23325	27	9196	22

Thanks for everyone's input!

Copying Isaac's style:

Total Pageviews vs. Unique Pageviews

@mforns @JAllemandou @Isaac

It seems like all of you think that a unique pageviews (actors) threshold would suffice without an overall pageview threshold, and I agree!

I'd love to hear if anyone has any doubts about it.

@Isaac thanks for your extensive analysis of this data. Looking at all the data you posted, I agree with you about "pushing the k value as low as we feel comfortable so that more countries can be included.

Bucketing for daily data

@JAllemandou

Thanks for bringing up that threat - I agree that bucketing could address that

@Isaac

I also don't think there're any strong use-cases that require exact counts.

However, while I do thing that ranks without pageview data would definitely eliminate some threats altogether, I'd think that buckets would provide substantially more utility while curbing those threats.

This is a subject that I'd love to dive deeper into, though. I'll be looping in a privacy engineer very soon, and I'm interested in hearing their opinion regarding the breadth of these privacy concerns.

Miscellaneous

@Isaac

I'm leaning towards retaining the Main Page articles just because all the other endpoints do the same.

The issue you brought up about page_id vs page_title is super interesting, I hadn't thought about that. Forgive my lack of context, but how does the wmf.pageview_hourly table do it, or does that not apply here? I'll definitely look into your suggested approach, though.

Again, thanks everyone for your input, and I'd love to hear your opinions about bucketed data vs ranks only, and some of these miscellaneous topics.

I'll be looping in a privacy engineer very soon, and I'm interested in hearing their opinion regarding the breadth of these privacy concerns.

Excellent - glad to hear!

I'm leaning towards retaining the Main Page articles just because all the other endpoints do the same.

Makes sense -- consistency is good and people can always filter out after-the-fact

The issue you brought up about page_id vs page_title is super interesting, I hadn't thought about that. Forgive my lack of context, but how does the wmf.pageview_hourly table do it, or does that not apply here? I'll definitely look into your suggested approach, though.

Yeah, it's complicated :( The data all is derived from wmf.webrequest, where there is a page_id field that reflects the page shown to the reader (i.e. after resolving redirects) and page_title field that reflects the page title the reader requested. So if a reader clicked on a link to Chicago, Illinois and got redirected to Chicago, then the page_title would be "Chicago, Illinois" but the page_id would be the page_id from the Chicago article (6886). The wmf.pageview_hourly table carries both the page_id and page_title fields from webrequests so if you group on page_title, you're keeping redirects separated out, and if you group on page_id then you're automatically resolving redirects. It's easy to see the breakdown (and why this matters) using the redirects part of the pageviews tool. Here's the data for the Chicago article on enwiki (all of these titles would be associated with the same page_id in pageview_hourly): https://pageviews.toolforge.org/redirectviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&sort=views&direction=1&view=list&page=Chicago

The challenge for grouping on page_id is it's nice to have page_title for readability purposes but there's no super straightforward and always up-to-date way of getting the canonical page_title for a given page_id in Hive. Ideally you join on the page table but that would require querying against the MariaDB tables if you wanted current information. What I suggested is a join where you'd pull the most common page_title for any given page_id based on the pageview_hourly data. That should almost always work and at least guarantee you find a title for every page_id. Cases where it breaks down are pages that are moved to a new title but already have a ton of links in Wikipedia to the old titles so many of the pageviews still come to the old title for a while until it's eventually balanced out.

My perspective on page_id at pageview level:

page_id is mostly available in pageviews - not present for mobil-app.
All pageview data is currently using page_title as identifier, also bringing in page_id when feasible. While for content-consumption analysis page_id is simpler as it allows to avoid the redirect complexity, page_title is the actual page identifier for other use-cases I know (mostly editing).
For the sake of consistency, I'd rather continue using page_title as identifier.

When we get the historical redirects problem solved we'll be able to provide a redirect-resolving table across time.

For the sake of consistency, I'd rather continue using page_title as identifier.

Thanks @JAllemandou for these additional details. What you say makes sense and for this dataset I'm more open to using page_title because of the dataset's clear intent to help editors and the fact that the ranking is presumed to be more valuable than the underlying pageview counts (so missing a few pageviews that came from a redirect feels like less of a concern). A few additional thoughts:

It goes against consistency, but another option is page_title for daily and page_id for monthly. This will handle page moves that happen mid-month, provide higher-quality (in my opinion) data for at least one of the datasets, and be far far easier to actually execute (because you can just join against the page table for the canonical title to associate with that page ID)
From a privacy perspective, the one thing I'll note is that preserving redirects can bring with it some implications because there will be page redirects that are only used by e.g., one external site that could have enough interest to make it onto the top articles list while still being so specific as to reveal information about exactly where those pageviews are coming from. Aggregating redirects helps with this because if an article has enough interest that it makes the list, it probably is receiving pageviews from a variety of independent sources.

page_id is mostly available in pageviews - not present for mobil-app.

I was unaware of this -- tangent but why? I assumed the page ID aspect was handled on the server side not the client...

When we get the historical redirects problem solved we'll be able to provide a redirect-resolving table across time.

Looking forward to this!

@JAllemandou @Isaac

Thanks to both of you for going into deeper detail about page_title vs. page_id. I'm also leaning towards using page_title because, in addition to being consistent with the other endpoints, it's simpler to implement as you both mentioned. I'm still open to considering this issue further, especially for the potential monthly data as Isaac brought up.

Also, just wanted to update everyone on the design status. Last week, I met with @JFishback_WMF, who was gracious enough to take on the privacy analysis of the current general design. Once he completes his analysis, there should only be a few more minor design considerations to work out, and then I can start on implementation.

Thanks again to everyone for all your help and input - I really appreciate it!

Once he completes his analysis, there should only be a few more minor design considerations to work out, and then I can start on implementation.

@lexnasser great to hear! and thank you for leading this work and taking all of these points into consideration. I'm excited to see this come to fruition!

Hey everyone, @JFishback_WMF has completed his risk analysis of the working API design, and, from a privacy perspective, everything's a go! Thanks to all of you for bringing up various potential privacy threats - it looks like we covered all the main ones.

Now that the design is all good from in terms of privacy, I've updated the design doc, located here. Please take a look and add any thoughts you have.

Once we settle on a design, I'll get started on implementation. Can't wait!

Additional data to hopefully help see the impact of the different privacy unique actor thresholds would have on what countries would actually be able to benefit from this data (this is based on the data from T207171#6615009). I look at 1000 vs. 500 unique actor thresholds and how many countries in each continent show up on the resulting list with at least k articles. In general, if a country only has one or two articles on a list, that means Main Page and Special:Search (so not particularly useful data for that region and it looks to me that the list starts becoming useful at around at least 5 articles)

Continent	Total # Countries	>1000 unique actors; >0 articles on list	>500 unique actors; >0 articles on list	>1000 unique actors; >=5 articles on list	>500 unique actors; >=5 articles on list	>1000 unique actors; >=10 articles on list	>500 unique actors; >=10 articles on list
North America	26	14 (54%)	18 (69%)	4 (15%)	10 (38%)	3 (12%)	7 (27%)
Africa	46	20 (43%)	29 (63%)	5 (11%)	10 (22%)	5 (11%)	5 (11%)
Europe	48	33 (69%)	38 (79%)	27 (56%)	29 (60%)	25 (52%)	27 (56%)
Asia	51	41 (80%)	47 (92%)	25 (49%)	34 (67%)	23 (45%)	27 (53%)
Oceania	7	2 (29%)	2 (29%)	2 (29%)	2 (29%)	2 (29%)	2 (29%)
South America	13	10 (77%)	11 (85%)	9 (69%)	10 (77%)	6 (46%)	10 (77%)

Some takeaways for me:

Moving the threshold from 1000 to 500 moves the # of countries with at least 5 articles on their list from 72 to 95 and doubles representation in Africa (from 5 to 10 countries). Also large bumps in Central America / Caribbean and Asia, so clear positive impacts on equity
Moving to 100 unique actor threshold (not shown here) has an even more dramatic effect -- in general, if a country has at least 1 article at the 500 unique actor threshold, it'll have at least 10 on the 100 unique actor threshold.

Thibaut120094 subscribed.Dec 8 2020, 5:17 PM

Isaac mentioned this in T270140: Release dataset on top search engine referrers by country, device, and language.Dec 14 2020, 10:36 PM

Sorry for the radio silence! I just finished up final exams, so I'm now freed up to make more progress on this.

@Isaac
Thanks for aggregating this country-inclusion data -- I didn't have any empirical data to consider when deciding the threshold, so this really helps put things into perspective. @JFishback_WMF is currently working on creating a more robust privacy framework to analyze these types of tradeoffs, so we'll be able to make more progress on this threshold decision when that is finished.

Until then, I'll be working on creating the initial Oozie job to load the data, and any decisions made about this threshold should be extremely easy to change. Thanks again for all your help!

Sorry for the radio silence! I just finished up final exams, so I'm now freed up to make more progress on this.

No worries - thanks for working on this!

is currently working on creating a more robust privacy framework to analyze these types of tradeoffs, so we'll be able to make more progress on this threshold decision when that is finished

Great news! @JFishback_WMF if I can provide any more data, let me know.

Until then, I'll be working on creating the initial Oozie job to load the data, and any decisions made about this threshold should be extremely easy to change.

Yep, that makes sense to me.

JAllemandou awarded a token.Dec 22 2020, 7:54 PM

Change 654924 had a related patch set uploaded (by Lex Nasser; owner: Lex Nasser):
[analytics/refinery@master] Create and configure Oozie job to load 'Top Articles by Country Pageviews API' data into Cassandra

https://gerrit.wikimedia.org/r/654924

gerritbot added a project: Patch-For-Review.Jan 7 2021, 8:47 PM

Change 657228 had a related patch set uploaded (by Lex Nasser; owner: Lex Nasser):
[analytics/aqs@master] Create pageviews 'top-per-country' endpoint with tests

https://gerrit.wikimedia.org/r/657228

kzimmerman mentioned this in T273924: Provide a list of 100 most popular articles of Russian and English Wikipedias in terms of page views from Ukraine.Feb 18 2021, 12:14 AM

Change 654924 merged by Joal:
[analytics/refinery@master] Create and configure Oozie job to load data into Cassandra for pageviews 'top-per-country' AQS endpoint

https://gerrit.wikimedia.org/r/654924

Change 668236 had a related patch set uploaded (by Lex Nasser; owner: Lex Nasser):
[analytics/refinery@master] Add double quote when constructing JSON in Hive query and change field names in properties file for top-per-country job

https://gerrit.wikimedia.org/r/668236

Change 668236 merged by Joal:
[analytics/refinery@master] Fix and optimize Hive query and change field names in properties file for top-per-country job

https://gerrit.wikimedia.org/r/668236

Change 657228 merged by jenkins-bot:
[analytics/aqs@master] Create pageviews 'top-per-country' endpoint with tests

https://gerrit.wikimedia.org/r/657228

Maintenance_bot removed a project: Patch-For-Review.Mar 16 2021, 9:10 AM

Mentioned in SAL (#wikimedia-operations) [2021-03-17T13:27:40Z] <otto@deploy1002> Started deploy [analytics/aqs/deploy@3e92346]: deploy aqs as part of train - T207171, T263697

Mentioned in SAL (#wikimedia-analytics) [2021-03-17T13:28:01Z] <ottomata> deploy aqs as part of train - T207171, T263697

Mentioned in SAL (#wikimedia-operations) [2021-03-17T13:31:04Z] <otto@deploy1002> Finished deploy [analytics/aqs/deploy@3e92346]: deploy aqs as part of train - T207171, T263697 (duration: 03m 24s)

The pageviews/top-per-country endpoint is now public! Take a look at the documentation here. You can query data starting from January 1, 2021, and examples can be found here. Note that, at the moment, the endpoint has a stability of 'experimental', meaning that the endpoint can change in incompatible ways at any time, without incrementing the API version. However, I don't expect this to occur.

Feel free to reach out with any questions at all.

Thanks so much to all of you for your help in designing, developing, testing, and deploying this endpoint, namely @Milimetric, @Amire80, @Isaac, @JFishback_WMF, @JAllemandou, @elukey, @razzi, @Ottomata, @Pchelolo, and @Nuria!

Awesome work @lexnasser :) This new endpoint is worth a blog post IMO :)

Mike.Khoroshun awarded a token.Mar 25 2021, 9:27 AM

Thank you so much @lexnasser! This has been many years in the making and is truly excellent work doing the engineering and bringing everyone together to get it done!

This new endpoint is worth a blog post IMO :)

Very much agree -- right now I'm just going around sharing the wikitech documentation with whoever will listen :)

Piling on the thanks @lexnasser! Our team has been patiently waiting for this and glad you led it through to its completion!

lexnasser closed subtask T263697: Add more popular articles per country data to AQS as Resolved.Mar 25 2021, 5:34 PM

Passing this task over to Francisco to carry out the implementation of this data into WikiStats.

@Htriedman

(Resetting inactive assignee account)

odimitrijevic moved this task from Analytics Query Service to Incoming on the Analytics board.Aug 23 2021, 5:59 PM

odimitrijevic moved this task from Incoming to Analytics Query Service on the Analytics board.Aug 23 2021, 6:08 PM

odimitrijevic moved this task from Analytics Query Service to Wikistats on the Analytics board.Jan 6 2022, 4:26 AM

Maintenance_bot added a project: Data-Engineering.Jan 6 2022, 4:46 AM

odimitrijevic removed a project: Data-Engineering.Jan 6 2022, 4:50 AM

JArguello-WMF edited projects, added Data-Engineering-Icebox; removed Analytics.Jul 4 2022, 6:41 PM

Restricted Application edited projects, added Data-Engineering; removed Data-Engineering-Icebox. · View Herald TranscriptJul 4 2022, 6:41 PM

JArguello-WMF moved this task from Incoming (new tickets) to Ice-box on the Data-Engineering board.Jul 6 2022, 6:12 PM

This data was released. Due to various technical factors, there are three distinct datasets:
https://analytics.wikimedia.org/published/datasets/

I do not think this data is yet part of pageview api but the datasets exist so likely the task can be closed

For the non-DataEng folks like myself, how is that data different from the API described in T207171#6944256? At first glance, it seems the limit of inclusion is lower, 150 global pageviews in the README vs 1000 on the API?

Hi @Strainu! I was the primary person who worked on implementing this data release for the past 18 months and can describe how this data is different from the API.

The total number of global pageviews to be included in the dataset is 150 (vs. 1000 in the API)
This data is split by country and project and page (vs. country and page [across multiple projects] or project and page [with multiple countries])
This dataset uses differential privacy to add a small amount of random noise, in order hide the contribution of an individual to the dataset and prevent dataset linkage/reidentification attacks — this is the meaning of the "noise type" and "noise scale" parts of the README (vs. no noise added to the API)
We have a minimum release threshold of:
1. 90 pageviews per country-project-page (from 6 Feb 2023 to present)
2. 450 pageviews per country-project-page (from 9 Feb 2017 to 5 Feb 2023)
3. 3500 pageviews per country-project-page (from 1 July 2015 to 8 Feb 2017)

Does that help answer your question? Please feel free to follow-up if you have any other questions.

Thanks Hal! I have a few follow-up questions:

This data is split by country and project and page (vs. country and page [across multiple projects] or project and page [with multiple countries])

I don't quite get this. If I query this URL I thought I get views from Romania drilled-down per project and page (see "FCV_Farul_Constanța", present on both enwiki and rowiki). Is this not true or am I missing the defition of the splits?

When can we expect the data to be included in the API? Is there a task for that?

@Strainu

I don't quite get this. If I query this URL I thought I get views from Romania drilled-down per project and page (see "FCV_Farul_Constanța", present on both enwiki and rowiki). Is this not true or am I missing the defition of the splits?

You are correct about the API data. The advantage of this dataset is that it is significantly more granular and exact, and allows for customized analysis. For example, take a look at the "Example analysis of Romania on 21 May 2023" section of this sample notebook I put together. Rather than only 20 pages with >1,000 pageviews and rounded to multiples of 100, we have 1,280 pages, with nearly exact values.

Further on in the notebook, you can also see some examples that compare spikes in pageviews for Queen Elizabeth's page across languages in the 10 days surrounding her death. This data is more precise, granular, and usable than the existing pageviews API, and it uses differential privacy, which is more private for platform users.

When can we expect the data to be included in the API? Is there a task for that?

I've started working on this, and am waiting for AQS 2.0 to be deployed by WMF's API Platforms team before I move ahead. The old Pageviews API will be deprecated soon, and I'd rather not double my workload. My hope is that it can be done in the next month or so.

Flomeier85 subscribed.Jul 21 2023, 7:39 PM

This comment was removed by Flomeier85.

@Flomeier85 if you have any questions at all feel free to post them here or reach out to me via email at htriedman@wikimedia.org :)

Much appreciated! I just realised it was a dumb question/I made a mistake and deleted it again. No point in wasting anyone's time :)

Htriedman closed this task as Resolved.Oct 31 2023, 11:24 PM

Htriedman claimed this task.

Krinkle moved this task from Backlog to Done on the Tool-Pageviews board.Dec 6 2023, 10:05 PM

	F37145827: Screenshot 2023-07-21 at 21.31.09.png
	Jul 21 2023, 7:39 PM

Have a way to show the most popular pages per countryClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Total Pageviews vs. Unique Pageviews

Bucketing for daily data

Miscellaneous

Total Pageviews vs. Unique Pageviews

Bucketing for daily data

Miscellaneous

Have a way to show the most popular pages per country
Closed, ResolvedPublic
Actions

Related Objects
Search...