Page MenuHomePhabricator

Have a way to show the most popular pages per country
Open, HighPublic

Description

The Pageviews tool provides valuable information about the most popular articles in every project.

However, it would also be useful to have information about the most popular articles by country. Countries and languages are very different. This is especially important for languages that are spoken in many countries, such as English, French, and Spanish: the most popular articles in U.K., U.S., Australia, Nigeria, South Africa, and India are probably quite different.

In such a tool it would be useful to see the most popular pages in all the projects; in such a case, the 100 most popular pages in Moldova will probably include articles in Romanian, Russian, and English Wikipedias, and possibly also some Commons, Wiktionary, and Wikisource pages. It would also be useful to filter for both country and language and, for example, see only the most popular English Wikipedia articles in Moldova.

It probably makes the most sense to integrate this into the existing Pageviews tool, and add a new tab to the current Langviews, Topviews, Siteviews, etc. However, it may also make sense to set it up elsewhere, for example in Turnilo, Superset, or some other platform.

I once raised this at the Analytics mailing list: https://lists.wikimedia.org/pipermail/analytics/2018-July/006385.html . The query suggested in that thread by @fdans works, but it's slowish, and there's no fast API for this, as there is for pageviews per project.

There are probably some blocking privacy issues. They should not be a total blocker, however. It's OK to filter out some problematic entries in small countries or languages where personally identifiable information can show up, but it's probably fine to show the top 500 viewed pages in Nigeria (just as an example).

Beyond the general "zeitgeist" curiosity, such a tool will be particularly strategically useful to people who want to develop projects in languages that are spoken by many people, but don't yet have a lot of articles. This is true for many languages of India, for example, where English is the most popular language by far, even though most people there speak other languages.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Milimetric raised the priority of this task from Low to High.Jun 8 2020, 4:16 PM
Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.
Milimetric added a subscriber: Milimetric.

From Nuria on priority: we need to finish the API work (Lex's work in T238365) before we do this, but we can grab it as a goal after that.

@Milimetric @Nuria Checking in again about the plans for this?

We are wrapping up the work listed on the ticket above so it is likely we can get to this quarter, to be clear this data is available internally. The work we will be doing will be to make it such it is also available externally.

@AMuigai @Amire80 @lexnasser is going to start working on this project this quarter. I propose we use a top 100 of articles per country where these articles can be in *any* language. Thus you could request "the most popular articles in french in spain" in the month of 2020-01 and that list might have say 5 articles because the top 100 includes 95 articles in Spanish and 5 in french, makes sense?

If you request the top list of articles (regardless of language) for a country the list returned will have 100 items as long as numbers of pageviews are sufficiently large.

Privacy wise: articles with very small number of views or countries with a very small number of views will not be reported

... I propose we use a top 100 of articles per country where these articles can be in *any* language

Maybe someone should do some analysis around how that would look in practice. Because it's possible English Wikipedia, with the majority of the traffic, would dominate a lot of countries. So maybe if we run some numbers we'd get a better sense of what we should output.

... I propose we use a top 100 of articles per country where these articles can be in *any* language

Maybe someone should do some analysis around how that would look in practice. Because it's possible English Wikipedia, with the majority of the traffic, would dominate a lot of countries. So maybe if we run some numbers we'd get a better sense of what we should output.

Indeed, this is what will happen. In some other countries it will be French (Guinea, Mali), Portuguese (Mozambique, Angola), or Russian (the difference between Romania and Moldova is quite striking). And this is totally fine, and we want to know this!

So both things are useful:

  • The top 100 (or maybe 500?..) pages in all languages.
  • Top 100 in each language.

And of course it's OK if you stop after fewer than 100 if there's a privacy issue or if too few people read in that language.

Top 100 in each language.

To be clear, this is mostly not possible due to privacy reasons as we can only release buckets of pageviews that are large enough and things like "malasyan top pageviews in san marino" will not fit that criteria. It is very likely also that "top 100 pageviews in san marino" will also not fit that criteria. So an strategy that tries all languages for all countries is not very likely to be successful as the clients will be left with an "undetermined" number of projects for which pageviews are available per country and no way to know which are those projects..

In any case to milimetric's point this can be further quantified with data analysis.

Top 100 in each language.

To be clear, this is mostly not possible due to privacy reasons as we can only release buckets of pageviews that are large enough and things like "malasyan top pageviews in san marino" will not fit that criteria.

San Marino is a very extreme example :)

Is it possible for Nigeria, Mali, Kenya, India, Philippines, Romania, Kyrgyzstan?

Is it possible for Nigeria, Mali, Kenya, India, Philippines, Romania, Kyrgyzstan?

It depends but (other than Romania) the answer is "mostly yes". Have in mind that San Marino has about ~10,000 pageviews per day with a population of 33,000. Kyrgyzstan has about ~300K pageviews with a population of 6 million so the number of pageviews per person is a lot smaller. If you do language splits that are very detailed, basically you can infer what wikipedia pages habitants of Kyrgyzstan that "speak french and have access to internet" are viewing. This number might be much smaller than the population of San Marino.

In general, you are right that San Marino is an "extreme example". In this API we will probably not publish stats for such small countries cause there is no way to do that w/o disclosing a lot about the users on the country but the "idea" applies to other larger countries where issues are not so obvious.

In terms of the privacy considerations for countries with low pageview counts, I found that the most-viewed articles by project endpoint reports articles with very low pageview counts. See: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/kl.wikipedia/all-access/2019/10/01. It reports 374 different articles with a single pageview, and I'd assume that the vast majority of those originate from Greenland, given that it is Greenlandic-language Wikipedia. With the relative similarity between these two projects, I was wondering what led to the decision to not perform those privacy transformations on the data, and if those same reasons would be relevant to this case.

@lexnasser Nice, yes, same considerations apply to your example. That such a low count is available speaks of a bug here: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/monthly/pageview_top_articles.hql

speaks of a bug

I disagree :)
The raw data is available for per-language project broadly on the API or on dumps. Therefore the top endpoint is a just computation facilitated over data available.
For country data it is different as it is a dimension we currently provide for pageviews at project level only, and with bucketing.

The raw data is available for per-language project broadly on the API or on dumps.

You are right. i forgot these files are available per project. My reasoning above breaks down where project (more or less) == country (per @lexnasser) example as that data is already public. The example of users in Kyrgyzstan that "speak french and have access to internet" still stands.

Just finished a first draft of the design doc for this project! You can find it here: https://docs.google.com/document/d/19HbdPvSHPUF9n4thFOlck0dIgvZvg50K3mcL-guoViY/edit?usp=sharing

I made a bunch of comments that I'd love everyone's thoughts on. Feel free to either respond to my comments or make new ones of your own.

I know there's still a bunch of ambiguity on what the API should look like, so I'll get started on some data analysis once my access to the relevant data is re-enabled.

Just wanted to follow up to say that I'd love for everyone to take a look at the design doc and make suggestions as you see fit.

I'm also starting my data analysis for this project, which may affect the API design. I'll be sure to report any relevant findings.

Thanks so much!

Wei talked about doing some data analysis to quantify the issues with privacy and country splits. As we spoke we need to quantify the identification risk, an article with 1 pageview in "Greenlandic-language Wikipedia" might carry an identification risk of 1/55,000 (55,000 being the population of Greenland) and article in Malasyan in San marino might have an identification risk of 1/5 (5 citizens with malsyan names in San Marino) so it is not the "number of pageviews" that defines the identification risk but rather "possible population from which this pageviews are drawn"

Adding @Isaac cause I think he can probably be a good person to help to explore more than a simple bucketization solution might be needed.

Thanks @Nuria -- indeed, I'm highly motivated to find a good solution for this and we had good conversations about similar aspects for the Covid session data project. Just quickly, our solution there had a few parts to address different aspects:

  • a blocklist of countries we'd never retain data for -- i.e. highly sensitive countries where an error in logic is deemed more problematic
  • a mininum # of unique users (where user is IP+user-agent) -- i.e. try to assert that at least X unique individuals are generating the pageviews to be included in the data
  • a maximum % of pageviews we'd release data on -- e.g., don't release data if it covers more than 10% of pageviews from the whole country to again reduce the risk of a user appearing in the data and being identifiable
  • random sampling by day so that heavy users or e.g., users who always view popular pages don't continue to show up in the data every day (in general, introducing some small amount of randomness is good if you don't need exactness, which doesn't seem necessary for this dataset in my opinion though I'm happy to be proved wrong)
  • exclude people with traces of editing in their sessions -- these accounts can be tied to a given page on a given day via the edit history so it's best to exclude them as they are a higher risk of deindentification
  • exclude mobile app editors -- this is largely a function of the prior piece because the clause we use for detecting edits doesn't cover edits made in the apps
  • exclude power users -- i.e. userhashes with greater than X pageviews in a day. This doubles as another form of likely bot removal, protects very heavy users of the project, and also in theory would help reduce the chance of a single user heavily skewing the data.

I'm not sure how many of these will be relevant / necessary but I'll take a look at Lex's document and give it some thought in the next week or so.

@Isaac thanks so much, those are the kinds of considerations we'd love to apply.

This seems also like a great opportunity to bring in population data into our data pipelines. Some of us have wanted to normalize our data by population for a while (T242621). Partly to get up to parity with Wikistats 1, but now I see that it could also be invaluable in privacy considerations.

Basically:

  1. Maintain a regular (yearly?) import of country populations
  2. Use the data to normalize our geographic metrics, so we're displaying something more meaningful than a population map
  3. Use the data to compute what Nuria describes in T207171#6547043 and make dynamic decisions about what to publish and not publish

Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task).

In the case of the small "malasyan" bucket of "san marino" the country population is not helping much, for example. What quantifies that the pool is too small in that case (more or less) is the the <# pageviews on malasyan in san marino>/<# total pageviews in san marino>

One thing that seems to be missed here is that it's not really that important how many articles in the Slovak language people read in Zambia, so if their count is too low and can't be displayed, no one is going to notice it or ask for it. It is interesting what articles people read in Slovak from Hungary, Poland, or Czechia, but if that number is also too low, then this fact is interesting, and the further details—not so much. So, if you just say "this number is too low to be displayed", I don't think that anyone will complain, because that is all the information that the product managers and the editors' community need. I trust the Analytics professionals to define what exactly does "too low" mean.

So, if you just say "this number is too low to be displayed" , I don't think that anyone will complain

This is actually very useful info, thank you.

@Isaac thanks for sharing these!

I think the following points are most useful:

a blocklist of countries we'd never retain data for -- i.e. highly sensitive countries where an error in logic is deemed more problematic

a mininum # of unique users (where user is IP+user-agent) -- i.e. try to assert that at least X unique individuals are generating the pageviews to be included in the data

Some metrics to report:

  • For San Marino, a country with <34,000 population
    • There were ~150,000 pageviews in October
      • The 10th most viewed article in all of October had ~110 pageviews
      • The 100th most viewed article in all of October had ~30 pageviews
      • The 1000th most viewed article in all of October had ~10 pageviews
    • There were ~7,000 pageviews on October 10
      • The 10th most viewed article on October 10 had ~10 pageviews

Given these numbers, I think that granularity should be monthly, and, per the solution Isaac brought up, we could only report articles with above Y(=50?) unique user views rather than limiting reporting to articles that have above X total views. I also think that the top 1000 articles should be reported as long as they all have above Y unique user views.

I'd love to hear everyone's thoughts on this approach.

I'll have to repeat that San Marino is a very extreme case :)

However, monthly is an OK default, at least as a start. Perhaps, once this is rolling, you could switch some larger countries to weekly.

@Amire80 I definitely agree that San Marino is an edge case. Do you think there are any other metrics that could help gauge what would be the best way to exclude articles (total views vs unique views vs something else)? Or do you just in general prefer one way over the others?

@Amire80 I definitely agree that San Marino is an edge case. Do you think there are any other metrics that could help gauge what would be the best way to exclude articles (total views vs unique views vs something else)? Or do you just in general prefer one way over the others?

Can't think of anything special. I'm not a true web analytics expert.

As far as I can see, I am mostly interested in the most popular articles in different languages in every country, for example for these purposes:

  • Helping people who live in these countries understand which articles should be prioritized for translation. For example, if a Russian Wikipedia article is popular in Moldova, and is not available in the Romanian language, it can be somehow suggested in Content Translation (such a feature doesn't exist yet, but maybe it will appear some day).
  • If an English Wikipedia article is popular in Nigeria, and it is also available in the local languages of Nigeria (Yoruba, Hausa, Igbo, Fula), the interlanguage links to it can be emphasized for readers in Nigeria (such a feature doesn't exist yet, but maybe it will appear some day).
  • A Wikimedia chapter in Colombia can organize a workshop for writing articles about famous people from the history of Colombia, and later check how popular did they become in each Spanish-speaking country.

(The countries and the languages above are just examples, and lots of other countries and languages could be there.)

Other product managers, strategists, designers, and community members can also have different purposes and usage scenarios.

... Another kind of related scenario that someone has just brought up in the Wikimedia Telegram chat: "Is there anyway to know where's the readers come from for an article?"

That is, to see in which countries is the article popular.

If something like this can be done together with this task, it would be nice, but it's find to do it separately.

Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task). In the case of the small "malasyan" bucket of "san marino" the country population is not helping much, for example. What quantifies that the pool is too small in that case (more or less) is the the <# pageviews on malasyan in san marino>/<# total pageviews in san marino>

Nuria makes a very good point and I would also add that tourists would also greatly complicate interpretation of these numbers (see this list of countries where tourists greatly outnumber citizens).

Given these numbers, I think that granularity should be monthly, and, per the solution Isaac brought up, we could only report articles with above Y(=50?) unique user views rather than limiting reporting to articles that have above X total views. I also think that the top 1000 articles should be reported as long as they all have above Y unique user views.

However, monthly is an OK default, at least as a start. Perhaps, once this is rolling, you could switch some larger countries to weekly.

I'm onboard in general with your approach @lexnasser . I'd love to have a release cadence though that is more fine-grained than monthly because so much happens in a month and it's nice to be able to get a sense of whether articles are spiking in interest and where. Editors who like to be addressing breaking-news-type events too I'm sure would appreciate a daily report. In particular, one option that I could see working well and hopefully still easy to get off the ground:

  • A daily release to provide quick information for editors interested in very targeted editing. I suspect that this could even be just a ranking of most popular articles that meet the privacy thresholds without including any raw count data (though being able to include pageview counts / bins obviously provides more nuance and value).
  • The monthly (or weekly) to make sure that as many country-project pairs as possible are included (per your San Marino analysis, it seems a monthly release is necessary if they'll ever be included) and give actual pageview counts while reducing privacy risks by having the data cover so many days. This then could be used for quantitative analyses by people interested in campaign impact, trends in reader interest, etc.

A daily release to provide quick information for editors interested in very targeted editing. I suspect that this could even be just a ranking of most popular articles that meet the privacy thresholds without including any raw count data

Nice, +1 to this idea

I would implement the daily "top" 1st and once that is in place I would add the monthly job, given the very different amounts of data needed for both a different strategy might be needed for the second one.

Hey everyone, I spent the last couple of days compiling data for less edge-casey countries that are relatively multilingual (India and Belgium). The metrics are below and my takeaways are at the bottom.

For one day in 2020:

For India, a country with ~1.3B population:

		For all languages:
			The most viewed article had ~370000 pageviews
			The 10th most viewed article had ~28000 pageviews
			The 100th most viewed article had ~6700 pageviews

		For Hindi, spoken natively by ~44% of the Indian population:
			The most viewed Hindi article had ~28000 pageviews
			The 10th most viewed Hindi article had ~4600 pageviews
			The 100th most viewed Hindi article had ~1100 pageviews

		For Bengali (~8%):
			1)   ~3800 pageviews
			10)  ~800 pageviews
			100) ~200 pageviews

		For Marathi (~7%):
			1)   ~17000 pageviews
			10)  ~900 pageviews
			100) ~200 pageviews

		For Telugu (~7%):
			1)   ~5200 pageviews
			10)  ~450 pageviews
			100) ~100 pageviews

		For Kannada (~4%):
			1)   ~23000 pageviews
			10)  ~800 pageviews
			100) ~150 pageviews

		For English (~0%):
			1)   ~37000 pageviews
			10)  ~22000 pageviews
			100) ~6300 pageviews

For Belgium, a country with ~12M population:

		For all languages:
			1)   ~32000 pageviews
			10)  ~7000 pageviews
			100) ~900 pageviews

		For Dutch (~60%):
			1)   ~32000 pageviews
			10)  ~5500 pageviews
			100) ~500 pageviews

		For French (~40%):
			1)   ~25000 pageviews
			10)  ~2000 pageviews
			100) ~400 pageviews

		For German, (~0.5%):
			1)   ~2000 pageviews
			10)  ~50 pageviews
			100) Too low to report

		For English (~0%):
			1)   ~17000 pageviews
			10)  ~1000 pageviews
			100) ~200 pageviews

My Takeaways:

  • If we want useful data by country by language with day granularity, then the pageview threshold should be on the order of 100(s), which seems relatively low in terms of privacy
  • Data by country by language with month granularity with a pageview threshold on the order of 1000(s) seems reasonable in terms of both utility and privacy, especially with bucketing - this could potentially be spun off into a separate API if we want daily granularity for all languages aggregated
  • Data by country, not by language, with daily granularity with a pageview threshold on the order of 1000(s) seems reasonable for a wide variety of countries in terms of both utility and privacy, and is the best option for this API implementation
  • Buckets of size 100 would likely not compromise the utility of the data

I'd love to hear where everyone agrees/disagrees, or any other suggestions or metric requests.

My Takeaways:

  • If we want useful data by country by language with day granularity, then the pageview threshold should be on the order of 100(s), which seems relatively low in terms of privacy

I agree that threshold of 100s pageviews seems small for privacy.

  • Data by country by language with month granularity with a pageview threshold on the order of 1000(s) seems reasonable in terms of both utility and privacy, especially with bucketing - this could potentially be spun off into a separate API if we want daily granularity for all languages aggregated
  • Data by country, not by language, with daily granularity with a pageview threshold on the order of 1000(s) seems reasonable for a wide variety of countries in terms of both utility and privacy, and is the best option for this API implementation
  • Buckets of size 100 would likely not compromise the utility of the data

If we decide to go for per-project (=per-language) + all-projects, using the same API endpoint with special project value all-projects has been our way to go so far.

I agree that threshold of 100s pageviews seems small for privacy.

I agree if we're delivering raw data and using pageviews as our sole threshold. I'm more open to e..g, 100 pageviews as a minimum threshold if...

  • We're bucketing -- e.g., buckets in 100s -- i.e. 100-200 pageviews, 200-300 pageviews, ... 1000-1100 pageviews, ... I'd also be open to thousands buckets (100-1000, 1000-2000, ...) if that's deemed safer. 100 pageviews from a given country to a given article in a day is a pretty high level of traffic for the smaller language editions and I'd like to see us try to include it if possible.
  • We're reporting pageviews but using unique # of users as the threshold. This is something that @lexnasser had indicated a willingness to consider. How does this change things? Is this still feasible? I'm much more comfortable with 100 pageviews if I know that's coming from 100 different UA+IPs in a day.

Data by country by language with month granularity with a pageview threshold on the order of 1000(s)

Yeah, I think I would push for 1000 even with a bucket size of e.g., 100. Seems sufficiently high to preserve privacy given that this number will be aggregated from at least 28 days and that any daily data will hopefully be sufficiently bucketed as to not reveal too much information (where you could e.g. infer the # of pageviews on a day that doesn't show up in the daily reports). I don't have a great way of quantifying this though. In the past, we've used 500 pageviews from over the course of a week numerous times in Research for when to release data.

Data by country, not by language

@lexnasser what's your thinking about how this would look? Is it essentially a list of Wikidata items for a country? Do we have a use-case that this covers or is the hope that e.g., for countries with many languages, topics would bubble up at the country level that don't meet any language-specific thresholds?

The other scenario I have in the back of my mind that I want to see us explicitly consider is e.g., what about Norwegian Wikipedia articles. The raw pageview data is available and it's well known that well >90% of pageviews to Norwegian Wikipedia articles is coming from Norway, so whatever the choices are around daily / monthly thresholds and bucketing, we want to make sure there's enough ambiguity that nobody could reasonably pinpoint pageviews from Norwegian readers outside of Norway. So if pageview data shows 1001 pageviews to an article in a month and our monthly threshold is 1000 and the article shows up in the monthly report, is this problematic? I think it's fair to decide that the fact that that remaining pageview could come from any country means that it's not a breach of privacy, but it's something that should be decided at some point.

Awesome analysis @lexnasser!

I was thinking about using the threshold on pageviews vs. on unique readers, as @Isaac suggests.

I see 2 privacy threats:

  1. A person, or a small group of people, are the only ones that speak a given language in a very small country. If they visit Wikipedia (a lot), they could generate enough pageviews to appear in the ranking, and then anyone who knows them could deduce what they read in Wikipedia.
  1. An editor person lives in a foreign country with very low Wikipedia traffic. They write an article over the course of one month. They do lots of small edits. And they visit the article they're editing lots of times (maybe their local friends and relatives do as well). The article makes it to the ranking of that country. By looking at the ranking combined with the public wiki databases, one could associate the edition of the article with the ranked article and deduce the country of the editor person.

I believe applying the threshold on unique readers would be safer privacy-wise. However, I had suggested @lexnasser to study first the rankings, to determine if that would be needed, or we were safe by just using pageviews for the threshold (which would be easier). Now, seeing @lexnasser's analyses, I'm on the fence.

@lexnasser, would it be possible to repeat an analysis for a smaller country (like i.e. Andorra)? Maybe we can find any instances of the 2 privacy threats I mentioned? Like ranked articles that look unexpected, in out-of-context languages? In any case, it will give us a better idea of how small country rankings will look like, and help us decide what to threshold on, no?

@JAllemandou Thanks for your thoughts and for the all-projects suggestion! I'm often unaware of those types of existing naming conventions.


@Isaac

I definitely agree that additional privacy features like bucketing and unique view thresholds may allow for that 100-view threshold.

I'm more open to e..g, 100 pageviews as a minimum threshold if... We're bucketing -- e.g., buckets in 100s -- i.e. 100-200 pageviews, 200-300 pageviews, ... 1000-1100 pageviews, ... I'd also be open to thousands buckets (100-1000, 1000-2000, ...) if that's deemed safer. 100 pageviews from a given country to a given article in a day is a pretty high level of traffic for the smaller language editions and I'd like to see us try to include it if possible.

I'm leaning towards buckets in 100s if we set the minimum total/unique view threshold to 100. My concern with 1000-sized buckets is that the API would provide very little insight into granular trends, especially for languages (e.g., the India data, where the vast majority (>90%) of reported pages for Bengali, Marathi, Telugu, and Kannada would fall into that 100-1000 bucket).

reporting pageviews but using unique # of users as the threshold. This is something that lexnasser had indicated a willingness to consider. How does this change things? Is this still feasible? I'm much more comfortable with 100 pageviews if I know that's coming from 100 different UA+IPs in a day.

I think this feature could bridge the privacy gap of using small (~100) buckets. I don't see any reason why this feature is infeasible, is there a particular reason why you were questioning its feasibility? I fully agree with your sentiment "I'm much more comfortable with 100 pageviews if I know that's coming from 100 different UA+IPs in a day.".

Also, even though I can find the UA Map, I'm unsure how to retrieve the IP for a specific pageview in wmf.pageviews_hourly. Do you have any insight for this, or for a different unique ID? Is this what you were referring to for unfeasibility?

what's your thinking about how this would look? Is it essentially a list of Wikidata items for a country? Do we have a use-case that this covers or is the hope that e.g., for countries with many languages, topics would bubble up at the country level that don't meet any language-specific thresholds?

All that I mean by "by language" is adding a request parameter so that data is returned for a specific language (== project) like:

>>> GET /metrics​/pageviews​/top-by-article/in/hi.wikipedia/mobile-app/2019​/10 

"country": "in", <- India
"project": "hi.wikipedia", <- Hindi language
"access": "mobile-app",
"year": "2019",
"month": "10",
"articles": [
    {
        "article": "Main_Page",                   
        "project": "hi.wikipedia",
        "views": 12324,
        "rank": 1,
    },
    {
        "article": "India",                   
        "project": "hi.wiktionary",
        "views": 2312,
        "rank": 2,
    },
    ...
]

So, not "by language" would return a list of items like:

>>> GET /metrics​/pageviews​/top-by-article/in/mobile-app/2019​/10/01

"country": "in",
"access": "mobile-app",
"year": "2019",
"month": "10",
"day": "01",
"articles": [
    {
        "article": "Main_Page",                   
        "project": "hi.wikipedia",
        "views": 12324,
        "rank": 1,
    },
    {
        "article": "Joker_(2019_film)",                   
        "project": "en.wikipedia",
        "views": 3290,
        "rank": 2,
    },
    ...
]

I'm differentiating between them because there may be different privacy-utility tradeoffs for their respective use cases (e.g., language data may be just as useful with monthly granularity, general data may be more useful with daily granularity). I'm open to merging these into a single endpoint if it doesn't compromise utility much.

I think that clarification should address your remarks, but let me know if I'm missing something.

what about Norwegian Wikipedia articles... So if pageview data shows 1001 pageviews to an article in a month and our monthly threshold is 1000 and the article shows up in the monthly report, is this problematic? I think it's fair to decide that the fact that that remaining pageview could come from any country means that it's not a breach of privacy, but it's something that should be decided at some point.

For your example in particular, I don't think that should be a problem just because it could come from any country as you mentioned, but I'm open to reconsidering if you disagree.


Awesome analysis lexnasser!

Thanks!

I see 2 privacy threats:...

I agree with your assessment of the main privacy threats and that a unique pageview threshold could thwart those.

would it be possible to repeat an analysis for a smaller country (like i.e. Andorra)? Maybe we can find any instances of the 2 privacy threats I mentioned? Like ranked articles that look unexpected, in out-of-context languages? In any case, it will give us a better idea of how small country rankings will look like, and help us decide what to threshold on, no?

I'm actually not sure how to compute the unique view count (UA map + something else?). But here's the non-unique pageview data for Andorra:

For one day in Andorra, a country with ~80k population:

		For all languages:
			1)   ~300 pageviews
			10)  <50 pageviews
			100) too low to report

		For Catalan (~40%):
			1)   ~180 pageviews
			10)  <25 pageviews
			100) too low to report

		For Spanish (~40%):
			1)   ~300 pageviews
			10)  <25 pageviews
			100) too low to report

		For Portuguese, (~0.5%):
			1)   too low to report
			10)  too low to report
			100) too low to report

		For English (~0%):
			1)   <50 pageviews
			10)  too low to report
			100) too low to report

@lexnasser Thanks for the analysis of Andorra's traffic!
In this case, no article would make it to the ranking because of the threshold (1000 pageviews).
So, no possibility of an unexpected privacy-sensitive article being featured.
I'm curious about Belgium's ranking then, to see if we can find something odd there.
Maybe we can pair on this @lexnasser?

Hey everyone, I just created a table lex.pageview_ranks_with_unique, available on Hive and Superset, that holds the exploratory data (for one day) that I've been analyzing. I created this to make it easier for everyone to examine the data, and see how different thresholds (including unique pageview thresholds) would affect the data returned for different countries.

If you care about a specific use case and want to see how that use case is affected by different thresholds, I'd love for you to try out a few queries and report back your thoughts.


Here's an example query:

SELECT
    row_number() OVER (ORDER BY total_count DESC) AS output_rank,
    country_rank,
    country_code,
    lang,
    page_title,
    country_lang_rank,
    total_count,
    unique_count
FROM lex.pageview_ranks_with_unique
WHERE
    country_code = 'BE'
    AND total_count > 1000
    AND unique_count > 500
ORDER BY total_count
DESC LIMIT 1000;

Each record (row) would represent a single country-language-page combination.
In this above query:

  • output_rank represents the rank of that record among the records that would be returned, excluding pages that would not be returned due to pageview thresholds
  • country_rank represents the rank of that record among all the pages for that country, including pages that would not be returned due to pageview thresholds
  • country_lang_rank represents the rank of that record among all the pages for that country in that language (e.g. If the entry had lang = en, country_code = BE, and country_lang_rank = 47, then that would mean that that page was the 47th most-viewed English page in Belgium for that day)
  • total_count represents the total number of pageviews for that record
  • unique_count represents the number of unique pageviews for that record

As such, this query would represent data that would be returned for the country Belgium, with a total pageview minimum threshold of 1000 pageviews and a unique pageview minimum threshold of 500 pageviews. This query in particular retrieves 87 records that match those criteria from the top 1000 most-viewed articles on that day in Belgium.

Looks great @lexnasser!
I feel the threshold on unique actors is working well. I was surprised to see that the number of unique actors on an article is typically close to the number of pageviews.
So, maybe we should consider only using the unique actors threshold, because it sort of supersedes the pageview threshold, no? An article with 500 unique actors will at least have 500 pageviews.
And even if 500 pageviews is lower that we initially thought the pageview threshold would be, knowing the article has 500 unique actors, I think we're fine.
Thoughts, anyone?

Using unique-actors as main threshold metrics seems a nice idea. As @Milimetric was pointing the other day, there is a vector of attack adding fake users. Malicious attacker would be able to find if a page got a single viewer by faking 499 calls to that page with different UA, if the page has 500 views for instance. I wonder how we could think of mitigating that - buckets seems to do it.

Thanks for making the table @lexnasser! A few thoughts below:

Total Pageviews vs. Unique Pageviews

I checked on unique vs. total pageviews just to verify that it makes a difference and it does. I did a query to check for pages where a few users were generating many pageviews. For example, there were 55 pages where >1000 pageviews were generated from <50 unique devices and another 55 where >1000 pageviews were generated from <100 unique devices. Personally, I don't see any value to having both a total pageview and unique pageview threshold. I can't think of a privacy use-case where we'd e.g., be worried if there were 1000 pageviews but they all came from unique devices as opposed to 1000 pageviews that came from only 500 devices. So I'd just drop it to a threshold based on unique pageviews.

Bucketing for daily data

Malicious attacker would be able to find if a page got a single viewer by faking 499 calls to that page with different UA, if the page has 500 views for instance. I wonder how we could think of mitigating that - buckets seems to do it.

Agreed but I'd probably go further -- I don't think there's any strong use-case that requires exact counts, especially in daily data. The long-term differential privacy approach will greatly help too with this but I don't think we're close yet to that. I frankly feel like until we have a better understanding of the privacy risks that we should just opt for the ranking of the pages and not actual pageview data for the daily reports. Maybe pilot it with just rankings and get feedback on whether it's useful enough that way? People can always cross-reference with the public pageview data to get a sense of orders of magnitude. That said, I would love to see some level of bucketing for the monthly reports though because it would seem to me that it'd be far more difficult to manipulate monthly data in a way that reveals information about a single user.

Miscellaneous

  • I assume you'll want to retain only pages in article namespace. I think I lean towards retaining the Main Page articles but you could also reasonably remove them (that filter could be automatically generated by doing an anti join on page_id + wiki_db where item_id = Q5296 in item_page_link)
  • It looks like you're currently grouping by page_title for the counts (I queried for Nigeria and saw both Special:Search and Ihü_kárírí:Search for Igbo Wikipedia). I'd recommend instead to instead group by page_id (which will take into account redirects). Unfortunately there's no great way to get the current canonical title in bulk because the page table in Hive will generally be at least several days behind. So maybe best fix is to re-add page_title by grabbing the most common page_title associated with each page_id. Others might have ideas too (or push back on using page_id to group instead of page_title).

And regarding unique pageview threshold, I threw together this table of # of pages (and unique languages/projects) that would be on the list for each country for 100, 500, and 1000 unique pageviews. Example query for k = 1000 and data for all below. My takeaway is that unless we have a strong reason for k=1000, I'd push for k=500 or k=100 given that many countries are included and go from maybe 1-2 pages to 10 or more as you push k lower, which feels like a sizable jump in value for these countries. In general, I'd argue for pushing the k value as low as we feel comfortable so that more countries can be included (and would drop bucketed pageview data if that makes us feel comfortable with pushing the k lower).

# Example query:
SELECT
  country_code,
  COUNT(1) AS num_pages,
  COUNT(DISTINCT(lang)) AS num_projects
FROM lex.pageview_ranks_with_unique
WHERE
  unique_count > 1000
GROUP BY
  country_code

Countries always excluded (if k>100): Antarctica; Cocos (Keeling) Islands; Cook Islands; Christmas Island; Western Sahara; Eritrea; British Indian Ocean Territory; Korea (Democratic People's Republic of); Marshall Islands; Montserrat; Norfolk Island; Nauru; Niue; Saint Helena, Ascension and Tristan da Cunha; Svalbard and Jan Mayen; Tokelau; Tuvalu; Wallis and Futuna; Antigua and Barbuda; Anguilla; American Samoa; Saint Barthélemy; Bonaire, Sint Eustatius and Saba; Central African Republic; Cabo Verde; Dominica; Falkland Islands (Malvinas); Micronesia (Federated States of); Grenada; Equatorial Guinea; Guinea-Bissau; Kiribati; Comoros; Saint Kitts and Nevis; Saint Martin (French part); Northern Mariana Islands; Saint Pierre and Miquelon; Palau; Solomon Islands; Seychelles; Sao Tome and Principe; Sint Maarten (Dutch part); Turks and Caicos Islands; Chad; Tonga; Holy See; Saint Vincent and the Grenadines; Virgin Islands (British); Vanuatu; Samoa; Mayotte; Saint Lucia

country# pageviewsnum_pages (k=100)num_projects (k=100)num_pages (k=500)num_projects (k=500)num_pages (k=1000)num_projects (k=1000)
Greenland1K - 10K110000
Gambia1K - 10K110000
Lesotho1K - 10K110000
San Marino1K - 10K110000
Andorra10K - 100K220000
Afghanistan10K - 100K1122211
Aruba10K - 100K110000
Åland Islands10K - 100K110000
Barbados10K - 100K610000
Burkina Faso10K - 100K621100
Burundi10K - 100K220000
Benin10K - 100K1021100
Bermuda10K - 100K211100
Brunei Darussalam10K - 100K410000
Bahamas10K - 100K210000
Bhutan10K - 100K110000
Botswana10K - 100K211100
Belize10K - 100K110000
Congo10K - 100K110000
Curaçao10K - 100K110000
Djibouti10K - 100K220000
Fiji10K - 100K310000
Faroe Islands10K - 100K220000
Gabon10K - 100K410000
French Guiana10K - 100K110000
Guernsey10K - 100K210000
Gibraltar10K - 100K110000
Guinea10K - 100K1031100
Guadeloupe10K - 100K311100
Guam10K - 100K110000
Guyana10K - 100K411100
Isle of Man10K - 100K210000
Jersey10K - 100K311100
Cayman Islands10K - 100K110000
Lao People's Democratic Republic10K - 100K521100
Liechtenstein10K - 100K110000
Liberia10K - 100K510000
Monaco10K - 100K220000
Madagascar10K - 100K320000
Mali10K - 100K621100
Martinique10K - 100K111100
Mauritania10K - 100K520000
Mauritius10K - 100K621100
Maldives10K - 100K311100
Malawi10K - 100K310000
Mozambique10K - 100K1120000
Namibia10K - 100K211100
New Caledonia10K - 100K410000
Niger10K - 100K320000
French Polynesia10K - 100K110000
Papua New Guinea10K - 100K210000
Rwanda10K - 100K210000
Sierra Leone10K - 100K311100
Somalia10K - 100K721100
Suriname10K - 100K110000
South Sudan10K - 100K210000
Eswatini10K - 100K110000
Togo10K - 100K620000
Tajikistan10K - 100K1431100
Timor-Leste10K - 100K110000
Turkmenistan10K - 100K430000
Virgin Islands (U.S.)10K - 100K110000
?10K - 100K420000
Zambia10K - 100K822111
Zimbabwe10K - 100K210000
None100K - 1M59154321
Albania100K - 1M196513262
Armenia100K - 1M3736513173
Angola100K - 1M723222
Azerbaijan100K - 1M271528452
Bosnia and Herzegovina100K - 1M6553300
Bahrain100K - 1M5647221
Bolivia (Plurinational State of)100K - 1M119210141
Congo, Democratic Republic of the100K - 1M2635221
Côte d'Ivoire100K - 1M4333211
Cameroon100K - 1M1633211
China100K - 1M2143221
Costa Rica100K - 1M138213252
Cuba100K - 1M822121
Cyprus100K - 1M3234200
Dominican Republic100K - 1M17528232
Algeria100K - 1M1725244124
Estonia100K - 1M9138322
Ethiopia100K - 1M2232121
Georgia100K - 1M22859332
Ghana100K - 1M8228131
Guatemala100K - 1M233214241
Honduras100K - 1M93211241
Croatia100K - 1M262714252
Haiti100K - 1M1721100
Iraq100K - 1M2145253113
Iceland100K - 1M1221111
Jamaica100K - 1M711111
Jordan100K - 1M103211241
Kenya100K - 1M9346332
Kyrgyzstan100K - 1M14436242
Cambodia100K - 1M3963111
Kuwait100K - 1M34041142612
Lebanon100K - 1M7136231
Sri Lanka100K - 1M8157242
Lithuania100K - 1M24457322
Luxembourg100K - 1M1363333
Latvia100K - 1M5534333
Libya100K - 1M1921111
Moldova, Republic of100K - 1M7532100
Montenegro100K - 1M1040000
North Macedonia100K - 1M3132100
Myanmar100K - 1M2935232
Mongolia100K - 1M2221100
Macao100K - 1M1821111
Malta100K - 1M1022100
Nicaragua100K - 1M7123121
Nepal100K - 1M8824121
Oman100K - 1M80315261
Panama100K - 1M169210231
Puerto Rico100K - 1M5337322
Palestine, State of100K - 1M2222111
Paraguay100K - 1M115313151
Qatar100K - 1M87211232
Réunion100K - 1M421111
Sudan100K - 1M3356232
Slovenia100K - 1M5063322
Slovakia100K - 1M2478183103
Senegal100K - 1M1432211
El Salvador100K - 1M9026121
Syrian Arab Republic100K - 1M2222111
Tunisia100K - 1M9148322
Trinidad and Tobago100K - 1M2511111
Tanzania, United Republic of100K - 1M2842121
Uganda100K - 1M922121
Uruguay100K - 1M142210151
Uzbekistan100K - 1M5435221
Venezuela (Bolivarian Republic of)100K - 1M327325271
Yemen100K - 1M1922200
United Arab Emirates1M - 10M3819374132
Argentina1M - 10M483363413942
Austria1M - 10M108811842272
Australia1M - 10M61462146161595
Bangladesh1M - 10M47016313162
Belgium1M - 10M1130121487684
Bulgaria1M - 10M6576293112
Belarus1M - 10M3784262101
Switzerland1M - 10M75914507226
Chile1M - 10M16465942292
Colombia1M - 10M409282542712
Czechia1M - 10M1736111015254
Denmark1M - 10M53312363143
Ecuador1M - 10M11573782171
Egypt1M - 10M7429503213
Spain1M - 10M850321700112099
Finland1M - 10M14079943362
Greece1M - 10M11299743202
Hong Kong1M - 10M1865121404453
Hungary1M - 10M13146803262
Ireland1M - 10M936268912428
Israel1M - 10M2706112144954
Iran (Islamic Republic of)1M - 10M61881157631462
Korea, Republic of1M - 10M1451141626743
Kazakhstan1M - 10M316552943862
Morocco1M - 10M4276324153
Malaysia1M - 10M14188993303
Nigeria1M - 10M7559675243
Netherlands1M - 10M3548252719927
Norway1M - 10M62812467162
New Zealand1M - 10M3283311101
Peru1M - 10M15424793243
Philippines1M - 10M52801644961374
Pakistan1M - 10M8259621201
Poland1M - 10M90191767661793
Portugal1M - 10M6537382112
Romania1M - 10M16345811224234
Serbia1M - 10M4879352132
Saudi Arabia1M - 10M2494727721132
Sweden1M - 10M2847181725504
Singapore1M - 10M69814443173
Thailand1M - 10M2551121902522
Turkey1M - 10M34701630751093
Taiwan, Province of China1M - 10M57221061532122
Ukraine1M - 10M5378123153683
Viet Nam1M - 10M2756131805532
South Africa1M - 10M98816516163
Brazil>10M1380313109872713
Canada>10M1153921893122946
Germany>10M312664328971788111
France>10M20126252006116638
United Kingdom of Great Britain and Northern Ireland>10M3075135348515110211
Indonesia>10M1153621184365923
India>10M3449137430127149518
Italy>10M1957718189686197
Japan>10M47674215680819005
Mexico>10M1446014209257883
Russian Federation>10M22880222085115707
United States of America>100M1332171012332527919622

Thanks for everyone's input!

Copying Isaac's style:

Total Pageviews vs. Unique Pageviews

@mforns @JAllemandou @Isaac

It seems like all of you think that a unique pageviews (actors) threshold would suffice without an overall pageview threshold, and I agree!

I'd love to hear if anyone has any doubts about it.

@Isaac thanks for your extensive analysis of this data. Looking at all the data you posted, I agree with you about "pushing the k value as low as we feel comfortable so that more countries can be included.

Bucketing for daily data

@JAllemandou

Thanks for bringing up that threat - I agree that bucketing could address that

@Isaac

I also don't think there're any strong use-cases that require exact counts.

However, while I do thing that ranks without pageview data would definitely eliminate some threats altogether, I'd think that buckets would provide substantially more utility while curbing those threats.

This is a subject that I'd love to dive deeper into, though. I'll be looping in a privacy engineer very soon, and I'm interested in hearing their opinion regarding the breadth of these privacy concerns.

Miscellaneous

@Isaac

I'm leaning towards retaining the Main Page articles just because all the other endpoints do the same.

The issue you brought up about page_id vs page_title is super interesting, I hadn't thought about that. Forgive my lack of context, but how does the wmf.pageview_hourly table do it, or does that not apply here? I'll definitely look into your suggested approach, though.


Again, thanks everyone for your input, and I'd love to hear your opinions about bucketed data vs ranks only, and some of these miscellaneous topics.

I'll be looping in a privacy engineer very soon, and I'm interested in hearing their opinion regarding the breadth of these privacy concerns.

Excellent - glad to hear!

I'm leaning towards retaining the Main Page articles just because all the other endpoints do the same.

Makes sense -- consistency is good and people can always filter out after-the-fact

The issue you brought up about page_id vs page_title is super interesting, I hadn't thought about that. Forgive my lack of context, but how does the wmf.pageview_hourly table do it, or does that not apply here? I'll definitely look into your suggested approach, though.

Yeah, it's complicated :( The data all is derived from wmf.webrequest, where there is a page_id field that reflects the page shown to the reader (i.e. after resolving redirects) and page_title field that reflects the page title the reader requested. So if a reader clicked on a link to Chicago, Illinois and got redirected to Chicago, then the page_title would be "Chicago, Illinois" but the page_id would be the page_id from the Chicago article (6886). The wmf.pageview_hourly table carries both the page_id and page_title fields from webrequests so if you group on page_title, you're keeping redirects separated out, and if you group on page_id then you're automatically resolving redirects. It's easy to see the breakdown (and why this matters) using the redirects part of the pageviews tool. Here's the data for the Chicago article on enwiki (all of these titles would be associated with the same page_id in pageview_hourly): https://pageviews.toolforge.org/redirectviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&sort=views&direction=1&view=list&page=Chicago

The challenge for grouping on page_id is it's nice to have page_title for readability purposes but there's no super straightforward and always up-to-date way of getting the canonical page_title for a given page_id in Hive. Ideally you join on the page table but that would require querying against the MariaDB tables if you wanted current information. What I suggested is a join where you'd pull the most common page_title for any given page_id based on the pageview_hourly data. That should almost always work and at least guarantee you find a title for every page_id. Cases where it breaks down are pages that are moved to a new title but already have a ton of links in Wikipedia to the old titles so many of the pageviews still come to the old title for a while until it's eventually balanced out.

My perspective on page_id at pageview level:

  • page_id is mostly available in pageviews - not present for mobil-app.
  • All pageview data is currently using page_title as identifier, also bringing in page_id when feasible. While for content-consumption analysis page_id is simpler as it allows to avoid the redirect complexity, page_title is the actual page identifier for other use-cases I know (mostly editing).
  • For the sake of consistency, I'd rather continue using page_title as identifier.

When we get the historical redirects problem solved we'll be able to provide a redirect-resolving table across time.

For the sake of consistency, I'd rather continue using page_title as identifier.

Thanks @JAllemandou for these additional details. What you say makes sense and for this dataset I'm more open to using page_title because of the dataset's clear intent to help editors and the fact that the ranking is presumed to be more valuable than the underlying pageview counts (so missing a few pageviews that came from a redirect feels like less of a concern). A few additional thoughts:

  • It goes against consistency, but another option is page_title for daily and page_id for monthly. This will handle page moves that happen mid-month, provide higher-quality (in my opinion) data for at least one of the datasets, and be far far easier to actually execute (because you can just join against the page table for the canonical title to associate with that page ID)
  • From a privacy perspective, the one thing I'll note is that preserving redirects can bring with it some implications because there will be page redirects that are only used by e.g., one external site that could have enough interest to make it onto the top articles list while still being so specific as to reveal information about exactly where those pageviews are coming from. Aggregating redirects helps with this because if an article has enough interest that it makes the list, it probably is receiving pageviews from a variety of independent sources.

page_id is mostly available in pageviews - not present for mobil-app.

I was unaware of this -- tangent but why? I assumed the page ID aspect was handled on the server side not the client...

When we get the historical redirects problem solved we'll be able to provide a redirect-resolving table across time.

Looking forward to this!

@JAllemandou @Isaac

Thanks to both of you for going into deeper detail about page_title vs. page_id. I'm also leaning towards using page_title because, in addition to being consistent with the other endpoints, it's simpler to implement as you both mentioned. I'm still open to considering this issue further, especially for the potential monthly data as Isaac brought up.


Also, just wanted to update everyone on the design status. Last week, I met with @JFishback_WMF, who was gracious enough to take on the privacy analysis of the current general design. Once he completes his analysis, there should only be a few more minor design considerations to work out, and then I can start on implementation.

Thanks again to everyone for all your help and input - I really appreciate it!

Once he completes his analysis, there should only be a few more minor design considerations to work out, and then I can start on implementation.

@lexnasser great to hear! and thank you for leading this work and taking all of these points into consideration. I'm excited to see this come to fruition!

Hey everyone, @JFishback_WMF has completed his risk analysis of the working API design, and, from a privacy perspective, everything's a go! Thanks to all of you for bringing up various potential privacy threats - it looks like we covered all the main ones.

Now that the design is all good from in terms of privacy, I've updated the design doc, located here. Please take a look and add any thoughts you have.

Once we settle on a design, I'll get started on implementation. Can't wait!

Additional data to hopefully help see the impact of the different privacy unique actor thresholds would have on what countries would actually be able to benefit from this data (this is based on the data from T207171#6615009). I look at 1000 vs. 500 unique actor thresholds and how many countries in each continent show up on the resulting list with at least k articles. In general, if a country only has one or two articles on a list, that means Main Page and Special:Search (so not particularly useful data for that region and it looks to me that the list starts becoming useful at around at least 5 articles)

ContinentTotal # Countries>1000 unique actors; >0 articles on list>500 unique actors; >0 articles on list>1000 unique actors; >=5 articles on list>500 unique actors; >=5 articles on list>1000 unique actors; >=10 articles on list>500 unique actors; >=10 articles on list
North America2614 (54%)18 (69%)4 (15%)10 (38%)3 (12%)7 (27%)
Africa4620 (43%)29 (63%)5 (11%)10 (22%)5 (11%)5 (11%)
Europe4833 (69%)38 (79%)27 (56%)29 (60%)25 (52%)27 (56%)
Asia5141 (80%)47 (92%)25 (49%)34 (67%)23 (45%)27 (53%)
Oceania72 (29%)2 (29%)2 (29%)2 (29%)2 (29%)2 (29%)
South America1310 (77%)11 (85%)9 (69%)10 (77%)6 (46%)10 (77%)

Some takeaways for me:

  • Moving the threshold from 1000 to 500 moves the # of countries with at least 5 articles on their list from 72 to 95 and doubles representation in Africa (from 5 to 10 countries). Also large bumps in Central America / Caribbean and Asia, so clear positive impacts on equity
  • Moving to 100 unique actor threshold (not shown here) has an even more dramatic effect -- in general, if a country has at least 1 article at the 500 unique actor threshold, it'll have at least 10 on the 100 unique actor threshold.

Sorry for the radio silence! I just finished up final exams, so I'm now freed up to make more progress on this.

@Isaac
Thanks for aggregating this country-inclusion data -- I didn't have any empirical data to consider when deciding the threshold, so this really helps put things into perspective. @JFishback_WMF is currently working on creating a more robust privacy framework to analyze these types of tradeoffs, so we'll be able to make more progress on this threshold decision when that is finished.

Until then, I'll be working on creating the initial Oozie job to load the data, and any decisions made about this threshold should be extremely easy to change. Thanks again for all your help!

Sorry for the radio silence! I just finished up final exams, so I'm now freed up to make more progress on this.

No worries - thanks for working on this!

is currently working on creating a more robust privacy framework to analyze these types of tradeoffs, so we'll be able to make more progress on this threshold decision when that is finished

Great news! @JFishback_WMF if I can provide any more data, let me know.

Until then, I'll be working on creating the initial Oozie job to load the data, and any decisions made about this threshold should be extremely easy to change.

Yep, that makes sense to me.

Change 654924 had a related patch set uploaded (by Lex Nasser; owner: Lex Nasser):
[analytics/refinery@master] Create and configure Oozie job to load 'Top Articles by Country Pageviews API' data into Cassandra

https://gerrit.wikimedia.org/r/654924

Change 657228 had a related patch set uploaded (by Lex Nasser; owner: Lex Nasser):
[analytics/aqs@master] Create pageviews 'top-per-country' endpoint with tests

https://gerrit.wikimedia.org/r/657228

Change 654924 merged by Joal:
[analytics/refinery@master] Create and configure Oozie job to load data into Cassandra for pageviews 'top-per-country' AQS endpoint

https://gerrit.wikimedia.org/r/654924

Change 668236 had a related patch set uploaded (by Lex Nasser; owner: Lex Nasser):
[analytics/refinery@master] Add double quote when constructing JSON in Hive query and change field names in properties file for top-per-country job

https://gerrit.wikimedia.org/r/668236

Change 668236 merged by Joal:
[analytics/refinery@master] Fix and optimize Hive query and change field names in properties file for top-per-country job

https://gerrit.wikimedia.org/r/668236

Change 657228 merged by jenkins-bot:
[analytics/aqs@master] Create pageviews 'top-per-country' endpoint with tests

https://gerrit.wikimedia.org/r/657228

Mentioned in SAL (#wikimedia-operations) [2021-03-17T13:27:40Z] <otto@deploy1002> Started deploy [analytics/aqs/deploy@3e92346]: deploy aqs as part of train - T207171, T263697

Mentioned in SAL (#wikimedia-analytics) [2021-03-17T13:28:01Z] <ottomata> deploy aqs as part of train - T207171, T263697

Mentioned in SAL (#wikimedia-operations) [2021-03-17T13:31:04Z] <otto@deploy1002> Finished deploy [analytics/aqs/deploy@3e92346]: deploy aqs as part of train - T207171, T263697 (duration: 03m 24s)

The pageviews/top-per-country endpoint is now public! Take a look at the documentation here. You can query data starting from January 1, 2021, and examples can be found here. Note that, at the moment, the endpoint has a stability of 'experimental', meaning that the endpoint can change in incompatible ways at any time, without incrementing the API version. However, I don't expect this to occur.

Feel free to reach out with any questions at all.


Thanks so much to all of you for your help in designing, developing, testing, and deploying this endpoint, namely @Milimetric, @Amire80, @Isaac, @JFishback_WMF, @JAllemandou, @elukey, @razzi, @Ottomata, @Pchelolo, and @Nuria!

Awesome work @lexnasser :) This new endpoint is worth a blog post IMO :)

Thank you so much @lexnasser! This has been many years in the making and is truly excellent work doing the engineering and bringing everyone together to get it done!

This new endpoint is worth a blog post IMO :)

Very much agree -- right now I'm just going around sharing the wikitech documentation with whoever will listen :)

Piling on the thanks @lexnasser! Our team has been patiently waiting for this and glad you led it through to its completion!

lexnasser added a subscriber: lexnasser.

Passing this task over to Francisco to carry out the implementation of this data into WikiStats.