Page MenuHomePhabricator

Investigate releasing historical top-pageview-per-country data
Closed, ResolvedPublic

Description

there have been a demand on the Analytics email list to backfill the top-pageview-per-country data in the API - https://lists.wikimedia.org/hyperkitty/list/analytics@lists.wikimedia.org/thread/STLYZXCF442KJZ6457TMK5XUMJNTA6PQ/
The reason for which we have not yet done so is because we use the actor field as a filtering mechanism for the dataset (we don't release pages having been seen by less than 1k actors), and the pageview-actor data is only available for 90 days.
We could investigate finding a less fine-grain method of filtering that would allow us to release less data but some data nonetheless.

Event Timeline

Hi @JAllemandou — does this pageview data exist in a private table somewhere stripped of the actor_signature field? Or is it preaggregated somehow? This could be a case where differential privacy (which we are currently piloting on similar data) could come in handy.

Hi @Htriedman - Indeed the pageview data is available without the actor signature since mid-2015, aggregated hourly over a set of dimensions (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly). Indeed differential privacy could be a great solution for this use case I assume.

@JAllemandou

Thanks so much for getting back to me on this with some more information. We're currently in the middle of establishing protocols and processes around the use of differential privacy (and configuring software!), which should be done by the end of Q4 (June 2022). This data release is definitely possible within certain privacy bounds — if we can wait until then. If not, I can also potentially suggest some other mitigation heuristics.

Do you think this could wait a couple months for a higher quality data release?

I'd love if we could use this use case as a first release of DiffPriv data :)

Great! I'll be sure to circle back to this in a month or two with some updates.

Hi all! Just wanted to come back to this thread (even though it's been more than a month or two) with some updates — 

The Privacy Engineering team was recently funded to pursue the differentially-private release of historical pageview_hourly data, grouped-by country and project and summed. We're hoping to get that data release out by mid-year 2023. I'd love to know if anyone has comments, questions, concerns, etc.

Htriedman claimed this task.

Update (very late but still necessary): As of Feb 2023, this data request has been completed!

Daily data from 1 July 2015 - 8 Feb 2017 is available at: https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/

Daily data from 9 Feb 2017 - 5 Feb 2023 is available at: https://analytics.wikimedia.org/published/datasets/country_project_page_historical/

Daily data from 6 Feb 2023 - present is available at: https://analytics.wikimedia.org/published/datasets/country_project_page/