there have been a demand on the Analytics email list to backfill the top-pageview-per-country data in the API - https://lists.wikimedia.org/hyperkitty/list/analytics@lists.wikimedia.org/thread/STLYZXCF442KJZ6457TMK5XUMJNTA6PQ/
The reason for which we have not yet done so is because we use the actor field as a filtering mechanism for the dataset (we don't release pages having been seen by less than 1k actors), and the pageview-actor data is only available for 90 days.
We could investigate finding a less fine-grain method of filtering that would allow us to release less data but some data nonetheless.
Description
Event Timeline
Hi @JAllemandou — does this pageview data exist in a private table somewhere stripped of the actor_signature field? Or is it preaggregated somehow? This could be a case where differential privacy (which we are currently piloting on similar data) could come in handy.
Hi @Htriedman - Indeed the pageview data is available without the actor signature since mid-2015, aggregated hourly over a set of dimensions (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly). Indeed differential privacy could be a great solution for this use case I assume.
Thanks so much for getting back to me on this with some more information. We're currently in the middle of establishing protocols and processes around the use of differential privacy (and configuring software!), which should be done by the end of Q4 (June 2022). This data release is definitely possible within certain privacy bounds — if we can wait until then. If not, I can also potentially suggest some other mitigation heuristics.
Do you think this could wait a couple months for a higher quality data release?
Hi all! Just wanted to come back to this thread (even though it's been more than a month or two) with some updates —
The Privacy Engineering team was recently funded to pursue the differentially-private release of historical pageview_hourly data, grouped-by country and project and summed. We're hoping to get that data release out by mid-year 2023. I'd love to know if anyone has comments, questions, concerns, etc.
Update (very late but still necessary): As of Feb 2023, this data request has been completed!
Daily data from 1 July 2015 - 8 Feb 2017 is available at: https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/
Daily data from 9 Feb 2017 - 5 Feb 2023 is available at: https://analytics.wikimedia.org/published/datasets/country_project_page_historical/
Daily data from 6 Feb 2023 - present is available at: https://analytics.wikimedia.org/published/datasets/country_project_page/