Back in 2020 I worked on a dataset and notebook with a data processing flow that looks something like this
Wikidata Item ID -> Wikipedia page titles -> aggregated daily pageview data across all Wikimedia projects
I'd like the publish the data and graphs that come out of the end of this pipeline for the COVID-19 topic.
Before I do this I would like some review of this data to make sure I won't be publishing anything that I should not.
Most notably I'd be concerned about low pageview values alongside per country splits of data, as mentioned at https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Country_data_and_privacy
Details of the data
The specific notebook and files that I would wish to publish can currently be found at addshore@stat1005:~/publish/wd-topic-pageviews/COVID-19/2019-10-01_2022-03-07
These include:
- The notebook used for generation in (html, markdown, ipynb)
- the data aggregated from the notebook in CSV format
- plotly interactive graphs as html files for (totals, by access method, by continent, by country)
A fake sample of the data that would be in the CSV can be seen below
continent country access_method pageviews date 2020-11-08 North America United States desktop 6666.0 2020-11-15 North America United States desktop 7777.0 2020-11-01 North America United States desktop 8888.0 2020-11-22 North America United States desktop 9999.0
So this is per day, per country, per access method aggregated pageview data, across ALL wikimedia projects, for a collection of titles (not published as a complete list)
Currently, I filter out any pageviews values that are less than 1000 at the advice of @JAllemandou
in our daily top-pageview per project and country we filter for less than 1000 actors (actually more radical than views)
https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/daily/pageview_top_percountry.hql#L115
All diagrams are then generated from this filtered dataset.
The notebook itself also doesn't include any of this data until after filtering has occurred.
Publication plan
I'd plan on publishing the data to https://analytics.wikimedia.org/published/notebooks/addshore/
And I would plan on writing a personal blog post about this