Page MenuHomePhabricator

Privacy review for dataset publishing (Wikidata topic -> pageview data)
Closed, ResolvedPublic

Description

Back in 2020 I worked on a dataset and notebook with a data processing flow that looks something like this

Wikidata Item ID -> Wikipedia page titles -> aggregated daily pageview data across all Wikimedia projects

I'd like the publish the data and graphs that come out of the end of this pipeline for the COVID-19 topic.
Before I do this I would like some review of this data to make sure I won't be publishing anything that I should not.
Most notably I'd be concerned about low pageview values alongside per country splits of data, as mentioned at https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Country_data_and_privacy

Details of the data

The specific notebook and files that I would wish to publish can currently be found at addshore@stat1005:~/publish/wd-topic-pageviews/COVID-19/2019-10-01_2022-03-07
These include:

  • The notebook used for generation in (html, markdown, ipynb)
  • the data aggregated from the notebook in CSV format
  • plotly interactive graphs as html files for (totals, by access method, by continent, by country)

A fake sample of the data that would be in the CSV can be seen below

	continent	country	access_method	pageviews
date				
2020-11-08	North America	United States	desktop	6666.0
2020-11-15	North America	United States	desktop	7777.0
2020-11-01	North America	United States	desktop	8888.0
2020-11-22	North America	United States	desktop	9999.0

So this is per day, per country, per access method aggregated pageview data, across ALL wikimedia projects, for a collection of titles (not published as a complete list)

Currently, I filter out any pageviews values that are less than 1000 at the advice of @JAllemandou

in our daily top-pageview per project and country we filter for less than 1000 actors (actually more radical than views)
https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/daily/pageview_top_percountry.hql#L115

All diagrams are then generated from this filtered dataset.

The notebook itself also doesn't include any of this data until after filtering has occurred.

Publication plan

I'd plan on publishing the data to https://analytics.wikimedia.org/published/notebooks/addshore/
And I would plan on writing a personal blog post about this

Event Timeline

Hi @Addshore! I'm Hal, a privacy engineer at WMF, and I'll be taking a look at this (rerunning the notebook, assessing potential harms, writing up a formal privacy review, etc.) in the next few days.

My initial take is that this data release doesn't strike me as particularly risky — covid may be a controversial topic in some places at some times, but views of a page about covid seem unlikely to spur any additional risk, particularly if page view counts have been filtered to only be great than 1000. However, when releasing datasets about editors and readers, we typically do not include data from countries in the Country Protection List.

I'll be sure to ping you with any questions I have as I go through this process! Thanks for submitting this.

My initial take is that this data release doesn't strike me as particularly risky — covid may be a controversial topic in some places at some times, but views of a page about covid seem unlikely to spur any additional risk, particularly if page view counts have been filtered to only be great than 1000. However, when releasing datasets about editors and readers, we typically do not include data from countries in the Country Protection List.

This should be easy enough to include in the notebook / filter out :)

@Htriedman do you have any timeline estimates for this?

Hi @Addshore working on this now, hopefully I'll have it done in the next 24h!

Amazing!
I'll keep an eye out here (also on slack or IRC under the same name) if anything crops up / there are issues

Hi @Addshore! Hope you're well — I'm done with my privacy review and am hoping to share it with you soon (I just need your email)

Overall, this presents a low risk. You're good to proceed with publishing this data as long as you:

  • aggregate page views by a topic (e.g. a set of titles), rather than disaggregating and posting page view counts for individual pages
  • drop any crosstabs with fewer than 1000 views total
  • do not publish data for countries in the country protection list (linked above)

You should also check out two previous COVID datasets WMF has released; they may have information useful for your analysis:

Addshore claimed this task.

I'm gonna work on actually publishing this soon :)