Page MenuHomePhabricator

Legal review for Wikidata queries data release proposal
Closed, ResolvedPublic

Description

We would like to hear Legal team's feedback on the proposal to publish WDQS query data set which was collected for the research by Markus Krötzsch.

The proposal is here: https://meta.wikimedia.org/wiki/User:Markus_Kr%C3%B6tzsch/Wikidata_queries

TLDR summary of it: we would like to release anonymized data about queries performed on WDQS over the period of twelve weeks in summer 2017, which were collected for research done by Markus. This data set is anonymized and modified in a way to exclude PII from the data. @Smalyshev (for WDQS) and @leila (for Research team) have reviewed the proposal and it looks fine to us.

The purpose of the data release is to enable other researches the insight into how SPARQL services are used in general and Wikidata one specifically.

Event Timeline

Smalyshev triaged this task as Medium priority.Mar 27 2018, 8:11 PM
Smalyshev created this task.
Smalyshev updated the task description. (Show Details)
Smalyshev claimed this task.

From Leighanna Mixter:

Hi Stas and Leila,

Legal has reviewed and we approve the release. Feel free to share that on the ticket.

I had a look at the sample, but there doesn't seem to be any data. The three compressed files seem empty.

Would add below what exactly you had asked the legal team to review (so people can know what was approved).

Somehow it seems odd that full logs would be published by WMF.

Somehow it seems odd that full logs would be published by WMF.

They won't be. Please see in the proposal what exactly is planned to be published, it is in no way or form "full logs", it's highly filtered to remove anything that could even hint at PII.

I can see data at https://github.com/Wikidata/QueryAnalysis/tree/master/exampleMonthsFolder/exampleMonth/anonymousRawData just fine, they are not empty. Maybe there was some download issue? You can just click on any file there and download it directly from github, I just checked and it works.

Thanks. I can get it to load know. Not sure what happened. I will have a look.

Is it correct the it includes all queries for these weeks, formatting is not standardized beyond variable names and all QIDs are included as is?

it includes all queries for these weeks

Yes, all syntactically correct queries.

formatting is not standardized beyond variable names

Not true, queries are reformatted. Please do check https://meta.wikimedia.org/wiki/User:Markus_Kr%C3%B6tzsch/Wikidata_queries#How_is_this_data_generated? - it describes the whole procedure in detail.

all QIDs are included as is

Yes. All P-ids and Q-ids are preserved.

Is there something that concerns you specifically?