Investigate the possibility to release Wikidata queries
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	leila
	Dec 15 2017, 6:46 PM

Description

As part of the research on understanding Wikidata queries [1], Markus et al. are interested to release a dataset that can empower the researchers, community members (editor and developer communities) to build on top of their learnings. This is aligned and heavily encourage by Research given our Open Access Policy [2].

To this end, Markus has written an initial proposal for us to get started on investigating the possibility of such a release and the specifics of what data we're talking about. We can follow a step by step process similar to an earlier investigation we did for releasing pageview traces [3].

The current list of steps are as follows:

[Stas] Please review the proposal [4] and provide feedback on the Discussion page. Ping Leila once you're done.
[Leila] Review the proposal and Stas' comment. Determin the next steps.
Team manager's signoff (@EBjune)
Legal signoff (T190874)
Security signoff (T190875)

Timeline: Let's aim for having a Yes/No decision no later than early March 2018.

[1] https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
[2] https://wikimediafoundation.org/wiki/Open_access_policy
[3] https://meta.wikimedia.org/wiki/Research:Improving_link_coverage/Release_page_traces
[4] https://meta.wikimedia.org/wiki/User:Markus_Kr%C3%B6tzsch/Wikidata_queries

Related Objects
Search...

Status	Assigned	Task
Resolved	leila	T135083 Create a formal collaboration for WDQS research
Resolved	Smalyshev	T200658 Release of processed queries from WDQS queries research
Resolved	Smalyshev	T183020 Investigate the possibility to release Wikidata queries
Resolved	Smalyshev	T190874 Legal review for Wikidata queries data release proposal
Resolved	Bawolff	T190875 Security review for Wikidata queries data release proposal

Event Timeline

leila triaged this task as Medium priority.Dec 15 2017, 6:46 PM

leila created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 15 2017, 6:46 PM

@Smalyshev Can you review the task description and if it makes sense to you assign the task to yourself? (If you agree with the description, the first step of the work ahead is yours.)

Lydia_Pintscher added a project: Wikidata.Dec 15 2017, 7:18 PM

Smalyshev claimed this task.Dec 16 2017, 10:30 AM

Smalyshev added a project: User-Smalyshev.

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.

I've reviewed the proposal and it looks good to me. In fact, it aligns with my thinking about how we should make data from SPARQL logs available (see T143819: Data request for logs from SparQL interface at query.wikidata.org), though I think we should eventually find a way to provide such data sets on regular basis. I'll add more of my thinking on the subject to T143819, but so far I think it's fine.

One further thing to consider (probably not for current data set but for applying it to further applications) is that data can be not only strings, but also numeric values and URIs. And in the future - e.g. for SDC General project - we could have URIs that identify particular users (e.g. as author of the content in Commons - T127929: [Story] Add a new datatype for linking to creators of artwork and more (smart URI)). Also, right now (though probably not in data set in question) we produce both string and URI for external ID. While in most cases I can think of external ID is not PII - after all, it's an ID specifically invented to be a public identifier - it could potentially reveal some information, such as people looking for a particular book. So if we want to apply the same process for future data sets, we need to be aware of it and think about solutions.

I agree with Stas: regular data releases are desirable, but need further thought. The task is easier for our current case since we already know what is in the data. For a regular process, one has to be very careful to monitor potential future issues. By releasing historic data, we avoid exploits that could be theoretically possible based on detailed knowledge of the methodology.

Regarding external IDs, one could whitelist unproblematic IDs that can be preserved, and obfuscate others. I agree that authority control IDs might identify humans, since they scope over so many things that are tied to particular humans (books, authors, etc.) that one could have a hypothetical situation where the interest in a particular item would already suggest who asked the query. I don't think something similar is even theoretically plausible for other IDs (e.g., for proteins or stars). Even for book ids, the lack of user traces makes it very hard to exploit this data further (the certainty you get from a single query being asked can hardly be high, and a query that helps you to guess who asked it will often not be interesting in its own right -- most likely you would want to know what else the identified person has asked). Anyway, we could restrict the "numerical strings are ok" rule to whitelisted properties for our current release. The main reason we have it at all are things like BlazeGraph's "radius" service parameter that have to be a number but are given as a string (I think the gas service might have similar cases).

There is a general limitation to potential exploits of SPARQL logs for breaching someone's privacy. If you don't control the software that formulated the query, then you can only connect queries to people if you already knew that only this person would ask this query. But then you learn very little by observing the query! On the other hand, if you control the software, then it would usually be easy to gather user data more directly, without needing the detour across some SPARQL logs released months later. One exception that might be relevant in the future is the use of SPARQL from Lua built-ins or MediaWiki tags on Wikipedia pages, which could in theory expose some page traffic. This is not relevant for our historic logs, and it would be hard to fully exploit due to parser caches and crawler-based hits, but it might become a theoretical issue nonetheless. To avoid it, one could either filter all Wikipedia servers from the logs, or use a separate SPARQL service for such requests (as discussed in Berlin), whose logs would not be released.

Considering our current dataset, it seems that even the obfuscation of strings is more than one would have to do, but in the future one might indeed have to add external URLs if they become more common in queries.

Smalyshev reassigned this task from Smalyshev to leila.Dec 17 2017, 6:54 AM

Smalyshev moved this task from Next to Waiting/Blocked on the User-Smalyshev board.Dec 17 2017, 7:21 AM

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 18 2017, 3:09 PM

Smalyshev updated the task description. (Show Details)Dec 18 2017, 11:47 PM

Smalyshev mentioned this in T143819: Data request for logs from SparQL interface at query.wikidata.org.Dec 19 2017, 1:41 AM

leila moved this task from Backlog to Time Sensitive on the Research board.Feb 2 2018, 4:54 PM

I did one pass and left a couple of comments at https://meta.wikimedia.org/wiki/User_talk:Markus_Kr%C3%B6tzsch/Wikidata_queries

leila moved this task from Time Sensitive to In Progress on the Research board.Mar 19 2018, 10:42 PM

@Smalyshev can you check my comment at https://meta.wikimedia.org/wiki/User_talk:Markus_Kr%C3%B6tzsch/Wikidata_queries and let me know if this is something your team is willing to pick up?

leila moved this task from In Progress to Backlog on the Research board.Mar 26 2018, 8:51 PM

@leila I can probably review it, but I am not sure how "sign off" looks like. Is it just me saying "I'm ok with it" or something more formal is required?

@Smalyshev I would say you need your team's manager sign-off, plus Security's and Legal's. Given that you're deeply familiar with this data and how it's processed, you're perhaps in the best position to have these conversations with the three people/entities.

Smalyshev added a subscriber: • EBjune.Mar 27 2018, 8:01 PM

Smalyshev closed subtask T190874: Legal review for Wikidata queries data release proposal as Resolved.Apr 7 2018, 1:20 AM

Smalyshev updated the task description. (Show Details)Apr 7 2018, 1:22 AM

Esc3300 reopened subtask T190874: Legal review for Wikidata queries data release proposal as Open.Jun 14 2018, 7:04 AM

Smalyshev closed subtask T190874: Legal review for Wikidata queries data release proposal as Resolved.Jun 14 2018, 5:12 PM

Smalyshev mentioned this in T197777: potential issues with planned release of query logs (Wikidata Query Server).Jun 20 2018, 10:53 PM

Bawolff closed subtask T190875: Security review for Wikidata queries data release proposal as Resolved.Jul 2 2018, 8:27 AM

Smalyshev updated the task description. (Show Details)Jul 3 2018, 1:05 AM

I think we've got all the approvals for this except for the formal nod from @EBjune. @leila - anything else we need to do here before the release?

@Smalyshev you have my sign-off on this, thanks to you and @leila for persisting in making this important data available to researchers!

Smalyshev updated the task description. (Show Details)Jul 10 2018, 3:54 AM

Smalyshev moved this task from Waiting/Blocked to Doing on the User-Smalyshev board.Jul 10 2018, 6:10 AM

In T183020#4409356, @Smalyshev wrote:

@leila - anything else we need to do here before the release?

If you have Legal, Security, and the team's manager's sign off, you have checked all the practical boxes I listed earlier. You should be good to go. (I do want to call out that we have devised this process based on what makes sense for this dataset and past experiences. it would be good/essential for WMF to have a process in place.)

Smalyshev added a project: Data-release.Jul 11 2018, 7:27 PM

• DarTar added a subscriber: Andrawaag.Jul 19 2018, 1:16 PM

Gstupp subscribed.Jul 19 2018, 9:38 PM

Smalyshev moved this task from Doing to Waiting/Blocked on the User-Smalyshev board.Jul 23 2018, 4:43 PM

Andrawaag merged a task: T143819: Data request for logs from SparQL interface at query.wikidata.org.Jul 23 2018, 9:36 PM

Andrawaag added subscribers: I9606, Esc3300, JAllemandou and 8 others.

Since we're done with the "investigate" part, and decision is made, I've created a new task for the actual release process: T200658: Release of processed queries from WDQS queries research.

Smalyshev closed this task as Resolved.Jul 30 2018, 4:57 AM

Smalyshev claimed this task.

Investigate the possibility to release Wikidata queriesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate the possibility to release Wikidata queries
Closed, ResolvedPublic
Actions

Related Objects
Search...