Page MenuHomePhabricator

Investigate the possibility to release Wikidata queries
Closed, ResolvedPublic

Description

As part of the research on understanding Wikidata queries [1], Markus et al. are interested to release a dataset that can empower the researchers, community members (editor and developer communities) to build on top of their learnings. This is aligned and heavily encourage by Research given our Open Access Policy [2].

To this end, Markus has written an initial proposal for us to get started on investigating the possibility of such a release and the specifics of what data we're talking about. We can follow a step by step process similar to an earlier investigation we did for releasing pageview traces [3].

The current list of steps are as follows:

  • [Stas] Please review the proposal [4] and provide feedback on the Discussion page. Ping Leila once you're done.
  • [Leila] Review the proposal and Stas' comment. Determin the next steps.
  • Team manager's signoff (@EBjune)
  • Legal signoff (T190874)
  • Security signoff (T190875)

Timeline: Let's aim for having a Yes/No decision no later than early March 2018.

[1] https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
[2] https://wikimediafoundation.org/wiki/Open_access_policy
[3] https://meta.wikimedia.org/wiki/Research:Improving_link_coverage/Release_page_traces
[4] https://meta.wikimedia.org/wiki/User:Markus_Kr%C3%B6tzsch/Wikidata_queries

Event Timeline

leila triaged this task as Medium priority.Dec 15 2017, 6:46 PM
leila created this task.

@Smalyshev Can you review the task description and if it makes sense to you assign the task to yourself? (If you agree with the description, the first step of the work ahead is yours.)

Smalyshev added a project: User-Smalyshev.
Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.

I've reviewed the proposal and it looks good to me. In fact, it aligns with my thinking about how we should make data from SPARQL logs available (see T143819: Data request for logs from SparQL interface at query.wikidata.org), though I think we should eventually find a way to provide such data sets on regular basis. I'll add more of my thinking on the subject to T143819, but so far I think it's fine.

One further thing to consider (probably not for current data set but for applying it to further applications) is that data can be not only strings, but also numeric values and URIs. And in the future - e.g. for SDC General project - we could have URIs that identify particular users (e.g. as author of the content in Commons - T127929: [Story] Add a new datatype for linking to creators of artwork and more (smart URI)). Also, right now (though probably not in data set in question) we produce both string and URI for external ID. While in most cases I can think of external ID is not PII - after all, it's an ID specifically invented to be a public identifier - it could potentially reveal some information, such as people looking for a particular book. So if we want to apply the same process for future data sets, we need to be aware of it and think about solutions.

I agree with Stas: regular data releases are desirable, but need further thought. The task is easier for our current case since we already know what is in the data. For a regular process, one has to be very careful to monitor potential future issues. By releasing historic data, we avoid exploits that could be theoretically possible based on detailed knowledge of the methodology.

Regarding external IDs, one could whitelist unproblematic IDs that can be preserved, and obfuscate others. I agree that authority control IDs might identify humans, since they scope over so many things that are tied to particular humans (books, authors, etc.) that one could have a hypothetical situation where the interest in a particular item would already suggest who asked the query. I don't think something similar is even theoretically plausible for other IDs (e.g., for proteins or stars). Even for book ids, the lack of user traces makes it very hard to exploit this data further (the certainty you get from a single query being asked can hardly be high, and a query that helps you to guess who asked it will often not be interesting in its own right -- most likely you would want to know what else the identified person has asked). Anyway, we could restrict the "numerical strings are ok" rule to whitelisted properties for our current release. The main reason we have it at all are things like BlazeGraph's "radius" service parameter that have to be a number but are given as a string (I think the gas service might have similar cases).

There is a general limitation to potential exploits of SPARQL logs for breaching someone's privacy. If you don't control the software that formulated the query, then you can only connect queries to people if you already knew that only this person would ask this query. But then you learn very little by observing the query! On the other hand, if you control the software, then it would usually be easy to gather user data more directly, without needing the detour across some SPARQL logs released months later. One exception that might be relevant in the future is the use of SPARQL from Lua built-ins or MediaWiki tags on Wikipedia pages, which could in theory expose some page traffic. This is not relevant for our historic logs, and it would be hard to fully exploit due to parser caches and crawler-based hits, but it might become a theoretical issue nonetheless. To avoid it, one could either filter all Wikipedia servers from the logs, or use a separate SPARQL service for such requests (as discussed in Berlin), whose logs would not be released.

Considering our current dataset, it seems that even the obfuscation of strings is more than one would have to do, but in the future one might indeed have to add external URLs if they become more common in queries.

leila removed leila as the assignee of this task.Mar 26 2018, 8:50 PM

@Smalyshev can you check my comment at https://meta.wikimedia.org/wiki/User_talk:Markus_Kr%C3%B6tzsch/Wikidata_queries and let me know if this is something your team is willing to pick up?

@leila I can probably review it, but I am not sure how "sign off" looks like. Is it just me saying "I'm ok with it" or something more formal is required?

@Smalyshev I would say you need your team's manager sign-off, plus Security's and Legal's. Given that you're deeply familiar with this data and how it's processed, you're perhaps in the best position to have these conversations with the three people/entities.

I think we've got all the approvals for this except for the formal nod from @EBjune. @leila - anything else we need to do here before the release?

@Smalyshev you have my sign-off on this, thanks to you and @leila for persisting in making this important data available to researchers!

@leila - anything else we need to do here before the release?

If you have Legal, Security, and the team's manager's sign off, you have checked all the practical boxes I listed earlier. You should be good to go. (I do want to call out that we have devised this process based on what makes sense for this dataset and past experiences. it would be good/essential for WMF to have a process in place.)

Smalyshev claimed this task.