Page MenuHomePhabricator

potential issues with planned release of query logs (Wikidata Query Server)
Closed, ResolvedPublic

Description

There seem to be couple of potential issues with the release planned at T183020:

  • The period covered is fairly large (2 months)
  • All queries in the period are being released.
  • Items used in the query are included in the data being released.
  • Approximate coordinates are included in the data being released.
  • Given the number of queries being released, it's likely that these can be correlated with individual users.
  • Users are invited onwiki to help and share their queries with other users, if they choose so. At no point, they are advised that any queries will be release anyways.
  • Users are not advised in advance that such release will take place. Ideally users would be advised before the collecting period.
  • Users can't opt out of the planned release, e.g. by using a separate server or server endpoint.
  • It's unclear if the data request is consistent with the researcher's organizations policies.
  • The research question isn't known (or maybe I missed it). It's unclear if it actually needs the data.
  • It's unclear if a published query for specific patterns couldn't be a better formulation of the research question.

Obviously, these issues are less likely to affect users who only use Wikidata Query Server and don't edit Wikipedia or Wikidata. It's also less likely to affect users who don't occasionally or regularly publish queries. Users who only occasionally contribute might less likely be affected.

Given recent issues with data releases by organizations for pseudo-research, this should be looked into it more carefully.

Event Timeline

The period covered is fairly large (2 months)
All queries in the period are being released.
Items used in the query are included in the data being released.
Approximate coordinates are included in the data being released.

All correct. Though we could drop the coordinates, for example, if it's a concern (I believe it's not, due to very coarse granularity, but if it's a concern, it can be changed). Removing items would make most queries pointless, so that probably can't be achieved.

Given the number of queries being released, it's likely that these can be correlated with individual users.

Here I would like to hear more - how it is possible that "these can be correlated with individual users" given that no indication of queries belonging to users exist? Could you identify a specific concern here and what should be changed to remove it?

Users are invited onwiki to help and share their queries with other users, if they choose so. At no point, they are advised that any queries will be release anyways.

The queries that are published on wiki are trivially linked to the users that published and discussed them, by virtue of observing the edit history, which is public. Which provides way more information than query log release would ever do, for these queries.

Users are not advised in advance that such release will take place.

I think this is covered in our own privacy policy:

Similarly, we share non-Personal Information or aggregated information with researchers, scholars, academics, and other interested third parties who wish to study the Wikimedia Sites. Sharing this information helps them understand usage, viewing, and demographics statistics and patterns. They then can share their findings with us and our users so that we can all better understand and improve the Wikimedia Sites.

Users can't opt out of the planned release, e.g. by using a separate server or server endpoint.

I am not sure what you mean by "opting out" - since no user information is ever released, what they will be opting out of? We could of course trivially implement functionality of ignoring query requests having certain query parameter or such when processing, but I do not see what exactly that would achieve? The queries are not linked to the users in any way, all you know is somebody has run this query sometime during the period discussed. If even that is a concern, we could randomize the log order so that time sequence would not be apparent either. This may block certain research, but I think if we do it within large range (e.g. within a day) it still be useful for most statistical purposes.

It's unclear if the data request is consistent with the researcher's organizations policies.

I am not sure I understand this. Why we should be concerned about other organizations' policies?

The research question isn't known (or maybe I missed it). It's unclear if it actually needs the data.

The research question is not very relevant, this is not request for specific research. Such requests (for specific research) are already made and granted (if warranted), and this is covered - including Marcus' research. This is the request for data to be released for all researches to use.

It's unclear if a published query for specific patterns couldn't be a better formulation of the research question.

I am not sure what you are suggesting here, could you explain? What "specific patterns" do you mean?

It's also less likely to affect users who don't occasionally or regularly publish queries.

As I noted above, users who occasionally or regularly publish queries already have much more information about them in our public editing history than we ever intend to provide in this proposal. Moreover, these users are exactly the same people that probably would be interested whether the queries they published have been popular or not (only one example of the research that can be done).

Given recent issues with data releases by organizations for pseudo-research

I am not sure how this is relevant to what we are doing. Could you please explain this?

this should be looked into it more carefully.

Could you describe how this careful looking would be done - who will be doing it, in what timeframe and how it can be discussed? I am not sure what is to be the supposed procedure here that would be satisfactory? We have approval of Analytics & Legal teams, and if we get approval of Security team too - who you think should also be included to approve it? What you think should be the procedure for approval that is missing now?

Sorry, I has confused two release requests, so part of the answer above was not relevant to T183020: Investigate the possibility to release Wikidata queries. I've updated the response above to correct it.

Smalyshev renamed this task from potential issues with planned release of server logs (Wikidata Query Server) to potential issues with planned release of query logs (Wikidata Query Server).Jun 20 2018, 10:53 PM

Clearly we disagree on the proposed release, its scope and users expectations for privacy. If one thinks it's normal to publish users search history, the arguments you advance are probably a valid point of view.

I think similar arguments were given by some organizations that faced severe backslash when their approach was more widely known and I doubt the persons responsible would have gotten off by saying "We don't have the expertise needed in Research to be able to provide the sign off".

Advising users in advance that all their queries would be published unless they use https://query-2.wikidata.org instead of https://query.wikidata.org seems preferable.

If there are several similar release, this obviously applies to all of them. Which ones are the others?

Clearly we disagree on the proposed release, its scope and users expectations for privacy

Maybe, I don't know, that's what I am trying to find out. So far I am not clear what your concerns are really about and how they relate to what we are doing. I suspect that many of these concerns are related to misunderstanding of the nature of what we're doing, but may be misunderstanding is on my side. That's why I ask questions, to gain more clarity.

If one thinks it's normal to publish users search history, the arguments you advance are probably a valid point of view.

I am not sure who the "one" is but it's certainly nothing to do with our case - we're not publishing any search histories for any users.

Advising users in advance that all their queries would be published unless they use https://query-2.wikidata.org instead of https://query.wikidata.org seems preferable.

"Their" queries would not be published, since there's no way to link queries to the users. Queries may be publishes, but they are just queries, not related to any user, so there is no way to know whether it's "their" queries or somebody else's. We take specific steps to remove all information that could make the query unique or identify the users - remove variables, comments, locations, etc. The only thing that is left is the essence of the query - the usage of data. I am not sure which privacy problem you see in this - that's why, again, I am asking the questions.

So you think it's not possible to link some of the queries to specific users?

Anyways, I think the advance notice and an alternative offered to users to opt out of this is preferable. WMF took steps ahead of other websites to offer privacy features to its users and this shouldn't be undermined by publishing their logs.

BTW Which other publications of users' searches on query.wikidata.org are being reviewed?

So you think it's not possible to link some of the queries to specific users?

Yes, this is my opinion. If you have counter-examples, I will be glad to hear them and improve the process to fix it.

WMF took steps ahead of other websites to offer privacy features to its users and this shouldn't be undermined by publishing their logs.

I am still not sure how this infringes anybody's privacy. We are not publishing "their" logs - we are publishing anonymized data from which any actual or potential PII is stripped. There's no way to know who runs which query, neither personally, not by Wiki or any other identity, not by IP. If you see something in SPARQL query that still could leak PII please point it out and we could discuss.

BTW Which other publications of users' searches on query.wikidata.org are being reviewed?

Right now nothing else is being reviewed, but there are requests like T143819 and other similar with we may consider in the future, using similar anonymization procedures.

Currently this bug is marked restricted security. However I'm not it needs to be security restricted. Are there any objections to me making this bug publicly viewable?

Smalyshev claimed this task.
Smalyshev changed the visibility from "Custom Policy" to "All Users".

I don't think there's anything left to do for our team here.

Visibility should be Public, not "All Users"

Smalyshev changed the visibility from "All Users" to "Public (No Login Required)".Feb 23 2019, 12:41 AM
Smalyshev updated the task description. (Show Details)
sbassett triaged this task as Medium priority.Oct 16 2019, 5:33 PM