Page MenuHomePhabricator

Get the Legal Team to review the Wikidata Query service and make sure it's good for production deployment
Closed, ResolvedPublic

Description

The Legal Team needs to review the Wikidata Query Service before it goes to production to make sure that it's okay for release. Let's make that happen.

Event Timeline

Deskana raised the priority of this task from to High.
Deskana updated the task description. (Show Details)
Deskana set Security to None.
Deskana added subscribers: Deskana, Aklapper.
Deskana added a subscriber: ksmith.

I'm going to reach out to @Slaporte and @ZhouZ via email for this, and pass on any tech questions to @Smalyshev as he's the tech lead for the project.

Smalyshev renamed this task from Get the Legal Team to review the Maps service and make sure it's good for production deployment to Get the Legal Team to review the Wikidata Query service and make sure it's good for production deployment.Jul 7 2015, 10:17 PM

Assigning to me since it's my job to coordinate.

Since I've been asked this question before, here's situation with data on the Query Service that may be useful for legal team:

  1. The service stores only current snapshot of the data from Wikidata, no edit history, no edit logs, etc. - nothing except actual data items & links.
  2. The service strives to be in sync with Wikidata within minutes, though it is possible for it to fall behind due to technical issues - e.g. network problems, disk crashes, etc. - and be several days behind.
  3. Any data deleted from Wikidata is deleted from the service once the update process has caught up, and once it is deleted it is gone - no history is kept. However, since update process uses Wikidata recent change stream, if there is a way to delete information that is not reflected in change stream, regular update mechanism would not catch it.
  4. There is a possibility of manually syncing certain items or re-uploading dumps, though the latter is not recommended since it takes several days to complete and get back into sync. Manual sync will always bring the data into the state that matches current state of Wikidata data.
  5. We plan to log user requests to the service (i.e. query texts, maybe IPs too depending on how they arrive to us) and store them on the same or similar infrastructure that current query logs are stored. No login or any other authentication is needed or is used when querying the service.

@Smalyshev That comment probably just saved you an hour of explaining it all in a meeting. Thank you, sir! I'll be sure to point it out to the Legal Team.

  1. Any data deleted from Wikidata is deleted from the service once the update process has caught up, and once it is deleted it is gone - no history is kept. However, since update process uses Wikidata recent change stream, if there is a way to delete information that is not reflected in change stream, regular update mechanism would not catch it.

Suppression actions do not go through the RC stream, however, you cannot suppress the current revision of an article, only past ones, so it shouldn't be an issue.

@Legoktm yes, correct. I was describing scenario where by some mechanism an article is completely wiped out, bypassing regular revision process. No idea if such thing even exists, but if it does, right now the only way to deal with it is a manual update of the Q-id.

After our meeting on 9th June 2015, @ZhouZ has given this service preliminary legal approval, but wants to check with Analytics about our general data logging (e.g. request logs, and so on) before giving formal legal approval.

@Legoktm yes, correct. I was describing scenario where by some mechanism an article is completely wiped out, bypassing regular revision process. No idea if such thing even exists, but if it does, right now the only way to deal with it is a manual update of the Q-id.

Err, kind of. If a page is deleted with the supppress checkbox ticked (common when the entire page needs suppression), then no recentchanges entry will be generated since the log action is private (Special:RecentChanges or RCStream).

I checked with the Analytics and the logging on this feature is not an issue anymore.

The issue what to do about suppression of content is still unresolved. The issue is exceedingly rare; looking at the suppression log, there's only been two suppressions in the past year where content was removed from Wikidata items using suppressions. This is not surprising, given the nature of the data in Wikidata; there are few opportunities to enter free-form text which could contain suppressable material.

Given the above, I'm concerned we're over-engineering a solution to a problem that doesn't really exist in any meaningful sense. Not moving ahead with the deployment because of a theroetical problem would be a disappointing outcome.

I would like @Smalyshev and @ZhouZ's thoughts on this.

there's only been two suppressions in the past year where content was removed from Wikidata items using suppressions

Note that simple revision removal is not a problem for us. Only complete suppressed deletion of the article is a problem. That is probably even less frequent case - in fact, as I said, there's no reason to do it, so only way it may happen - unless I miss something - is some admin mistake.

That's good this does not seem like a big problem. You're right - we should not hold up the release because of this corner case.

Nonetheless, if in fact there's a suppression deletion, it could be a sign of a major issue, in which case we would want the Query service data to also have the new conforming data.

Can we 1) be notified whenever a complete suppression delete has occurred and/or 2) have a easy process to force a manual sync of the Query service whenever it becomes necessary to do so.

Manual sync process described here: https://www.mediawiki.org/wiki/Wikidata_query_service/Implementation#Updating_specific_ID (running one command, pretty easy).

Notification is harder, as current process has none, and the data is accessible only to oversight users.

The issue what to do about suppression of content is still unresolved. The issue is exceedingly rare; looking at the suppression log, there's only been two suppressions in the past year where content was removed from Wikidata items using suppressions. This is not surprising, given the nature of the data in Wikidata; there are few opportunities to enter free-form text which could contain suppressable material.

Given the above, I'm concerned we're over-engineering a solution to a problem that doesn't really exist in any meaningful sense. Not moving ahead with the deployment because of a theroetical problem would be a disappointing outcome.

It's not a theoretical problem, it's a real one. You said yourself that two such suppressions have occurred. This makes it pretty clear on what we need to protect suppressed content, and the fact WDQS doesn't is a security issue (hence the bug).

The solution I proposed in T105427#1448872 is rather simple and automates what the legal team apparently wants to do manually?

there's only been two suppressions in the past year where content was removed from Wikidata items using suppressions

Note that simple revision removal is not a problem for us. Only complete suppressed deletion of the article is a problem. That is probably even less frequent case - in fact, as I said, there's no reason to do it, so only way it may happen - unless I miss something - is some admin mistake.

Looking more closely, there have been six instances of suppression deletion in the entire history of Wikidata. So, yes, practically speaking a very rare issue. The most hassle-free way to handle this is, imo, to refresh/delete individual pages from WDQS's copy of the database manually if we are asked to do so.

Thanks @Smalyshev and @ZhouZ! :-)

@Legoktm I think you misunderstand the issue (or I do). As far as I can see, there is no security issue, as you can easily suppress any actual content without suppress-delete, by merely deleting the actual claims from wikidata and suppressing those releases that contained them. The only thing that would not work is to actually delete the ID (which looks like Q12345). Maybe I am missing something, but what exactly we would be trying to protect in Q12345?

I have provided the patch along the lines you suggested but so far didn't get approval for it from Wikidata team.

I have provided the patch along the lines you suggested but so far didn't get approval for it from Wikidata team.

That's the important part. There is no consensus from the involved engineers how to solve this issue more rigorously. In absence of a consensus on a solution, manual updating is enough.

@ZhouZ Does this task now have official signoff from you given that we have resolved all outstanding issues? Is there anything else we need to take care of or discuss?

@ZhouZ has said to me that the Wikidata Query Service has legal signoff, so closing this as resolved. Thanks!