Page MenuHomePhabricator

Security review for Wikidata queries data release proposal
Closed, ResolvedPublic

Description

We would like to hear Security team's feedback on the proposal to publish WDQS query data set which was collected for the research by Markus Krötzsch.

The proposal is here: https://meta.wikimedia.org/wiki/User:Markus_Kr%C3%B6tzsch/Wikidata_queries

TLDR summary of it: we would like to release anonymized data about queries performed on WDQS over the period of twelve weeks in summer 2017, which were collected for research done by Markus. This data set is anonymized and modified in a way to exclude PII from the data.@Smalyshev (for WDQS) and @leila (for Research team) have reviewed the proposal and it looks fine to us. See more detailed feedback on the proposal's discussion pages.

The purpose of the data release is to enable other researches the insight into how SPARQL services are used in general and Wikidata one specifically.

We would like to have Security team review the proposal and comment whether there are any security aspects we should consider for this task.

Event Timeline

Smalyshev triaged this task as Medium priority.Mar 27 2018, 8:17 PM
Smalyshev created this task.
Smalyshev updated the task description. (Show Details)
Smalyshev updated the task description. (Show Details)
Bawolff added subscribers: APalmer_WMF, Bawolff.

I suspect this is more something that should have a privacy review from WMF-Legal than a security review.

Hi,

So first of all, we'd like to see the code that does the query normalization.

Second, could this have a summary of the types of queries we expect to be most common in the data set. I appreciate there will be a very long tail here, but having a summary of the most common types (broadly speaking) ensures that we have a good understanding of the type of data we expect the data-set to contain.

Hi,

The code is here: https://github.com/Wikidata/QueryAnalysis
It was not written for general re-use, so it might be a bit messy in places. The code includes the public Wikidata example queries as test data that can be used without accessing any confidential information.

We have a list of query types ordered by frequency. However, there are millions of query types, and the most frequent are those created by bots. I can dig up a pointer to the local file where we have it, if this is what you want. If you are interested in a broader analysis of the data, you could take a look at a recent workshop paper of ours: https://iccl.inf.tu-dresden.de/web/Inproceedings3196/en
It has detailed statistics of SPARQL feature distributions and discusses some findings.

Question: Looking at https://github.com/Wikidata/QueryAnalysis/blob/master/tools/extractAnonymized.py, at first glance, it looks like the string handline code wouldn't handle edge cases properly e.g.

"foo\"bar"
"foo'bar"

?

(I only skimmed the code, may have misunderstood)

We have a list of query types ordered by frequency. However, there are millions of query types, and the most frequent are those created by bots. I can dig up a pointer to the local file where we have it, if this is what you want. If you are interested in a broader analysis of the data, you could take a look at a recent workshop paper of ours: https://iccl.inf.tu-dresden.de/web/Inproceedings3196/en
It has detailed statistics of SPARQL feature distributions and discusses some findings.

Ok, that's good enough that you did that I think. The main point of the question was to ensure that you had a good idea of the type of data in the data set (i.e. We aren't just releasing data we've never actually looked at)

extractAnonymized.py indeed seems broken, but I don't see anything using it. It seems that anonymization is done by Java class Anonymizer in QueryAnalysis, which is used by Anonymize.py script. Not sure whether it's completely true but I don't see any usage of extractAnonymized.py. @mkroetzsch - could you clarify whether that script is actually used?

The extractAnonymized.py-script is indeed not used anywhere, so I've removed it.

Quick rundown of the anonymization process and its code locations:
Stage 1: Parsing of the query here. This uses a slightly modified version of the OpenRDF-Parser, among others setting the default prefixes.
Stage 2, Point 1: Not exactly in the code, but parsing ignores comments.
Stage 2, Point 2: Replacing the strings is done here.
Stage 2, Point 3: The variable renaming code is here.
Stage 2, Point 4: The geographic coordinates are handled here.
Stage 3: The entire rendering is done here.

The python script Anonymize.py just for convenience, supplying the default locations on the server and building the maven call.

@Bawolff do we have any other concerns or this is fine?

Sorry, im on vacation until Monday. Perhaps someone else on the security team can take a look or failing that ill be back on monday

(Thank you for your patience, i know this has been delayed multiple times)

@Bawolff sorry for the nudge, just wondering if you've had a chance to take a final look at this? Appreciate any update, thanks.

Hi @JBennett, adding you to this ticket because it's been blocking us for a while now, I don't think there's much else to look at, but hoping for a final security sign-ff so we can move stuff forward.

Sorry for the delay, this kind of got preempted by t194204 but is now next on my todo list.

As an aside - this sort of thing traditionally doesnt require security team sign off (afaik) nor have we reviewed things like this in the past - historically its been legal and maybe analytics only. As far as I know security team has no criteria for evaluating this sort of thing (beyond ensuring that its not blantently outputting PII) so Im mostly planning to check that it implementsthe properties specified in the description on the linked wikipage. I hope that meets what everyone is looking for.

Thanks, Brian, I appreciate you taking a look. Maybe next time it would be
good to know all that stuff about security not usually reviewing this kind
of thing up front, it may save a lot of people a bunch of time and
expectations. I also wouldn't be nudging it if Stas wasn't escalating it to
me ;)

Cheers,

Erika

Erika Bjune
Engineering Manager - Search Platform
Wikimedia Foundation

@Bawolff @EBjune for context re why Security is asked to provide feedback: For data releases, we usually ask for privacy and security feedback if the data may contain private information (either within itself or in combination with other possible datasets that we, WMF, or others may release in the future.) Sometimes we don't have the capacity or expertise to do this in-house in which case we reach out to external privacy experts (check this example), sometimes we ask internally. Some level of such feedback is needed to understand the risks from the expert perspective before these releases.

Bawolff claimed this task.

I think this is ok, and I have no objections, with the caveat that (imo), security's role should be ensuring that stuff meets a certain standard or is safe against a certain threat model. I don't entirely feel confident that I'm competent to review something where the criteria of what we're trying to protect against is unspecified. That said, as far as I can tell, the proposal sounds reasonable.

If we intend to do this again in the future, we should fully disclose that we will release redacted logs, on the query.wikidata.org footer.

This is good news -- thanks for the careful review! The lack of specific threat models for this data was also a challenge for us, for similar reasons, but it is also a good sign that many years after the first SPARQL data releases, there is still no realistic danger to user anonymity known. The footer is still a good idea for general community awareness. People who do have concerns about their anonymity could be encouraged to come forward with scenarios that we should take into account.

Has legal reviewed this? I don't see any comments from them in this ticket. I'd like to sort out a process for reviewing items like this. It's sort of in-between security/privacy/data governance. I'll put together a strawman review process so to help us avoid delays and follow up with Stas.

Has legal reviewed this? I don't see any comments from them in this ticket.

Yes, they have. T190874

I'd like to sort out a process for reviewing items like this. It's sort of in-between security/privacy/data governance. I'll put together a strawman review process so to help us avoid delays and follow up with Stas.

Please check the approach developed at https://meta.wikimedia.org/wiki/Research:Improving_link_coverage/Release_page_traces in case you want to re-use parts of it. Happy to provide input to what you will put together. :)