Page MenuHomePhabricator

Increase of training data retention (>90 days) is validated with Legal / Privacy
Closed, ResolvedPublic

Description

Query Clicks Data Retention

Executive Summary

The search platform team would like to extend data retention on the daily full text search query/clicks dataset from the default of 90 days to 13 months. These logs do not contain traditional PII and thus the standard 90 day retention guidelines may not be applicable. The search platform team has reviewed the creation methods for these logs and proposed a few improvements that would further protect user privacy when stored beyond the retention of the source datasets.

Background

The daily query clicks dataset is an unsampled log of user interaction with full-text search in desktop and mobile web. The dataset itself is not a direct log of user events, rather it is generated in a daily batch job by combining multiple input sources. The dataset is used as input to the search platform machine learning (ML) ranking algorithms. It additionally allows backtesting new ranking approaches against historical data. This dataset contains the search queries that were performed, the pages that were returned, and pages that the user clicked on, along with a session ID that refers to a single user over a short timespan.

PII in search

We collect in Hadoop all the search queries (90 days retention) our users may enter using our site and APIs. The search query strings are not by nature pure PII like IP addresses, SSN, GPS coords, etc., but they may contain PII if, for example:

  • A user makes a mistake, e.g. copy/pasting their SSN, email address (or anything that is considered PII) in the search box
  • Someone legitimately searches to see if their SSN is present in one of our corpora

We don’t have clear numbers of how many queries may contain PII.

In 2012 the WMF released anonymous search logs but quickly took the dump down after realizing that “a small percentage of queries contained information unintentionally inserted by users.” In 2019 we began a project called Glent, intended to improve the quality of “did you mean” suggestions shown to users on-wiki. Glent maps an incoming user query onto another user’s query that is similar and which has more results. End users don’t know whether their suggestion is based on another user’s query, or if it came from the normal statistical suggestion process. The similarity requirements also mean that the output will always be reasonably similar to the input, so no particularly novel information can be retrieved by end users. We have recently gotten clearance form Legal to save the aggregate query information Glent uses indefinitely.

Importance of Search Queries in Search Algorithms

It is no secret that one can use search queries to improve search ranking quality. The search platform team operates a project, named Mjolnir, which applies a variety of statistical algorithms to historical user interactions with full-text search. The output of this project is per-wiki ML models that decide the final order of search results displayed to a majority of end users. The models that this algorithm generates are free of any PII as it does not contain any reference to the source search queries. All the computation is performed on the analytics network; the only part that ever leaves the analytics network is the model itself.

The ML approaches that have been applied to this dataset so far have shown through AB testing to improve the search experience on high-volume wikis. The same AB testing shows decreased performance, compared to a hand-tuned baseline, on wikis outside the top 18 by search volume.

Data Retention

The statistical models we apply to the click logs require seeing the same search query, issued to the same wiki, across multiple search sessions. This results in a snowball effect, where the percentage of usable sessions inside the dataset increases as the dataset increases in size. Based on previous analysis and experiments we’ve found that we need at least 300k search sessions to train a useful ranking model. Extrapolating from current event counts, increasing data retention from 90 days to 13 months could potentially double the number of wikis, from 18 to 36, that we have enough data to train ranking algorithms for. This additional data is likely to improve the ranking performance on all but the busiest of the currently deployed wikis (i.e., everything except enwiki).

How user privacy is maintained

  • This dataset does not contain user names, IDs, IP addresses, or otherwise identifying information of the user.
  • Session reconstruction assigns a new identifier to searches from the same weak user fingerprint that are more than 30 minutes apart.
  • Rather than store the IP addresses we store aggregate metadata about the IP address to be used in downstream processing.

Potential improvements

The session reconstruction, and the identifiers emitted by that reconstruction, are based on a weak fingerprint of the user along with a timeout. Currently identifiers for sequential sessions, potentially even across days, could be very similar for a given user. This link can be broken across days by hashing session identifiers with a daily random token that is used and disposed of without being recorded. The downside here is that all sessions will be broken across days, a search at 23:59 will never be in the same session as a search at 00:01.
For debugging purposes we maintain a unique token representing the search request in the dataset. This token can be used to link this dataset back to raw operational logs which contain private data (IP, etc). If there are no links between individual sessions there should be no additional risk of carrying this metadata, as no link can be made between the operational logs and sessions beyond the retention of those logs.
Even without including explicit links, repeated user behaviour may allow linking sessions across days. For example, if a user performs the same search every day at 8 am, perhaps an editor interested in pages about a specific topic, dedicated analysis of these logs could find that behaviour and create probabilistic links across sessions. Comparing search results and clicks to page revision times could also generate probabilistic links between this dataset and individual users. These links could be weakened by fuzzing the timestamps on a per-session basis, essentially shifting each session in its entirety by ± 15 minutes.

Appendix

Structure of stored data

fielddescriptionreasoning
publicwikiidThe wiki the search occured on
semi - privatehitsList of page IDs returnedDerived from private data
semi - privaterequest_set_tokenRandom ID from operational logsLinks this record to more private datasets
semi - privateq_by_ip_dayNumber of searches by same IP and dayMetadata about the IP address
privatequeryText input from userUsers search for private things and don't expect it to be shared
privatetimestampSecond-resolution timestamp searchDirect record of user behavior
privateclicksList of page IDs clicked with second-resolution timestampsDirect record of user behavior
privatesession_idIdentifier of a single search session, typically 30 minutesAssociates multiple rows within dataset

Event Timeline

There is a new process to access potential privacy risks. I've moved our proposal into the description of this ticket, copied relevant parts of the proposal into the Privacy Review Template along with links back to this ticket, and submitted the proposal to Privacy and Legal.

@EBernhardson I emailed the stakeholders with my updates to the privacy risk analysis. Please advise if you need anything else on this task or if I can resolve it. Thx!