Privacy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Oct 18 2019, 12:42 PM

Description

Query Clicks Data Retention

Executive Summary

The search platform team would like to extend data retention on the daily full text search query/clicks dataset from the default of 90 days to 13 months. These logs do not contain traditional PII and thus the standard 90 day retention guidelines may not be applicable. The search platform team has reviewed the creation methods for these logs and proposed a few improvements that would further protect user privacy when stored beyond the retention of the source datasets.

Background

The daily query clicks dataset is an unsampled log of user interaction with full-text search in desktop and mobile web. The dataset itself is not a direct log of user events, rather it is generated in a daily batch job by combining multiple input sources. The dataset is used as input to the search platform machine learning (ML) ranking algorithms. It additionally allows backtesting new ranking approaches against historical data. This dataset contains the search queries that were performed, the pages that were returned, and pages that the user clicked on, along with a session ID that refers to a single user over a short timespan.

PII in search

We collect in Hadoop all the search queries (90 days retention) our users may enter using our site and APIs. The search query strings are not by nature pure PII like IP addresses, SSN, GPS coords, etc., but they may contain PII if, for example:

A user makes a mistake, e.g. copy/pasting their SSN, email address (or anything that is considered PII) in the search box
Someone legitimately searches to see if their SSN is present in one of our corpora

We don’t have clear numbers of how many queries may contain PII.

In 2012 the WMF released anonymous search logs but quickly took the dump down after realizing that “a small percentage of queries contained information unintentionally inserted by users.” In 2019 we began a project called Glent, intended to improve the quality of “did you mean” suggestions shown to users on-wiki. Glent maps an incoming user query onto another user’s query that is similar and which has more results. End users don’t know whether their suggestion is based on another user’s query, or if it came from the normal statistical suggestion process. The similarity requirements also mean that the output will always be reasonably similar to the input, so no particularly novel information can be retrieved by end users. We have recently gotten clearance form Legal to save the aggregate query information Glent uses indefinitely.

Importance of Search Queries in Search Algorithms

It is no secret that one can use search queries to improve search ranking quality. The search platform team operates a project, named Mjolnir, which applies a variety of statistical algorithms to historical user interactions with full-text search. The output of this project is per-wiki ML models that decide the final order of search results displayed to a majority of end users. The models that this algorithm generates are free of any PII as it does not contain any reference to the source search queries. All the computation is performed on the analytics network; the only part that ever leaves the analytics network is the model itself.

The ML approaches that have been applied to this dataset so far have shown through AB testing to improve the search experience on high-volume wikis. The same AB testing shows decreased performance, compared to a hand-tuned baseline, on wikis outside the top 18 by search volume.

Data Retention

The statistical models we apply to the click logs require seeing the same search query, issued to the same wiki, across multiple search sessions. This results in a snowball effect, where the percentage of usable sessions inside the dataset increases as the dataset increases in size. Based on previous analysis and experiments we’ve found that we need at least 300k search sessions to train a useful ranking model. Extrapolating from current event counts, increasing data retention from 90 days to 13 months could potentially double the number of wikis, from 18 to 36, that we have enough data to train ranking algorithms for. This additional data is likely to improve the ranking performance on all but the busiest of the currently deployed wikis (i.e., everything except enwiki).

How user privacy is maintained

This dataset does not contain user names, IDs, IP addresses, or otherwise identifying information of the user.
Session reconstruction assigns a new identifier to searches from the same weak user fingerprint that are more than 30 minutes apart.
Rather than store the IP addresses we store aggregate metadata about the IP address to be used in downstream processing.

Potential improvements

The session reconstruction, and the identifiers emitted by that reconstruction, are based on a weak fingerprint of the user along with a timeout. Currently identifiers for sequential sessions, potentially even across days, could be very similar for a given user. This link can be broken across days by hashing session identifiers with a daily random token that is used and disposed of without being recorded. The downside here is that all sessions will be broken across days, a search at 23:59 will never be in the same session as a search at 00:01.
For debugging purposes we maintain a unique token representing the search request in the dataset. This token can be used to link this dataset back to raw operational logs which contain private data (IP, etc). If there are no links between individual sessions there should be no additional risk of carrying this metadata, as no link can be made between the operational logs and sessions beyond the retention of those logs.
Even without including explicit links, repeated user behaviour may allow linking sessions across days. For example, if a user performs the same search every day at 8 am, perhaps an editor interested in pages about a specific topic, dedicated analysis of these logs could find that behaviour and create probabilistic links across sessions. Comparing search results and clicks to page revision times could also generate probabilistic links between this dataset and individual users. These links could be weakened by fuzzing the timestamps on a per-session basis, essentially shifting each session in its entirety by ± 15 minutes.

Appendix

Structure of stored data

	field	description	reasoning
public	wikiid	The wiki the search occured on
semi - private	hits	List of page IDs returned	Derived from private data
semi - private	request_set_token	Random ID from operational logs	Links this record to more private datasets
semi - private	q_by_ip_day	Number of searches by same IP and day	Metadata about the IP address
private	query	Text input from user	Users search for private things and don't expect it to be shared
private	timestamp	Second-resolution timestamp search	Direct record of user behavior
private	clicks	List of page IDs clicked with second-resolution timestamps	Direct record of user behavior
private	session_id	Identifier of a single search session, typically 30 minutes	Associates multiple rows within dataset

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T235857 Learning to Rank (LTR) applied to additional languages and projects to improve ranking (needs experimentation, might not work at all)
		Resolved		• JFishback_WMF	T235858 Increase of training data retention (>90 days) is validated with Legal / Privacy

Event Timeline

Gehel created this task.Oct 18 2019, 12:42 PM

Gehel moved this task from Incoming to Epics on the Discovery-Search (Current work) board.

Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).Oct 18 2019, 12:49 PM

EBernhardson updated the task description. (Show Details)Dec 12 2019, 8:45 PM

There is a new process to access potential privacy risks. I've moved our proposal into the description of this ticket, copied relevant parts of the proposal into the Privacy Review Template along with links back to this ticket, and submitted the proposal to Privacy and Legal.

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.Dec 12 2019, 9:18 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson moved this task from Epics to Blocked/Waiting on the Discovery-Search (Current work) board.

• chasemp added a project: Security-Team.May 19 2020, 4:19 PM

• chasemp moved this task from Incoming to Watching on the Security-Team board.

• JFishback_WMF added a project: Privacy Engineering.May 28 2020, 3:09 PM

• JFishback_WMF moved this task from Incoming to In Progress on the Privacy Engineering board.

• JFishback_WMF claimed this task.May 31 2020, 7:07 PM

• JFishback_WMF triaged this task as Medium priority.Jun 6 2020, 8:42 PM

@EBernhardson I emailed the stakeholders with my updates to the privacy risk analysis. Please advise if you need anything else on this task or if I can resolve it. Thx!

Gehel moved this task from Blocked/Waiting to Needs Reporting on the Discovery-Search (Current work) board.Jun 15 2020, 5:42 PM

• JFishback_WMF closed this task as Resolved.Jul 14 2020, 3:27 PM

• JFishback_WMF moved this task from Waiting to Completed on the Privacy Engineering board.

Gehel mentioned this in T235859: Any new data retention requirements are implemented.Sep 23 2020, 2:25 PM

EBernhardson mentioned this in T360536: Increase retention of training data.Mar 20 2024, 3:38 PM