Page MenuHomePhabricator

Remove user_agent_map from pageview_hourly long term
Closed, DeclinedPublic

Description

Remove user_agent_map from pageview_hourly long term

A much reduced version of this data is available as part of https://wikitech.wikimedia.org/wiki/Analytics/Data/Browser_general (with e.g. URL, project, referrer class and geographical information removed). The user_agent map per url is useful in the 90 day period when it help us troubleshoot ops issues, and has been used beyond that period in various other analyses.

See:
https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Identity_reconstruction_analysis

Event Timeline

Browser data is been useful to many teams on druid.

  • For detail data we can delete after 90 days
  • We can load (to see browser trends) our browser dataset over time

Currently the pipeline is:

webrequest (60 days) -> pageview_hourly (indefinite)

Proposal is:

webrequest (60 days) -> pageview_hourly (90 days) -> pageview_hourly_sanitized (indefinite)

And pageview_hourly_sanitized would have user_agent_map removed and be sanitized as we planned.

Milimetric triaged this task as Medium priority.Feb 2 2017, 5:14 PM
Milimetric moved this task from Incoming to Wikistats on the Analytics board.

I think this is a bad idea. This data has numerous uses way beyond 60 days, see e.g. (just to pick examples I happened to look at this week) T148461#3011175 or T149355#2796903. And on the other hand, considering that this is already parsed, aggregated data instead of raw user agents, and only accessible internally under NDA, the privacy benefits would seem limited. What's more, isn't there already a plan for achieving these benefits in a less disruptive way (https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Sanitization#Solution:_Sanitizing_using_K-Anonymity_over_multiple_fields )?

Browser data is agreggated and kept long term in https://wikitech.wikimedia.org/wiki/Analytics/Data/Browser_general

And on the other hand, considering that this is already parsed, aggregated data instead of raw user agents, and only accessible internally under NDA, the privacy benefits would seem limited.

This is incorrect, for several reasons, the most obvious ones are explained here: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Identity_reconstruction_analysis

What's more, isn't there already a plan for achieving these benefits in a less disruptive way

mmm.. no , k -anonymization removes many signals from the data (on purpose) is no less disruptive. Our core use case there is anonymization of pageviews per geographical location and probably removing the parsed user_agent would make our calculations on that regard a lot simpler.

Browser data is agreggated and kept long term in https://wikitech.wikimedia.org/wiki/Analytics/Data/Browser_general

Yes, the link was already in the task description. But that table doesn't contain the information necessary for, say, the two aforementioned cases, country (T148461#3011087 ) and referer_class+URL (for the proposed Google referrals investigation or any other future traffic trend analysis that need to correct for the huge anomaly we had in July/August). In other words, the claim in the task that "We have that data as part of ..." is not true; I'm going to correct it accordingly, alongside the claim that "user_agent map per url is useful just on the 90 day period when it help us troubleshoot ops issues", which is likewise contradicted by these two examples.

And on the other hand, considering that this is already parsed, aggregated data instead of raw user agents, and only accessible internally under NDA, the privacy benefits would seem limited.

This is incorrect, for several reasons, the most obvious ones are explained here: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Identity_reconstruction_analysis

Not sure what you mean by "incorrect" - that the privacy benefits are unlimited? ;)
I am well aware of that page (identity reconstruction analysis), and had already linked above to another page that contains a summary of its results. Yes, there are some corner cases where this data would allow deanonymization, meaning we must not make this data public. But they are orders of magnitude less likely than, say, cases where IP addresses allow deanonymization.

In other words, the case for completely removing this data after 60 or 90 days is much weaker here than in case of IPs. And I also think the case for anonymization is much stronger in the case of published datasets that in the case of private tables only accessible under NDA.

What's more, isn't there already a plan for achieving these benefits in a less disruptive way

mmm.. no , k -anonymization removes many signals from the data (on purpose) is no less disruptive. Our core use case there is anonymization of pageviews per geographical location and probably removing the parsed user_agent would make our calculations on that regard a lot simpler.

"no less disruptive" is not true, at least not for the k-anonymization method your team outlined here and here. It specifically mentions ("os_family", "Android") as an example of user agent data that may be preserved for tuples with enough traffic.

What's more, it seems there is some choice of tradeoffs here. For example, I'm not convinced at all that keeping geolocation information with city (instead of country) resolution is more important than keeping information about browser families. There are certainly k-anonymization choices involving user_agent_map that do not disrupt investigations such as in the two examples above. E.g., as a simple concrete example, observing that in a large country like Pakistan there were more than 100k pageviews any given day for each major browser version examined in the linked chart, it's even possible to achieve k-anonymity for k=100,000 - a very large k - in a way that would still have allowed that particular analysis, whereas the blanket deletion proposed in this task would have prevented it.

BTW, back in 2015 @dr0ptp4kt already made a related recommendation at https://wikitech.wikimedia.org/wiki/Talk:Analytics/Data/Pageview_hourly/Identity_reconstruction_analysis ("I recommend that we retain the following - os_family, os_major, os_minor, browser_family - for all such distinct maps having at least 1000 daily members."). Sadly it seems to have been ignored in this task.

subset of sanitizing, etc.