In case of only one IP leading to a certain country/region/city data point (or user_agent_map data point), it would be possible with access to webrequest logs to re-link some private data to a full reading history (in case we don't delete aggregated pageviews). We want to know how often this seems to be at risk, to possibly mitigate the risk.
The goal of this ticket is to establish how easy is to link records in pageview_hourly to a particular identity (that is, an IP) IF you were to gain access to the cluster and you had at your disposal both pageview_hourly and webrequest tables.
The webrequest table stores IPs if for a short time, only 2 months. Pageview_hourly has geographical information associated to pageviews but no IPs. Pageview_hourly data is stored forever.
Can we link records on pageview_hourly (and thus a viewing session) to records on webrequest using the geographical information to join those two tables?
More info from IRC conversation:
1:38 PM <csteipp> nuria: So my specific example when talking with Joseph was, if we have pageviews for a very unique geographic area, and there's only one person in the geography, then someone with access to our IP's and can find one that geolocates there, they have the real id. I'd like to understand how common that is.
1:38 PM <csteipp> nuria: Yes pageviews + geographic is the worst. Very unique UserAgent would be the same issue too.
1:38 PM <csteipp> (iirc)
1:39 PM <nuria> csteipp: understanding that IP when it comes to #G connections is pretty meaningless right?
1:39 PM <nuria> csteipp: cause we are going to find tons of matches for regions in which mobile users share the same IP
1:39 PM <csteipp> #G connections?
1:39 PM <nuria> csteipp: sorry 3G
1:40 PM <csteipp> nuria: that's carrier dependent. Some give IP per device (especially ones that are ipv6), others nat.
1:43 PM <nuria> csteipp: that is less common than global IPs though on my experience, what I wanted to point out is that in mobile IP most of the time would not imply an individual , so: "will we find tons of records on the same city (even a small one) with the same IP? " Yes, we shall.
1:43 PM <csteipp> real id => reading history, not the reverse
1:44 PM <nuria> csteipp: i will spend some time doing queries but -unless I am missing something - the false positives will be "all IOS8 users on verizon on toronto"
1:44 PM <csteipp> nuria: Right. The question is do we have pageviews with very small (single) numbers of IP's that geolocate there.
1:44 PM <csteipp> Not large numbers
csteipp> nuria: Cool. So just to be clear, running querries where we have geographies with *small* numbers of unique IP's, right?
1:46 PM <csteipp> I think Joseph and I discussed >10
1:46 PM <csteipp> <10, I mean :)