Page MenuHomePhabricator

Hashed IP addresses in refined webrequest logs
Closed, DeclinedPublic

Description

Research can benefit from hashed IP addresses in refined table if the hashing is done using the same algorithm and salt used for hashing IPs in EL.

To give more context: we need this for the reader research which is being documented at https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour and to match EL survey responses to webrequest logs.

Open Question for Chris: what do you think about this?

Event Timeline

leila assigned this task to Ottomata.
leila raised the priority of this task from to Needs Triage.
leila updated the task description. (Show Details)
leila added a project: Improving-access.
DarTar set Security to None.
DarTar moved this task from Staged to Radar on the Research board.
Milimetric triaged this task as Medium priority.Nov 19 2015, 5:55 PM
Milimetric moved this task from Incoming to Backlog on the Analytics-Backlog board.
Milimetric updated the task description. (Show Details)
Milimetric added a subscriber: csteipp.

I talked with @ellery about this briefly.

I'd prefer that we don't permanently make this connection between our webrequests and eventlogging for all requests. If this is for a limited set of time, and preferably only for a specific set of pages, then storing this data in hadoop temporarily while you do analysis for a specific project would be fine, but I don't think we should put it in the refined logs. Are there other options?

The other option (from a technical perspective) is to brute-force the IP from the hash for survey results that you want to further analyze. Since the hashes share a salt, it shoudn't take more than a minute or two to unhash all ipv4 addresses in the table. If you need ipv6, then you can forward hash ips from the webrequest logs and probably get all results even faster.

@csteipp Otto mentioned that there is the potential to introduce a request ID. We could associate eventlogging records with a request in the webrequest logs based on this ID. This would eliminate the need for hashed ips in the webrequest table. Is this approach any better from your perspective?

Separately from privacy concerns about such deanonymizing techniques, a heads-up that the hashed IPs in EL might not be reliable currently: T119144

We don't keep IPs in eventlogging anymore. It would be good to send request-id with webrequest logs if we aren't doing that yet, and it would be good to log request-id with EL too.