Page MenuHomePhabricator

Honor DNT header for access logs & varnish logs
Closed, DeclinedPublic

Assigned To
None
Authored By
Gilles
May 12 2015, 7:53 AM
Referenced Files
None
Tokens
"Like" token, awarded by dpatrick."Love" token, awarded by Dbrant."Love" token, awarded by yuvipanda."Like" token, awarded by Dzahn.

Description

While the change to make EventLogging respect Do Not Track has just been merged: https://gerrit.wikimedia.org/r/#q,I17ee55f464743580b18f6e594fdfc73a3d76f1cf,n,z I imagine that our analytics still heavily relies on access logs and varnish logs, which are a form of tracking.

If we're striving to honor DNT in its purest ideological form, people who have it set shouldn't be tracked at all, even for our internal tracking (that's the spirit of the EL change, I imagine).

If technically feasible (apparently possible with Apache: http://donottrack.us/server), should we simply not record entries in the access logs/varnish logs if DNT is turned on? That seems like the simplest solution and it's the safest in terms of privacy for these users. We'd have to accept the fact that those users don't show up in any statistics at all. But since we accept that they don't show up in EL-based statistics anymore, this seems like the next logical step to me.

Event Timeline

Gilles raised the priority of this task from to Needs Triage.
Gilles updated the task description. (Show Details)
Gilles subscribed.
Dzahn triaged this task as Medium priority.May 26 2015, 9:51 PM
Dzahn subscribed.

EFF just published a DNT policy template which sets out expectations for the behavior of DNT-friendly sites. For web server logs it recommends a 10-day retention of requests with DNT, at most.

There are reasons why logging requests with DNT for a short time is useful:

  • to have the ability to diagnose malicious requests (see the recent DDoS ops thread for an example of an investigation that would not have been possible if we don't log DNT requests and the attacker is clever enough to use them)
  • to be able to calculate aggregate numbers, as mentioned in the task description
  • load etc. monitoring usually relies on logs. Filtering DNT shouldn't normally make a big difference, but it can cause unpleasant surprises (such as when IE 10 launched with DNT on by default)

If we're striving to honor DNT in its purest ideological form, people who have it set shouldn't be tracked at all, even for our internal tracking (that's the spirit of the EL change, I imagine).

No, not really. EL data is most of the time about user behaviour when using the site. That data is of different nature than data that js just used to agreggates counts. For example: "number of pageviews on italian wikipedia coming from italy".

On my opinion drafting actions for DNT needs to be own and managed like a project that spams many teams. More so when some of our clients include explicit opt-outs (apps, for example) that are of DNT-nature but not exactly the same. Also it is likely that research/product wants to know the % of users/devices/requests with DNT enabled.

FYI, that our policy when it comes to operational logs is a 90 day retention. Same for data that contains PII, currently pageview data is retained 60 days.

Aggregate counts isn't problematic, but the data we store is. If it's recorded, it can be compromised. I know we have retention policies, etc. but they're of no use if someone gets access to our data. 60 days is a lot.

I think that the safety of people who want complete anonymity when visiting our sites trumps having perfect aggregate figures. DNT and its loose definition seems like it would be a good way to achieve that. And since it's opt-in, it really shouldn't affect our analytics much.

The idea of tracking uniques of people with DNT is ironic... IMHO you could do that for an experiment to determine the usual ratio once, but then the point is that you wouldn't record log entries of any form for people who have that header going forward. You can still count hits anonymously at the entry point but not uniques, which would at least inform us if DNT traffic is increasing over time.

I don't get why losing aggregate unique counts for what will probably be less than 1% of our visitors is such a big deal. Especially since we can estimate that ratio before honoring DNT in its widest definition.

If I was living in a totalitarian regime where what I read on wikipedia could get me in trouble, I'd feel safer knowing that I can turn an option on and nothing will be recorded on the other side. That wouldn't solve MITM attacks, but it would certainly be a lot safer than the current 60-day retention.

I think that our current policies are fine for most people. But what's the harm in providing an optional higher level of safety for people who need it?

FYI, Our privacy policy does mention we do not honor DNT.

https://wikimediafoundation.org/wiki/Privacy_policy/FAQ#DNTFAQ , if we were to do it i just found recently about the w3 api on this regard: https://www.w3.org/TR/tracking-dnt/ it is resource specific

@Gilles in light of https://www.w3.org/2011/tracking-protection/ shall we decline this task? (Apple already announced that they will remove DNT from Safari).

Sure, there will probably be more precise standards in the future for this sort of thing.