Page MenuHomePhabricator

[Spike] Should EventLogging support DNT?
Closed, ResolvedPublic

Description

Currently EventLogging supports the Do Not Track header on the client-side, but not the server-side. There has been some debate about whether or not DNT should affect EventLogging at all. Let's resolve that question in order to inform how we resolve this inconsistency.

Event Timeline

My opinion is that EventLogging should not be affected by DNT. Here are my reasons:

  • DNT is a failed experiment and has been superseded by ad blockers, browser-based tracking protection, and laws like the GDPR. The W3C's DNT working group was disbanded in January 2019 and most other websites, including Google, don't respect DNT.
  • It seems unlikely to me that our EventLogging activity even constitutes "tracking" as intended by the DNT draft documents. The draft documents (which were never finalized) seem to almost exclusively refer to third-party web tracking. The full title of the standard proposal is "Do Not Track: A Universal Third-Party Web Tracking Opt Out". EventLogging is purely internal tracking, not third-party tracking.
  • Strict adherence to DNT makes it harder for us to understand how users are using our site. We need comprehensive logging data to make decisions about features and products. Since a lot of Wikipedia's most active editors are also very privacy conscience, we may be skewing our data by ignoring users with DNT, even though its unlikely these users are intending to prevent us from collecting strictly-internal usage data that is not used for targeting or advertising.

This, I think is already a settle request, the new instrumentation client we are working on for Modern Event Platform does not support DNT thus far: https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/c4dd0aeca5b678a937f85bab8b0bbc304c137df3#diff-b03b762d2fc042cf84e0494bdeacf525

At this time we are not modifying the older client at all.

Ah, no, wait, it DOES support DNT. My bad, cause we discussed this and I thought we decided on not including it.

DNT is a failed experiment and has been superseded by ad blockers, browser-based tracking protection, and laws like the GDPR.

I can see this argument and do not disagree.

It seems unlikely to me that our EventLogging activity even constitutes "tracking" as intended by the DNT draft documents.

Many, many will disagree with this point.

Strict adherence to DNT makes it harder for us to understand how users are using our site

No, not really, its usage is very small and much smaller than adblock. Remember adblockers also block eventlogging requests.

See also T190188: VirtualPageView schema should not use EventLogging api to send virtual page view events and T186572: uBlock blocks EventLogging and T220627: QuickSurveys EventLogging missing ~10% of interactions

Ah, no, wait, it DOES support DNT. My bad, cause we discussed this and I thought we decided on not including it.

@Nuria - So does that mean that the Analytics team has already decided this question in favor of supporting DNT, and thus we should move forward with adding DNT support on the server-side for parity? If so, I'm OK with that decision (even if I don't like it much personally).

So does that mean that the Analytics team has already decided this question in favor of supporting DNT, and thus we should move forwad

No, to be honest, I think now is a good time to decide what to do given MEP.

I think we should respect DNT.

I know at the WMF we take our user's privacy very seriously. Even if we collect DNT events, I know that the information that we keep will never be used for those things that DNT is meant to prevent.
But I think one of our biggest assets as an organization is the trust of the community.
If someone from the community, enables DNT, and still continues to see event beacons being sent, that would be confusing for them, and could lead to loss of trust.
I acknowledge @Nuria's argument that DNT is very rarely used.
But that could also be an argument in favor of respecting it, because we'd be losing very few data.
Plus, adding DNT awareness to EventLogging clients is pretty straightforward (the current JS client already does respect DNT).

The Legal team looked into this and said they do not see any legal impediment to either choice, as long as we are clear and consistent about it.

It's also worth noting that our privacy policy currently states that we do not respond to the Do Not Track header.

It's also worth noting that our privacy policy currently states that we do not respond to the Do Not Track header.

I retract my previous comment, then. I'd be OK with ignoring DNT, and it's probably cleaner to do so.
Thanks @nshahquinn-wmf for bringing up this fact!

I think this choice belongs with the security team, our privacy expert @JFishback_WMF, and I'm ok with whatever they decide. I will, of course, add my reasoning:

First, I interpret DNT as the EFF has re-framed it here, for the reasons explained here. As the EFF explains, this policy:

includes exceptions that respect the ordinary functionality of a site, that allow measures for security purposes and prevent fraud, and allow data analysis techniques that protect the anonymity of the users.

Reading more about this brings up two key points:

  • we can do nothing on the client side, the header was not meant to stop requests, it was meant to describe how they should be handled on the server
  • on the server, we already pretty much support this policy as written, except we use a period of 90 days instead of 10 and are not as strict about the K=5000 [1] when we k-anonymize (sometimes it's bigger sometimes smaller, according to our risk-mitigation analysis). I think this is reasonable due to our small team, large data, and relatively small compute resources.

We can update our policy to say that what we implement for ALL users is very close to the DNT policy, and definitely in the same spirit. We can explain the differences and our reasons for them publicly. When Ori first blocked requests with the DNT header, we were not handling data well on the server. Due to monumental work by @mforns and others, we're now much more consistent. If someone has a problem with our exceptions to the DNT policy as written, I think it's reasonable to engage in that discussion and not too hard technically to implement a more strict adherence for requests with the DNT header. Most importantly, though, I think we can and should remove any DNT handling from clients, except maybe making sure the header is sent with the EL request. We can then adapt as needed on the server.

[1] (Section 4b) "Anonymized" means we have conducted risk mitigation to ensure that the dataset, plus any additional information that is in our possession or likely to be available to us, does not allow the reconstruction of reading habits, online or offline activity of groups of fewer than 5000 individuals or devices.

tl;dr: I think @kaldari gave one sound argument and two unsound arguments for not handling DNT, but you only need one sound argument. I agree that DNT handling should be disabled, though we should understand the narrow technical reason why it makes sense and prepare to support a similar technology through other means.
I think @mforns is also right to say

If someone from the community, enables DNT, and still continues to see event beacons being sent, that would be confusing for them, and could lead to loss of trust.

but I wonder if a console warning message or deprecation notice might be appropriate. We could even recommend in the notice that they use an ad-blocker or other privacy extension to help protect themselves. This would side-step the issue of implementing DNT handling on the server.

Agree with @Milimetric that I'd like to hear from @JFishback_WMF on this. James -- did you get a chance to talk with any of the DNT folks from W3C last year at the meeting? I might e-mail Bert Bos or somebody to understand how things unfolded (but I think most of it is public knowledge by now).

  • DNT is a failed experiment and has been superseded by ad blockers, browser-based tracking protection, and laws like the GDPR. The W3C's DNT working group was disbanded in January 2019 and most other websites, including Google, don't respect DNT.

So, yes, basically, DNT as specified by W3C was not fully adopted by browser vendors (some did, some didn't) and for various reasons (not necessarily as nefarious as you might think), is now being removed by some of them. This is separate from the matter of websites honoring or not honoring the header.

The only reason any browser vendors like Mozilla or orgs like EFF continue to support the header is to try to continue the discussion, as far as I can tell. It brings me no joy to say it, but as an item of technology, DNT is defunct. Practically speaking, the day is soon to come when users will not be able to set DNT in their browsers, and we will essentially be handling a non-standard HTTP header.

Also, I wouldn't say DNT is superceded by GDPR.

It seems unlikely to me that our EventLogging activity even constitutes "tracking" as intended by the DNT draft documents.

Many, many will disagree with this point.

Have to agree with @Nuria here. For context: the IETF memo you cited was a really early iteration, superceded by the W3C Tracking Preference Expression (DNT) Working Group Specification, which broke out the question of what is and isn't tracking into a separate Tracking Compliance and Scope Note which was never formalized into a spec. There, the matter of internal-use versus third-party transfer is handled separately, and it's clear that either can constitute tracking. But remember all of these documents are neither laws nor even adopted specifications, so these definitions are not enforceable consensus.

I personally think there are two separate discussions:

  • Whether a user can signal to the website that they do not consent to trust. This is the essence of DNT and is more akin to anonymous browsing where you are either anonymous or you aren't, it's 0-1.
  • Whether there are exceptions even if DNT has been signaled. Based on data lifecycle, identification risk, etc etc. (like the 'data analysis techniques that protect the anonymity of the users' exception mentioned in @Milimetric's post) it can be argued that quite a lot of data can be collected under the right conditions without exposing the user to meaningful risk. But if the user has signalled that they do not trust the website (using e.g. DNT), then why should they trust the website to adjudicate what is and is not adequately anonymized?

This is an area where we as an org can do some interesting work in defining policy and practice, but I think as of today, some (if not most) of our data usage would constitute tracking by the definitions laid out by the various DNT specs.

  • Strict adherence to DNT makes it harder for us to understand how users are using our site. We need comprehensive logging data to make decisions about features and products. Since a lot of Wikipedia's most active editors are also very privacy conscience, we may be skewing our data by ignoring users with DNT, even though its unlikely these users are intending to prevent us from collecting strictly-internal usage data that is not used for targeting or advertising.

I agree with your point here, though not with your argument maybe. I don't think it is reasonable to deny users the choice to "go dark" with DNT if by denying them that choice, it means we are forcing them to surrender their privacy. I think between a users' need for privacy and our need for statistics, we should allow the user to protect their privacy.

However if we provide (as @Nuria, @JFishback_WMF, and I drone on about at various meetings), the user the protection of allowing only provably non-identifiable data to be collected, with something like differential privacy etc., then I do think it becomes more reasonable to prevent them "going dark", because now we can argue in good faith that the statistics we collect are essential to the service and introduce no association risk to the user and thus don't constitute tracking. But we don't have anything like that right now.

So imo this is not a sound argument for not honoring DNT.

First, I interpret DNT as the EFF has re-framed it here, for the reasons explained here. As the EFF explains, this policy:

Full disclosure: I love EFF and give them a donation about the size of a rent check every year. That said, I think the W3C recommendations are more fleshed out and better reflect the reality of the browser landscape. I think the EFF policy deepens the already confusing voluntary nature of the DNT spec by introducing these exceptions to what constitutes "tracking". I think the work they put into their technical guide is interesting and would be a good base for a systematic effort to provide standard techniques and measurements that would allow sites to verify that they were compliant with "soft DNT" of the kind I touched on above, where we can provably anonymize the collected data.

But as mentioned above, that is a whole other matter. We are not talking about "soft DNT" and that isn't what we implement in EventLogging. And if we did implement "soft DNT" I think it needs to be called something else that is more precise and developed on a W3C standards track, and preferably developed to interact with relevant law (e.g. GDPR).

  • we can do nothing on the client side, the header was not meant to stop requests, it was meant to describe how they should be handled on the server

This is not accurate wrt the W3C spec. See 5.3 JavaScript Property to Detect Preference.

  • on the server, we already pretty much support this policy as written, except we use a period of 90 days instead of 10 and are not as strict about the K=5000 [1] when we k-anonymize (sometimes it's bigger sometimes smaller, according to our risk-mitigation analysis)

It looks like from what you wrote that we don't support the policy as written. We either do or don't, right? That's not to say that their policy is law. I think the EFF policy (like the W3C Tracking Compliance and Scope) is inadequately developed and represents a real gap in standards. My guess is the intention was to wait for legislation to define a way to measure or define compliance, which has been slow and uneven. Again, I think this is a place where we (or a consortium of folks collaborating) could probably do really good work to define what is provably safe to collect, and for what values of 'safe'.

We can update our policy to say that what we implement for ALL users is very close to the DNT policy, and definitely in the same spirit. We can explain the differences and our reasons for them publicly.

I think the only reason I would choose to explain our policies as "close to DNT" (I assume the EFF version?) is if I had to explain the absence of support for DNT. I would rather explain our policies on their merits and leave the confusing DNT (IETF, W3C, or EFF?) business out.

Sorry for my delayed response here - I wanted to spend time really reviewing the disparate viewpoints and refining my own thoughts on the subject, so thank you for your patience. I appreciate the thoughtful discussion so far. Clearly, we all agree we want to respect the privacy of our users. Consistent with AGF, I think the disagreement here is how best to manifest this respect. Here's my attempt to frame the issue in terms of user expectations, along with my recommendation.

DNT is a method of obtaining tracking consent. Consent is an attractive notion in privacy because it seeks to shift power from the "tracker" to the "target" by allowing the target to determine whether or not their data are processed. All other things being equal, empowering our users feels consistent with our values as a Movement. So DNT seems, at first, like a Good Thing. But consent frameworks have also been roundly criticized as providing merely the illusion of control, specifically because they abstract away so much complexity in such a simple design.

DNT makes this consent abstraction problem even worse. We need not go any further than this very Phab task to see that the effects of a binary consent paradigm are not always clear. We have a whole group of smart, technically educated, privacy-minded experts weighing in and even we do not entirely agree what DNT means. Further, every website, browser, and advocacy group seems to disagree over how and whether DNT should be implemented, honored, and supported. How is a user with little or no privacy expertise or technical background supposed to understand the nuance or trade-offs involved in tracking? Especially with respect to secondary processing, etc.? This makes DNT even less useful as a consent mechanism. To illustrate, let's assume every user wanted to enable DNT. If they use a browser that even supports DNT, and they're able to figure out how to enable it, what should their reasonable expectation be when it is checked? Does it only stop third-party tracking? First-party tracking? Only some first-party tracking? Will DNT prevent their IP address from being recorded when they edit? How about when they read? Will their data be excluded from third-party research projects or public data dumps? What if their browser does not support DNT headers? Does that mean they are unprotected on our project sites in some way? I just don't think that there are clear answers to any of these questions in terms of user expectations. We probably have our opinions on how it should work, or know what our own expectations might be, but there is certainly no consistency in practice.

So DNT is not great, but what harm is there in responding to it anyway? Developing logic around DNT feels like we're incurring technical debt (and policy debt, if that's a thing?) for an inconsistently implemented quasi-standard that is difficult for users to understand, and in all likelihood will soon be completely abandoned. My recommendation is to ignore DNT going forward. I don't say that because I think privacy is unimportant, but because I think it's extremely important. If we want to meaningfully empower our users by implementing some form of tracking consent, there are more user-friendly and "standard" patterns to do it (think cookie banners)[0]. If we respond to DNT, it just clouds user expectations. We can also review our explanation of how we respond to DNT and why. If a user is savvy enough to enable DNT and then search for tracking that is inconsistent with their expectations, my guess is that a stop at our Privacy Policy won't be far behind, so let's set expectations there instead of relying on a conflicting interpretation of DNT.

[0] I think cookie banners suffer from a lot of flaws, but at least their presence is surfaced to the user in a largely consistent way. They promote all kinds of other privacy anti-patterns (obfuscating opt out links, etc.), but I don't have to Google "change firefox do not track settings" to figure out how to interact with a cookie banner.

We have a whole group of smart, technically educated, privacy-minded experts weighing in and even we do not entirely agree what DNT means.

could not agree more, this is been the case for DNT since early on.

Coming mostly from the perspective that DNT is a failed experiment half supported across the codebase I think we should be explicit in not supporting it going forward . This practically would mean:

  • the eventlogging MEP client will not support DNT
  • the privacy policy will have a specific not about WMF abandoning any support of DNT all together.

Change 598794 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/EventLogging@master] Remove DoNotTrack support

https://gerrit.wikimedia.org/r/598794

Change 598820 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/WikimediaEvents@master] InukaPageView: remove support for "do not track"

https://gerrit.wikimedia.org/r/598820

Change 598820 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] InukaPageView: remove support for "do not track"

https://gerrit.wikimedia.org/r/598820

kaldari claimed this task.

Marking this as resolved (more or less). The conclusion is that EventLogging should not support DNT, for the following reasons:

  • DNT is a failed experiment which was not fully adopted by browser vendors and is now being removed by some of them.
  • There is no consensus on what DNT should apply to.
  • Our existing implementation of DNT-support is inconsistent.
  • Our Privacy Policy already states that we do not support DNT, and Legal has no definite opinion on the matter.

This is moot at this point, but sometimes I like to make sure I understand something even if it's moot.

  • we can do nothing on the client side, the header was not meant to stop requests, it was meant to describe how they should be handled on the server

This is not accurate wrt the W3C spec. See 5.3 JavaScript Property to Detect Preference.

I see that preference, but it looks like it's there to help an agent "determine what DNT header field value would be sent to the effective script origin". So, again, not to stop the request client-side, but to inform the header that is sent with the request. I see nowhere in the spec or follow-up guidance that says the request should be stopped client side based on DNT.

As for the way we handle it server side and how we should leave DNT out of "we support something close to DNT" in order to be more clear, I fully agree.

(And by the way I like and agree very much with your greater point of setting ourselves up to support something like DNT in spirit. And I like your eulogy of DNT, too :))

Change 598794 merged by Nuria:
[mediawiki/extensions/EventLogging@master] Remove DoNotTrack support

https://gerrit.wikimedia.org/r/598794

This is moot at this point, but sometimes I like to make sure I understand something even if it's moot.

I do it too!

I see that preference, but it looks like it's there to help an agent "determine what DNT header field value would be sent to the effective script origin". So, again, not to stop the request client-side, but to inform the header that is sent with the request. I see nowhere in the spec or follow-up guidance that says the request should be stopped client side based on DNT.

Yeah it's not stated in the spec because I think it doesn't really want to define a 'why' for using it, but if you go and dig around in the mailing lists enough, you'll find this is the basic reason it was proposed:

https://lists.w3.org/Archives/Public/public-tracking/2012May/0313.html

Yeah it's not stated in the spec because I think it doesn't really want to define a 'why' for using it, but if you go and dig around in the mailing lists enough, you'll find this is the basic reason it was proposed:

https://lists.w3.org/Archives/Public/public-tracking/2012May/0313.html

I salute your archivist skills. I see it's not quite a firm proposal to block requests that would have a DNT header, more of a guidance on what to do with potential requests going to third parties. In our case, is our tracking domain considered a "third party" because it's different from the domain we're serving on, or "first party" because we're the same entity operating both? The relevance here is that I still think DNT was mainly there to protect users from third party analysis and long-term tracking, and I feel that in spirit we're still heading in that direction. If there's evidence that people interpreted DNT to protect users from first-party analytics, then I'd withdraw my desire to support something in the spirit of DNT and have to agree with the rest of the industry that it was kind of a silly standard.

Change 682227 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaEvents@master] statsd: Remove reference to undefined mw.eventLog.isDntEnabled

https://gerrit.wikimedia.org/r/682227

Change 682227 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] statsd: Remove reference to undefined mw.eventLog.isDntEnabled

https://gerrit.wikimedia.org/r/682227