Page MenuHomePhabricator

Should it be possible for a schema to override DNT in exceptional circumstances?
Closed, ResolvedPublic

Description

Currently EventLogging will always honor the user's DNT header (see also T184793#3965300)

In T184793 we want to track virtual page views when a user hovers over a link. It has been suggested that this should ignore DNT but given we are logging these virtual page views using EventLogging and EventLogging enforces the opposite this does not happen.

@Tbayer points out in T184793#3965440 that this makes the virtual pageviews inconsistent with page view traffic (which ignores DNT).

Possible Solutions

  1. EventLogging should provide a way for clients calling mw.track to override the DNT header in exceptional circumstances.
  1. We exclude DNT users from the pageview data
  2. We need to measure virtual page views in a different way

Short-term Outcome (Friday, 16th March)

Per T187277#4052896:

  • #1 was discarded; and
  • Readers Web will opt for #3: we'll use parts of the core EventLogging client-side API in order to construct the correct URL to request and make the request using sendBeacon inside of the Page Previews codebase, e.g.
[0]
const eventData = {
  // ...
};
const payload = mw.eventLog.prepare( 'VirtualPageview', eventData);
const url = mw.eventLog.makeBeaconUrl( payload );

navigator.sendBeacon( url );

Long-term Outcome

We should revisit our treatment of the user's DNT preference in order to get consensus and apply our conclusions consistently across all of our projects. This may mean opting for #2 or the exact opposite.

Event Timeline

Jdlrobson created this task.

IMHO DNT should be respected whenever possible.

IMHO DNT should be respected whenever possible.

I don't think that's being disputed here. The question is whether something that tracks page-view-like things should be consistent with how we collect page views (which does not respect DoNotTrack). :)

The question is whether something that tracks page-view-like things should be consistent with how we collect page views (which does not respect DoNotTrack). :)

See: https://phabricator.wikimedia.org/T98831 for requests about pageviews abiding to DNT, there are several of these, this is just one of the most modern ones.

Is collecting agregated data TRACKING? I don't think so. Therefore
collecting those data is possible no matter if the user's sending DNT.

Something else is collecting pageviews per user, for example to give them
personalized advice about what to read.

The question is whether something that tracks page-view-like things should be consistent with how we collect page views (which does not respect DoNotTrack). :)

See: https://phabricator.wikimedia.org/T98831 for requests about pageviews abiding to DNT, there are several of these, this is just one of the most modern ones.

Thanks for the link! Actually that task from 2015 clearly shows that storing pageview numbers in aggregate form has so far been considered uncontroversial - even the task author agrees with that in the comments; the task was about storing individual log entries in non-aggregate form. This also matches the stance of the EFF (as summarized by @Tgr back then in T98831#1522212 ) , which furthermore allows for short-term retention of individual log entries too.
Happy to review older discussions if you find the links, but for now it looks like it's consensus that DNT does not apply to aggregate content consumption metrics like the one we are currently working on T186728 (the reason this task was filed).

And for context, the overarching issue here is that EventLogging's main use case has been studying UI interactions, where excluding DNT makes sense (no one wants to change that as far as I know). But now we want to use it for a different purpose, tallying content consumption, which, as @Tgr and I pointed out in last month's discussion already, is a different matter. In the 2015 task linked above (T98831#3107001 ), @Nuria expressed the same point as follows:

EL data is most of the time about user behaviour when using the site. That data is of different nature than data that js just used to agreggates counts. For example: "number of pageviews on italian wikipedia coming from italy".

It seems option 1. is the way to go here; otherwise we won't be able to use EL for T184793, which has (so far) been the method recommended by the Analytics Engineering team for this.

How will the EventLogging client side javascript know if an event is a "page view like" or the more common UI interaction measurement? What is the damage to the movement if "page view like" events are undercounted by 11% as long as that undercount is relatively uniform? What is the damage to the movement if we slowly increase the amount of data collected and retained about each visitor? What is the damage if that increased collection is against the visitor's explicit request?

I'll stay out of the meta question about page view count fixation for now because I have some idea of why some people think they are sacred.

The problem is that the mechanics of EventLogging aren't collecting just aggregate counts, they're keeping individual records about requests. If the virtual pageviews were being sent to a backend that only keeps aggregate counts, incrementing on each hit, then that might be fine. You essentially want to turn a blind eye to the detailed data EventLogging records because you won't use it, but it's there, it's being recorded. The definition of DNT is loose enough that people keep making compromises to it. But IMHO in this case if you want to record virtual pageviews and only care about aggregate counts, you should be sending the data to a backend that only records aggregates.

Now, if you're talking about recording times links are hovered, they are not a page view. A popup with very limited information isn't a page. It seems like the definition of "virtual pageview" is stretched very far in this proposal, for the sole purpose of ignoring DNT.

Furthermore, we know that users with DNT represent only a tiny fraction of traffic. Will it be the end of the world to not have their data? Do you expect different conclusions to be drawn if a small percentage is missing from the data? How is the extra data "actionable"? Collecting more data for the sake of it is pointless if it can't result in a different course of action.

To exclude "DNT requests" from pageview counts, we'd at least need to keep tracking the proportion of such requests in each timeframe, geography and wiki, so that the remaining requests can be "scaled" up by anyone to approximately "real" totals. Probably overkill?

As noted, EventLogging was not originally supposed to be used for such purposes. Calculating virtual pageviews in another way would be logical. Allowing each schema to (ab)use the system for different purposes is dangerous (there is little control over individual schemas) and could have unintended consequences (some users may decide that it's better to block all eventlogging requests with a browser extension or other method).

A fourth alternative could be to change the way we honor the DNT directive: the exclusion could be wider but more surgical. It could apply to all MediaWiki logs as well, but radically reduce the information content for each request rather than eliminate it altogether. Implementation may not be hard but some research would be needed to identify all the bits of information to obfuscate, and in the end some EventLogging schemas would also have a slightly diminished precision.

How will the EventLogging client side javascript know if an event is a "page view like" or the more common UI interaction measurement?

As outlined in the task description (1.), the instrumentation code generating such an event could pass an override parameter to the standard function responsible for sending it, making it easy to audit the code to ensure that only the few applicable schemas use it.

I'll stay out of the meta question about page view count fixation for now because I have some idea of why some people think they are sacred.

Both the editor community in general and the Foundation have long embraced pageview data as an important source of information about the impact of our work, even though we know it has limitations. It's fine to disagree with this stance on principle, but I don't think it's productive in a technical discussion to depict it as a mental aberration ("fixation") or a religious dogma.

The problem is that the mechanics of EventLogging aren't collecting just aggregate counts, they're keeping individual records about requests. If the virtual pageviews were being sent to a backend that only keeps aggregate counts, incrementing on each hit, then that might be fine. You essentially want to turn a blind eye to the detailed data EventLogging records because you won't use it, but it's there, it's being recorded. The definition of DNT is loose enough that people keep making compromises to it.

"people" includes the EFF here, I assume? (see above)

And yes, aggregate counts are the ultimate goal in the case of previews (T186728). I guess that the 10-day limit recommended by the EFF would be enough to carry out the aggregation step. For now EL uses the 90-day limit from our privacy policy instead, as does our general webrequest data (which, as discussed, includes requests from DNT clients).

But IMHO in this case if you want to record virtual pageviews and only care about aggregate counts, you should be sending the data to a backend that only records aggregates.

Now, if you're talking about recording times links are hovered, they are not a page view. A popup with very limited information isn't a page. It seems like the definition of "virtual pageview" is stretched very far in this proposal, for the sole purpose of ignoring DNT.

I'm not sure how productive it is to get hung up on the precise nomenclature - granted, previews are not "page views" in the classical sense, the whole current effort is about measuring them separately, as there is consensus that they an important a new form of reading Wikipedia content. We do know that they cause a measurable decrease in classical pageviews, as readers are content with reading the introductory sentence(s) on the preview card in lieu of the full article. I think @Jdlrobson's task description used "virtual page views" in a more technical sense, as a reading action associated with a particular page (here, the one being previewed). If that's a concern, then perhaps we should find a different name.

Furthermore, let's keep in mind that the preview EL event being discussed here (T184793) would always be immediately preceded by the existing API request for the preview content, which is already being logged in the webrequest table (for both DNT users and non-DNT users). In other words, the privacy benefits of not registering previews seen by DNT users would be minimal.

appy to review older discussions if you find the links, but for now it looks like it's consensus that DNT does not apply to aggregate content consumption metrics like the one we are currently working on

Content consumption is akin to UI interactions in this case, just like it would be if you are trying to measure consumption of minutes of video played, playing a video IS a UI interaction and ALSO content consumption. A page preview IS a ui interaction and ALSO indicates consumption of a snippet of content. We have to decide what is our stand on DNT but in any case we can separate content consumption from ui interactions going forward. As wikimedia sites become less static most of the consumption will shift towards richer UI interactions.

To exclude "DNT requests" from pageview counts, we'd at least need to keep tracking the proportion of such requests in each timeframe, geography and wiki, so that the remaining requests can be "scaled" up by anyone to approximately

yes, this is been mentioned before and it will be great data to have , just like the browser data we publish at https://analytics.wikimedia.org/dashboards/browsers/ it will be useful to scale metrics such as this one but also for the world at large. Filed ticket: https://phabricator.wikimedia.org/T187376

Yeah it sounds like we are trying to use EL “the wrong way”

Previews just aren’t page views. I realize in some ways we want to be able to show that our losses in page views are reasonable, as they are intentional: we don’t want users to navigate to pages just to have a basic understanding of a concept that is referenced in the text.

Rather than trying to make changes to EL, I think it would be worth considering that we just need to reset our baseline for page views. There were page views before previews, and page views after. The new numbers have a lower baseline.

And then let’s just measure previews like any other event going forward and be confident in the fact that we lowered page views intentionally for the users benefit.

Modern web design is moving away from full page reloads and client side caching is becoming more aggressive. Apps already have this same issue.

I think this issue is a clear marker that we need to figure out how we are going to have to learn to deal with these types of logging inconsistencies moving forward.

Yeah it sounds like we are trying to use EL “the wrong way”

I think this issue is a clear marker that we need to figure out how we are going to have to learn to deal with these types of logging inconsistencies moving forward

Cough cough cough https://office.wikimedia.org/wiki/Tech_program_proposals/Modern_Event_Data_Platform (still WIP). Perhaps using EventLogging for this is not what it was orignally intended for, but the idea of modeling content consumption of all kinds as events makes a lot of sense. A regular ol' pageview is an event, just one that we currently detect by grepping a huge firehose of webrequest logs. A page preview is another event.

Whatever the EventLogging of the Future ('EOF', good acronym, ehhh?) is, it will need to be able to support measuring things like pageviews. I also won't weigh in about this DNT decision, but whatever is decided, it should work the same in the future for pageviews and pagepreviews and any other content consumption measurement.

Allowing each schema to (ab)use the system for different purposes is dangerous (there is little control over individual schemas)

This would likely change in EOF, which would allow particular events to be configured to override DNT. BTW, even if a user has DNT, there is nothing stopping a developer now from deploying javascript that emits events to the EventLogging beacon. (Heckaay, I could emit an event from my CLI right now!) I don't think we get a lot of safety by not providing a configurable DNT override.

Some people in this discussion seem to labor under a misinterpretation of what DNT means. DNT expresses the user preference of not wanting to be tracked by third-party sites:

  • Do Not Track is a technology and policy proposal that enables users to opt out of tracking by websites they do not visit, including analytics services, advertising networks, and social platforms. (donottrack.us)
  • Do Not Track is a feature in Firefox that allows you to let a website know you would like to opt-out of third-party tracking for purposes including behavioral advertising. (Mozilla DNT FAQ)
  • Do Not Track (DNT) is a way to keep users’ online behavior from being followed across the Internet by behavioral advertisers, analytics companies, and social media sites. (EFF)
  • ...we believe that there are some kinds of web tracking which ... may not need to be categorically prohibited when the DNT header is set. A reasonable set of exceptions might be: 1. Tracking that is limited to a single "1st party"1 website (either by the website itself or by an analytics provider subject to suitable contractual and technical protections) (EFF Deeplinks newsletter)
  • Tracking is the collection and correlation of data about the Internet activities of a particular user, computer, or device, over time and across non-commonly branded websites, for any purpose other than fraud prevention or compliance with law enforcement requests. (CDT)
  • Tracking is the collection of data regarding a particular user's activity across multiple distinct contexts and the retention, use, or sharing of data derived from that activity outside the context in which it occurred. A context is a set of resources that are controlled by the same party or jointly controlled by a set of parties. (DNT spec)
  • With respect to a given user action, a first party to that action which receives a DNT:1 signal MAY collect, retain and use data received from those network interactions. (DNT compliance spec)

(All emphasis mine.)

Given that Wikimedia websites do not include third-party resources and do not share the collected data with third parties (except for very limited purposes such as research, which do not constitute tracking; see the definition of service providers in the spec), we would be fully compliant even if we completely disregarded the DNT header for all of EventLogging, both under the commonly accepted W3C definition and under CDC's stricter alternative proposal.


The DNT standard is mainly aimed at web analytics and advertisement companies an as such it is not a very high bar. Notably, the EFF suggests websites should adopt higher standards on a voluntary basis: we hope that many 1st party domains will choose to adopt limited logging and retention practices for users who enable DNT (source) and has created a more aspirational DNT policy. Even this policy does not say that no logging of events for DNT users should take place. It says that:

  1. All user identifiers, such as unique or nearly unique cookies, "supercookies" and fingerprints should be discarded as soon as the response is issued.
  2. Logs with DNT Users' identifiers removed (but including IP addresses and User Agent strings) may be retained for a period of 10 days or less

The EFF policy is fairly naive and not written with participatory sites like Wikipedia in mind, and would prohibit key anti-abuse features like CheckUser and cookie blocks; nevertheless for generic analytics features it might be a useful guideline. Given that EventLogging collects neither unique identifiers not IP / useragent by default, even if we aim to comply with this more radical implementation of Do Not Track, we only need to make sure that those are not added manually, or - in the case of IP/UA - purged within 10 days.

Given that EventLogging collects neither unique identifiers not IP / useragent by default

FYI, we do collect user agent, but it is saved in parsed format. Aaannnnd to support this page previews project, IP is back (not in MySQL db because it's complicated) in order to geocode! See T186833. Both will be purged within 90 days.

What is the damage to the movement if "page view like" events are undercounted by 11% as long as that undercount is relatively uniform?

For one thing, there is no reason to think it is relatively uniform. It probably correlates with the choice of browser because different browsers expose the setting to a different extent (IE10 used to make it default, Firefox enables it automatically when you are in private browsing mode, in Chrome you have to go into advanced settings to enable it, non-Apple browsers are not even allowed to use it on iOS) and also because privacy-consciousness of the user influences browser choice. So shifts in browser popularity and migration from desktop to mobile could create fake pageview trends.

Also, current pageview stats do not undercount, so a technical change from server-side to client-side page rendering would result in a large perceived pageview drop, which would probably have product owners scrambling to figure out what kind of horrible deficiency went unnoticed. This is less relevant for page previews (where there isn't an 1:1 correspondence between old and new views anyway) and more relevant for things like a skin that functions as a single-page application.

What is the damage to the movement if we slowly increase the amount of data collected and retained about each visitor?

I'll restrain myself from making a snarky comment, but hopefully we can agree that this is values territory where at best vague intuition can be pitted against vague intuition on what kind of borderline risks and bad incentives would be created in the long term by one or the other choice. I think as an organization we are not very good at being respectful of widely-held values, much less having reasonable and meaningful conversations about them, but debating them in semi-random Phabricator tickets is not a good replacement for that.

What is the damage if that increased collection is against the visitor's explicit request?

As I have hopefully shown in my previous comment, there is no explicit way to make such a request. DNT expresses the wish to not be tracked all across the internet by analytics providers, social networks and such. There are probably far more people uncomfortable about Google, Facebook & co knowing their entire browsing history and factoring that into the algorithm that spits back advertisements and news recommendations at them, than those who are also uncomfortable about Wikimedia storing short-lived not-really-identifying information about their article views, and we don't have any way to differentiate between them.

appy to review older discussions if you find the links, but for now it looks like it's consensus that DNT does not apply to aggregate content consumption metrics like the one we are currently working on

Content consumption is akin to UI interactions in this case, just like it would be if you are trying to measure consumption of minutes of video played, playing a video IS a UI interaction and ALSO content consumption. A page preview IS a ui interaction and ALSO indicates consumption of a snippet of content. We have to decide what is our stand on DNT but in any case we can separate content consumption from ui interactions going forward. As wikimedia sites become less static most of the consumption will shift towards richer UI interactions.

That's true, but entirely beside the point here. Of course the act of viewing a preview involves an interaction with the user interface as well - just like a normal pageview has also been a major UI interaction ever since Tim Berners-Lee made the first web browser in 1990.

Rather, it's about measuring different aspects of one and the same action (previewing a page). As you may recall, an instrumentation was already built and used long ago to study the user interface aspects of the feature, that's the Popups schema. The whole reason that in recent weeks we have all been working on a new instrumentation of the feature for the purpose of measuring content consumption is that everyone agrees (or has agreed so far) that these are different purposes with different requirements.

Yeah it sounds like we are trying to use EL “the wrong way”

Previews just aren’t page views.

I'm not sure anyone said they are page views.

I realize in some ways we want to be able to show that our losses in page views are reasonable, as they are intentional: we don’t want users to navigate to pages just to have a basic understanding of a concept that is referenced in the text.

Rather than trying to make changes to EL, I think it would be worth considering that we just need to reset our baseline for page views. There were page views before previews, and page views after. The new numbers have a lower baseline.

And then let’s just measure previews like any other event going forward and be confident in the fact that we lowered page views intentionally for the users benefit.

Maybe I misunderstand something in these remarks about baselines and losses, but this seems to be about an entirely different discussion, which has already been had (we measured the drop in pageviews in A/B tests - although not across all projects - and made the point about that intentional lowering when reporting year-over-year pageview changes recently).

Rather, T184793 is about recognizing page previews as a new method of reading Wikipedia pages in its own right, which needs to be taken into account in the future when we want to understand how Wikipedia content is being consumed, alongside the standard pageviews. The ratio between the two will be one important aspect to monitor (does it differ between projects/pages/languages? change over time? etc.), which is why introducing major discrepancies between the methods for each from the very beginning is a really bad idea.

Given that EventLogging collects neither unique identifiers not IP / useragent by default

Aaannnnd to support this page previews project, IP is back (not in MySQL db because it's complicated) in order to geocode! See T186833. Both will be purged within 90 days.

Given that EventLogging collects neither unique identifiers not IP / useragent by default, even if we aim to comply with this more radical implementation of Do Not Track, we only need to make sure that those are not added manually, or - in the case of IP/UA - purged within 10 days.

@Ottomata, @mforns: Out of curiosity, is it possible to configure purging on a per-schema basis such that IP/UA information is purged from the VirtualPageview table after 10 days rather than the default (?) 90?

@phuedx not easily. We implement policies "per-data-type" in this case "schema-bound-data", eventlogging data is purged according to a whitelist every day and for data that is older than 90 days at the time of purging, sensitive fields are removed. Other purging schemes are applied for other data types.

In T187277#3982615, @Tbayer provided clear reasoning as to why the logging and consequent aggregation of VirtualPageviews events should match that of the wmf.webrequest table.

Here and in this analytics@ thread, Analytics Engineering have advocated for tracking this new method of interacting with pages via EventLogging. You'll note that in that thread, we confirmed that the MultimediaViewer method of tracking virtual pageviews of files is regarded as a hack – it was called a hack when it was introduced in this analytics@ thread – and so was quickly discarded as a possible implementation of the Page Previews virtual pageview instrumentation.

After reading T187277#3973927 several times, I'm forced to ask whether EventLogging's behaviour of not logging events when the user has enabled DNT is a reasonable interpretation. We do purge IPs, UAs, and other sensitive information after 90 days, for instance. This is likely an RFC-worthy question and answering it shouldn't block the specific task of ensuring that pageviews and virtual pageviews via Page Previews are recorded and aggregated as similarly as possible.

It's already possible to use parts of the client-side EventLogging (EL) API to construct validated events that can be logged by another process, regardless of the user's DNT preference. This is because our Varnishes ignore the DNT HTTP header and they are the only other part of the EL stack that the user preference is visible to. If modifying the EL API in order to circumvent this DNT-related behaviour is seen as too broad or controversial a change to support what is the first exception to the most common EL use case, then Web can construct a lightweight EventLogging client in the Page Previews codebase specifically for logging VirtualPageviews events. Once we've vetted the refinement and aggregation steps, I think we should consider generalising VirtualPageviews events as necessary to support the MediaViewer instrumentation that I mentioned above.


Aside: I am concerned that we're now collecting IPs for EventLogging by default. I can conceive of situations where an analyst would require this information but I believe that it should only be made available by request, presumably during the privacy review performed by Analytics Engineering. Is there a way to whitelist recording this information on a per-schema basis? Should I create a task?

consequent aggregation of VirtualPageviews events should match that of the wmf.webrequest table

Personally I have no opinion in the fight for or against respecting DNT, so whatever yall decide is fine with me. To be specific though, I'd word this differently, and say that the VirtualPageviews should match Pageviews, not necessarily wmf.webrequest. Assuming some future RFC decided that Pageviews should respect DNT, then we'd track the DNT header in webrequest, and exclude those records from Pageview aggregation. OR even better...the client would emit a Pageview event, and whatever DNT policy we had would be contained in the client side code. That would be best!

Aside: I am concerned that we're now collecting IPs for EventLogging by default

Its a little funky, but since we collect IPs for all webrequests, and all EventLogging events are originally webrequests, we have always tracked IPs for EventLogging.

It's a little difficult (not impossible) to introduce rules for specific schemas at the ingestion and refinement step. But we could very easily salt and hash the IPs (like EventLogging used to do), or even remove the IP altogether after geocoding happens.

To be specific though, I'd word this differently, and say that the VirtualPageviews should match Pageviews, not necessarily wmf.webrequest.

Hah! I actually meant to type pageviews! Thanks for the correction and further clarification 🙂

Sounds like the way forward is to continue with @phuedx's proposal for building something within the Page Previews codebase specifically for this use case. I agree that perhaps a larger conversation around DNT is necessary on a different forum.

I also wanted to quickly summarize our motivation here: the goal of this change is to gain the ability to define and measure page previews and pageviews in an analogous way. The deployment of page previews will alter the significance of pageviews as a core metric. Based on the results from T182314: Analyze results of enwiki and dewiki page previews a/b test, we are expecting a significant decrease in pageviews. This decrease is intentional - we want users to use previews for pages they would have opened in the past only to get a quick definition or context. However we are expecting users to interact with the content of more pages across the session. After this change, merely looking at pageviews no longer gives us an accurate representation of interaction of content within various pages. This led to the creation of a new metric so-far defined as page interactions (roughly, page previews plus pageviews). We would like to track and report on page interactions in a similar way to pageviews as well as to be able to monitor the ratio between the two, as @Tbayer mentioned in T187277#3982615

ovasileva claimed this task.

Created T190188: VirtualPageView schema should not use EventLogging api to send virtual page view events for the changes proposed above. Resolving this for now while noting that the larger conversation should be continued elsewhere.