Page MenuHomePhabricator

% of "none" referers seems too high
Closed, DeclinedPublic3 Estimated Story Points

Description

(from @JKatzWMF , via email:)

I think there is possibly something wrong with how referers are currently classified:

If you look at the data, the none/internal are closely coupled and the external/unknown are as well (see link or chart below).

Based on user behavior I would expect the internal referrals to be highly correlated to the total number of non-internal (as the pages per session remain roughly constant). However, I don't expect that the internal pageviews are so tightly correlated and nearly equivalent to the "none". This consistent near equivalence seems like a very odd coincidence. The new distinction between the two in the last month is due to a known mobile Safari-related bug being resolved, but I don't think it explains what we're seeing.

Aside from being tightly coupled, the % of "none" seems incredibly high to me at 35% of total traffic. If you look at external-only traffic, that says that for every 6 google referred pages there are 4 unknown links (email/app links, direct urls). For another website I volunteer for, the % of known external traffic is 95%. For us, it is 60%. I know we don't have the tools/infrastructure that uncle Google has, but this seems to be a very big difference.

image.png (1×1 px, 488 KB)

Note: All the %s I derived are from exporting the turnilo link shared above, using division and eyeballing a mostly flat daily % over the course of 2 years.

Event Timeline

From @JAllemandou 's e-mail:

I can't find change-reasons for this artifact:

  • We deployed code on the 10th (not the 9th), but for ode tha doesn't impact referer_class
  • We did change the referer_class code, but we deployed it beginning of May, not April (5th to be precise).

I am not sure if I understand the comments around "correlation" but what is happening arround late march is the lauch of a new version of Safari that supports our meta tag arround referrers and thus some hits that were previously classified as "unknown" or "none" are now classified as "internal".

referrers-all-year.png (1×2 px, 397 KB)

See: https://phabricator.wikimedia.org/T154702

Nuria set the point value for this task to 3.May 31 2018, 10:52 AM
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.
Nuria moved this task from Incoming to Data Quality on the Analytics board.

@Nuria Sorry for the confusion. (context for anyone who didn't see the email exchange: I wrote the email quoted above). I wasn't suggesting the issues with referrer class were related to the updated header project, I was asking if the changes you were making might fix what I believe are pre-existing issues. Specifically:

If you look at the data, the non/internal are closely coupled and the external/unknown are as well (see link or chart below). Based on user behavior I would expect the internal referrals to be highly correlated to the total number of non-internal, but the near equivalence seems like a very odd coincidence.

Also, the number of unknowns and none seem very high to me.

Regardless of whether or not they were in scope for the project, I should probably just file a ticket or change this ticket to be request to look into it?

From @JAllemandou 's e-mail:

[...]

  • We did change the referer_class code, but we deployed it beginning of May, not April (5th to be precise).

Does that refer to T191714: Add Ecosia and Startpage to list of search engines (which according to this log was deployed on May 2), or to some other change that could have affected the data as well?

Talked to @JKatzWMF and edited premise of ticket to better describe issue. No changes have been done to referrers as of late (recent changes just removed duplicated code) so issues must be prexisting, will take a look.

Vvjjkkii renamed this task from Problems with external referrals? to 71baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii removed Nuria as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii removed a subscriber: Aklapper.

Hi @Nuria just following up on this to see if you have had a chance to take a look.

Let's see, the graph you attached shows the "number of pageviews per referral class per day for 2 years for wikipedias". So in a day like for example: June 25th 2017 you have 44 million pageviews tagged with "external -search engine" referrer, 36 million tagged with "none" and 28 million tagged with "internal" and 2.4 million tagged with "external"

I do not understand the "coupling" part. Since your main signal in this graph is "number of pageviews" it stands to reason that all lines fluctuate at the same time. That is, the count(pageviews) when referrer="external " decreases or increases with the same frequency than the ount(pageviews) when referrer="internal " if that makes sense. Maybe I am totally missing your question.

As to the percentages of referrers and how much of our percentage of traffic is external versus internal this is what I see in raw data: https://bit.ly/2zBBe4o
Notice dataset, it is "raw data" not "pageviews". Graph shows the last 6 hours of requests and incoming referrers (notice trick to split referrers via filter regex). As you mentioned the traffic with no referrer is very significant, note that these records actually show the requests we get. While referrer '-" is high percentage wise I really have no reason to think that is not correct. Let me know if you can think of one.

Main culprits (browser-wise) of sending "empty referrers": https://bit.ly/2zBP1Ig

Nuria to look into the differences between "unknown" and "none"

Per conversation with @JKatzWMF

  • we do not believe that there is an issue with "correlations" of referrers
  • it is kind of odd that much of our traffic it is tagged as having no referrer, that would mean that a big percentage of our traffic is not coming from an external source like a search engine but rather is a "direct" hit.

From my research on this the traffic tagged with referrer equal to "none" is so for different reasons:

  1. Browsers that do not understand our referrer policy or have bugs:
    Screen Shot 2018-07-16 at 3.28.52 PM.png (1×2 px, 374 KB)

As can be seen here: https://bit.ly/2NkH6ko the biggest culprit in terms of browsers that send no referrer is Safari, after the release of webkit that fixed issue (https://phabricator.wikimedia.org/T154702) on late March matters have improved for this browser and requests with "referrer"== none lessen quite a bit. Older versions of safari do not understand our referrer policy so much of this traffic marked as "none" will remain as is.

  1. Bot traffic not labelled as such

The other major culprit is IE, requests were wrongly classified as IE7 before but now they appear mostly as IE11, this no-referrer traffic corresponds (to a significant percentage) to well-known bots (not tagged as such) that originate in the middle east. See:

Screen Shot 2018-07-16 at 4.19.09 PM.png (1×2 px, 377 KB)

Some of this traffic (little) might be lawful IE traffic, IE is known to have issues with referrers.

  1. Webviews?

Chrome Mobile version 38 is the other major culprit of no-referrer traffic: https://bit.ly/2Ln6fuk, if you dig deeper this version of Chrome is probably a Chrome Webview in an older version of Android 4. Android 4 was the first one that sported Chrome Webviews (https://developer.chrome.com/multidevice/webview/overview) so likely this version has issues reporting referrers. https://bit.ly/2uFMSoW. The newer versions of android seem to also be represented here but the version 38 is not represented if you look at internal refferrers for chrome mobile: https://bit.ly/2L3M20F. So this is a mix of actual direct hits via webviews and a browser (Chrome Mobile 38) whose usage is wide spread that never sends referrers.

Actions:
See below referrer policy, for internal hits inside wikipedia the browsers that do not understand neither of these three will be sending an empty referrer. It is likely that a significant portion of the traffic that is tagged as "none" belongs to category 1) Browsers that do not understand our referrer policy or have bugs. My recommendation would be that @phuedx team spends some time looking into improving the below referrer policy for browsers that do not understand it or see if any workarounds can be done

<meta name="referrer" content="origin">
<meta name="referrer" content="origin-when-crossorigin">
<meta name="referrer" content="origin-when-cross-origin">

I am moving this ticket to "radar" on our kanban as I do not think it requires any further actions from analytics.

Nuria raised the priority of this task from High to Needs Triage.Jul 17 2018, 4:53 PM
Nuria removed a project: Analytics-Kanban.
Nuria added a project: Web-Team-Backlog.
Nuria moved this task from Data Quality to Radar on the Analytics board.
JKatzWMF renamed this task from Problems with external referrals? to % of "none" referers seems too high.Nov 1 2018, 3:42 PM

See also T211077 (TLDR: it looks like a lot of formerly "unknown" referrers on Chrome Mobile are now, since around September 13, classified as "external (search engine)")

FYI that comment above is tied to android 8 upgrade. I think this ticket can be closed, it should be linked on docs for future reference as it does not really have any actions. None referrers are higher that you might see in other sites, our bot traffic not labeled as such is as high as 7/10% in some days by our latest counts, that is likely the biggest driver of "none" referrers.

@Nuria while bot detection certainly plays a role, I am nervous about classifying this as an issue that can be more or less fixed with better bot detection. Other sites have something like 5% of their external referrals coming from none + unknown (which includes direct) and we have 29% (in the last year). Our "referrals" from other sites is 14% of external traffic, for comparison. Even if 10% of TOTAL traffic was undetected bots (yikes), it wouldn't make up this difference. Also, the % of "none" traffic seems to be dropping over the last few years, while one would expect inflation over time of undetected bots (assuming a static bot definition).

Finally, in May a bug fix in Safari finally recategorized much of our "none" traffic to referral (T154702), causing "none" to drop from 36% of external traffic, suggesting that browser behavior is at least one major source of unknowns.

Not disagreeing in that browsers play a major factor. Per analysis above referrer "none" is caused by 1) bots + 2) older browsers (or browsers that do not understand referrer policy). We certainly would expect 2) to decrease over the years and that matches with recent findings. I am saying that w/o better bot detection you will not see the % "none" that is due to undetected bots decrease.

Makes sense. Thanks for clarifying!

Not sure why I assigned this to myself…

I wanted to add a couple data points / hypotheses to this discussion:

  • Chrome Mobile Version 38 that Nuria mentions as #3 in T195880#4429156 is actually almost all Google Weblight Proxy (per an inspection of the full user-agent string: Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19).
  • Intuitively we would have a higher amount of none referer traffic than other sites due to people setting Wikipedia (or Special:Random) as their homepage.
    • This is further supported by the fact that there is almost no "none" traffic from Chrome Mobile (if you take out Google Weblight Proxy) and Chrome Mobile defaults either to no home page or just a blank "new tab". I don't have an iPhone but my understanding is that mobile Safari loads up the most recent page you visited (which would be Wikipedia a non-trivial amount of time).
  • Approximately 20% of these page views are coming from IP+UA pairs that have over 500 pageviews an hour -- suggesting either bot or shared VPN/proxy.
    • It's hard to judge what proportion of those page views are bots vs. VPNs/proxies, but when you define bots as >90% page views to a single project or a single article receiving >10% of page views, it looks like a quarter to a half of these page views could be VPNs and not bots.
  • Looking at the titles being read, we further see that while the Main Page generally gets less than 1% of page views, it receives closer to 10% of page views from None referrers (backing up Wikipedia as home page hypothesis).

Another data point that is interesting in this discussion: Youtube provides Wikipedia articles as fact-checks / context for a variety of conspiracy theories / state-sponsored broadcasting companies. For all of those Wikipedia article links, regardless of platform, they also provide a URL parameter that tells us that the person is coming from Youtube. This provides a rare opportunity to compare pageviews that have Youtube as a referrer with pageviews that we know came from Youtube. On top of that, I did some of self-experimentation to see how the usage of different apps / browsers affects the Youtube referrer. Summary is that 40% of referrals from Youtube are None referrers and that this happens when the user starts in the Youtube app and switches to a mobile browser that is not Android+Chrome. This is not going to fully apply to every app as they each handle referrers differently but it does provide support that app traffic often comes through as None referrers. Hard to know how big of the pie this is though. The None traffic part is about 200 thousand per day for Youtube and other apps presumably produce similar or higher traffic counts.

I'll also add the reference to the thread about labeling automated traffic so that this ticket is more useful as a reference: https://lists.wikimedia.org/pipermail/analytics/2020-May/006850.html

ovasileva subscribed.

Closing this as it seems it hasn't been updated in the past four years