Page MenuHomePhabricator

% of "none" referers seems too high
Open, Needs TriagePublic3 Story Points

Description

(from @JKatzWMF , via email:)

I think there is possibly something wrong with how referers are currently classified:

If you look at the data, the none/internal are closely coupled and the external/unknown are as well (see link or chart below).

Based on user behavior I would expect the internal referrals to be highly correlated to the total number of non-internal (as the pages per session remain roughly constant). However, I don't expect that the internal pageviews are so tightly correlated and nearly equivalent to the "none". This consistent near equivalence seems like a very odd coincidence. The new distinction between the two in the last month is due to a known mobile Safari-related bug being resolved, but I don't think it explains what we're seeing.

Aside from being tightly coupled, the % of "none" seems incredibly high to me at 35% of total traffic. If you look at external-only traffic, that says that for every 6 google referred pages there are 4 unknown links (email/app links, direct urls). For another website I volunteer for, the % of known external traffic is 95%. For us, it is 60%. I know we don't have the tools/infrastructure that uncle Google has, but this seems to be a very big difference.

Note: All the %s I derived are from exporting the turnilo link shared above, using division and eyeballing a mostly flat daily % over the course of 2 years.

Event Timeline

Nuria created this task.May 29 2018, 4:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 29 2018, 4:13 PM
Nuria claimed this task.May 29 2018, 4:14 PM
Nuria added a project: Analytics-Kanban.

From @JAllemandou 's e-mail:

I can't find change-reasons for this artifact:

  • We deployed code on the 10th (not the 9th), but for ode tha doesn't impact referer_class
  • We did change the referer_class code, but we deployed it beginning of May, not April (5th to be precise).
Nuria added a comment.EditedMay 31 2018, 10:51 AM

I am not sure if I understand the comments around "correlation" but what is happening arround late march is the lauch of a new version of Safari that supports our meta tag arround referrers and thus some hits that were previously classified as "unknown" or "none" are now classified as "internal".

See: https://phabricator.wikimedia.org/T154702

Nuria set the point value for this task to 3.May 31 2018, 10:52 AM
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.
Nuria triaged this task as High priority.May 31 2018, 4:38 PM
Nuria moved this task from Incoming to Data Quality on the Analytics board.

@Nuria Sorry for the confusion. (context for anyone who didn't see the email exchange: I wrote the email quoted above). I wasn't suggesting the issues with referrer class were related to the updated header project, I was asking if the changes you were making might fix what I believe are pre-existing issues. Specifically:

If you look at the data, the non/internal are closely coupled and the external/unknown are as well (see link or chart below). Based on user behavior I would expect the internal referrals to be highly correlated to the total number of non-internal, but the near equivalence seems like a very odd coincidence.

Also, the number of unknowns and none seem very high to me.

Regardless of whether or not they were in scope for the project, I should probably just file a ticket or change this ticket to be request to look into it?

From @JAllemandou 's e-mail:

[...]

  • We did change the referer_class code, but we deployed it beginning of May, not April (5th to be precise).

Does that refer to T191714: Add Ecosia and Startpage to list of search engines (which according to this log was deployed on May 2), or to some other change that could have affected the data as well?

JKatzWMF updated the task description. (Show Details)Jun 6 2018, 12:51 AM
Nuria added a comment.EditedJun 6 2018, 1:05 AM

Talked to @JKatzWMF and edited premise of ticket to better describe issue. No changes have been done to referrers as of late (recent changes just removed duplicated code) so issues must be prexisting, will take a look.

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Jun 11 2018, 10:56 PM
Vvjjkkii renamed this task from Problems with external referrals? to 71baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii removed Nuria as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii removed a subscriber: Aklapper.

Hi @Nuria just following up on this to see if you have had a chance to take a look.

Nuria added a comment.EditedJul 12 2018, 11:14 PM

Let's see, the graph you attached shows the "number of pageviews per referral class per day for 2 years for wikipedias". So in a day like for example: June 25th 2017 you have 44 million pageviews tagged with "external -search engine" referrer, 36 million tagged with "none" and 28 million tagged with "internal" and 2.4 million tagged with "external"

I do not understand the "coupling" part. Since your main signal in this graph is "number of pageviews" it stands to reason that all lines fluctuate at the same time. That is, the count(pageviews) when referrer="external " decreases or increases with the same frequency than the ount(pageviews) when referrer="internal " if that makes sense. Maybe I am totally missing your question.

As to the percentages of referrers and how much of our percentage of traffic is external versus internal this is what I see in raw data: https://bit.ly/2zBBe4o
Notice dataset, it is "raw data" not "pageviews". Graph shows the last 6 hours of requests and incoming referrers (notice trick to split referrers via filter regex). As you mentioned the traffic with no referrer is very significant, note that these records actually show the requests we get. While referrer '-" is high percentage wise I really have no reason to think that is not correct. Let me know if you can think of one.

Main culprits (browser-wise) of sending "empty referrers": https://bit.ly/2zBP1Ig

Nuria moved this task from Paused to In Progress on the Analytics-Kanban board.Jul 12 2018, 11:16 PM
Nuria added a comment.Jul 13 2018, 7:06 PM

Nuria to look into the differences between "unknown" and "none"

Nuria added a subscriber: phuedx.EditedJul 16 2018, 11:28 PM

Per conversation with @JKatzWMF

  • we do not believe that there is an issue with "correlations" of referrers
  • it is kind of odd that much of our traffic it is tagged as having no referrer, that would mean that a big percentage of our traffic is not coming from an external source like a search engine but rather is a "direct" hit.

From my research on this the traffic tagged with referrer equal to "none" is so for different reasons:

  1. Browsers that do not understand our referrer policy or have bugs:

As can be seen here: https://bit.ly/2NkH6ko the biggest culprit in terms of browsers that send no referrer is Safari, after the release of webkit that fixed issue (https://phabricator.wikimedia.org/T154702) on late March matters have improved for this browser and requests with "referrer"== none lessen quite a bit. Older versions of safari do not understand our referrer policy so much of this traffic marked as "none" will remain as is.

  1. Bot traffic not labelled as such

The other major culprit is IE, requests were wrongly classified as IE7 before but now they appear mostly as IE11, this no-referrer traffic corresponds (to a significant percentage) to well-known bots (not tagged as such) that originate in the middle east. See:

Some of this traffic (little) might be lawful IE traffic, IE is known to have issues with referrers.

  1. Webviews?

Chrome Mobile version 38 is the other major culprit of no-referrer traffic: https://bit.ly/2Ln6fuk, if you dig deeper this version of Chrome is probably a Chrome Webview in an older version of Android 4. Android 4 was the first one that sported Chrome Webviews (https://developer.chrome.com/multidevice/webview/overview) so likely this version has issues reporting referrers. https://bit.ly/2uFMSoW. The newer versions of android seem to also be represented here but the version 38 is not represented if you look at internal refferrers for chrome mobile: https://bit.ly/2L3M20F. So this is a mix of actual direct hits via webviews and a browser (Chrome Mobile 38) whose usage is wide spread that never sends referrers.

Actions:
See below referrer policy, for internal hits inside wikipedia the browsers that do not understand neither of these three will be sending an empty referrer. It is likely that a significant portion of the traffic that is tagged as "none" belongs to category 1) Browsers that do not understand our referrer policy or have bugs. My recommendation would be that @phuedx team spends some time looking into improving the below referrer policy for browsers that do not understand it or see if any workarounds can be done

<meta name="referrer" content="origin">
<meta name="referrer" content="origin-when-crossorigin">
<meta name="referrer" content="origin-when-cross-origin">

I am moving this ticket to "radar" on our kanban as I do not think it requires any further actions from analytics.

Nuria raised the priority of this task from High to Needs Triage.Jul 17 2018, 4:53 PM
Nuria removed a project: Analytics-Kanban.
Nuria added a project: Readers-Web-Backlog.
Nuria moved this task from Data Quality to Radar on the Analytics board.
phuedx claimed this task.Aug 22 2018, 3:56 PM
JKatzWMF renamed this task from Problems with external referrals? to % of "none" referers seems too high.Nov 1 2018, 3:42 PM
Tbayer updated the task description. (Show Details)Dec 7 2018, 4:57 AM

See also T211077 (TLDR: it looks like a lot of formerly "unknown" referrers on Chrome Mobile are now, since around September 13, classified as "external (search engine)")

Nuria added a comment.Dec 7 2018, 3:57 PM

FYI that comment above is tied to android 8 upgrade. I think this ticket can be closed, it should be linked on docs for future reference as it does not really have any actions. None referrers are higher that you might see in other sites, our bot traffic not labeled as such is as high as 7/10% in some days by our latest counts, that is likely the biggest driver of "none" referrers.

@Nuria while bot detection certainly plays a role, I am nervous about classifying this as an issue that can be more or less fixed with better bot detection. Other sites have something like 5% of their external referrals coming from none + unknown (which includes direct) and we have 29% (in the last year). Our "referrals" from other sites is 14% of external traffic, for comparison. Even if 10% of TOTAL traffic was undetected bots (yikes), it wouldn't make up this difference. Also, the % of "none" traffic seems to be dropping over the last few years, while one would expect inflation over time of undetected bots (assuming a static bot definition).

Finally, in May a bug fix in Safari finally recategorized much of our "none" traffic to referral (T154702), causing "none" to drop from 36% of external traffic, suggesting that browser behavior is at least one major source of unknowns.

Nuria added a comment.Dec 7 2018, 5:57 PM

Not disagreeing in that browsers play a major factor. Per analysis above referrer "none" is caused by 1) bots + 2) older browsers (or browsers that do not understand referrer policy). We certainly would expect 2) to decrease over the years and that matches with recent findings. I am saying that w/o better bot detection you will not see the % "none" that is due to undetected bots decrease.

Makes sense. Thanks for clarifying!

phuedx removed phuedx as the assignee of this task.Dec 11 2018, 11:58 AM

Not sure why I assigned this to myself…