Page MenuHomePhabricator

webrequest dataset sets referer_class "unknown" instead of "external (search engine)" for origin-based referer values
Closed, DeclinedPublic

Description

When investigating something for T214998, I used referer LIKE '%google%' to crudely approximate Google referals (covering all google.com, www.google.com, country TLDs like google.es, google.nl, etc.). I then noticed that many of these were not categorised as referer_class='external (search engine)' but rather referer_class='unknown'.

krinkle@stat1011$ hive
SELECT COUNT(*) AS cnt, referer_class, referer FROM wmf.webrequest WHERE year = 2025 AND month = 1 AND day = 5 AND (is_pageview=true OR is_redirect_to_pageview=true) AND referer_class != 'external (search engine)' AND referer LIKE '%google%' GROUP BY referer_class, referer ORDER BY cnt DESC LIMIT 10;

Using yesterday (5 Jan 2025) as an example, this 24 hour window contains about 10 million pageviews and redirects-to-pageviews affected by this issue.

cnt	referer_class	referer
9895226	unknown	www.google.com
…
…
6365	unknown	google.com
…
105	external	https://blog.google/

Looking at the W3C spec for Referer-Policy, I don't see a way to get this behaviour through there. Looking at the IETF spec for Referer HTTP header this does not appear to be strictly valid, but is clearly common enough to care about.

I broke it down by user agent to find a likely source or cause.

SELECT COUNT(*) AS cnt, referer_class, referer, user_agent_map["browser_family"] AS browser FROM wmf.webrequest WHERE year = 2025 AND month = 1 AND day = 5 AND hour = 10 AND (is_pageview=true OR is_redirect_to_pageview=true) AND referer_class != "external (search engine)" AND referer LIKE '%google%' GROUP BY referer_class, referer, user_agent_map["browser_family"] ORDER BY cnt DESC LIMIT 10;
cnt	referer_class	referer	browser
9888357	unknown	www.google.com	Firefox
6810	unknown	www.google.com	Chrome Mobile WebView
…
1261	unknown	google.com	Chrome Mobile
1236	unknown	google.com	Samsung Internet
1234	unknown	google.com	Mobile Safari
1206	unknown	google.com	Chrome Mobile iOS
617	unknown	google.com	Chrome
579	unknown	google.com	Edge
…
223	unknown	google.com	Firefox
…
55	unknown	www.google.com	Chrome

I don't know how our search engine referal dashboard is built. I considered whether maybe it uses referer_data instead of referer_class, so I spot checked that on a few rows as well. Alas, no. I take it referer_class is probably derived from referer_data, but I wanted to spot check this, just in case.

SELECT referer_class, referer, referer_data FROM wmf.webrequest WHERE year = 2025 AND month = 1 AND day = 5 AND hour = 10 AND (is_pageview=true OR is_redirect_to_pageview=true) AND referer_class != "external (search engine)" AND referer='www.google.com' LIMIT 10;
referer_class	referer	referer_data
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
unknown	www.google.com	{"referer_class":"unknown","referer_name":"none"}
`

It seems all browsers send a bit of it, but, Firefox disproportionally so.

This might be the effect of a privacy-related browser extension that perhaps Firefox users are more likely to have installed, and that various Firefox-based browsers may even have pre-installed, such as Tor Browser, Waterfox, and LibreWolf.

Event Timeline

Change #1198313 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Update referer classification patterns

https://gerrit.wikimedia.org/r/1198313

Just commenting here too as a duplicate of T406531#11303547: I personally would leave in place the expectation that referrers start with http or https. My read is that that behavior is largely coming from bots who are improperly mocking up a referrer. I don't see nearly the volume that Krinkle saw in January and the majority of it is being labeled as automated (query below). Given that it's also not acceptable behavior per the specs, I'd lean towards us enforcing the expectation of having legitimate referers as a further check against bot data.

spark.sql("""
SELECT
  agent_type,
  COUNT(*) AS num_requests
FROM wmf.pageview_actor
WHERE
  year = 2025 AND month = 10 AND day = 18
  AND referer LIKE "www.%"
  AND is_pageview
GROUP BY
  agent_type
ORDER BY
  num_requests DESC
LIMIT 500
""").show(500, False)

+----------+------------+
|agent_type|num_requests|
+----------+------------+
|automated |4683        |
|user      |2233        |
|spider    |930         |
+----------+------------+

+1 to @Isaac . I re ran the query that @Krinkle ran using pageview_actor for a day in Oct and Sep 2025. Most of the requests are now tagged as 'automated'. I think this could be a result of the new heuristic we put into place in T395934.

image.png (790×606 px, 96 KB)

based on my observation -

  1. the proportion of non https requests are much smaller than the ones tagged correctly as search_engine referer
  2. of the ones tagged as external or unknown, majority are automated ie user pageviews arent impacted
  3. also, looking at browser_family, the automated requests are mostly coming from 'Chrome' or 'Other' browser family. Firefox isnt dominating this anymore.

Boldly declining this based on feedback above. please re-open as needed.