Overview
I was evaluating the quality of referer_class data in webrequests and specifically the external referers (not including search engines). I noticed that there is a fair bit of noise in the referers that would be useful to remove. I'm estimating that about 40% of external pageviews are actually search engines, another 10% are spam/virus sites, and 5% is Google Translate. It would be great to fix this as being able to accurately track referers is quite important to some work that I'm starting around external re-use of Wikipedia content.
Search Engines as External Referrers
I collected a sample of external referrers from 13 November 2019. This was an arbitrary date and while I expect the relative amount of each referrer to shift if I chose a different day, I suspect the data is largely stable. I ran a very similar query back in February as part of an initial exploration and reached largely the same conclusions.
Query for top sites being counted as external and data on cookies to help with seeing undetected bots.
SELECT w.host AS host, count(w.host) AS num_referrals, sum(w.new_user) AS num_new_users, sum(w.same_day) AS num_same_days FROM ( SELECT parse_url(referer, 'HOST') AS host, (IF(x_analytics_map['WMF-Last-Access-Global'] IS NULL, 1, 0)) AS new_user, (IF(x_analytics_map['WMF-Last-Access-Global'] = '13-Nov-2019', 1, 0)) AS same_day FROM wmf.webrequest WHERE year = 2019 AND month = 11 AND day = 13 AND is_pageview AND agent_type = 'user' AND referer_class = 'external' ) w GROUP BY w.host ORDER BY num_referrals DESC LIMIT 5000;
Recommendations:
- There are many search engines that have search in their hostname that are not classified as search engines. A single regex would grab a very large number of them, though at the cost of at least a few false positives. I took a look at that here and it seems that websites with "research" in their domain such as www.researchgate.net are the main false positive: https://docs.google.com/spreadsheets/d/1-8cnEcb4GWit9-TXtUEQVm1DVJrAMlEcSDE8k2j-ZN0/edit#gid=55125331
- If we don't want to go this general route, I would advocate for at least the following search engines being added due to their high volume and, in some cases, country-specificity that would mean much higher skew in the data for those regions:
- Naver (search.naver.com): 10% of external referals; common in South Korea
- Docomo (search.smt.docomo.ne.jp): 6% of external referals; common in Japan
- Qwant (qwant.com): 4% of external referals; common across Europe
- Daum (search.daum.net): 3% of external referals; common in South Korea
- MyWay (search.myway.com): 3% of external referals; common in US
- AU (search.auone.jp): 2% of external referals; common in Japan
- Seznam (search.seznam.cz): 2% of external referals; common in Czech Republic
- Lilo (search.lilo.org): 1% of external referals; common in France
- Coc coc (coccoc.com): 1% of external referals; common in Vietnam
Google Translate as External Referer
Google Translate makes up 5% of referrals worldwide but is heavily used in certain regions like Indonesia, skewing the data heavily there. It also can spike as Google tries different ways of leading search users to translations. It would be nice to be able to appropriately map Google Translate queries as a result.
Query for evaluating Google Translate referrals from another arbitrary day:
SELECT continent, referer_class, sum(toledo) AS num_toledo, sum(google_search) AS num_gsearch, count(1) AS total FROM ( SELECT geocoded_data['continent'] AS continent, referer_class, (IF(referer LIKE '%client=srp%', 1, 0)) AS toledo, (IF(referer LIKE '%prev=search%', 1, 0)) AS google_search FROM webrequest WHERE year = 2019 AND month = 11 AND day = 25 AND x_analytics_map['translationengine'] IS NOT NULL AND is_pageview AND agent_type = 'user' ) w GROUP BY continent, referer_class;
Recommendations:
- Right now about 75% of google translate referals are classified as external, comprising 5% of all external referals. Digging deeper into the referer URL for that 75%, the URL parameter prev=search seems to indicate that the user came from Google Search and client=srp indicates that it was automatically-translated page in the search results (see T212414#4996923). Together, these two parameters suggest that almost 90% of google-translate referals that are labeled as external are actually coming from Google search. I am not sure if there is an efficient way to handle this though.
Virus / Spam sites
There are a number of referers that follow the pattern <sport>-<random characters>.site (e.g., www.motorsport-b9f4a06e.site) that are clearly bots and collectively make up close to 10% of external referal traffic. I saw them back in February when I looked then too so this is an ongoing issue. I assume it does not make sense to attempt to individually blacklist these sites, but I wanted to document them as potentially problematic in analyses. There are also a number of virus sites that co-opt people's browsers and generate clicks. Again, there's no obvious solution for these and the virus sites thankfully do not make for much traffic so can probably be ignored. It's also likely that methods like those mentioned for the bot identification in pageview_hourly (T238357) could be applied here after the fact (e.g., high # of no-cookies).
What next?
I'd appreciate some help trying to decide what the best path forward is and how to implement it. The best example that I could find of a similar task is this one for adding Ecosia / Startpage to the search engine list: T191714