Improve quality of external referer data
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Dec 2 2019, 3:41 PM

Description

Overview

I was evaluating the quality of referer_class data in webrequests and specifically the external referers (not including search engines). I noticed that there is a fair bit of noise in the referers that would be useful to remove. I'm estimating that about 40% of external pageviews are actually search engines, another 10% are spam/virus sites, and 5% is Google Translate. It would be great to fix this as being able to accurately track referers is quite important to some work that I'm starting around external re-use of Wikipedia content.

Search Engines as External Referrers

I collected a sample of external referrers from 13 November 2019. This was an arbitrary date and while I expect the relative amount of each referrer to shift if I chose a different day, I suspect the data is largely stable. I ran a very similar query back in February as part of an initial exploration and reached largely the same conclusions.

Query for top sites being counted as external and data on cookies to help with seeing undetected bots.

SELECT w.host AS host,
       count(w.host) AS num_referrals,
       sum(w.new_user) AS num_new_users,
       sum(w.same_day) AS num_same_days
  FROM (
       SELECT parse_url(referer, 'HOST') AS host,
              (IF(x_analytics_map['WMF-Last-Access-Global'] IS NULL, 1, 0)) AS new_user,
              (IF(x_analytics_map['WMF-Last-Access-Global'] = '13-Nov-2019', 1, 0)) AS same_day
         FROM wmf.webrequest
        WHERE year = 2019 AND month = 11 AND day = 13
              AND is_pageview
              AND agent_type = 'user'
              AND referer_class = 'external'
       ) w
 GROUP BY w.host
 ORDER BY num_referrals DESC
 LIMIT 5000;

Results: https://docs.google.com/spreadsheets/d/1-8cnEcb4GWit9-TXtUEQVm1DVJrAMlEcSDE8k2j-ZN0/edit#gid=1444705399

Recommendations:

There are many search engines that have search in their hostname that are not classified as search engines. A single regex would grab a very large number of them, though at the cost of at least a few false positives. I took a look at that here and it seems that websites with "research" in their domain such as www.researchgate.net are the main false positive: https://docs.google.com/spreadsheets/d/1-8cnEcb4GWit9-TXtUEQVm1DVJrAMlEcSDE8k2j-ZN0/edit#gid=55125331
If we don't want to go this general route, I would advocate for at least the following search engines being added due to their high volume and, in some cases, country-specificity that would mean much higher skew in the data for those regions:
- Naver (search.naver.com): 10% of external referals; common in South Korea
- Docomo (search.smt.docomo.ne.jp): 6% of external referals; common in Japan
- Qwant (qwant.com): 4% of external referals; common across Europe
- Daum (search.daum.net): 3% of external referals; common in South Korea
- MyWay (search.myway.com): 3% of external referals; common in US
- AU (search.auone.jp): 2% of external referals; common in Japan
- Seznam (search.seznam.cz): 2% of external referals; common in Czech Republic
- Lilo (search.lilo.org): 1% of external referals; common in France
- Coc coc (coccoc.com): 1% of external referals; common in Vietnam

Google Translate as External Referer

Google Translate makes up 5% of referrals worldwide but is heavily used in certain regions like Indonesia, skewing the data heavily there. It also can spike as Google tries different ways of leading search users to translations. It would be nice to be able to appropriately map Google Translate queries as a result.

Query for evaluating Google Translate referrals from another arbitrary day:

SELECT continent,
       referer_class,
       sum(toledo) AS num_toledo,
       sum(google_search) AS num_gsearch,
       count(1) AS total
  FROM (
       SELECT geocoded_data['continent'] AS continent,
              referer_class,
              (IF(referer LIKE '%client=srp%', 1, 0)) AS toledo,
              (IF(referer LIKE '%prev=search%', 1, 0)) AS google_search
         FROM webrequest
        WHERE year = 2019 AND month = 11 AND day = 25
              AND x_analytics_map['translationengine'] IS NOT NULL
              AND is_pageview
              AND agent_type = 'user'
       ) w
 GROUP BY continent,
          referer_class;

Results: https://docs.google.com/spreadsheets/d/1-8cnEcb4GWit9-TXtUEQVm1DVJrAMlEcSDE8k2j-ZN0/edit#gid=1282449904

Recommendations:

Right now about 75% of google translate referals are classified as external, comprising 5% of all external referals. Digging deeper into the referer URL for that 75%, the URL parameter prev=search seems to indicate that the user came from Google Search and client=srp indicates that it was automatically-translated page in the search results (see T212414#4996923). Together, these two parameters suggest that almost 90% of google-translate referals that are labeled as external are actually coming from Google search. I am not sure if there is an efficient way to handle this though.

Virus / Spam sites

There are a number of referers that follow the pattern <sport>-<random characters>.site (e.g., www.motorsport-b9f4a06e.site) that are clearly bots and collectively make up close to 10% of external referal traffic. I saw them back in February when I looked then too so this is an ongoing issue. I assume it does not make sense to attempt to individually blacklist these sites, but I wanted to document them as potentially problematic in analyses. There are also a number of virus sites that co-opt people's browsers and generate clicks. Again, there's no obvious solution for these and the virus sites thankfully do not make for much traffic so can probably be ignored. It's also likely that methods like those mentioned for the bot identification in pageview_hourly (T238357) could be applied here after the fact (e.g., high # of no-cookies).

What next?

I'd appreciate some help trying to decide what the best path forward is and how to implement it. The best example that I could find of a similar task is this one for adding Ecosia / Startpage to the search engine list: T191714

Details

	Subject	Repo	Branch	Lines +/-
	Modify external webrequest search engine classification and add tests.	analytics/refinery/source	master	+93 -64
	Fix style and correct incorrect test case.	analytics/refinery/source	master	+21 -21

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T242168 Study external re-use of Wikimedia content
Resolved	Isaac	T235780 Literature review of external reuse of Wikimedia content
Resolved	lexnasser	T239625 Improve quality of external referer data

Event Timeline

Isaac created this task.Dec 2 2019, 3:41 PM

mforns triaged this task as High priority.Dec 2 2019, 4:58 PM

mforns moved this task from Incoming to Mentoring on the Analytics board.

lexnasser subscribed.Dec 5 2019, 7:09 PM

• Nuria assigned this task to lexnasser.Dec 6 2019, 7:06 PM

• Nuria added a project: Analytics-Kanban.

• Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.

Hey @Isaac,

Started looking into the referer class, and had a few questions:

You classified go.mail.ru as not a search engine. However, it seems to be the search component of mail.ru, and acts like a search engine. From my understanding, this would be better classified under external (search engine), but let me know if I made a mistake.
Right now, android-app://com/googlequicksearchbox(...) , which is the referer corresponding to the Google Search app for Android, is classified as unknown. I was wondering if it would be better classified under external (search engine) under the Google purview. It makes up roughly 60% of unknown referers.
Right now, google.com (no http... or www...) , is classified as unknown. I was wondering if it would be better classified under external (search engine), or if there is a specific reason that URLs with no "front path" are automatically set to unknown. It makes up roughly 17% of unknown referers.

Other than those questions, I've created regexes for:

Naver
Docomo
Qwant
Daum
MyWay
Seznam
AU
Ask (which you didn't suggest, but I thought would be fitting)
Lilo
Coc Coc
AOL Search (which you didn't suggest, but I thought would be fitting)
Rakuten (which you didn't suggest, but I thought would be fitting)

In addition, I've created a regex to classify referers other than these as search engines. I've added it to the search_regex tab of the Google Sheets doc under the field 'Regex Classification'. It classifies a referer as a search engine if its address contains search, but doesn't include research with the only exceptions of presearch and securesearch. From the top external referers you listed, this regex has 100% accuracy for the top 60 referers, and more than 95% accuracy for the top 500. I'm unsure of the best practices for classifying these referers, but I'm assuming that they can simply be classified as external (search engine) without any additional clarification.

I've also updated the Google regex to include referers like: translate.googleusercontent.% that include client=srp and prev=search. Let me know if my understanding is correct.

I'm also open to suggestions regarding the filtering out of spam sites, as I haven't figured out an effective way to do that.

Let me know if you have any questions or concerns. Thanks!

Hey @lexnasser this is really great! Thanks!!

You classified go.mail.ru as not a search engine. However, it seems to be the search component of mail.ru, and acts like a search engine. From my understanding, this would be better classified under external (search engine), but let me know if I made a mistake.

Yep, you're right. I was being sloppy and assumed incorrectly based on the "mail" part of the URL. I've updated my coding accordingly.

Right now, android-app://com/googlequicksearchbox(...) , which is the referer corresponding to the Google Search app for Android, is classified as unknown. I was wondering if it would be better classified under external (search engine) under the Google purview. It makes up roughly 60% of unknown referers.
Right now, google.com (no http... or www...) , is classified as unknown. I was wondering if it would be better classified under external (search engine), or if there is a specific reason that URLs with no "front path" are automatically set to unknown. It makes up roughly 17% of unknown referers.

Yeah, I hadn't even looked into unknown yet but I agree 100% with your reclassifications.

Other than those questions, I've created regexes for: Naver, Docomo, Qwant, Daum, MyWay, Seznam, AU, Ask (which you didn't suggest, but I thought would be fitting), Lilo, Coc Coc, AOL Search (which you didn't suggest, but I thought would be fitting), Rakuten (which you didn't suggest, but I thought would be fitting)

Awesome, and thanks for adding those others in!

In addition, I've created a regex to classify referers other than these as search engines. I've added it to the search_regex tab of the Google Sheets doc under the field 'Regex Classification'. It classifies a referer as a search engine if its address contains search, but doesn't include research with the only exceptions of presearch and securesearch. From the top external referers you listed, this regex has 100% accuracy for the top 60 referers, and more than 95% accuracy for the top 500. I'm unsure of the best practices for classifying these referers, but I'm assuming that they can simply be classified as external (search engine) without any additional clarification.

Yeah, that looks great. Frankly given how power-law-y the distribution is for pageviews per referer, getting it 100% correct for the top 60 covers almost everything. I'd agree that everything that matches the regex could be pushed into external (search engine).

I've also updated the Google regex to include referers like: translate.googleusercontent.% that include client=srp and prev=search. Let me know if my understanding is correct.

Just make sure it is an OR not an AND. In my experience, URLs either had one or the other or client=srp and prev=search but not both.

I'm also open to suggestions regarding the filtering out of spam sites, as I haven't figured out an effective way to do that.

Yeah, I don't think there's a good solution unfortunately. I'm hopeful that T238357 will catch a lot of these further down the pipeline via checking nocookie but these virus/spam sites are much easier to detect in aggregate as opposed to when evaluating a single page view. In the meantime, I'll just have to remember to include custom filters when I do my analyses.

Additionally, Analytics is generally really good at documenting changes to schemas, but depending on the shift in data that these changes cause, I'll also try to remember to reach out to product analytics regarding their dashboards (e.g., https://discovery.wmflabs.org/external/) and potentially add a note to the clickstream documentation (e.g, https://dumps.wikimedia.org/other/clickstream/readme.html) so consumers of that data know where to find details of changes.

Isaac moved this task from Backlog to In Progress on the Research board.Dec 11 2019, 2:13 AM

Change 556449 had a related patch set uploaded (by Lex Nasser; owner: Lex Nasser):
[analytics/refinery/source@master] Modified external webrequest search engine classification and added tests.

https://gerrit.wikimedia.org/r/556449

gerritbot added a project: Patch-For-Review.Dec 11 2019, 6:52 PM

Change 556730 had a related patch set uploaded (by Lex Nasser; owner: Lex Nasser):
[analytics/refinery/source@master] Fix style and correct incorrect test case.

https://gerrit.wikimedia.org/r/556730

Milimetric moved this task from In Progress to In Code Review on the Analytics-Kanban board.Dec 12 2019, 5:15 PM

Change 556730 abandoned by Lex Nasser:
Fix style and correct incorrect test case.

Reason:
Did not amend to original commit

https://gerrit.wikimedia.org/r/556730

Isaac added a parent task: T235780: Literature review of external reuse of Wikimedia content.Dec 16 2019, 3:47 PM

lexnasser mentioned this in T241295: Unusual unknown external webrequest referers.Dec 21 2019, 9:10 AM

Isaac mentioned this in T235784: Identify data / questions that we can(not) answer regarding external reuse.Dec 23 2019, 5:05 PM

Quick Status Update

@Isaac, I just wanted to update you on the status of this task.

I committed the code that will classify the aforementioned search engine referers, predict referers as from search engines, and classify a few new sources as search existing engines. URL: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/556449/

I ran some basic testing soon after committing, and recently attempted to test on real webrequest data, the final step before deployment. However, I ran into some issues with my user account regarding reading raw webrequest data, delaying this final testing process. *Note that this is not an issue with the code, but rather with my personal user account.* I am hoping to resolve this issue soon, be able to do some final testing, and then release the refinery update quickly after.

I hope this isn't too much of an inconvenience. Let me know if you have any questions, and thanks for your patience!

Adding Product-Analytics so they can see where do changes to referrers are coming from

SELECT w.host AS host,
       count(w.host) AS num_referrals,
       sum(w.new_user) AS num_new_users,
       sum(w.same_day) AS num_same_days
  FROM (
       SELECT parse_url(referer, 'HOST') AS host,
              (IF(x_analytics_map['WMF-Last-Access-Global'] IS NULL, 1, 0)) AS new_user,
              (IF(x_analytics_map['WMF-Last-Access-Global'] = '13-Nov-2019', 1, 0)) AS same_day
         FROM wmf.webrequest
        WHERE year = 2019 AND month = 11 AND day = 13
              AND is_pageview
              AND agent_type = 'user'
              AND referer_class = 'external'
       ) w
 GROUP BY w.host
 ORDER BY num_referrals DESC
 LIMIT 5000;

Just a note that this select as written is counting devices that come to the site twice in one day twice, once as "new_user" and the second time as "same_day". I think for the larger point you are trying to make it does not matter but just an FYI.

I am hoping to resolve this issue soon, be able to do some final testing, and then release the refinery update quickly after.

@lexnasser that sounds great -- thanks for the update!

Just a note that this select as written is counting devices that come to the site twice in one day twice, once as "new_user" and the second time as "same_day".

@Nuria yeah, I didn't make any effort to count unique users, just page views. It's a good point though that in general there is no restriction on how many counts a given device can contribute to this query beyond that it is presumably unlikely that any given individual would have many many exterally-referred page views in a given day even if they are an avid reader.

LGoto moved this task from Triage to Tracking on the Product-Analytics board.Jan 6 2020, 5:37 PM

mpopov subscribed.Jan 6 2020, 5:41 PM

Hi @Isaac Got to final testing, and found an issue.

The way the referer search-engine classifier currently functions is it extracts the referer's host and runs the regexes on that.

For example, if the referer is https://translate.googleusercontent.com/translate_c?hl=fi&prev=search,external, then the classifier would extract translate.googleusercontent.com and then run regex on that. As you can see, this creates an issue for the Google Translate referers that you requested be classified as search engines, as it does not consider the parameters (like prev=search and client=srp.

I made the mistake of forgetting that this was the case, and put the parameter regex into this existing system, which does not work.

To make it work, I will slightly refactor the code today to separately track the host and parameters, in order to allow the Google Translate search-engine classification work. I will test the code tomorrow, and hopefully the change can be deployed by the end of the week.

Let me know if you have any questions or suggestions. Thanks!

@lexnasser thanks for the update -- bummer regarding the issues with the Google Translate but thanks for continuing to work on it. If it ends up becoming too bulky to separately track host and parameters, just let me know and we'll just make sure to note it as a known issue. Better handling of the search engines is already a huge improvement!

@Isaac

One last thing to resolve: There are a few Google Translate referers with the parameter prev=/search... (ex. prev=/search%3Fq%3DBARON%2BDE%2BHIRSCH%26hl%3Del%26rlz%3D1T4GGLL_elGR398GR398%26prmd%3Divns). Should these also be classified under the Google Translate search engine purview?

Other than that, everything's complete and tested, and I'll deploy either hopefully today or in the next few days.

(ex. prev=/search%3Fq%3DBARON%2BDE%2BHIRSCH%26hl%3Del%26rlz%3D1T4GGLL_elGR398GR398%26prmd%3Divns). Should these also be classified under the Google Translate search engine purview?

Let's keep things simple and let's document that this format is not covered.

Let's keep things simple and let's document that this format is not covered.

Works for me -- thanks!

Change 556449 merged by Milimetric:
[analytics/refinery/source@master] Modify external webrequest search engine classification and add tests.

https://gerrit.wikimedia.org/r/556449

Maintenance_bot removed a project: Patch-For-Review.Jan 13 2020, 9:10 PM

Deployed with the help of @Milimetric ! Hope you find these changes helpful!

The last thing to do here is to update docs for dataset: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Changes_and_known_problems_since_2015-03-04

This is great -- thanks @lexnasser and others who supported! I'll rerun some of the queries that inspired this work in a few days and let you know if I see anything amiss, but silence should be interpreted as success :)

And thanks in advance for updating documentation as well!

Isaac mentioned this in T242170: Define research agenda for external re-use.Jan 17 2020, 9:37 PM

• Nuria mentioned this in T248014: Grant needs kerberos credentials.Mar 18 2020, 7:12 PM

Improve quality of external referer dataClosed, ResolvedPublicActions