Page MenuHomePhabricator

Expanding External Referrer Tracking
Closed, ResolvedPublic5 Estimated Story Points

Description

Request Status: New Request
Request Type: project support request
Related OKRs:

Request Title: Expanding External Referrer Tracking

  • Request Description: Adding a select list of external sites to our tracking to avoid custom analyses and in order to make the process of measuring external referrer impact more accessible to any team at the Foundation.
  • Indicate Priority Level: Medium
  • Main Requestors: @KinneretG @Maryana
  • Ideal Delivery Date: FY23 Q1-Q2
  • Stakeholders: @odimitrijevic

Request Documentation

Document TypeRequired?Document/Link
Related PHAB TicketsYesMobile redirects drop prov parameter, Impact of TikTok on Wikipedia traffic
Product One PagerYesAdding External Referrers
Product Requirements Document (PRD)Yes<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNo<add links here>

The suggested list to track:
Youtube
Twitter
Facebook
Reddit
TikTok
Quora
Instagram
ycombinator
LinkedIn
Github
Pinterest
Coursera
Medium
Stackoverflow

Event Timeline

KinneretG updated the task description. (Show Details)

Thanks @KinneretG for creating this task! Just chiming in with a thought for whoever takes up this work: search engines are a bit more standardized but in their referer URLs but some of these external platforms have link shorteners that we need to account for. You can see a few examples from a past pilot with a similar scope (though fewer external platforms). Generally I just inspected the top external traffic from a given day to identify any non-standard referer formats -- e.g.,:

SELECT
  parse_url(referer, 'HOST') AS host,
  COUNT(1) AS num_referrals
FROM wmf.pageview_actor
WHERE
  year = {year}
  AND month = {month}
  AND day = {day}
  AND is_pageview
  AND agent_type = 'user'
  AND referer_class = 'external'
GROUP BY
  parse_url(referer, 'HOST')
ORDER BY
  num_referrals DESC
LIMIT 5000
EChetty triaged this task as Medium priority.
EChetty moved this task from Backlog to Investigate on the Foundational Technology Requests board.

@EChetty,
As a follow up to our conversation, I am adding the dimensions we would ideally like to track for the externally referred pageviews:

  • By Country
  • By Project
  • By Device type
  • By Article
EChetty set the point value for this task to 5.Oct 17 2022, 3:31 PM
EChetty raised the priority of this task from Medium to High.Nov 2 2022, 4:53 PM
EChetty updated Other Assignee, added: EChetty.
EChetty added a subscriber: EChetty.

Change 864772 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Refactor and Expand External referer classification

https://gerrit.wikimedia.org/r/864772

In the current patch we have a updated our referer classifier to include "external (media sites)" class to represent the list of sites to track. This is in addition to the previous classes: unknown, internal, external (search engine) and external. The classifier would also identify the Name of the site if it's a search engine or a media site (eg Youtube, Facebook, etc.).
Next step:

  1. Test for performance and optimise to include caching if necessary.
  2. Create a new UDF that will Identify the Names of the search engine and media sites by using the referer classifier.

@Snwachukwu very excited to see this functionality being added -- thanks for working on it! As a potential user of this data, I wanted to verify my understanding of the code and check on next steps:

  • My understanding of the patch: it will alter the referer_class field in wmf.webrequest (and by extension, wmf.pageview_actor and wmf_pageview_hourly among others) by assigning the rows that would otherwise be labeled as external into either external (media sites) or the original external based on whether they match the media list (Youtube etc.).
  • It doesn't look like the name of the specific site -- e.g., Youtube -- would be stored in a table at this point. Did I miss this or is it planned for a future patch or is the intent that users would have to write their own queries that use the new UDFs if they wanted this data?

@Isaac

  • Indeed this will alter the referer_class field as some rows previously labelled as external will now be labelled as external (media sites) class.
  • To answer your second point, the patch has been updated with the new UDF called GetRefererDataUDF. The UDF will return a struct data containing: referer class and referer name of the referer. For example for https://www.youtube.com/ UDF will return a struct with referer class=external (media sites)) and the referer name=Youtube.
  • The next step is use the new UDF, we plan to introduce a new field to the wmf.webrequest table and maybe remove the referer_class field. This is yet to be decided. Kindly let me know if you have any suggestion.

@Snwachukwu thanks for the update!

The next step is use the new UDF, we plan to introduce a new field to the wmf.webrequest table and maybe remove the referer_class field. This is yet to be decided. Kindly let me know if you have any suggestion.

I'll do some thinking on it. I'm pretty open to how this is represented in webrequest/pageview_actor. When this was initially proposed, I remember being most worried that it might greatly de-aggregate pageview_hourly to have an additional field that is the search engine or external media platform that referred the pageview. But I guess we won't know how big of an impact that actually is until we generate some data and compare.

Tagging @nshahquinn-wmf too as he often works with referrer data and might have thoughts about where it would be most useful to incorporate this data and how.

Tagging Product-Analytics so we discuss this as a team. Also adding @Mayakp.wiki given that there may be an impact on pageview_hourly, which we use in our monthly metrics.

@Snwachukwu I also wanted to flag this ticket - T325611 - about how the ua-parser library does not detect TikTok's in-app browser even though it can be identified using the BytedanceWebview substring

I ran the UDF on a day's data and extracted the top 1000 referer's for that day to show the impact of the GetRefererDataUDF on referers. You can check the spreadsheet and a little doc on it.

Thanks for running that analysis @Snwachukwu! I also looked at the rows under external in the attached spreadsheet to see what the UDF wasn't capturing (copy) and make sure that matched expectations. Generally looks good (one suggested change below) and surfaced I think a few interesting trends:

  • We're capturing a bunch of .github.io sites under Github, which are just a bunch of random personal websites that happen to be hosted via Github so in my view should be excluded. We should probably adjust the regex so it checks github.com but excludes github.io
  • In the top 5 we have two webpage translation sites -- Google's and Yandex's (turbopages). These are almost all going to be actually internal referrals (in the sense that it's a link within Wikipedia being clicked), just hosted via translation services so not on our servers. They're a small proportion of internal referrals but do comprise a large proportion of the remaining external referrers. I don't think we necessarily need to do anything about this right now but interesting to observe and if trends continue, we might think of adding an additional external category that is external (translation)
  • In the top 30, we also see a number of toolforge sites which also blur the internal/external boundary but I don't think it's wrong to leave that as is.
  • A newer category I'm seeing in the top 30 or so is games -- e.g., various wikipedia speed run games, wikitrivia -- which is pretty cool!

I ran the UDF on a day's data and extracted the top 1000 referer's for that day to show the impact of the GetRefererDataUDF on referers. You can check the spreadsheet and a little doc on it.

ping @Mayakp.wiki who was interested in QA-ing the data as well.

@Maryana @KinneretG please see the spreadsheet provided for your review

@Maryana @KinneretG please see the spreadsheet provided for your review

Dont think we are ready for review yet. Still needing to fix why twitter is not correctly classifying.- Will Ping Kinneret when @Snwachukwu confirms we have are ready for review.

Change 864772 merged by jenkins-bot:

[analytics/refinery/source@master] Refactor and Expand External referer classification

https://gerrit.wikimedia.org/r/864772

Heya @KinneretG & @Isaac:

@Snwachukwu has updated here QA evaluation here for one days worth of data.

You can see the requested referer name in UDF(referer).referer_name column. Can you please check this makes sense to you and is what you are expecting?

Thanks - Emil

Hey @EChetty and @Snwachukwu -- thanks for sharing the updated outputs! Just one I think very easy change that I'd request before deploying: we're still capturing a bunch of .github.io sites alongside github.com as Github. Those .io sites are just a bunch of random personal websites like this University of Pennsylvania course syllabus from Fall 2020 that somehow generated 48,000 referrals (presumably bot traffic but that's a different issue) that happen to be hosted via Github pages. We should probably adjust the regex so it checks github.com but excludes github.io. Once that change is made though, I'm quite happy with the output and we can always revisit occasionally to update the sites as needed. Thanks!!

Hey @Isaac! This actually has already gone out.
But you are right to say its a super simple change to the RE- we have flagged here and is next up - https://phabricator.wikimedia.org/T329307.

This actually has already gone out. But you are right to say its a super simple change to the RE- we have flagged here and is next up

Thanks!

Adding a note here for posterity: we will need this field added to the following tables derived from webrequest

  • pageview_hourly
  • pageview_actor
  • pageview_daily

which are aggregated, easier to query and use in Turnilo/Superset.
cc. @JArguello-WMF , @Kgordon