Page MenuHomePhabricator

NEWFEATURE REQUEST: Add new referral sources to pageview data
Closed, ResolvedPublicFeature

Description

Data Platform Request Form

Is this a request for a:

  • Dataset
  • Data Pipeline
  • Data Feature

Is this a change to something existing:

  • Yes - please provide details of existing datasets/data pipelines (wiki links, Git URL, names of jobs, etc)
  • No

If a new dataset, has this been through the essential metric review? (need link):

  • Yes
  • No

Please provide the description of your request:
Our referrers haven't been updated since at least the launch of ChatGPT and since these aren't hardcoded into the referrer header parsing they haven't been appearing in pageviews datasets. We should have visibility of new referral sources added to our datasets and categorized properly.
Ex. Chatbot providers (ChatGPT, Perplexity, Claude, Grok, etc. )

Use Case: (Please briefly explain what this feature will be used for):
The addition of new referral sources will help us know if AI tools are bringing in more readers to our projects.

Ideal Delivery Date:
Q2

What is needed for this feature

  • decide which new referal sources
  • decide if these should be included in existing referer_class or create as a new category
  • find the source for this data (webrequest)
  • add to hard coded list ?
  • add to pageview tables in hive, iceberg and druid
  • data QA
Data Feature Checklist

Please link to the following if applicable.

Document TypeRequired?Document/Link
Related PHAB TicketsYes<add link here>
Product One PagerYes<add link here>
Product Requirements Document (PRD)Yes<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNo<add links here>

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Bump refine_webrequest hive jar versionrepos/data-engineering/airflow-dags!1796joalupdate_webrequest_hive_jarmain
Update webrequest refine hive jar to v0.3.7repos/data-engineering/airflow-dags!1780joalupdate_webrequest_refine_jarmain
Customize query in GitLab

Event Timeline

Pppery renamed this task from NEWFEATURE REQUEST: to NEWFEATURE REQUEST: Update referers.Oct 6 2025, 10:31 PM
Mayakp.wiki renamed this task from NEWFEATURE REQUEST: Update referers to NEWFEATURE REQUEST: Add new referral sources to pageview data.Oct 6 2025, 10:35 PM
Mayakp.wiki triaged this task as High priority.

For this we need an analysis of chatbots referal URLs, in order to update our referer classification.
I assume we'll create a new external (chatbots) referer-class.

I did an ad-hoc analysis of counting the number of referers from chatgpt some time ago (slack-thread). We saw that traffic from chatgpt showed up in (at least) two different ways:

  • F.col("referer")=="https://chatgpt.com/" or
  • F.col("uri_query").contains("utm_source=chatgpt.com")

FYI chiming in as I've done a bit of this in the past but am happy to pass it off to others!

The basic query I used in previous iterations to help me generate candidates is something like what's below (with more days included leading to more stable data and probably better candidates). You'll see perplexity, chatgpt, copilot in those results (also ya.ru, which maybe should be added to our Search Engine list as an alias for Yandex):

SELECT
  SUBSTRING(parse_url(referer, 'HOST'), 1, 50) AS host,
  COUNT(1) AS num_referrals
FROM wmf.pageview_actor
WHERE
  year = 2025
  AND month = 9
  AND day = 16
  AND is_pageview
  AND agent_type = 'user'
  AND referer_class = 'external'
GROUP BY
  parse_url(referer, 'HOST')
HAVING
  num_referrals > 1000
ORDER BY
  num_referrals DESC
LIMIT 5000

We saw that traffic from chatgpt showed up in (at least) two different ways:

+1 to this being an interesting thing to consider. If we go down this route, I'd love someone to take a moment to explore all of the most common uri_query parameters and see if there is other valuable data in there that we're missing. Looking at a small sample of utm_source, I see the following: chatgpt.com; yandexsmartcamera; perplexity; substack; openai; null; zalo (with chatgpt.com being the most prevalent by far). Example query:

# uri_query sometimes contains duplicate keys and this makes Spark mad
# so we set it to just retain the last pair it finds when there are duplicate keys
originalmapKeyDedupPolicy = spark.conf.get("spark.sql.mapKeyDedupPolicy")
spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN")

year = 2025
month = 9
day = 16
query = f"""
WITH query_params AS (
    SELECT
      EXPLODE(STR_TO_MAP(SUBSTR(uri_query, 2), '&', '='))
    FROM wmf.pageview_actor
    WHERE
      year = {year}
      AND month = {month}
      AND day = {day}
      AND is_pageview
      AND agent_type = 'user'
      AND LENGTH(uri_query) > 0
)
SELECT
  value,
  COUNT(1) AS num_instances
FROM query_params
WHERE
  key = "utm_source"
GROUP BY
  value
HAVING
  num_instances > 1000
ORDER BY
  num_instances DESC
"""

spark.sql(query).show(5000, False)
    
spark.conf.set("spark.sql.mapKeyDedupPolicy", originalmapKeyDedupPolicy)

Thank you for your input folks :)
In relation to using the uri_query to find referrers: it's not done like that currently, we only consider the referer field. It would require more changes to our code.
I ran the following query showing that if we were to consider only the referer field, we'd miss ~60% of rows.

SELECT
    (referer = 'https://chatgpt.com/') as ref,
    (uri_query like '%utm_source=chatgpt.com%') as query,
    COUNT(1) as c
FROM wmf.webrequest
WHERE webrequest_source = 'text' and year = 2025 and month =10 and day = 7
GROUP BY referer = 'https://chatgpt.com/', uri_query like '%utm_source=chatgpt.com%'
ORDER BY ref, query;

+-----+-----+----------+                                                        
|ref  |query|c         |
+-----+-----+----------+
|false|false|8521965309|
|false|true |472518    |
|true |false|47826     |
|true |true |264231    |
+-----+-----+----------+

@JAllemandou and @OSefu-WMF just had a chat about this one and how we can address it, now and in the future.

Next steps:

  1. Movement Insights (MI) to determine new referrers of interest (e.g. chatbots), and decide on signals used to characterize them (Referer header only? Or utm_source too?)
  2. (optional?) MI to identify, based on webrequest, whether any new referrers should be characterized, and how.
  3. (optional?) MI to review all referrer names and referrer classes. Do we want changes to our referrer taxonomy?
  4. DE to apply the changes: update the categorization logic and backfill the relevant data products.

I'll reassign this ticket to Maya for steps 1-3.

We also agree to support these updates in an incremental fashion if needed, not only as a big one like this. In the long run, we want to work together to make this a smoother process, moving the categorization logic to SQL that MI can easily maintain (e.g. as dbt models). But one step at a time :)

I have replicated the query above on pageviews only, the ratios stay exactly similar.

Thank you all for working on this. I just want to mention that I encourage you to also refresh the classification for existing referrers in the list, to make sure they are catching everything they should. For instance, TikTok, YouTube, or Instagram may have started sending some traffic to us with different headers than when we first wrote these rules. Could you check for anything that we're missing? (Note: I won't see replies to this comment. Please reach me on Slack if needed).

Random suggestions:

  • Consider including Kagi as a search engine
  • Consider addressing T383088 by dropping the requirement that the referrer start with "http:// or "https://"

@calbon mentioned today that we want to make sure to capture the referrals from x.com/twitter.com . We do look for Twitter in the Referer header, but not X.

On a quick look at webrequest, we do seem to have that host in the header:

spark-sql (default)> select count(1) from wmf.webrequest where uri_host = 'en.wikipedia.org' and year = 2025 and month = 10 and day = 20 and parse_url(referer, 'HOST') = 'x.com';
count(1)
482

(thanks @Isaac for teaching me parse_url in a previous comment!)

Hi Everyone, thank you for your inputs! this is helpful as we begin to think about upgrading the referer column. However, this should be done in phases as we can make many improvements and I see quite a few suggestions in this task. I'll try to curate them in a document and share soon.

For the short term it would be helpful to get the referers with the largest impact to our pageviews and include them in the existing external (search engine) refer_class as Wikimedia projects are used as a search result citation.

DPE team, could we update the Search Engine definition with the following changes and make sure that we capture this all the way into Druid tables:

  • add CHATGPT("ChatGPT", "chatgpt.com\\.", "")
  • add PERPLEXITY( "Perplexity", "perplexity.ai\\.", "")
  • update YANDEX("Yandex", "yandex\\.", "ya.ru", "")

Change #1198313 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Update referer classification patterns

https://gerrit.wikimedia.org/r/1198313

I have a question as I'm changing this: in the current classification, referer in the form of IPs with a protocol (for instance http://192.168.2.1) are classified as external referrers. I'm planning changing this to unknown, in the same way IPs without protocols are. Is that ok for you ?
Pinging @Isaac on this too, cause I know you devised the referrer dataset, and this change impacts it.

Thanks for the ping! A few thoughts but don't let this block the work if you all want to proceed:

  • I personally would leave in place the expectation that referrers start with http or https. I can go comment on T383088 too but my read is that that behavior is largely coming from bots who are improperly mocking up a referrer. I don't see nearly the volume that Krinkle saw in January and the majority of it is being labeled as automated (query below). He noted that it's not actually acceptable behavior but believed it to be an issue caused by some privacy extensions perhaps. It's a judgment call but I'd prefer that we expect legitimate referers as a further check against bot data.
  • +1 to changing IPs to unknown -- seems reasonable and I assume not a major impact on our data so consistency is less important.
  • I'd be cautious about adding ChatGPT and Perplexity into our Search Engine definition. This is a broader philosophical thing so I don't think any right answer but my thoughts: I don't think "Search Engines" are actually well-defined anymore and by all means, Google is very chat-agenty these days so the boundaries are getting more and more blurred. That said, if we plan to establish a "chat agent" referer class, then please don't temporarily put ChatGPT/Perplexity into the Search Engine data as it'll just cause confusion as to why there's a temporary blip in the data.
spark.sql("""
SELECT
  agent_type,
  COUNT(*) AS num_requests
FROM wmf.pageview_actor
WHERE
  year = 2025 AND month = 10 AND day = 18
  AND referer LIKE "www.%"
  AND is_pageview
GROUP BY
  agent_type
ORDER BY
  num_requests DESC
LIMIT 500
""").show(500, False)

@Mayakp.wiki can I get your perspective on Isaac's comment above please? Thank you :)

hey @Isaac, more and more search engines are integrating conversational AI features into the search experience, and yes, I agree that the boundaries are getting blurry. so the way I see it, it means we may not need a new classifier for chat agents. Traditional chat agents like ChatGPT would fit into the search engine category, and even more now since the launch of Atlas.
May be some time in the future we may just have to rename or modify the definition of search engine referer to accommodate this nuance.

@JAllemandou , +1 on changing the IP referer to unknown.

so the way I see it, it means we may not need a new classifier for chat agents.

@Mayakp.wiki that's fine by me then -- I mainly am concerned with a shifting definition than the exact boundaries and it sounds like you are planning to stick with ChatGPT/Perplexity as Search Engines as opposed to moving them to their own category at a later point. Thanks!

Thank you everyone for chiming in :)

I have created another task about improving our classification using the utm URL query parameter: T408185.

I still have one thing topic preventing me to move forward here: should I keep our classification for URLs with protocol only (http(s)://) as it is now, or should I open it to URLs without protocol as suggested in T383088 ? When this decision is made I can finalize my patch and proceed with applying the change.

hey @JAllemandou/ @Ahoelzl , Ive chimed in here T383088#11308163 about the modification to referer - since most of it is being tagged as automated there is low impact to user pageviews currently.

Also, could you help answer a few questions on the referer_class field:

  • is there a cost to creating a new category?
  • if we have a new definition for chatbots similar to media sites and search engines, would it be ok if it had very few values, like just top 5 chat bots for now?

we are thinking about what would work best and will confirm in the upcoming week. apologies for going back on this.

Hey everyone we discussed this a bit more and have decided to go with a new category called external (ai_chatbot) to include some of the top chabot referrers. In addition I’ve proposed a few changes and additions to existing categories.

Requested changes to referrers

  • New: External (ai_chatbot)
  • Search engine
    • Update Yandex to include “ya.ru/ ”
    • Add Kagi [\\kagi.com]
  • Media sites (add new sources)
    • Blue sky [\\.bsky.app]
    • Discord [\\discord.com\\]
    • Twitch [\\twitch.tv\]
    • Steam [\\steam\\.com]
    • Threads [\\threads.com\\]
  • Media sites (update)
    • Update twitter to include x.com
    • Update reddit, to include “com.reddit.. ”

@JAllemandou pls lmk if you have any questions!

@JAllemandou , do we have webrequest data retained from the pageviews backfill? it would be great if we could backfill this field.

@Mayakp.wiki The extended webrequest data is still retained.

Back from holidays, I'll update my patch to:

  • not include invalid referer URLS (declined T383088 based on comments)
  • Add the requested AIChatbot referer category
  • Update/add individual referer definition as suggested

Code got pushed and reviewed positively. I'll spend tomorrow to do some analysis on real data, trying to spot blatantly wrong rows, and if everything goes right it should be merged/deployed early next week.

I have vetted the data with the latest version of the change, and it looks good to me.
I have copied some values where referer has changed grouped by referer-domain and ordered by number of hists desc here (note: this list doesn't include IP-referer that have now been categorized as unknown instead of external).

One thing to note in this spreadsheet is that AI-chatbots also scan images (upload webrequest source), and this is not counted as pageviews!

My plan is to merge the patch tomorrow and apply it with this week's train, but please ask me to hold if you think something is wrong.

I have copied some values where referer has changed grouped by referer-domain and ordered by number of hists desc here (note: this list doesn't include IP-referer that have now been categorized as unknown instead of external).

Thank you for sharing this!

In additional to all the positive changes, I noticed a few new false positives from removing the requirement for some sites that there be a preceding subdomain (which is probably assumed to be "www."). The top examples I see are "https://bing.gugugegu.com/" (now classified as Bing) and "https://startpage.freebrowser.org/" (now classified as Startpage). The numbers are very small, so it's not a big issue, but it also makes me wonder about the "https://bing.com" referrers that are now classified as Bing. If the regex was originally written to require a subdomain because that's what Bing always uses, maybe the hits without it are actually from a bot that messed up when trying to imitate Bing.

So my suggestion is to consider reinstating the dot at the start of some of the regexes. I don't have a strong feeling about it, but I wanted to share the suggestion in case you hadn't thought about it.

In additional to all the positive changes, I noticed a few new false positives from removing the requirement for some sites that there be a preceding subdomain (which is probably assumed to be "www."). The top examples I see are "https://bing.gugugegu.com/" (now classified as Bing) and "https://startpage.freebrowser.org/" (now classified as Startpage). The numbers are very small, so it's not a big issue, but it also makes me wonder about the "https://bing.com" referrers that are now classified as Bing. If the regex was originally written to require a subdomain because that's what Bing always uses, maybe the hits without it are actually from a bot that messed up when trying to imitate Bing.

Thank you @nshahquinn-wmf for the validation :)
I have checked, and https://bing.com is the actual bing search engine.
About the false positives, I had a similar pattern for Ask, and after analysis decided to accept [any-subdomain.]ask.com.
I'll do a similar analysis for Bing and startpage and try to refine my regexes accordingly.

I have updated the regexes a few times, and I have posted a latest list of domains and referer-classes in the doc, in place of the previous one.
In my checking of the list, I haven't find false-positives, only improvements.
Obviously this will not stay true for long, but it seems good enough for now.
I'll be deploying tomorrow if ok for you @nshahquinn-wmf .
Thanks!

@JAllemandou absolutely, I think the rule improvements are in great shape! By specifying TLDs for each search engine, you have already gone well above and beyond the requirements 😊

FYI, I have left two minor Gerrit comments that are still unresolved (suggesting renaming "external (ai chatbot)" to "external (AI chatbot)" and pointing out a possible typo in a test case).

Change #1198313 merged by jenkins-bot:

[analytics/refinery/source@master] Update referer classification patterns

https://gerrit.wikimedia.org/r/1198313

The patch has been deployed, new data started to flow:

SELECT
    hour,
    COUNT(1),
    COUNT(DISTINCT referer_data)
FROM wmf.webrequest
WHERE year = 2025 and month = 11 and day = 5 and hour IN (15, 16)
GROUP BY hour
ORDER BY hour

+----+---------+----------------------------+                                   
|hour|count(1) |count(DISTINCT referer_data)|
+----+---------+----------------------------+
|15  |625535257|41                          |
|16  |618935569|53                          |
+----+---------+----------------------------+

I'll start backfilling pageview data tomorrow.

Starting backfilling pageview_hourly hive table from 2025-05-10T04:00. I'll backfill one month at a time.

Change #1203389 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Make referer classification more robust

https://gerrit.wikimedia.org/r/1203389

While backfilling I discovered that some fake very long domains could make the classification too long (2d for one webrequest hour!). I have sent the above patch to solve this.

The pageview_hourly referer-data field has been completely backfilled since 2025-05-10T04:00. Now I'll be backfilling the druid related datasources.

Change #1203389 merged by jenkins-bot:

[analytics/refinery/source@master] Make referer classification more robust

https://gerrit.wikimedia.org/r/1203389

And the druid datasources have been backfilled.
This task is considered done, waiting for feedback for a few days before closing it.