Page MenuHomePhabricator

Can we see chatgpt accesses in logs?
Closed, ResolvedPublic

Description

Look through two primary data sources referred traffic and API searches to see if we can identify chatgpt in the logs.

For referred traffic:
provenance parameter (wprov = gpio1) https://wikitech.wikimedia.org/wiki/Provenance
value of this parameter will be recorded in the X-Analytics field of the webrequest table (as x_analytics_map['wprov']).
webrequest table

For ChatGPT requests (accessing our search API):
wmf_raw.cirrussearchrequestset
https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf_raw.cirrussearchrequestset,PROD)/Schema?is_lineage_mode=false&schemaFilter=

Team has relied on AWS logs up to now. Do we need the AWS data?
If we don’t see it in our search log (missing/unclear user agent), can we track by IPs
FYI: https://meta.wikimedia.org/wiki/User-Agent_policy

See also this Chat GPT Plugin process, method, & logging overview

See also: github.com/search-traffic-breakdown

Event Timeline

Reedy renamed this task from can we see chatgpt accessing in logs? to Can we see chatgpt accesses in logs?.Jul 13 2023, 10:33 AM

See ChatGPT click-throughs by day created by IJ using data from FROM "isaacj_group"."wprov_daily_source" in presto_analytics_hive.
Isaac has setup a cron job that goes through the webrequest table and gets wprov pageviews from both sources (fixing the below described bug)...and that is what is being used on the Superset chart.

wprov bug: when ppl are on a mobile device and access url, the are redirected to mobile site and the wprov is scrubbed...what gets logged as a pageview is the mobile pageview

Q: The provenance parameter is sometimes, randomly, stripped. How often does this occur? ½ the time?
NH notes that wprov may be dropped on up to 1/3 of the queries
IJ notes that Iit’s happening infrequently, and we don't have an accurate picture, so we can use current traffic as a reasonable estimate of traffic coming from chat gpt

FYI: see also these tickets that look into User Agent and research that referral source: T295073, T336715, T257893
FYI: https://developer.chrome.com/blog/private-prefetch-proxy/

FYI: In April Maya P and Isaac and Kinneret worked on identifying chat gpt in our search traffic (external). See notes form their discussions and analysis.

@Iflorez Im posting the query I used for getting AI referrer traffic

  • from ChatGPT
 SELECT        
   year, month, day,
    geocoded_data['country_code'] AS country,
    access_method,
    agent_type,
    referer_class,
    user_agent, 
    referer, 
    COUNT(1) AS views
FROM wmf.webrequest
WHERE  
    webrequest_source = "text"
    AND is_pageview
    AND year = 2023
    AND month = 03
    AND day IN (26,19) 
    AND referer like '%chat.openai.com/' --chats from ChatGPT
GROUP BY
    year, month, day,
    geocoded_data['country_code'],
    access_method,
    agent_type,
    referer_class,
    user_agent,
    referer

we can also use pageview_actor

 SELECT            
    year, month, day,
    access_method,
    agent_type,
    referer_class,
    user_agent, 
    referer, 
    COUNT(1) AS views
FROM wmf.pageview_actor
WHERE  
    year = 2023
    AND month = 03
    AND day IN (26,19) 
    AND referer like '%chat.openai.com%' --chats from ChatGPT
GROUP BY
    year, month, day,
    access_method,
    agent_type,
    referer_class,
    user_agent,
    referer

For other AI tools you can use

AND referer like '%edgeservices.bing.com%' --pageviews referred from Bing Chat
AND referer like '%bard.google.com%' --pageviews referred from Google's AI chatbot Bard
Iflorez updated the task description. (Show Details)

This test query, using expected parameters, pulls up many results from a single IP: 85.236.56.254

"""
WITH base AS (
select  http.client_ip, 
        search_id,
        http.request_headers,
        params,
        user_agent_map,
        hits, 
        elasticsearch_requests
from event.mediawiki_cirrussearch_request 
where year        == 2023                      AND  
      month       == {month}                        AND
      day             == {day}                        AND
      database    =='enwiki'                   AND  
      source      == 'api'                     AND 
      params IS NOT NULL                       AND  
      hits IS NOT NULL                         AND 
      http.method == 'GET'                     AND
      element_at(params, 'action') IS NOT NULL AND
      element_at(params, 'action') =='query'  
)
      
SELECT * 
FROM base
WHERE 
   request_headers['user-agent']  ='USER_AGENT'    AND --https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/services/wikipedia_search_service.py
   params["format"]                  = "json"      AND --https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/services/wikipedia_search_service.py
   params["action"]                  = "query"      --https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/services/wikipedia_search_service.py
"""

zero results looking for the IP:

"""
WITH base AS (
select http.client_ip, 
        http.request_headers, 
        params
from event.mediawiki_cirrussearch_request 
where year        == 2023         AND  
      month       == 7            AND  
      day         == 19           AND  
      hour        == 9            AND 
      database    =='enwiki'      AND  
      source      == 'api'        AND 
      params IS NOT NULL          AND  
      hits IS NOT NULL            AND 
      http.method == 'GET'         
)
    
SELECT * 
FROM base
WHERE client_ip   LIKE "23.102.140%"
                        
"""

@Iflorez: The User-Agent isn't "USER_AGENT" – in the code that's a constant imported from https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/constants.py

So the UA string the service uses is "wikipedia-chagpt-plugin bot"

The below is working for search logs data. See the notes with today's date.

"""
WITH base AS (
select  http.client_ip, 
        search_id,
        http.request_headers,
        params,
        user_agent_map,
        hits, 
        elasticsearch_requests
from event.mediawiki_cirrussearch_request 
where year        == 2023                      AND  
      month       == 7                         AND
      day         == 25                         AND
      --database    =='enwiki'                   AND  
      source      == 'api'                     AND 
      params IS NOT NULL                       AND  
      hits IS NOT NULL                         AND 
      http.method == 'GET'                     AND
      element_at(params, 'action') IS NOT NULL AND
      element_at(params, 'action') =='query'   AND
      http.request_headers['user-agent'] ='wikipedia-chagpt-plugin bot' 
)
      
SELECT * 
FROM base
WHERE 
   --request_headers['user-agent']  ='wikipedia-chagpt-plugin bot'    AND --https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/services/wikipedia_search_service.py
   params["format"]                  = "json"      AND 
   params["action"]                  = "query"     
   
"""
SELECT
  DATE_TRUNC('day', CAST(FROM_ISO8601_TIMESTAMP(meta.dt) as TIMESTAMP)) AS __timestamp,
  COUNT(1) AS n_searches
FROM event.mediawiki_cirrussearch_request 
WHERE year = 2023 AND month >= 7
  AND http.request_headers['user-agent'] = 'wikipedia-chagpt-plugin bot'
GROUP BY 1

This work was completed. We worked with two tables, cirrus_search & web_request, to gather internal side data on plugin use. You can see the final report here:
https://meta.wikimedia.org/wiki/Future_Audiences/Experiments:_conversational/generative_AI