Can we see chatgpt accesses in logs?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Iflorez
	Jul 12 2023, 10:19 PM

Description

Look through two primary data sources referred traffic and API searches to see if we can identify chatgpt in the logs.

For referred traffic:
provenance parameter (wprov = gpio1) https://wikitech.wikimedia.org/wiki/Provenance
value of this parameter will be recorded in the X-Analytics field of the webrequest table (as x_analytics_map['wprov']).
webrequest table

For ChatGPT requests (accessing our search API):
wmf_raw.cirrussearchrequestset
https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf_raw.cirrussearchrequestset,PROD)/Schema?is_lineage_mode=false&schemaFilter=

Team has relied on AWS logs up to now. Do we need the AWS data?
If we don’t see it in our search log (missing/unclear user agent), can we track by IPs
FYI: https://meta.wikimedia.org/wiki/User-Agent_policy

Related Objects

Mentioned In: T359440: [REQUEST] Product Analytics data instrumentation: CTR, Cirrus Search data and Superset dashboarding thereof
T343167: [ChatGPT Plugin] Track long-term traffic & search API usage
T343163: [ChatGPT Plugin] Track plug-in usage with Wikimedia Event Platform
Mentioned Here: T317045: [Epic] Re-architect the Search Update Pipeline
T341625: Requesting permission to use kafka-main cluster to transport CirrusSearch updates
T222268: Port usage of mediawiki_CirrusSearchRequestSet to mediawiki_cirrussearch_request
T342069: productionize Isaac's chatgpt plug-in data cron job to airflow
T257893: [EPIC] Support User-Agent Client Hints header in CheckUser
T295073: <Org-Wide Impact> Google Chrome User-Agent Deprecation Impact
T336715: Investigate relation of Prefetch feature to increase in automated traffic and impact on unique devices

Event Timeline

Iflorez created this task.Jul 12 2023, 10:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 12 2023, 10:19 PM

Iflorez moved this task from Backlog to Doing_Future_Audiences on the User-Iflorez board.Jul 12 2023, 10:19 PM

Reedy renamed this task from can we see chatgpt accessing in logs? to Can we see chatgpt accesses in logs?.Jul 13 2023, 10:33 AM

Iflorez added a subscriber: Isaac.Jul 13 2023, 3:21 PM

See ChatGPT click-throughs by day created by IJ using data from FROM "isaacj_group"."wprov_daily_source" in presto_analytics_hive.
Isaac has setup a cron job that goes through the webrequest table and gets wprov pageviews from both sources (fixing the below described bug)...and that is what is being used on the Superset chart.

wprov bug: when ppl are on a mobile device and access url, the are redirected to mobile site and the wprov is scrubbed...what gets logged as a pageview is the mobile pageview

Q: The provenance parameter is sometimes, randomly, stripped. How often does this occur? ½ the time?
NH notes that wprov may be dropped on up to 1/3 of the queries
IJ notes that Iit’s happening infrequently, and we don't have an accurate picture, so we can use current traffic as a reasonable estimate of traffic coming from chat gpt

FYI: see also these tickets that look into User Agent and research that referral source: T295073, T336715, T257893
FYI: https://developer.chrome.com/blog/private-prefetch-proxy/

FYI: In April Maya P and Isaac and Kinneret worked on identifying chat gpt in our search traffic (external). See notes form their discussions and analysis.

FYI: related task: T342069

@Iflorez Im posting the query I used for getting AI referrer traffic

from ChatGPT

 SELECT        
   year, month, day,
    geocoded_data['country_code'] AS country,
    access_method,
    agent_type,
    referer_class,
    user_agent, 
    referer, 
    COUNT(1) AS views
FROM wmf.webrequest
WHERE  
    webrequest_source = "text"
    AND is_pageview
    AND year = 2023
    AND month = 03
    AND day IN (26,19) 
    AND referer like '%chat.openai.com/' --chats from ChatGPT
GROUP BY
    year, month, day,
    geocoded_data['country_code'],
    access_method,
    agent_type,
    referer_class,
    user_agent,
    referer

we can also use pageview_actor

 SELECT            
    year, month, day,
    access_method,
    agent_type,
    referer_class,
    user_agent, 
    referer, 
    COUNT(1) AS views
FROM wmf.pageview_actor
WHERE  
    year = 2023
    AND month = 03
    AND day IN (26,19) 
    AND referer like '%chat.openai.com%' --chats from ChatGPT
GROUP BY
    year, month, day,
    access_method,
    agent_type,
    referer_class,
    user_agent,
    referer

For other AI tools you can use

AND referer like '%edgeservices.bing.com%' --pageviews referred from Bing Chat

AND referer like '%bard.google.com%' --pageviews referred from Google's AI chatbot Bard

MGerlach subscribed.Jul 18 2023, 7:36 AM

Iflorez updated the task description. (Show Details)Jul 24 2023, 7:10 PM

Iflorez triaged this task as High priority.Jul 26 2023, 12:02 AM

Iflorez updated the task description. (Show Details)Jul 26 2023, 12:43 AM

Iflorez updated the task description. (Show Details)Jul 26 2023, 12:48 AM

Iflorez updated the task description. (Show Details)

Iflorez updated the task description. (Show Details)Jul 26 2023, 1:19 AM

Iflorez updated the task description. (Show Details)Jul 26 2023, 1:31 AM

It looks like CirrusSearchRequestSet was decommissioned per T222268; and all previous uses of it, documented on Github, are from years back.

https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/517874/

I'm now looking at event.mediawiki_cirrussearch_request
https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,event.mediawiki_cirrussearch_request,PROD)/Schema?is_lineage_mode=false&schemaFilter=

This test query, using expected parameters, pulls up many results from a single IP: 85.236.56.254

"""
WITH base AS (
select  http.client_ip, 
        search_id,
        http.request_headers,
        params,
        user_agent_map,
        hits, 
        elasticsearch_requests
from event.mediawiki_cirrussearch_request 
where year        == 2023                      AND  
      month       == {month}                        AND
      day             == {day}                        AND
      database    =='enwiki'                   AND  
      source      == 'api'                     AND 
      params IS NOT NULL                       AND  
      hits IS NOT NULL                         AND 
      http.method == 'GET'                     AND
      element_at(params, 'action') IS NOT NULL AND
      element_at(params, 'action') =='query'  
)
      
SELECT * 
FROM base
WHERE 
   request_headers['user-agent']  ='USER_AGENT'    AND --https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/services/wikipedia_search_service.py
   params["format"]                  = "json"      AND --https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/services/wikipedia_search_service.py
   params["action"]                  = "query"      --https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/services/wikipedia_search_service.py
"""

zero results looking for the IP:

"""
WITH base AS (
select http.client_ip, 
        http.request_headers, 
        params
from event.mediawiki_cirrussearch_request 
where year        == 2023         AND  
      month       == 7            AND  
      day         == 19           AND  
      hour        == 9            AND 
      database    =='enwiki'      AND  
      source      == 'api'        AND 
      params IS NOT NULL          AND  
      hits IS NOT NULL            AND 
      http.method == 'GET'         
)
    
SELECT * 
FROM base
WHERE client_ip   LIKE "23.102.140%"
                        
"""

Iflorez updated the task description. (Show Details)Jul 27 2023, 3:33 AM

Iflorez updated the task description. (Show Details)Jul 27 2023, 3:59 AM

Iflorez updated the task description. (Show Details)Jul 27 2023, 4:27 AM

@Iflorez: The User-Agent isn't "USER_AGENT" – in the code that's a constant imported from https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/constants.py

So the UA string the service uses is "wikipedia-chagpt-plugin bot"

The below is working for search logs data. See the notes with today's date.

"""
WITH base AS (
select  http.client_ip, 
        search_id,
        http.request_headers,
        params,
        user_agent_map,
        hits, 
        elasticsearch_requests
from event.mediawiki_cirrussearch_request 
where year        == 2023                      AND  
      month       == 7                         AND
      day         == 25                         AND
      --database    =='enwiki'                   AND  
      source      == 'api'                     AND 
      params IS NOT NULL                       AND  
      hits IS NOT NULL                         AND 
      http.method == 'GET'                     AND
      element_at(params, 'action') IS NOT NULL AND
      element_at(params, 'action') =='query'   AND
      http.request_headers['user-agent'] ='wikipedia-chagpt-plugin bot' 
)
      
SELECT * 
FROM base
WHERE 
   --request_headers['user-agent']  ='wikipedia-chagpt-plugin bot'    AND --https://gitlab.wikimedia.org/repos/machine-learning/chatgpt-plugin/-/blob/dev/app/services/wikipedia_search_service.py
   params["format"]                  = "json"      AND 
   params["action"]                  = "query"     
   
"""

SELECT
  DATE_TRUNC('day', CAST(FROM_ISO8601_TIMESTAMP(meta.dt) as TIMESTAMP)) AS __timestamp,
  COUNT(1) AS n_searches
FROM event.mediawiki_cirrussearch_request 
WHERE year = 2023 AND month >= 7
  AND http.request_headers['user-agent'] = 'wikipedia-chagpt-plugin bot'
GROUP BY 1

mpopov mentioned this in T343163: [ChatGPT Plugin] Track plug-in usage with Wikimedia Event Platform.Jul 31 2023, 9:02 PM

mpopov mentioned this in T343167: [ChatGPT Plugin] Track long-term traffic & search API usage.Jul 31 2023, 9:55 PM

yes, see the ChatGPT Plugin dashboard

Iflorez moved this task from Doing_Future_Audiences to needs sign-off on the User-Iflorez board.Aug 4 2023, 9:02 PM

This work was completed. We worked with two tables, cirrus_search & web_request, to gather internal side data on plugin use. You can see the final report here:
https://meta.wikimedia.org/wiki/Future_Audiences/Experiments:_conversational/generative_AI

Iflorez mentioned this in T359440: [REQUEST] Product Analytics data instrumentation: CTR, Cirrus Search data and Superset dashboarding thereof.Mar 6 2024, 8:07 PM

	Unknown Object (File)
	Jul 27 2023, 8:20 PM

Can we see chatgpt accesses in logs?Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Can we see chatgpt accesses in logs?
Closed, ResolvedPublic
Actions