Look through two primary data sources referred traffic and API searches to see if we can identify chatgpt in the logs.
For referred traffic:
provenance parameter (wprov = gpio1) https://wikitech.wikimedia.org/wiki/Provenance
value of this parameter will be recorded in the X-Analytics field of the webrequest table (as x_analytics_map['wprov']).
webrequest table
For ChatGPT requests (accessing our search API):
wmf_raw.cirrussearchrequestset
https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf_raw.cirrussearchrequestset,PROD)/Schema?is_lineage_mode=false&schemaFilter=
- https://wikitech.wikimedia.org/wiki/Search
- https://phabricator.wikimedia.org/project/view/209/
- https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
- https://en.wikipedia.org/wiki/Wikipedia?action=cirrusdump
Team has relied on AWS logs up to now. Do we need the AWS data?
If we don’t see it in our search log (missing/unclear user agent), can we track by IPs
FYI: https://meta.wikimedia.org/wiki/User-Agent_policy
See also this Chat GPT Plugin process, method, & logging overview
See also: github.com/search-traffic-breakdown