Look through two primary data sources referred traffic and API searches to see if we can identify chatgpt in the logs.
For referred traffic:
provenance parameter (wprov = gpio1) https://wikitech.wikimedia.org/wiki/Provenance
value of this parameter will be recorded in the X-Analytics field of the webrequest table (as x_analytics_map['wprov']).
webrequest table
For ChatGPT requests (accessing our search API):
wmf_raw.cirrussearchrequestset
https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf_raw.cirrussearchrequestset,PROD)/Schema?is_lineage_mode=false&schemaFilter=
* https://wikitech.wikimedia.org/wiki/Search
* https://phabricator.wikimedia.org/project/view/209/
* https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
* https://en.wikipedia.org/wiki/Wikipedia?action=cirrusdump
Team has relied on AWS logs up to now. Do we need the AWS data?
If we don’t see it in our search log (missing/unclear user agent), can we track by [[ https://platform.openai.com/docs/plugins/production/domain-verification-and-security | IPs ]]
FYI: https://meta.wikimedia.org/wiki/User-Agent_policy
See also this [[ https://docs.google.com/document/d/1tX3yDE-URO09-ycxp0kGrTI7N-JYaT2EncJKwF3WcmM/edit | Chat GPT Plugin process, method, & logging overview ]]