Change Details

Look through two primary data sources referred traffic and API searches to see if we can identify chatgpt in the logs. For referred traffic: provenance parameter (wprov = gpio1) https://wikitech.wikimedia.org/wiki/Provenance value of this parameter will be recorded in the X-Analytics field of the webrequest table (as x_analytics_map['wprov']). webrequest table For ChatGPT requests (accessing our search API): wmf_raw.cirrussearchrequestset https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf_raw.cirrussearchrequestset,PROD)/Schema?is_lineage_mode=false&schemaFilter= * https://wikitech.wikimedia.org/wiki/Search * https://phabricator.wikimedia.org/project/view/209/ * https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours * https://en.wikipedia.org/wiki/Wikipedia?action=cirrusdump Team has relied on AWS logs up to now. Do we need the AWS data? If we don’t see it in our search log (missing/unclear user agent), can we track by [[ https://platform.openai.com/docs/plugins/production/domain-verification-and-security | IPs ]] FYI: https://meta.wikimedia.org/wiki/User-Agent_policy See also this [[ https://docs.google.com/document/d/1tX3yDE-URO09-ycxp0kGrTI7N-JYaT2EncJKwF3WcmM/edit | Chat GPT Plugin process, method, & logging overview ]] See also: [[ https://github.com/wikimedia-research/search-traffic-breakdown/blob/main/T301902-get-events.ipynb | github.com/search-traffic-breakdown ]]