Change Details

# Background In T112605, we performed a broad analysis of [[ https://www.mediawiki.org/wiki/Wikidata_query_service | Wikidata Query Service ]] users and queries. This was almost a year ago, and we're coming up on the first anniversary of WDQS' public launch (announced on Monday, 7 September 2015). The [[ http://discovery.wmflabs.org/wdqs/ | WDQS dashboard ]] only tracks basic metrics like SPARQL usage, so we don't currently have an up-to-date picture of who WDQS users are and where they're from. But it would be nice to know how that picture looks these days! :) # Objective In this task, you will perform an original analysis of [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest | web requests ]], focusing specifically on successful (HTTP status codes 200 & 304) web requests to the SPARQL endpoint (see [[ https://github.com/wikimedia/wikimedia-discovery-golden/blob/master/wdqs/basic_usage.R | golden/wdqs/basic_usage.R ]] and [[ https://github.com/wikimedia-research/Discovery-WDQS-Adhoc-Usage/blob/master/01a%20Data.R#L44 | lines 45-52 ]] from that old report's analysis codebase for references). Your analysis should focus on the geographic and agent type breakdown of those queries. Which countries have users who use WDQS? What are the top countries by SPARQL queries? How does that breakdown look when you compare known automata vs not known automata? Are the patterns consistent day-to-day over the course of a week? Produce a 1-2 page report of your findings. Once the report has been reviewed & OK'd by me, @debt, and @Smalyshev, please upload the PDF to [[ https://commons.wikimedia.org/wiki/Main_Page | Commons ]]. # Tips & Links * You shouldn't need to import/use any refinery UDFs for this analysis; you'll do this in the next task :P * Study the refined webrequest [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Current_Schema | schema ]] * These articles on [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive | Hive ]] and [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries | Hive queries ]] are good resources. That second one uses [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline | Beeline ]] interface which we've tried to migrate to once but it didn't work out, so [[ https://github.com/wikimedia/wikimedia-discovery-wmf/blob/master/R/hive.R | wmf::query_hive() ]] still uses Hive. * Remember not to include any PII like IP addresses in your report and do not upload the data if you end up making a GitHub repo like [[ https://github.com/wikimedia-research/Discovery-WDQS-Adhoc-Usage | this one ]] * After uploading the report to Commons, you'll need to copy over some of the licensing info from [[ https://commons.wikimedia.org/wiki/File:Impact_of_Wikipedia.org_Portal_Changes_-_A_Retrospective_Statistical_Analysis.pdf | this report ]] to yours * As always, don't hesitate to ask questions or to ask for help/clarification :D

# Background In T112605, we performed a broad analysis of [[ https://www.mediawiki.org/wiki/Wikidata_query_service | Wikidata Query Service ]] users and queries. This was almost a year ago, and we're coming up on the first anniversary of WDQS' public launch (announced on Monday, 7 September 2015). The [[ http://discovery.wmflabs.org/wdqs/ | WDQS dashboard ]] only tracks basic metrics like SPARQL usage, so we don't currently have an up-to-date picture of who WDQS users are and where they're from. But it would be nice to know how that picture looks these days! :) # Objective In this task, you will perform an original analysis of [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest | web requests ]], focusing specifically on successful (HTTP status codes 200 & 304) web requests to the SPARQL endpoint (see [[ https://github.com/wikimedia/wikimedia-discovery-golden/blob/master/wdqs/basic_usage.R | golden/wdqs/basic_usage.R ]] and [[ https://github.com/wikimedia-research/Discovery-WDQS-Adhoc-Usage/blob/master/01a%20Data.R#L44 | lines 45-52 ]] from that old report's analysis codebase for references). Your analysis should focus on the geographic and agent type breakdown of those queries. Which countries have users who use WDQS? What are the top countries by SPARQL queries? How does that breakdown look when you compare known automata vs not known automata? Are the patterns consistent day-to-day over the course of a week? Produce a 1-2 page report of your findings. Once the report has been reviewed & OK'd by me, @debt, and @Smalyshev, please upload the PDF to [[ https://commons.wikimedia.org/wiki/Main_Page | Commons ]]. # Tips & Links * You shouldn't need to import/use any refinery UDFs for this analysis; you'll do this in the next task :P * Study the refined webrequest [[ https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Current_Schema | schema ]] * These articles on [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive | Hive ]] and [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries | Hive queries ]] are good resources. That second one uses [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline | Beeline ]] interface which we've tried to migrate to once but it didn't work out, so [[ https://github.com/wikimedia/wikimedia-discovery-wmf/blob/master/R/hive.R | wmf::query_hive() ]] still uses Hive. And [[ https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF | here's a good reference ]] of functions and operations built into HiveQL. * Remember not to include any PII like IP addresses in your report and do not upload the data if you end up making a GitHub repo like [[ https://github.com/wikimedia-research/Discovery-WDQS-Adhoc-Usage | this one ]] * After uploading the report to Commons, you'll need to copy over some of the licensing info from [[ https://commons.wikimedia.org/wiki/File:Impact_of_Wikipedia.org_Portal_Changes_-_A_Retrospective_Statistical_Analysis.pdf | this report ]] to yours * As always, don't hesitate to ask questions or to ask for help/clarification :D