In T112605, we performed a broad analysis of Wikidata Query Service users and queries. This was almost a year ago, and we're coming up on the first anniversary of WDQS' public launch (announced on Monday, 7 September 2015). The WDQS dashboard only tracks basic metrics like SPARQL usage, so we don't currently have an up-to-date picture of who WDQS users are and where they're from. But it would be nice to know how that picture looks these days! :)
In this task, you will perform an original analysis of web requests, focusing specifically on successful (HTTP status codes 200 & 304) web requests to the SPARQL endpoint (see golden/wdqs/basic_usage.R and lines 45-52 from that old report's analysis codebase for references). Your analysis should focus on the geographic and agent type breakdown of those queries. Which countries have users who use WDQS? What are the top countries by SPARQL queries? How does that breakdown look when you compare known automata vs not known automata? Are the patterns consistent day-to-day over the course of a week?
Tips & Links
- You shouldn't need to import/use any refinery UDFs for this analysis; you'll do this in the next task :P
- Study the refined webrequest schema
- These articles on Hive and Hive queries are good resources. That second one uses Beeline interface which we've tried to migrate to once but it didn't work out, so wmf::query_hive() still uses Hive. And here's a good reference of functions and operations built into HiveQL.
- Remember not to include any PII like IP addresses in your report and do not upload the data if you end up making a GitHub repo like this one
- After uploading the report to Commons, you'll need to copy over some of the licensing info from this report to yours
- As always, don't hesitate to ask questions or to ask for help/clarification :D