Page MenuHomePhabricator

WDQS: Geographic breakdown of SPARQL queries
Closed, ResolvedPublic6 Estimated Story Points

Assigned To
Authored By
mpopov
Aug 24 2016, 1:12 AM
Referenced Files
F4553759: report.pdf
Oct 4 2016, 4:47 AM
F4537643: report.pdf
Sep 29 2016, 6:36 PM
F4487819: report.pdf
Sep 19 2016, 9:26 PM
F4452046: report.pdf
Sep 9 2016, 8:58 PM
F4420629: report.pdf
Aug 31 2016, 6:22 PM
Tokens
"Like" token, awarded by Deskana."Like" token, awarded by mpopov."Like" token, awarded by Jonas."Like" token, awarded by Smalyshev.

Description

Background

In T112605, we performed a broad analysis of Wikidata Query Service users and queries. This was almost a year ago, and we're coming up on the first anniversary of WDQS' public launch (announced on Monday, 7 September 2015). The WDQS dashboard only tracks basic metrics like SPARQL usage, so we don't currently have an up-to-date picture of who WDQS users are and where they're from. But it would be nice to know how that picture looks these days! :)

Objective

In this task, you will perform an original analysis of web requests, focusing specifically on successful (HTTP status codes 200 & 304) web requests to the SPARQL endpoint (see golden/wdqs/basic_usage.R and lines 45-52 from that old report's analysis codebase for references). Your analysis should focus on the geographic and agent type breakdown of those queries. Which countries have users who use WDQS? What are the top countries by SPARQL queries? How does that breakdown look when you compare known automata vs not known automata? Are the patterns consistent day-to-day over the course of a week?

Produce a 1-2 page report of your findings. Once the report has been reviewed & OK'd by me, @debt, and @Smalyshev, please upload the PDF to Commons.

Tips & Links

  • You shouldn't need to import/use any refinery UDFs for this analysis; you'll do this in the next task :P
  • Study the refined webrequest schema
  • These articles on Hive and Hive queries are good resources. That second one uses Beeline interface which we've tried to migrate to once but it didn't work out, so wmf::query_hive() still uses Hive. And here's a good reference of functions and operations built into HiveQL.
  • Remember not to include any PII like IP addresses in your report and do not upload the data if you end up making a GitHub repo like this one
  • After uploading the report to Commons, you'll need to copy over some of the licensing info from this report to yours
  • As always, don't hesitate to ask questions or to ask for help/clarification :D

Event Timeline

First draft of the report:


I put a lot of stuff into report. However, because of my lack of domain knowledge, I don't have a very clear idea about what question is meaningful/useful to answer. So any suggestion is very welcome!!!

First draft looks good! I will try to review this as soon as I can :)

Reviewed; marked-up copy of the 1st draft sent back to Chelsy. Looking forward to 2nd draft :P

Reviewed copy with minor corrections & suggestions sent back to Chelsy.

Great job! Let's put it up on Commons! :)

Use the following licensing & categorization:

=={{int:license-header}}==
{{WMF-staff-upload|license=cc-by-sa-4.0}}
{{Wikimedia trademark}}

[[Category:Wikimedia Discovery]]
[[Category:Wiki Research]]

Let's be sure to get feedback from @Smalyshev :)

Updated Reviewers:


@debt and @Smalyshev, your suggestions are very welcome!!! :)

Excellent analysis, I think we need to make a blog post with highlights out of it.

BTW, does time-to-first-byte exclude error responses? If not, then sharp decline may indicate downtime - if the service is down for maintenance, for example, it will answer so very rapidly. 50ms seems too low for a real query to run, but about right for an error response. Unfortunately, I don't remember whether we had downtime around August 16 :)

@Smalyshev what do you mean by "error responses"?
Here is an example of my query:

SELECT CONCAT(year,'-',month,'-',day) AS dt, 
PERCENTILE_APPROX(time_firstbyte, 0.5) AS median_time_firstbyte,
PERCENTILE(response_size, 0.5) AS median_response_size
FROM webrequest
WHERE year = 2016 AND month = 07 AND day = 01
AND webrequest_source = 'misc'
AND uri_host = 'query.wikidata.org'
AND uri_path = '/bigdata/namespace/wdq/sparql'
AND http_status IN('200','304')
AND INSTR(uri_query, '?query=') > 0
GROUP BY CONCAT(year,'-',month,'-',day);

Modified:


@debt Please let me know if there is anything else need to be changed.

Looks great - thanks! :) It's good to go up on Commons!

I've updated these two pages with the link to the analysis: https://www.mediawiki.org/wiki/Wikidata_query_service#Reports
and
https://www.mediawiki.org/wiki/Discovery_Analysis#Past_analyses

to make sure we don't lose the links to good data! :)