WDQS: Geographic breakdown of SPARQL queries
Closed, ResolvedPublic6 Estimated Story Points
Actions

Assigned To

Authored By

	mpopov
	Aug 24 2016, 1:12 AM

Description

Background

In T112605, we performed a broad analysis of Wikidata Query Service users and queries. This was almost a year ago, and we're coming up on the first anniversary of WDQS' public launch (announced on Monday, 7 September 2015). The WDQS dashboard only tracks basic metrics like SPARQL usage, so we don't currently have an up-to-date picture of who WDQS users are and where they're from. But it would be nice to know how that picture looks these days! :)

Objective

In this task, you will perform an original analysis of web requests, focusing specifically on successful (HTTP status codes 200 & 304) web requests to the SPARQL endpoint (see golden/wdqs/basic_usage.R and lines 45-52 from that old report's analysis codebase for references). Your analysis should focus on the geographic and agent type breakdown of those queries. Which countries have users who use WDQS? What are the top countries by SPARQL queries? How does that breakdown look when you compare known automata vs not known automata? Are the patterns consistent day-to-day over the course of a week?

Produce a 1-2 page report of your findings. Once the report has been reviewed & OK'd by me, @debt, and @Smalyshev, please upload the PDF to Commons.

Tips & Links

You shouldn't need to import/use any refinery UDFs for this analysis; you'll do this in the next task :P
Study the refined webrequest schema
These articles on Hive and Hive queries are good resources. That second one uses Beeline interface which we've tried to migrate to once but it didn't work out, so wmf::query_hive() still uses Hive. And here's a good reference of functions and operations built into HiveQL.
Remember not to include any PII like IP addresses in your report and do not upload the data if you end up making a GitHub repo like this one
After uploading the report to Commons, you'll need to copy over some of the licensing info from this report to yours
As always, don't hesitate to ask questions or to ask for help/clarification :D

Related Objects
Search...

Status	Assigned	Task
Resolved	RobH	T142648 Requesting access to stat1003, stat1002 and fluorine for chelsyx
Resolved	• chelsyx	T143128 [EPIC] Learn about our databases and how to use them
Resolved	• chelsyx	T143762 WDQS: Geographic breakdown of SPARQL queries

Event Timeline

mpopov created this task.Aug 24 2016, 1:12 AM

Restricted Application added projects: Wikidata, Discovery-ARCHIVED. · View Herald TranscriptAug 24 2016, 1:12 AM

mpopov updated the task description. (Show Details)Aug 24 2016, 1:13 AM

mpopov mentioned this in T143128: [EPIC] Learn about our databases and how to use them.Aug 24 2016, 1:16 AM

Smalyshev awarded a token.Aug 24 2016, 7:30 PM

• chelsyx moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.Aug 24 2016, 10:13 PM

• Jonas awarded a token.Aug 24 2016, 10:20 PM

Smalyshev moved this task from Incoming to Watching / Waiting on the Wikidata-Query-Service board.Aug 27 2016, 7:20 PM

First draft of the report:

report.pdf4 MBDownload

I put a lot of stuff into report. However, because of my lack of domain knowledge, I don't have a very clear idea about what question is meaningful/useful to answer. So any suggestion is very welcome!!!

• chelsyx moved this task from In progress to Needs review on the Discovery-Analysis (Current work) board.Aug 31 2016, 6:24 PM

First draft looks good! I will try to review this as soon as I can :)

Reviewed; marked-up copy of the 1st draft sent back to Chelsy. Looking forward to 2nd draft :P

Thanks, @mpopov !

debt moved this task from Needs review to In progress on the Discovery-Analysis (Current work) board.Sep 6 2016, 8:09 PM

• Jonas added a subscriber: Addshore.Sep 6 2016, 9:19 PM

Second Draft:

report.pdf4 MBDownload

• chelsyx moved this task from In progress to Needs review on the Discovery-Analysis (Current work) board.Sep 9 2016, 8:59 PM

Reviewed copy with minor corrections & suggestions sent back to Chelsy.

• chelsyx moved this task from Needs review to In progress on the Discovery-Analysis (Current work) board.Sep 13 2016, 4:51 PM

3rd draft:

report.pdf5 MBDownload

• chelsyx moved this task from In progress to Needs review on the Discovery-Analysis (Current work) board.Sep 19 2016, 9:27 PM

Great job! Let's put it up on Commons! :)

Use the following licensing & categorization:

=={{int:license-header}}==
{{WMF-staff-upload|license=cc-by-sa-4.0}}
{{Wikimedia trademark}}

[[Category:Wikimedia Discovery]]
[[Category:Wiki Research]]

mpopov awarded a token.Sep 27 2016, 3:59 PM

Let's be sure to get feedback from @Smalyshev :)

Updated Reviewers:

report.pdf5 MBDownload

@debt and @Smalyshev, your suggestions are very welcome!!! :)

Excellent analysis, I think we need to make a blog post with highlights out of it.

BTW, does time-to-first-byte exclude error responses? If not, then sharp decline may indicate downtime - if the service is down for maintenance, for example, it will answer so very rapidly. 50ms seems too low for a real query to run, but about right for an error response. Unfortunately, I don't remember whether we had downtime around August 16 :)

@Smalyshev what do you mean by "error responses"?
Here is an example of my query:

SELECT CONCAT(year,'-',month,'-',day) AS dt, 
PERCENTILE_APPROX(time_firstbyte, 0.5) AS median_time_firstbyte,
PERCENTILE(response_size, 0.5) AS median_response_size
FROM webrequest
WHERE year = 2016 AND month = 07 AND day = 01
AND webrequest_source = 'misc'
AND uri_host = 'query.wikidata.org'
AND uri_path = '/bigdata/namespace/wdq/sparql'
AND http_status IN('200','304')
AND INSTR(uri_query, '?query=') > 0
GROUP BY CONCAT(year,'-',month,'-',day);

Thanks everyone! I've uploaded the report to the commons: https://commons.wikimedia.org/wiki/File:Exploration_on_the_Use_of_WDQS_-_Breakdown_by_Geography,_User_Agent_and_Referer_Class.pdf

• chelsyx moved this task from Needs review to Done on the Discovery-Analysis (Current work) board.Oct 3 2016, 6:29 PM

• Deskana awarded a token.Oct 3 2016, 8:01 PM

Modified:

report.pdf5 MBDownload

@debt Please let me know if there is anything else need to be changed.

Looks great - thanks! :) It's good to go up on Commons!

Thanks @debt! Updated on Commons!

I've updated these two pages with the link to the analysis: https://www.mediawiki.org/wiki/Wikidata_query_service#Reports
and
https://www.mediawiki.org/wiki/Discovery_Analysis#Past_analyses

to make sure we don't lose the links to good data! :)

Thank you @debt! :)

debt closed this task as Resolved.Oct 7 2016, 9:20 PM

debt mentioned this in T147882: Access to Wikidata query logs that were used for recent research.Oct 11 2016, 7:00 PM

	F4553759: report.pdf
	Oct 4 2016, 4:47 AM

	F4537643: report.pdf
	Sep 29 2016, 6:36 PM

	F4487819: report.pdf
	Sep 19 2016, 9:26 PM

	F4452046: report.pdf
	Sep 9 2016, 8:58 PM

WDQS: Geographic breakdown of SPARQL queriesClosed, ResolvedPublic6 Estimated Story PointsActions

Description

Background

Objective

Tips & Links

Related ObjectsSearch...

Event Timeline

WDQS: Geographic breakdown of SPARQL queries
Closed, ResolvedPublic6 Estimated Story Points
Actions

Related Objects
Search...