Examine WDQS queries for intent
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CBogen
	Sep 30 2020, 2:29 PM

Description

As a WDQS administrator, I want to examine WDQS queries for intent so that I can determine alternative solutions for expensive queries in order to effectively scale the system.

If we can determine the intent of a sample of queries, we may be able to determine the right things to do about them. For example, dropping some expensive queries may be an option depending on what we find - explicitly saying this is something we can't support.

Something to look at: is a query complicated in terms of inference, or is it because it’s wrongly formulated and the system can’t answer it in natural ways?

For this task, we will manually examine 50 queries for intent.

50_queries_2020_09_16 (2).tsv126 KBDownload

Per the email from @JAllemandou:

The base data is a day of WDQS requests having returned a result (http 200), and being parsable by my Sparql parser (90% of all http-200 requests are).

I have randomly picked 10 queries per query-time-bucket (less than 10ms, between 10ms and 100ms, between 100ms and 1s, between 1s and 10s, more than 10s).

Finally I have tried to provide helpful info as discussed, such as nodes (distinguishing URIs, literals (values) and variables), operators and labels.

The file is in TSV format with header. I has cells encompassing multiple lines and iIn order to open it in a not-so-messy manner I picked separator = tab and string-delimiter = ". This gives me one query per line, with multi-lines of queries inside cells as expected.

Please let me know if this is ok as choosing a format to present this data in a hopefully readable format was somehow challenging :)

PS: At least a few queries are duplicates - I didn't change it on purpose

Related Objects
Search...

Status	Assigned	Task
Resolved	dcausse	T221938 [Epic] Scaling strategy for Wikidata Query Service
Resolved	Gehel	T257045 [Epic] Better understanding of WDQS users and use cases
Resolved	Lydia_Pintscher	T264194 Examine WDQS queries for intent
Resolved	• Zbyszko	T264447 Examine cases where Blazegraph generates results that timeout and don’t make it back to the user

Event Timeline

CBogen created this task.Sep 30 2020, 2:29 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptSep 30 2020, 2:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Lydia_Pintscher I know this is already in progress, but I wanted to create a phab task so we can track it in a findable place :)

CBogen triaged this task as High priority.Sep 30 2020, 2:30 PM

CBogen added a parent task: T257045: [Epic] Better understanding of WDQS users and use cases.

CBogen moved this task from Incoming to Watching / Waiting on the Wikidata-Query-Service board.

CBogen updated the task description. (Show Details)Oct 5 2020, 2:17 PM

CBogen added a subtask: T264447: Examine cases where Blazegraph generates results that timeout and don’t make it back to the user.Oct 5 2020, 5:34 PM

CBogen mentioned this in T264447: Examine cases where Blazegraph generates results that timeout and don’t make it back to the user.Oct 5 2020, 5:37 PM

Lydia_Pintscher added a subscriber: Addshore.Oct 8 2020, 3:44 PM

I went over the first few queries. (I skipped a few that need more time/help/...). It'd be useful if someone could have a look at this now and let me know if this is useful and going in the right direction.
I'm doing this here: https://docs.google.com/spreadsheets/d/1Km0Ene2LzBXnjRE-hpl6X5uBKyPw-o_K2UemcE5p7T8/edit?usp=sharing

Some observations so far:

I'm very often not 100% confident about my judgments
judging how up-to-date the query result needs to be is the hardest for me
I'm coming across a number of queries that are basically variations of the same query with just an ID changed. I'll probably mark them somehow and not bother repeating the same info for each of them
People are doing weird things...

People are doing weird things...

@Lydia_Pintscher If you want a hand looking at any of these then perhaps we should schedule a quick call and go through what remains together?

judging how up-to-date the query result needs to be is the hardest for me

I wonder if we should have a system to categorize these into three buckets? Something like "queries that need to be immediately up to date" (ie, to check if I'm about to add something that's already there, other types of maintenance work), "queries that should be relatively up to date" (ie, I want a list of people who have died, so that should probably be up to date in the last day or so), and "queries that can be completely async, or updated within the last week/month" (ie, getting the life expectancy for a village in Peru). Does that seem like a helpful framework?

In T264194#6532648, @Addshore wrote:

@Lydia_Pintscher If you want a hand looking at any of these then perhaps we should schedule a quick call and go through what remains together?

Yes! That'd be lovely. I'll schedule something.

In T264194#6533277, @CBogen wrote:

judging how up-to-date the query result needs to be is the hardest for me

I wonder if we should have a system to categorize these into three buckets? Something like "queries that need to be immediately up to date" (ie, to check if I'm about to add something that's already there, other types of maintenance work), "queries that should be relatively up to date" (ie, I want a list of people who have died, so that should probably be up to date in the last day or so), and "queries that can be completely async, or updated within the last week/month" (ie, getting the life expectancy for a village in Peru). Does that seem like a helpful framework?

I was thinking of doing something similar, yeah. But even that looks harder than I expected. It seems to depend a lot more on the context the data is used in than I can guess from the query often :/

Adam and I went through the remaining queries. Here are a few more observations:

I sorted the queries alphabetically and this very easily grouped similar queries together.
Adam had the brilliant idea of looking for some snippets from the queries on github and we found a few projects there responsible for some of the queries in the sample.
Very few queries seem to have comments.
We found a pattern where people search for everything (for example all movies) and then filter by an identifier. This is pretty inefficient and could probably be rewritten.

Another key usecase that came up was retrieving data for one or more known entities, but expanding the statement properties and values to include label data.
Ultimately this is just a search or a lookup, with more data that is just contained within the matched entity.
Not something we have an API for currently, but something that can be done with 2 calls, or something that may be covered in a future graphql api

CBogen mentioned this in T266022: Programmatically categorize WDQS queries by potential alternative solution.Oct 20 2020, 2:15 PM

Here is the spreadsheet with the results: https://docs.google.com/spreadsheets/d/1Km0Ene2LzBXnjRE-hpl6X5uBKyPw-o_K2UemcE5p7T8/edit#gid=708401958

Since the manual analysis is complete, I'm closing this ticket. Next steps are described in T266022.

CBogen closed this task as Resolved.Oct 20 2020, 2:19 PM

Gehel closed subtask T264447: Examine cases where Blazegraph generates results that timeout and don’t make it back to the user as Resolved.Jan 6 2021, 9:07 AM

Examine WDQS queries for intentClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Examine WDQS queries for intent
Closed, ResolvedPublic
Actions

Related Objects
Search...