Page MenuHomePhabricator

Examine WDQS queries for intent
Closed, ResolvedPublic

Description

As a WDQS administrator, I want to examine WDQS queries for intent so that I can determine alternative solutions for expensive queries in order to effectively scale the system.

If we can determine the intent of a sample of queries, we may be able to determine the right things to do about them. For example, dropping some expensive queries may be an option depending on what we find - explicitly saying this is something we can't support.

Something to look at: is a query complicated in terms of inference, or is it because it’s wrongly formulated and the system can’t answer it in natural ways?

For this task, we will manually examine 50 queries for intent.

Per the email from @JAllemandou:

The base data is a day of WDQS requests having returned a result (http 200), and being parsable by my Sparql parser (90% of all http-200 requests are).

I have randomly picked 10 queries per query-time-bucket (less than 10ms, between 10ms and 100ms, between 100ms and 1s, between 1s and 10s, more than 10s).

Finally I have tried to provide helpful info as discussed, such as nodes (distinguishing URIs, literals (values) and variables), operators and labels.

The file is in TSV format with header. I has cells encompassing multiple lines and iIn order to open it in a not-so-messy manner I picked separator = tab and string-delimiter = ". This gives me one query per line, with multi-lines of queries inside cells as expected.

Please let me know if this is ok as choosing a format to present this data in a hopefully readable format was somehow challenging :)

PS: At least a few queries are duplicates - I didn't change it on purpose

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Lydia_Pintscher I know this is already in progress, but I wanted to create a phab task so we can track it in a findable place :)

I went over the first few queries. (I skipped a few that need more time/help/...). It'd be useful if someone could have a look at this now and let me know if this is useful and going in the right direction.
I'm doing this here: https://docs.google.com/spreadsheets/d/1Km0Ene2LzBXnjRE-hpl6X5uBKyPw-o_K2UemcE5p7T8/edit?usp=sharing

Some observations so far:

  • I'm very often not 100% confident about my judgments
  • judging how up-to-date the query result needs to be is the hardest for me
  • I'm coming across a number of queries that are basically variations of the same query with just an ID changed. I'll probably mark them somehow and not bother repeating the same info for each of them
  • People are doing weird things...

People are doing weird things...

@Lydia_Pintscher If you want a hand looking at any of these then perhaps we should schedule a quick call and go through what remains together?

judging how up-to-date the query result needs to be is the hardest for me

I wonder if we should have a system to categorize these into three buckets? Something like "queries that need to be immediately up to date" (ie, to check if I'm about to add something that's already there, other types of maintenance work), "queries that should be relatively up to date" (ie, I want a list of people who have died, so that should probably be up to date in the last day or so), and "queries that can be completely async, or updated within the last week/month" (ie, getting the life expectancy for a village in Peru). Does that seem like a helpful framework?

@Lydia_Pintscher If you want a hand looking at any of these then perhaps we should schedule a quick call and go through what remains together?

Yes! That'd be lovely. I'll schedule something.

judging how up-to-date the query result needs to be is the hardest for me

I wonder if we should have a system to categorize these into three buckets? Something like "queries that need to be immediately up to date" (ie, to check if I'm about to add something that's already there, other types of maintenance work), "queries that should be relatively up to date" (ie, I want a list of people who have died, so that should probably be up to date in the last day or so), and "queries that can be completely async, or updated within the last week/month" (ie, getting the life expectancy for a village in Peru). Does that seem like a helpful framework?

I was thinking of doing something similar, yeah. But even that looks harder than I expected. It seems to depend a lot more on the context the data is used in than I can guess from the query often :/

Adam and I went through the remaining queries. Here are a few more observations:

  • I sorted the queries alphabetically and this very easily grouped similar queries together.
  • Adam had the brilliant idea of looking for some snippets from the queries on github and we found a few projects there responsible for some of the queries in the sample.
  • Very few queries seem to have comments.
  • We found a pattern where people search for everything (for example all movies) and then filter by an identifier. This is pretty inefficient and could probably be rewritten.

Another key usecase that came up was retrieving data for one or more known entities, but expanding the statement properties and values to include label data.
Ultimately this is just a search or a lookup, with more data that is just contained within the matched entity.
Not something we have an API for currently, but something that can be done with 2 calls, or something that may be covered in a future graphql api

Here is the spreadsheet with the results: https://docs.google.com/spreadsheets/d/1Km0Ene2LzBXnjRE-hpl6X5uBKyPw-o_K2UemcE5p7T8/edit#gid=708401958

Since the manual analysis is complete, I'm closing this ticket. Next steps are described in T266022.