As a WDQS administrator, I want to examine WDQS queries for intent so that I can determine alternative solutions for expensive queries in order to effectively scale the system.
If we can determine the intent of a sample of queries, we may be able to determine the right things to do about them. For example, dropping some expensive queries may be an option depending on what we find - explicitly saying this is something we can't support.
Something to look at: is a query complicated in terms of inference, or is it because it’s wrongly formulated and the system can’t answer it in natural ways?
For this task, we will manually examine 50 queries for intent.
Per the email from @JAllemandou:
The base data is a day of WDQS requests having returned a result (http 200), and being parsable by my Sparql parser (90% of all http-200 requests are).
I have randomly picked 10 queries per query-time-bucket (less than 10ms, between 10ms and 100ms, between 100ms and 1s, between 1s and 10s, more than 10s).
Finally I have tried to provide helpful info as discussed, such as nodes (distinguishing URIs, literals (values) and variables), operators and labels.
The file is in TSV format with header. I has cells encompassing multiple lines and iIn order to open it in a not-so-messy manner I picked separator = tab and string-delimiter = ". This gives me one query per line, with multi-lines of queries inside cells as expected.
Please let me know if this is ok as choosing a format to present this data in a hopefully readable format was somehow challenging :)
PS: At least a few queries are duplicates - I didn't change it on purpose