Page MenuHomePhabricator

Programmatically categorize WDQS queries by potential alternative solution
Closed, ResolvedPublic

Description

As a WDQS administrator, I want to be able to categorize WDQS buckets by their potential alternative solutions so that we can prioritize the next steps for scaling WDQS.

Based on the work done in T264194, we want to programmatically categorize (1000?) WDQS queries into the categories defined in the WDQS user flow document. Then we'd like to further note how many "expensive" queries are in each category. We learned in T264194 that a significant percentage of users are using the query service for questions that other alternative solutions could easily solve with just a little bit more work. The hope is that some categories will contain more "expensive" queries, giving us a clear indicator that we should prioritize the alternative solution described in that category.

Some ways we discussed being able to differentiate the queries programmatically:

  • Identify the number of "hops" used in a query.
    • We noticed that many queries only require one "hop", which means they may be more efficiently served by another service, such as a property graph instead of a triple store graph, or perhaps by the API. Often this is the use case where the user is retrieving one or more known entities, but expanding the statement properties and values to include label data.
    • There are also many queries that ask for a specific property value pair and therefore require no "hops". These are likely better served by the new REST API.
    • There are also many queries that are simple identifier lookups, which we could have a separate service or dedicated space for.

Update November 5 2020:
We met to discuss and refine the buckets. The notes are here.

For this task, we will specifically categorize the queries into the following buckets:

  • Global (This is all queries that don't fall into another bucket. We will manually review and determine ways to further break this down later.)
  • Linked Data Fragments
  • REST API (first version - no hops)
  • API for an entity plus labels (one hop - star patterns (or two hops if hops are of type:labels))
  • “Sparkly” (one hop plus a search of the entities among which you return the star pattern (could be API or may better be Elastic))
  • Wikidata reconciliation service

We will also add tags about cost/time, and user agents (bots or not).

Acceptance criteria:

  • Refine the flow and categories defined in the WDQS user flow document.
    • Add more precision to each bucket's definition
    • Ensure we have a shared understanding of what goes in each bucket
  • Programmatically categorize a subset of (1000?) queries into each bucket
  • Programmatically determine which buckets contain the most "expensive" queries

Event Timeline

Heya - sorry for the late update :(
Our planned deadline was end of last month but I've gone through various issues preventing to achieve it.
I have started the actual work today (I gave it thoughts but didn't code earlier) and wish to present results before the end of the month.
My apologizes for the delay.

Ah! I realize I have not updated that task. The analysis can be found here: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis
@CBogen : I let you handle the definition of done, and whether this task should be closed or not :)

Carly, unless you have another reason to keep this ticket open, I think we can close this ticket as we move forward with the plan for an alternative service for the easy queries. I think by the time we need to return to identifying how to address hard queries on WDQS, we can open a new ticket and copy anything out of this one that may still be relevant.

Change 681330 had a related patch set uploaded (by Joal; author: Joal):

[wikidata/query/rdf@master] [WIP] WDQS queries analysis toolkit

https://gerrit.wikimedia.org/r/681330

Change 681330 merged by jenkins-bot:

[wikidata/query/rdf@master] WDQS queries analysis toolkit

https://gerrit.wikimedia.org/r/681330