As a WDQS administrator, I want to be able to categorize WDQS buckets by their potential alternative solutions so that we can prioritize the next steps for scaling WDQS.
Based on the work done in T264194, we want to programmatically categorize (1000?) WDQS queries into the categories defined in the WDQS user flow document. Then we'd like to further note how many "expensive" queries are in each category. We learned in T264194 that a significant percentage of users are using the query service for questions that other alternative solutions could easily solve with just a little bit more work. The hope is that some categories will contain more "expensive" queries, giving us a clear indicator that we should prioritize the alternative solution described in that category.
Some ways we discussed being able to differentiate the queries programmatically:
- Identify the number of "hops" used in a query.
- We noticed that many queries only require one "hop", which means they may be more efficiently served by another service, such as a property graph instead of a triple store graph, or perhaps by the API. Often this is the use case where the user is retrieving one or more known entities, but expanding the statement properties and values to include label data.
- There are also many queries that ask for a specific property value pair and therefore require no "hops". These are likely better served by the new REST API.
- There are also many queries that are simple identifier lookups, which we could have a separate service or dedicated space for.
Update November 5 2020:
We met to discuss and refine the buckets. The notes are here.
For this task, we will specifically categorize the queries into the following buckets:
- Global (This is all queries that don't fall into another bucket. We will manually review and determine ways to further break this down later.)
- Linked Data Fragments
- REST API (first version - no hops)
- API for an entity plus labels (one hop - star patterns (or two hops if hops are of type:labels))
- “Sparkly” (one hop plus a search of the entities among which you return the star pattern (could be API or may better be Elastic))
- Wikidata reconciliation service
We will also add tags about cost/time, and user agents (bots or not).
Acceptance criteria:
- Refine the flow and categories defined in the WDQS user flow document.
- Add more precision to each bucket's definition
- Ensure we have a shared understanding of what goes in each bucket
- Programmatically categorize a subset of (1000?) queries into each bucket
- Programmatically determine which buckets contain the most "expensive" queries