Context
We want to measure query federation in the Wikibase ecosystem to be able to evaluate the success of our initiatives focused on improving it. See details in the main doc.
We are currently looking for a metric that could be used that is 1) easiest to measure, 2) provides meaningful insights that can drive our work.
One of the approaches we decided to explore is measuring on the Query Service backend.
+: On Wikidata QS, some instrumentation already exists to analyze query logs and place them into a dataset discovery.processed_external_sparql_query
+: We see even queries that started outside of our UIs
-: The results will only be reliable if we are able to de-duplicate queries in the ecosystem: a single query that ended up in N subqueries going across the ecosystem should only be counted once, otherwise, our metric won't be useful. For this reason, it's important to differentiate between the 'original' query that is being coordinated by the given backend, and the subqueries that it is executing on behalf of another endpoint.
Task
Measure:
- A: # of federated queries that are coordinated on a specific WD/WBQS backend in the last day/week/month
- B: # of non-federated queries that are coordinated on this WD/WBQS backend in the last day/week/month
Our metric is A/(A+B).
Acceptance Criteria
- No dashboard required at this point, but we should be able to run this measurement ad hoc for a given WD/WBQS backend to get the measurement.
- For WBQS backend on Cloud, we should be able to get measurements for all running Wikibases separately.
Notes
- To differentiate between 'original' queries and subqueries, we could try looking at the Useragent or other metadata attached to the query. If it came from another SPARQL endpoint - this is a subquery and should not be counted towards A or B; otherwise, it is an 'original' query.
- The PrPL team is already doing segmentation of queries on WDQS for their purposes (although, they are classifying them according to different criteria). Please get in touch with them to make use of existing groundwork where possible. See details in GitLab cc: @Ifrahkhanyaree_WMDE @AndrewTavis_WMDE
- Some useful information can be found in the recommendations from the Search team here: T391383
Examples
Below are examples of queries that should ideally be considered A or B or none of those on a specific Query Service.
The most important part: we need to recognize whether the query is being COORDINATED on this endpoint or not (= is this the 'original' query or the subquery of another query). We can try using the provenance to answer this question.
A:
- A1: User triggers the query on Query Service UI / Query Builder UI (smth we have control over) -> it is accepted by this QS -> it sends out a subquery elsewhere
- A2: User triggers the query on a 3rd-party web-based UI (we don't have control over it, and the Useragent will likely just be the web browser) -> it is accepted by this QS -> it sends out a subquery elsewhere
- A3: User executes a Python notebook (or any other other code) that makes a SPARQL query to this endpoint (we don't have control over the tool, but it does set a User agent) -> it is accepted by this QS -> it sends out a subquery elsewhere
B:
- B1: Query Service UI / Query Builder UI -> it is accepted by this QS and fully executed there
- B2: 3rd party UI -> it is accepted by this QS and fully executed there
- B3: Any other code -> it is accepted by this QS and fully executed there
C (queries that are NOT coordinated by this endpoint):
- C1: Query Service UI / Query Builder of a different Query Service -> it gets accepted by the relevant QS -> the QS send a subquery to our QS -> it is accepted by our QS and fully executed here
- C2: 3rd party UI making a call to another Query Service-> it gets accepted by the relevant QS -> the QSs send a subquery to our QS -> it is accepted by our QS and fully executed here
- C3: Any other code making a call to another Query Service -> it gets accepted by the relevant QS -> the QSs send a subquery to our QS -> it is accepted by our QS and fully executed here
Also variations of C1, C2, C3 when the coordinating endpoint is a non-Wikibase (non-Query Service) SPARQL endpoint, for example QLever (I assume such variation of C1 is also possible, based on the comment from Tom in the main doc)
Also variations of C1, C2, C3 when the original query is nested, so the subquery then produces a sub-subquery of its own