Page MenuHomePhabricator

🧬Investigate measuring federation on WDQS/WBQS backend
Closed, ResolvedPublic

Description

Context

We want to measure query federation in the Wikibase ecosystem to be able to evaluate the success of our initiatives focused on improving it. See details in the main doc.

We are currently looking for a metric that could be used that is 1) easiest to measure, 2) provides meaningful insights that can drive our work.

One of the approaches we decided to explore is measuring on the Query Service backend.
+: On Wikidata QS, some instrumentation already exists to analyze query logs and place them into a dataset discovery.processed_external_sparql_query
+: We see even queries that started outside of our UIs
-: The results will only be reliable if we are able to de-duplicate queries in the ecosystem: a single query that ended up in N subqueries going across the ecosystem should only be counted once, otherwise, our metric won't be useful. For this reason, it's important to differentiate between the 'original' query that is being coordinated by the given backend, and the subqueries that it is executing on behalf of another endpoint.

Task

Measure:

  • A: # of federated queries that are coordinated on a specific WD/WBQS backend in the last day/week/month
  • B: # of non-federated queries that are coordinated on this WD/WBQS backend in the last day/week/month

Our metric is A/(A+B).

Acceptance Criteria

  1. No dashboard required at this point, but we should be able to run this measurement ad hoc for a given WD/WBQS backend to get the measurement.
  2. For WBQS backend on Cloud, we should be able to get measurements for all running Wikibases separately.

Notes

  1. To differentiate between 'original' queries and subqueries, we could try looking at the Useragent or other metadata attached to the query. If it came from another SPARQL endpoint - this is a subquery and should not be counted towards A or B; otherwise, it is an 'original' query.
  2. The PrPL team is already doing segmentation of queries on WDQS for their purposes (although, they are classifying them according to different criteria). Please get in touch with them to make use of existing groundwork where possible. See details in GitLab cc: @Ifrahkhanyaree_WMDE @AndrewTavis_WMDE
  3. Some useful information can be found in the recommendations from the Search team here: T391383

Examples

Below are examples of queries that should ideally be considered A or B or none of those on a specific Query Service.
The most important part: we need to recognize whether the query is being COORDINATED on this endpoint or not (= is this the 'original' query or the subquery of another query). We can try using the provenance to answer this question.

A:

  • A1: User triggers the query on Query Service UI / Query Builder UI (smth we have control over) -> it is accepted by this QS -> it sends out a subquery elsewhere
  • A2: User triggers the query on a 3rd-party web-based UI (we don't have control over it, and the Useragent will likely just be the web browser) -> it is accepted by this QS -> it sends out a subquery elsewhere
  • A3: User executes a Python notebook (or any other other code) that makes a SPARQL query to this endpoint (we don't have control over the tool, but it does set a User agent) -> it is accepted by this QS -> it sends out a subquery elsewhere

B:

  • B1: Query Service UI / Query Builder UI -> it is accepted by this QS and fully executed there
  • B2: 3rd party UI -> it is accepted by this QS and fully executed there
  • B3: Any other code -> it is accepted by this QS and fully executed there

C (queries that are NOT coordinated by this endpoint):

  • C1: Query Service UI / Query Builder of a different Query Service -> it gets accepted by the relevant QS -> the QS send a subquery to our QS -> it is accepted by our QS and fully executed here
  • C2: 3rd party UI making a call to another Query Service-> it gets accepted by the relevant QS -> the QSs send a subquery to our QS -> it is accepted by our QS and fully executed here
  • C3: Any other code making a call to another Query Service -> it gets accepted by the relevant QS -> the QSs send a subquery to our QS -> it is accepted by our QS and fully executed here

Also variations of C1, C2, C3 when the coordinating endpoint is a non-Wikibase (non-Query Service) SPARQL endpoint, for example QLever (I assume such variation of C1 is also possible, based on the comment from Tom in the main doc)
Also variations of C1, C2, C3 when the original query is nested, so the subquery then produces a sub-subquery of its own

Event Timeline

Progress on this ticket on Wednesday:

that not all WDQS queries are included in this dataset. The numbers vary, but as a rule of thumb I always consider that I'm only working with 90% of the queries in the given period. From discussions with Joseph from WMF on this: The process to generate the queries only includes those with status code 500 and 200, and monitoring queries will also be removed. If there are certain data points missing in the q_info post parsing, then the query is removed

I realise that we need to probably read and understand the class org.wikidata.query.rdf.spark.transform.queries.sparql.QueryExtractor from https://gerrit.wikimedia.org/g/wikidata/query/rdf/+/cd930447aa438d63548b9938d19cdf94d6d918ad/rdf-spark-tools/src/main/scala/org/wikidata/query/rdf/spark/transform/queries/sparql/QueryExtractor.scala#25 (and the QueriesProcessor it calls)

It probably doesn't make a huge amount of sense for people to try and pick this up on Friday/Monday before I'm back but it wouldn't be a waste of time for others (who are looking for stuff to do) to also have a read of the excellent doc content mentioned above

@Tarrow here are our rough queries from last time:

SELECT q_info.services
FROM discovery.processed_external_sparql_query
WHERE
  year = 2025 AND month = 5 AND day = 20
  AND q_info.services IS NOT NULL
  AND any_keys_match(q_info.services, key -> starts_with(key, 'NODE_URI[http'))
LIMIT 100;
SELECT q_info.services
FROM discovery.processed_external_sparql_query
WHERE
  q_info.services IS NOT NULL
  AND element_at(q_info.services, 'NODE_URI[wikibase:label]') IS NULL
  AND element_at(q_info.services, 'NODE_URI[wikibase:mwapi]') IS NULL
  AND element_at(q_info.services, 'NODE_URI[bd:slice]') IS NULL
  AND element_at(q_info.services, 'NODE_URI[wikibase:box]') IS NULL
LIMIT 100;
SELECT COUNT(*)
FROM event.wdqs_external_sparql_query
WHERE
  year = 2025 AND month = 4 AND day = 20
  AND map_key_exists(http.request_headers, 'origin')
  AND starts_with(element_at(http.request_headers, 'origin'), 'https://query.wikidata.org')

I continued exploring this data with @dena . We are trying to still grasp the extent of what queries that look like the are federated are actually doing. We're going to document the details in the following internal doc. https://docs.google.com/document/d/1L7m4q2F-QYXAuwBLxiWXkRAG2RAulW3A2F2ym9S3HBA/edit?tab=t.0

We used:

SELECT q_info.services, query
FROM discovery.processed_external_sparql_query
WHERE
  year = 2025 AND month = 5 AND day = 20
  AND q_info.services IS NOT NULL
  AND any_keys_match(q_info.services, key -> (NOT starts_with(key, 'NODE_URI[https://query.wikidata.org/sparql')) AND starts_with(key, 'NODE_URI[http'))
LIMIT 100;

to try and filter out the huge number of queries we were seeing with a service of NODE_URI[https://query.wikidata.org/sparql

We're uncertain if these come from:

  • bad user queries that would return a 4xx that we wrongly parse to look like this
  • good user queries that are actually valid sparql that is parsed to this as some kind of internal federation

Verbatim SPARQL queries as coming from users seem not directly available to us (only processed in some way) but at least in one example all that was missing was adding the prefix/service URIs plus the right formatting/insertion of newlines (because of comments)

Also I ran a federated query that we can inspect the data more confidently as the input is known.

I was able to confirm what I observed last week with a query that I executed myself: what we get in the query field is stripped of PREFIX and SERVICE URIs, plus since newlines get lost some formatting is needed, at least when the query includes comments. I think from all the processed data that is available it is possible to programmatically restore some executable query but it is not at all trivial and we might be better off changing the code that generates this data in the first place.

side note: server timezone seems to be set to UTC.

Query that I ran last friday:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX mmd: <http://data.mimotext.uni-trier.de/entity/>
PREFIX mmdt: <http://data.mimotext.uni-trier.de/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd: <http://www.bigdata.com/rdf#>

SELECT DISTINCT ?book ?bookLabel ?author ?authorLabel
WHERE {
      SERVICE <https://query.wikidata.org/sparql> {
  ?author wdt:P106 wd:Q33999. # Author was an actor
  ?book wdt:P50 ?author;      # Books by the author
            wdt:P12047 ?mimoID. # With a MiMoText ID
}      

  # Generate the MiMoText entity URI
  BIND(IRI(CONCAT(STR(mmd:), ?mimoID)) AS ?mimoEntity)
  
    SERVICE <https://query.mimotext.uni-trier.de/proxy/wdqs/bigdata/namespace/wdq/sparql> {
      ?mimoEntity mmdt:P36 mmd:Q3126.  # About travel
  }

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

queried via:

SELECT q_info.services
FROM discovery.processed_external_sparql_query
WHERE
  year = 2025 AND month = 6 AND day = 6
  AND q_info.services IS NOT NULL
  AND any_keys_match(q_info.services, key -> starts_with(key, 'NODE_URI[https://query.mimotext.uni-trier.de/'))
LIMIT 100;

query field in analytics data:

PREFIX wd: PREFIX wdt: PREFIX mmd: PREFIX mmdt: PREFIX wikibase: PREFIX bd: SELECT DISTINCT ?book ?bookLabel ?author ?authorLabel WHERE { SERVICE { ?author wdt:P106 wd:Q33999. # Author was an actor ?book wdt:P50 ?author; # Books by the author wdt:P12047 ?mimoID. # With a MiMoText ID } # Generate the MiMoText entity URI BIND(IRI(CONCAT(STR(mmd:), ?mimoID)) AS ?mimoEntity) SERVICE { ?mimoEntity mmdt:P36 mmd:Q3126. # About travel } SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en". } }

or after manually adding some newlines:

PREFIX wd: 
PREFIX wdt: 
PREFIX mmd: 
PREFIX mmdt: 
PREFIX wikibase: 
PREFIX bd: 

SELECT DISTINCT ?book ?bookLabel ?author ?authorLabel 
WHERE { 
SERVICE { 
?author wdt:P106 wd:Q33999. # Author was an actor 
?book wdt:P50 ?author; # Books by the author 
wdt:P12047 ?mimoID. # With a MiMoText ID 
} 

# Generate the MiMoText entity URI 
BIND(IRI(CONCAT(STR(mmd:), ?mimoID)) AS ?mimoEntity) 

SERVICE { 
?mimoEntity mmdt:P36 mmd:Q3126. # About travel 
} 

SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en". } 

}

To be able to spot some differences in the data when the wikidata QS is NOT the coordinating node, ran this query on https://query.mimotext.uni-trier.de/ - will inspect the data tomorrow:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX mmd: <http://data.mimotext.uni-trier.de/entity/>
PREFIX mmdt: <http://data.mimotext.uni-trier.de/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd: <http://www.bigdata.com/rdf#>

SELECT DISTINCT ?book ?bookLabel ?author ?authorLabel
WHERE {
    SERVICE <https://query.wikidata.org/sparql> {
        ?book wdt:P50 ?author;      # Books by the author
              wdt:P12047 ?mimoID.  # With a MiMoText ID
        ?author wdt:P106 wd:Q33999. # Author was an actor
    }

    # Generate the MiMoText entity URI
    BIND(IRI(CONCAT(STR(mmd:), ?mimoID)) AS ?mimoEntity)

    SERVICE <https://query.mimotext.uni-trier.de/proxy/wdqs/bigdata/namespace/wdq/sparql> {
        ?mimoEntity mmdt:P36 mmd:Q3126.  # About travel
    }

    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

I'll dig a bit more this afternoon but my initial explorations using Jupyter and SQL suggest that maybe these newlines (at least) are stripped only by superset and not in the underlying hive data.

I've done quite some staring. Principally playing with Jupyter and Spark; I've come up with a nice 1st number which is the total (A+B+C) for May: 466082821

I'm still really just getting my head around SparkSQL. Currently playing games like:

SELECT *

FROM discovery.processed_external_sparql_query
WHERE
    year = 2025
    AND month = 5
    AND array_contains(map_keys(q_info.services), key -> key LIKE 'NODE_URI[http%')

LIMIT 100;
"""

which unsurprisingly isn't working.

I've held back from putting a draft PR of the ipynb up on gitlab until I'm certain I've got a safe workflow around removing it from the stats server such that it's totally free of any PII hence pasting the snippet.

@Tarrow I was pulling the same metric yesterday (for unrelated reasons) and can confirm I have exactly the same final number as you. Afraid that's all I can help out w/ right now

Here's a list of queries we ran:

federation_count_all_query = """
SELECT count(*) AS total_processed_queries

FROM discovery.processed_external_sparql_query
    

WHERE
    year = 2025
    AND month = 5
"""
federation_probably_federated_from_here_sample_query = """
SELECT q_info.services

FROM discovery.processed_external_sparql_query
WHERE
    year = 2025
    AND month = 5
    AND day = 6
    AND q_info.services IS NOT NULL
    AND size(
        map_filter(
            q_info.services,
            (k, v) ->
                like(k, 'NODE_URI[http%')
                AND k != 'NODE_URI[https://query.wikidata.org/sparql]'
                AND k != 'NODE_URI[https://query-scholarly.wikidata.org/sparql]'
                AND k != 'NODE_URI[http://query-scholarly.wikidata.org/sparql]'
                AND k != 'NODE_URI[https://query.wikidata.org/bigdata/namespace/categories/sparql]'
                AND k != 'NODE_URI[http://query-scholarly.wikidata.org/bigdata/namespace/wdq/sparql]'
            )
    ) > 0
LIMIT 100;
"""
federation_count_probably_federated_from_here_query = """
SELECT count(*) AS total_federated_from_here_queries

FROM discovery.processed_external_sparql_query
WHERE
    year = 2025
    AND month = 5
    AND q_info.services IS NOT NULL
    AND size(
        map_filter(
            q_info.services,
            (k, v) ->
                like(k, 'NODE_URI[http%')
                AND k != 'NODE_URI[https://query.wikidata.org/sparql]'
                AND k != 'NODE_URI[https://query-scholarly.wikidata.org/sparql]'
                AND k != 'NODE_URI[http://query-scholarly.wikidata.org/sparql]'
                AND k != 'NODE_URI[https://query.wikidata.org/bigdata/namespace/categories/sparql]'
                AND k != 'NODE_URI[http://query-scholarly.wikidata.org/bigdata/namespace/wdq/sparql]'
            )
    ) > 0
    
"""
federation_count_distinct_services_from_here_query = """
SELECT COUNT(DISTINCT map_keys(q_info.services))

FROM discovery.processed_external_sparql_query
WHERE
    year = 2025
    AND month = 5
    AND day = 6
    AND q_info.services IS NOT NULL
"""
federation_distinct_services_from_here_query = """
SELECT DISTINCT map_keys(q_info.services)

FROM discovery.processed_external_sparql_query
WHERE
    year = 2025
    AND month = 5
    AND day = 6
    AND q_info.services IS NOT NULL
"""

Didn't manage to get any further with trying to find out what type of queries use the https://query.wikidata.org/sparql service on Friday.

Here is the query that lists the occurrences of services, though:

# occurances of each query services from here

spark.sql("""
SELECT query_service, format_number( COUNT(*), 0 ) AS occurances

FROM (
    SELECT explode( map_keys( q_info.services ) ) AS query_service
    FROM discovery.processed_external_sparql_query    
    WHERE year = 2025
      AND month = 5
      AND q_info.services IS NOT NULL
)
GROUP BY query_service
ORDER BY COUNT(*) DESC
""").show(truncate=False, n=100)

The full notebook can be found on stat1010 at /home/ollieshotton/wmde-analytics-git/tasks/wikibase_cloud/2025/federated_queries_investigation/T395044_federated_queries_exploration.ipynb

@dena and I had a look at the UA topic. We discovered that at least for this example query from above they ran the useragent was reported as Wikidata Query Service; https://query.wikidata.org/ for the query coming in from MimoText. We also saw other queries (details redacted) which probably at a guess are user queries originating from Wikibase Cloud.

I suggest we spin out another ticket (T397052) to change the UA on wikibase cloud (and another to recommend changing it in suite) so that in the future this traffic is easier to differentiate.

While we drive that ticket forward it doesn't help with historic data; I had the idea yesterday that perhaps talking the UA in combination with the ip ranges that requests are originating from we could identify:

  • those that aren't internal
  • those that are coming from Wikibase Cloud
  • those that are coming from other places (maybe even we can figure out the details of these)

Well, it only took a whole week to make some stab at this.
We got a May wide count of those with the queryservice looking useragent: 49497
Using the events table not the processed query table with this query:

spark.sql(
    """
SELECT COUNT(1)
FROM event.wdqs_external_sparql_query
WHERE
year = 2025
AND month = 5
AND http.request_headers['user-agent'] = 'Wikidata Query Service; https://query.wikidata.org/'
  """
).show(truncate=False)

Of those we determined which were in the the google europe-west3 ip ranges and therefore probably come from wikibase cloud and got a count of: 1643

No progress made yet on identifying "internal" requests

Reassigned to @Ollie.Shotton_WMDE for clarity while I'm off; have fun! As he knows all the work in on stat 1010 in my home directory

I will try to summarize my understanding of what happened within the boundaries of this ticket so far, because we had some learnings, and our understanding of what's possible and necessary evolved since we opened this ticket. We can the gracefully close it and create new tickets for investigations that are still relevant.

One problem we encountered is that queries that run on the main Wikidata graph and the Scholarly graph are mixed in the same data pool. At first, this was confusing to us - we thought that there was a lot of self-federation happening on Wikidata. Once we are able recognize them from each other, we will have a much better visibility of federation in the ecosystem. This is being taken care of in a separate ticket - T397528

In order to meaningfully estimate the volume of federated queries that users execute in the ecosystem, we would first need to differentiate between the 'original' queries that users executed vs. subqueries of federated queries that started elsewhere. This is the same as being able to recognize whether it arrived from another Blazegraph endpoint (=subquery) or not. We introduced a change that would allow us to group queries by user agent and can now recognize queries coming from Query Service UI (T396002) and Wikibase Cloud. Ideally, the more we are able to cluster provenance of a query, the more accurately we will be able to categorize a query into original/subquery/unknown. Next step here: for each graph separately (main / scholarly), use the info available in user agents to confidently categorize queries into original / subquery / unknown (by interpreting its provenance) and decide whether we still need to try to reduce the 'unknown' or the proportion is good enough. A separate ticket was created for this: T399220.

There was an initiative to guess queries coming from Wikibase Cloud based on IP to get historical data, but this historical data is not relevant to us anymore.

Except for looking at executed queries, we were also able to look at outgoing federation by analyzing how users use the SERVICE syntax in their queries. One of the core learnings was that people try to federate to endpoints that are not on the Allowlist. Addressing this will be one of the goals of Q3-2025 for us. To measure our impact, we need to confidently categorize the use of of the SERVICE syntax to invoke helpers, federate to endpoints on the Allowlist and to endpoints outside of Allowlist. To be able to do this, we could address several shortcomings in the data:

  • We don't see which queries or subqueries executed successfully or failed
  • When people use prefixes in their SERVICE syntax, it shows up the same way in the data - we don't see the actual URL behind it
  • We need to find a way to recognize a helper from an endpoint outside of the Allowlist

Separate tickets will be created to address these problems.