Compare the results of sparql queries between the fullgraph and the subgraphs
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Jan 15 2024, 10:08 AM

Description

By using a tool to compare the differences of two results of the same sparql query we should evaluate how many queries might "break" when running against the wikidata main graph instead of the full graph.

Comparison will use T351819 and be based on the sets of sparql extracted in T349512.

We should attempt to identify the reasons of the differences and whether they are related or unrelated to the split:

query features dependent on internal ordering the blazegraph btrees (LIMIT X OFFSET Y, bd:slice)
use of external datasets (federation, mwapi)
unicode collation issues (T233204)
...add more when discovered

For the queries whose results vary because of the split we should attempt to evaluate if targeting scholarly articles is intentional or not (e.g. statistical queries with group by counts) and possibly identify the tools and their maintainers to contact them to gather feedback on the project.

AC:

a report is available showing how the current split is going to affect queries once run on the wikidata main subgraph
a list of affected tools/scripts (when identifiable) that could possibly be contacted

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Draft: early draft of a comparison analysis	repos/search-platform/notebooks!1	dcausse	T355040_sholarly_articles_split_results_comparison_analysis	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Open	None	T352538 [EPIC] Evaluate the impact of the graph split
Resolved	dcausse	T355040 Compare the results of sparql queries between the fullgraph and the subgraphs

Event Timeline

dcausse created this task.Jan 15 2024, 10:08 AM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Jan 15 2024, 3:15 PM

Gehel added a project: Discovery-Search (Current work).

Gehel removed a project: Wikidata-Query-Service.

Gehel assigned this task to dcausse.Jan 15 2024, 4:33 PM

Gehel set the point value for this task to 8.

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Gehel moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Quick report on the progress being made:

Our query logs do not only contain sparql queries and the sparql client used to collect the data has to be adapted to support these (ASK, CONSTRUCT, DESCRIBE) (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/991622)
Getting failures due to response size, bumped the limit to 16M but still getting problems, I might stop here and simply tag & ignore such massive queries moving forward
Getting very bad numbers from Listeria and MixNMatch (34% and 17% identical respectively), avg result size is 1.6k and 8k so might explain partly why getting identical results is difficult, need more investigations to understand the cause...
Getting pretty mediocre numbers for WikidataIntegrator at 88% with very small avg result size at 8, more investigation needed
Pywikibot and SPARQLWrapper are good at 99.4% for both

dcausse opened https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/merge_requests/1

Draft: early draft of a comparison analysis

WIP: https://people.wikimedia.org/~dcausse/T355040_EARLY_DRAFT_wdqs_query_results_analysis.html (UA redacted for now)

TL/DR:

added support for identifying true positives (queries with a scientific article in the sparql query or in the results)
MixNMatch has a very high number of true positives, thus need more qualitative analysis (ticket TBD)
Listeria does not have any true positives but shows bad outcome (81% identical in the best case, 68% worst case), needs more qualitative analysis too

WIP:

included the new 100k queries sample named QUERY-Q4 from T349512 (random sample that is representative of the query length and runtime)
the % of affected queries (deduplicated) per tool is (sample being the QUERY-Q4 sample mentionned above)

The above graph should be taken with a grain of salt as the number of queries per datapoints varies a lot (86 queries for Listeria vs 85k for random), these numbers are being reviewed so no conclusions should be drawn yet but it does not seem that we obtain the same numbers that were found originally in Wikidata_Subgraph_Query_Analysis where 2.5% of the total query count are being identified as requiring scholarly articles.
A more qualitative analysis is in progress:

analyze of the user agents to understand what usecases are mainly affected, preliminary results show that for instance a single UA is the cause of 50% of the affected queries
extract some SPARQL queries to start evaluating how federation could be applied/tested