Page MenuHomePhabricator

Compare the results of sparql queries between the fullgraph and the subgraphs
Closed, ResolvedPublic8 Estimated Story Points

Description

By using a tool to compare the differences of two results of the same sparql query we should evaluate how many queries might "break" when running against the wikidata main graph instead of the full graph.

Comparison will use T351819 and be based on the sets of sparql extracted in T349512.

We should attempt to identify the reasons of the differences and whether they are related or unrelated to the split:

  • query features dependent on internal ordering the blazegraph btrees (LIMIT X OFFSET Y, bd:slice)
  • use of external datasets (federation, mwapi)
  • unicode collation issues (T233204)
  • ...add more when discovered

For the queries whose results vary because of the split we should attempt to evaluate if targeting scholarly articles is intentional or not (e.g. statistical queries with group by counts) and possibly identify the tools and their maintainers to contact them to gather feedback on the project.

AC:

  • a report is available showing how the current split is going to affect queries once run on the wikidata main subgraph
  • a list of affected tools/scripts (when identifiable) that could possibly be contacted

Details

TitleReferenceAuthorSource BranchDest Branch
Draft: early draft of a comparison analysisrepos/search-platform/notebooks!1dcausseT355040_sholarly_articles_split_results_comparison_analysismain
Customize query in GitLab

Event Timeline

Gehel set the point value for this task to 8.

Quick report on the progress being made:

  • Our query logs do not only contain sparql queries and the sparql client used to collect the data has to be adapted to support these (ASK, CONSTRUCT, DESCRIBE) (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/991622)
  • Getting failures due to response size, bumped the limit to 16M but still getting problems, I might stop here and simply tag & ignore such massive queries moving forward
  • Getting very bad numbers from Listeria and MixNMatch (34% and 17% identical respectively), avg result size is 1.6k and 8k so might explain partly why getting identical results is difficult, need more investigations to understand the cause...
  • Getting pretty mediocre numbers for WikidataIntegrator at 88% with very small avg result size at 8, more investigation needed
  • Pywikibot and SPARQLWrapper are good at 99.4% for both

WIP: https://people.wikimedia.org/~dcausse/T355040_EARLY_DRAFT_wdqs_query_results_analysis.html (UA redacted for now)

TL/DR:

  • added support for identifying true positives (queries with a scientific article in the sparql query or in the results)
  • MixNMatch has a very high number of true positives, thus need more qualitative analysis (ticket TBD)
  • Listeria does not have any true positives but shows bad outcome (81% identical in the best case, 68% worst case), needs more qualitative analysis too

WIP:

  • included the new 100k queries sample named QUERY-Q4 from T349512 (random sample that is representative of the query length and runtime)
  • the % of affected queries (deduplicated) per tool is (sample being the QUERY-Q4 sample mentionned above)
    image.png (470×771 px, 33 KB)

The above graph should be taken with a grain of salt as the number of queries per datapoints varies a lot (86 queries for Listeria vs 85k for random), these numbers are being reviewed so no conclusions should be drawn yet but it does not seem that we obtain the same numbers that were found originally in Wikidata_Subgraph_Query_Analysis where 2.5% of the total query count are being identified as requiring scholarly articles.
A more qualitative analysis is in progress:

  • analyze of the user agents to understand what usecases are mainly affected, preliminary results show that for instance a single UA is the cause of 50% of the affected queries
  • extract some SPARQL queries to start evaluating how federation could be applied/tested