Page MenuHomePhabricator

[NEEDS GROOMING] implement a test harness for data quality and query validation
Closed, DuplicatePublic

Description

See 2 related tasks:

We need the capability to automatically compare the results returned by running a query on different triplestores. Particularly, we need to stress test query rewrites.

Input can be

  • any successful Blazegraph query that failed during traffic replay
  • examples on query UI and on wiki documentation
  • bug reports.
  • ...

We need the capability to:

  • execute large batches of queries as well as spot analysis.
  • summarize and report the output of diffs (how many queries present different results? which ones?). Reports should not contain PII and avaialble in public, outside of superset (SPIKE: explore implementing a bespoke app on toolforge).
  • we should be able to schedule validation run on Airflow, as well as allow end users to run the tool locally.
  • triage if the issue is with the query, the index serialization, a bug in our data ingestion, a bug in the triplestore.

Needs to handle:

  • Blank nodes
  • Ordering
  • Inference differences?
  • Datatype normalization

Explore:

  • how to automate regression tests in the data preparation step that splits and normalizes the entity dump into main and scholarly datasets.
  • what makes for a good set of control queries to support regression tests.
  • any learning we can implement as Data Quality step in indexing pipeline.

Tooling

  • Jena
  • strong preference for map/reduce-style parallelization
  • ...

Event Timeline

gmodena renamed this task from [NEEDS GROOMING] setup a test haranesss for data quality and query validation to [NEEDS GROOMING] implement a test haranesss for data quality and query validation.Mar 25 2026, 9:32 PM
gmodena updated the task description. (Show Details)
gmodena added subscribers: Pfps, Physikerwelt.

@Pfps @Physikerwelt this is a placeholder task for further validation work we plan to tackle next quarter. We touched about this in T414443: Setup WDQS instances on test eqiad nodes.. This task is about scaling up bespoke data analysis work we have done so far, automate the tasks, and integrate it with our Data Platform infra. it's purely about qualitative comparison of output, and not about measuring perf.

I would be interested to explore ways in which what we build could be reused outside of Foundation infrastructure. Do you use similar tools in your workflows? Would you find test automation useful? Any pain points you'd like us to be aware of?

I don't think that the problem is solvable in general. You can't even rerun queries and always expect the same results because of updates.

If you are querying the same graph then you can compare result, but even then the comparison is potentially very expensive in general. But for useful Wikidata queries the comparison should be relatively easy, provided that there are no indexicals or randoms, because the Wikidata RDF dump no longer has blank nodes.

What I did was extract a single unit of information from the results and compare that. Incomplete but simple and effective. The details are all in my benchmark code.

I don't think that the problem is solvable in general. You can't even rerun queries and always expect the same results because of updates.

On test infrastructure we can compare static snapshots (we generate split graph data weekly), updates do not need to be enabled.

That's good. The next determination is whether to do a complete comparison or an incomplete one. Then there is the issue of whether to include a third or fourth engine so that compliance can be estimated.

lerickson renamed this task from [NEEDS GROOMING] implement a test haranesss for data quality and query validation to [NEEDS GROOMING] implement a test harness for data quality and query validation.Mar 27 2026, 4:43 PM