In order to evaluate the impact of splitting the wikidata graph we want to compare the outcome of some queries against different endpoint.
For this we need a tool in the same vein of RelevanceForge that can:
- for a given set of queries record the output of these queries when executed against a particular SPARQL endpoint
- the ability to compute various metrics by analyzing the difference of the outputs of the same set of queries ran against two different endpoint
Analyzing the difference might require extracting a couple metrics:
- same results
- same results but different ordering
- % of identical lines
- ...
The input for the tool is a dataset with the following columns: query_provenance, query_id, query_text
The output is a dataset with the following columns: query_provenance, query_id, status_code_left, status_code_right, same, same_unordered, pct_identical_lines
AC:
- A diff tool is available and can be run on top of a spark dataframe or CSV file
- It can produce a spark dataframe or a CSV file