As part of the offline evaluation of the WDQS graph split (scholarly article vs the rest) we want to extract multiple sets of SPARQL queries.
(initially drafted from https://docs.google.com/document/d/1QsV96LtpK5lDD2N2jy-6vaF_0d_Yf_HLb8uFARFMxJ8)
QUERY-Q1: From the query logs identify and extract sets of queries emitted from a set of known sources:
- [[ https://www.wikidata.org/wiki/Wikidata:Listeria | Listeria ]]
- Mix-n-match
- Pywikibot
- wd/wb-integrator
- WDQS UI
QUERY-Q2: From queries written in wikidata wikipages:
- https://observablehq.com/@pac02/hello-sparql-queries-dataset?collection=@pac02/wikidata-tools
- https://huggingface.co/datasets/htriedman/wiki-sparql/viewer/htriedman--wiki-sparql/test
QUERY-Q3: Extract a set of queries known to be used by scholia
QUERY-Q4: A set of queries from the query logs, ideally representative of the following characteristics:
- Query size
- Query time
- Status code (http return status)
The output is expected to be a hive table with 2 columns:
- query: the sparql query in plain text
- provenance: a code identifying the provenance (source) of the query
Note: query logs are available in `events.wdqs_external_sparql_query`.
**Acceptance criteria**
[] QUERY-Q1: From the query logs identify and extract sets of queries emitted from a set of known sources
[X] QUERY-Q2: From queries written in wikidata wikipages ([done](https://docs.google.com/document/d/1sOa_QKrVNgR-jvd0h-ja-kNOuXiGCMcFdXPL1mZNwmY/edit#bookmark=id.auilm4r9oljz))
[X] QUERY-Q3: Extract a set of queries known to be used by scholia ([done](https://docs.google.com/document/d/1sOa_QKrVNgR-jvd0h-ja-kNOuXiGCMcFdXPL1mZNwmY/edit#bookmark=id.auilm4r9oljz))
[] QUERY-Q4: A set of queries from the query logs, ideally representative of the following characteristics:
**Links**
* [14 Nov 2023 Meeting notes](https://docs.google.com/document/d/1sOa_QKrVNgR-jvd0h-ja-kNOuXiGCMcFdXPL1mZNwmY/edit)