As part of the offline evaluation of the WDQS graph split (scholarly article vs the rest) we want to extract multiple sets of SPARQL queries.
(initially drafted from https://docs.google.com/document/d/1QsV96LtpK5lDD2N2jy-6vaF_0d_Yf_HLb8uFARFMxJ8)
QUERY-Q1: From the query logs identify and extract sets of queries emitted from a set of known sources:
- Listeria
- Mix-n-match
- Pywikibot
- wd/wb-integrator
WDQS UI
QUERY-Q2: From queries written in wikidata wikipages:
- https://observablehq.com/@pac02/hello-sparql-queries-dataset?collection=@pac02/wikidata-tools
- https://huggingface.co/datasets/htriedman/wiki-sparql/viewer/htriedman--wiki-sparql/test
QUERY-Q3: Extract a set of queries known to be used by scholia
QUERY-Q4: A set of queries from the query logs, ideally representative of the following characteristics:
- Query size
- Query time
- Status code (http return status)
The output is expected to be a hive table with 2 columns:
- query: the sparql query in plain text
- provenance: a code identifying the provenance (source) of the query
Note: query logs are available in events.wdqs_external_sparql_query.
Acceptance criteria
- QUERY-Q1: From the query logs identify and extract sets of queries emitted from a set of known sources
- QUERY-Q2: From queries written in wikidata wikipages (done)
- QUERY-Q3: Extract a set of queries known to be used by scholia (done)
- QUERY-Q4: A set of queries from the query logs, ideally representative of the following characteristics:
- Size of query in characters
- The query duration
- Upload of query sampling process for QUERY-Q1 and QUERY-Q4 for documentation
- Location: https://gitlab.wikimedia.org/repos/search-platform/notebooks
- Be sure to remove PII (IPs from table heads)
Links