Page MenuHomePhabricator

[Analytics] Collect multiple sets of SPARQL queries
Closed, ResolvedPublic

Description

As part of the offline evaluation of the WDQS graph split (scholarly article vs the rest) we want to extract multiple sets of SPARQL queries.
(initially drafted from https://docs.google.com/document/d/1QsV96LtpK5lDD2N2jy-6vaF_0d_Yf_HLb8uFARFMxJ8)

QUERY-Q1: From the query logs identify and extract sets of queries emitted from a set of known sources:

  • Listeria
  • Mix-n-match
  • Pywikibot
  • wd/wb-integrator
  • WDQS UI

QUERY-Q2: From queries written in wikidata wikipages:

QUERY-Q3: Extract a set of queries known to be used by scholia

QUERY-Q4: A set of queries from the query logs, ideally representative of the following characteristics:

  • Query size
  • Query time
  • Status code (http return status)

The output is expected to be a hive table with 2 columns:

  • query: the sparql query in plain text
  • provenance: a code identifying the provenance (source) of the query

Note: query logs are available in events.wdqs_external_sparql_query.

Acceptance criteria

  • QUERY-Q1: From the query logs identify and extract sets of queries emitted from a set of known sources
  • QUERY-Q2: From queries written in wikidata wikipages (done)
  • QUERY-Q3: Extract a set of queries known to be used by scholia (done)
  • QUERY-Q4: A set of queries from the query logs, ideally representative of the following characteristics:
    • Size of query in characters
    • The query duration
  • Upload of query sampling process for QUERY-Q1 and QUERY-Q4 for documentation

Links

Event Timeline

Gehel renamed this task from Collect multiple sets of SPARQL queries to [Analytics] Collect multiple sets of SPARQL queries.Oct 31 2023, 8:45 AM
Manuel updated the task description. (Show Details)

Based on conversations with WMF engineering, uniquely identifying queries that come from the WDQS UI doesn't appear to be possible at this time. We'll thus not be able to include queries from there in the sample.

AndrewTavis_WMDE updated Other Assignee, added: AndrewTavis_WMDE.
AndrewTavis_WMDE updated Other Assignee, removed: AndrewTavis_WMDE.

MR with the notebooks and HTML versions has just been sent as per the Search Team meeting last week:

https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/merge_requests/2

Please let me know if anything else is necessary!

Closing this as the MR has been brought in 🎉 Thanks all for the support and the great work on this project!