[Analytics] Collect multiple sets of SPARQL queries
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Oct 23 2023, 1:41 PM

Description

As part of the offline evaluation of the WDQS graph split (scholarly article vs the rest) we want to extract multiple sets of SPARQL queries.
(initially drafted from https://docs.google.com/document/d/1QsV96LtpK5lDD2N2jy-6vaF_0d_Yf_HLb8uFARFMxJ8)

QUERY-Q1: From the query logs identify and extract sets of queries emitted from a set of known sources:

Listeria
Mix-n-match
Pywikibot
wd/wb-integrator
~~WDQS UI~~

QUERY-Q2: From queries written in wikidata wikipages:

QUERY-Q3: Extract a set of queries known to be used by scholia

QUERY-Q4: A set of queries from the query logs, ideally representative of the following characteristics:

Query size
Query time
Status code (http return status)

The output is expected to be a hive table with 2 columns:

query: the sparql query in plain text
provenance: a code identifying the provenance (source) of the query

Note: query logs are available in events.wdqs_external_sparql_query.

Acceptance criteria

QUERY-Q1: From the query logs identify and extract sets of queries emitted from a set of known sources
QUERY-Q2: From queries written in wikidata wikipages (done)
QUERY-Q3: Extract a set of queries known to be used by scholia (done)
QUERY-Q4: A set of queries from the query logs, ideally representative of the following characteristics:
- Size of query in characters
- The query duration
Upload of query sampling process for QUERY-Q1 and QUERY-Q4 for documentation
- Location: https://gitlab.wikimedia.org/repos/search-platform/notebooks
- Be sure to remove PII (IPs from table heads)

Links

14 Nov 2023 Meeting notes

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Open	None	T352538 [EPIC] Evaluate the impact of the graph split
Resolved	Manuel	T337799 [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3]
Resolved	AndrewTavis_WMDE	T349512 [Analytics] Collect multiple sets of SPARQL queries
Duplicate	AndrewTavis_WMDE	T350157 [Analytics] Extract a representative sample of SPARQL queries from the query logs

Event Timeline

dcausse created this task.Oct 23 2023, 1:41 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2023, 1:41 PM

dcausse added a parent task: T337013: [Epic] Splitting the graph in WDQS.Oct 23 2023, 1:41 PM

Maintenance_bot added a project: Wikidata.Oct 23 2023, 1:45 PM

EBernhardson moved this task from Incoming to Current work on the Wikidata-Query-Service board.Oct 23 2023, 3:27 PM

EBernhardson added a project: Discovery-Search (Current work).

EBernhardson moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

Gehel renamed this task from Collect multiple sets of SPARQL queries to [Analytics] Collect multiple sets of SPARQL queries.Oct 31 2023, 8:45 AM

AndrewTavis_WMDE mentioned this in T350157: [Analytics] Extract a representative sample of SPARQL queries from the query logs.Nov 2 2023, 1:19 PM

Manuel added a subtask: T350157: [Analytics] Extract a representative sample of SPARQL queries from the query logs.Nov 2 2023, 1:29 PM

Gehel added a subtask: T337015: review sample of queries to get a better sense of impact of various graph split options.Nov 3 2023, 10:15 AM

Gehel edited parent tasks, added: T352538: [EPIC] Evaluate the impact of the graph split; removed: T337013: [Epic] Splitting the graph in WDQS.Dec 1 2023, 2:49 PM

Gehel removed a subtask: T337015: review sample of queries to get a better sense of impact of various graph split options.

Manuel merged a task: T350157: [Analytics] Extract a representative sample of SPARQL queries from the query logs.Dec 14 2023, 2:05 PM

Manuel mentioned this in T337799: [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3].

Manuel added subscribers: Manuel, AndrewTavis_WMDE, dr0ptp4kt, WMDE-leszek.

Manuel updated the task description. (Show Details)Dec 14 2023, 2:08 PM

Manuel updated the task description. (Show Details)Dec 14 2023, 2:11 PM

Manuel added a parent task: T337799: [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3].

Manuel added a subtask: T353453: [Analytics] Impact of Scholia on WDQS.Dec 14 2023, 2:31 PM

Manuel removed a subtask: T353453: [Analytics] Impact of Scholia on WDQS.Dec 14 2023, 2:40 PM

Manuel updated the task description. (Show Details)Dec 14 2023, 2:46 PM

Manuel updated the task description. (Show Details)

Manuel updated the task description. (Show Details)Dec 14 2023, 2:50 PM

Manuel added a project: Wikidata Analytics (Kanban).Dec 15 2023, 9:59 AM

Manuel moved this task from Incoming to Prioritized backlog on the Wikidata Analytics (Kanban) board.

Based on conversations with WMF engineering, uniquely identifying queries that come from the WDQS UI doesn't appear to be possible at this time. We'll thus not be able to include queries from there in the sample.

AndrewTavis_WMDE claimed this task.Dec 20 2023, 9:55 AM

AndrewTavis_WMDE updated Other Assignee, added: AndrewTavis_WMDE.

AndrewTavis_WMDE updated Other Assignee, removed: AndrewTavis_WMDE.

AndrewTavis_WMDE moved this task from Prioritized backlog to In progress on the Wikidata Analytics (Kanban) board.Dec 22 2023, 11:33 AM

dr0ptp4kt mentioned this in T350106: Implement a spark job that converts a RDF triples table into a RDF file format.Jan 4 2024, 5:52 PM

dcausse mentioned this in T355040: Compare the results of sparql queries between the fullgraph and the subgraphs.Jan 15 2024, 10:08 AM

AndrewTavis_WMDE updated the task description. (Show Details)Jan 29 2024, 11:03 AM

AndrewTavis_WMDE updated the task description. (Show Details)Jan 29 2024, 11:06 AM

AndrewTavis_WMDE moved this task from In progress to Prioritized backlog on the Wikidata Analytics (Kanban) board.Feb 5 2024, 9:19 AM

AndrewTavis_WMDE moved this task from Prioritized backlog to In progress on the Wikidata Analytics (Kanban) board.

MR with the notebooks and HTML versions has just been sent as per the Search Team meeting last week:

https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/merge_requests/2