[Analytics] Extract a representative sample of SPARQL queries from the query logs
Closed, DuplicatePublic
Actions

Assigned To

Authored By

	Manuel
	Oct 31 2023, 3:29 PM

Description

Purpose

In the context of the WDQS Graph Splitting initiative we need to understand the consequences of selected splits. As a base for the respective evaluations we need a set of representative queries.

Scope

The goal of this task is to extract a representative sample of SPARQL queries from the Blazegraph query logs.

A The query set should be representative of the following characteristics:

B We create subsets of queries that are representative for different types of queries

Query size
Query time
Status code (http return status) [not in the table!]
(user agents), was mentioned in the call

Notes

query logs are available in events.wdqs_external_sparql_query.
https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Subgraph_Query_Analysis
- Would be nice to get the notebooks that produced these results, if possible
  - https://wikitech.wikimedia.org/wiki/User:AKhatun/WDQS_representative_test_queries
  - https://upload.wikimedia.org/wikipedia/labs/d/da/Wdqs_representative_query_clean_noPII.pdf
- See prior work in https://phabricator.wikimedia.org/T349512
- https://github.com/tanny411
- https://wikitech.wikimedia.org/wiki/User:AKhatun
WDQS Graph Splitting - Analysis needs (internal)

Open questions

Confirmation of goal and scope
- Are you interested in a representative set of queries, or a in different subsets of queries that are following different characteristics, or what AKhatun did?
- What subsets of queries would you be most interested in?
What is the data source?
What time frame should we look into?
What sample size were you looking for?
What output format would you prefer?
What is the urgency of this task?

Desired output

The output is expected to be a hive table with 2 columns:

query: the sparql query in plain text
provenance: a code identifying the provenance (source) of the query

Urgency

When this task should be completed by. If this task is time sensitive then please make this clear. Please also provide the date when the output will be used if there is a specific meeting or event, for example.

DD.MM.YYYY

Information below this point is filled out by the Wikidata Analytics team.

General Planning

Information is filled out by the analytics product manager.

Assignee Planning

Information is filled out by the assignee of this task.

Estimation

Estimate:
Actual:

Sub Tasks

Full breakdown of the steps to complete this task:

subtask

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

link_to_table

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

What is the metadata that defines this sample that we want?
- How big of a sample? Is this supposed to be determined by accuracy metrics?
  - We don't need the sample to be exactly representational, but the various kinds of queries should at least be represented
- Are we using the same breakdowns as AKhatun did
  - Query size: this was not binned, but could be
  - Query time: < 10ms, 10-100ms, 100ms - 1s, 1-10s, > 10s
  - Status code (http return status): 200 or 500
  - User agent: modern or old version

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Open	None	T352538 [EPIC] Evaluate the impact of the graph split
Resolved	Manuel	T337799 [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3]
Resolved	AndrewTavis_WMDE	T349512 [Analytics] Collect multiple sets of SPARQL queries
Duplicate	AndrewTavis_WMDE	T350157 [Analytics] Extract a representative sample of SPARQL queries from the query logs