[Analytics] [WDQS SEG M1] Identify queries that retrieve Items based on a simple statement
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Manuel
	Jul 24 2024, 10:13 AM

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS

Identify SPARQL queries that should ideally go to Elasticsearch instead of WDQS.

Specific Results

Please detail the specific results that the task should deliver.

Identify WDQS SPARQL queries that only retrieve entities based on a simple statement (e.g. Which Item has IMDb ID "tt0133093"?)

Desired Outputs

Please list the desired outputs of this task.

Initial simple identification algorithm that can be used in T370854 (and iteratively improved on later).
- Notebook: T370853_single_statement_wd_query_use.ipynb

Notes

Using Elasticsearch for Wikidata this should currently work for all properties with "external identifier", "string", "item", "property", "lexeme", "form" and "sense" datatypes, except published in (P1433) and cites (P2860), which are currently omitted for performance reasons. Also, it is impossible to make use of Class hierarchy.

Open Questions

Should we only focus on the SPARQL equivalents of haswbstatement, or should we also go for inlabel, wbstatementquantity, hasdescription, or haslabel?

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

DD.MM.YYYY

Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

Check scope of task

It says "queries that only retrieve Items", but is this for properties and lexemes as well?
- I ask because in the notes it says that Elasticsearch also works for property and lexeme data types
- Answer: Also properties and lexemes
Considering the various ways that simple statements could be defined
- The example is Which Item has IMDb ID "tt0133093"?
- So the query should have a single triple?
- Is there any restriction on the object of the triple, or can it be anything?
- Answer: Single triple, and the object can be anything
It says that published in (P1433) and cites (P2860) are currently omitted from Elasticsearch for Wikidata for performance reasons
- Should queries referencing these two be conditionally removed from the results?
- So it would be individual triple queries that don't have these two as predicates, or all individual triple queries?
- Answer: We do not need to account for this
It would be helpful to have more context on the Open Questions section
- "Should we only focus on the SPARQL equivalents of haswbstatement, or should we also go for inlabel, wbstatementquantity, hasdescription, or haslabel?"
- Context: Focusing on haswbstatement from the start

Explore idea of deriving simple queries via the number of subject node values from discovery.processed_external_sparql_query
- The general idea is to find queries that have a single unique subject, predicate and object from the processed metadata
Share baseline identification method
Check query results with WMF and WMDE stakeholders

Estimation

Estimate: 1 - 2 days (depending on specifics for the queries found out in the scope check)
Actual: 1.5 days

Data

The tables that will be referenced in this task.

discovery.processed_external_sparql_query
- triples.subjectNode.nodeValue will be used to derive the number of triples and their characteristics
- Note: We have to use this table as these results will be combined with the results of T370848 into a single DAG, so it doesn't make sense to mix two source tables

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

Note

Related Objects
Search...

Status	Assigned	Task
Open	None	T373033 [EPIC] [WDQS SEG] Segmentation of WDQS queries
In Progress	Lydia_Pintscher	T370416 [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS
Resolved	AndrewTavis_WMDE	T370853 [Analytics] [WDQS SEG M1] Identify queries that retrieve Items based on a simple statement

Event Timeline

• Manuel created this task.Jul 24 2024, 10:13 AM

• Manuel mentioned this in T370854: [Analytics] [WDQS SEG M1] Create monitoring for the number of WDQS queries that retrieve Items based on a simple statement.Jul 24 2024, 10:22 AM

• Manuel updated the task description. (Show Details)

• Manuel mentioned this in T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS.

• Manuel triaged this task as High priority.Jul 24 2024, 10:25 AM

• Manuel moved this task from Incoming to To-Do on the Wikidata Analytics (Kanban) board.

• Manuel updated the task description. (Show Details)Jul 24 2024, 12:59 PM

• Manuel updated the task description. (Show Details)

AndrewTavis_WMDE updated the task description. (Show Details)Aug 6 2024, 4:04 PM

AndrewTavis_WMDE edited subscribers, added: karapayneWMDE; removed: • Manuel.

AndrewTavis_WMDE claimed this task.Aug 6 2024, 4:44 PM

AndrewTavis_WMDE moved this task from To-Do to In Progress on the Wikidata Analytics (Kanban) board.Aug 9 2024, 10:35 AM

AndrewTavis_WMDE updated the task description. (Show Details)Aug 9 2024, 10:46 AM

AndrewTavis_WMDE updated the task description. (Show Details)Aug 9 2024, 11:09 AM

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE moved this task from In Progress to Stalled on the Wikidata Analytics (Kanban) board.

AndrewTavis_WMDE updated the task description. (Show Details)Aug 9 2024, 11:33 AM

AndrewTavis_WMDE updated the task description. (Show Details)Aug 9 2024, 3:11 PM

AndrewTavis_WMDE moved this task from Stalled to In Progress on the Wikidata Analytics (Kanban) board.Aug 9 2024, 3:35 PM

AndrewTavis_WMDE updated the task description. (Show Details)Aug 12 2024, 3:00 PM

T370853_single_statement_wd_query_use.ipynb has the work associated with this task :) Moving this to review so that we can check the queries for this and the other WDQS task T370848.

AndrewTavis_WMDE mentioned this in T370848: [Analytics] [WDQS SEG M1] Identify queries that only retrieve data of a known single entity.Aug 13 2024, 2:47 PM

AndrewTavis_WMDE updated the task description. (Show Details)Aug 13 2024, 2:51 PM

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE updated the task description. (Show Details)Aug 13 2024, 2:55 PM

AndrewTavis_WMDE changed the task status from Open to In Progress.Aug 13 2024, 3:04 PM

karapayneWMDE mentioned this in T373033: [EPIC] [WDQS SEG] Segmentation of WDQS queries.Aug 21 2024, 3:33 PM

karapayneWMDE renamed this task from [Analytics] Identify queries that retrieve Items based on a simple statement to [Analytics] [WDQS SEG M1] Identify queries that retrieve Items based on a simple statement.Aug 21 2024, 3:40 PM

AndrewTavis_WMDE updated the task description. (Show Details)Sep 26 2024, 10:40 PM

Task is done! The associated work is in https://phabricator.wikimedia.org/T370851 . closing this one as main critieria is met

[Analytics] [WDQS SEG M1] Identify queries that retrieve Items based on a simple statementClosed, ResolvedPublicActions

Description

Wikidata Analytics Request

Purpose

Specific Results

Desired Outputs

Notes

Open Questions

Deadline

Assignee Planning

Sub Tasks

Estimation

Data

Notes

Related ObjectsSearch...

Event Timeline

[Analytics] [WDQS SEG M1] Identify queries that retrieve Items based on a simple statement
Closed, ResolvedPublic
Actions

Related Objects
Search...