Page MenuHomePhabricator

[Analytics] [WDQS SEG M1] Identify queries that retrieve Items based on a simple statement
Closed, ResolvedPublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS

  • Identify SPARQL queries that should ideally go to Elasticsearch instead of WDQS.

Specific Results

Please detail the specific results that the task should deliver.

  • Identify WDQS SPARQL queries that only retrieve entities based on a simple statement (e.g. Which Item has IMDb ID "tt0133093"?)

Desired Outputs

Please list the desired outputs of this task.

Notes

  • Using Elasticsearch for Wikidata this should currently work for all properties with "external identifier", "string", "item", "property", "lexeme", "form" and "sense" datatypes, except published in (P1433) and cites (P2860), which are currently omitted for performance reasons. Also, it is impossible to make use of Class hierarchy.

Open Questions

  • Should we only focus on the SPARQL equivalents of haswbstatement, or should we also go for inlabel, wbstatementquantity, hasdescription, or haslabel?

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

DD.MM.YYYY


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Check scope of task
  1. It says "queries that only retrieve Items", but is this for properties and lexemes as well?
    • I ask because in the notes it says that Elasticsearch also works for property and lexeme data types
    • Answer: Also properties and lexemes
  2. Considering the various ways that simple statements could be defined
    • The example is Which Item has IMDb ID "tt0133093"?
    • So the query should have a single triple?
    • Is there any restriction on the object of the triple, or can it be anything?
    • Answer: Single triple, and the object can be anything
  3. It says that published in (P1433) and cites (P2860) are currently omitted from Elasticsearch for Wikidata for performance reasons
    • Should queries referencing these two be conditionally removed from the results?
    • So it would be individual triple queries that don't have these two as predicates, or all individual triple queries?
    • Answer: We do not need to account for this
  4. It would be helpful to have more context on the Open Questions section
    • "Should we only focus on the SPARQL equivalents of haswbstatement, or should we also go for inlabel, wbstatementquantity, hasdescription, or haslabel?"
    • Context: Focusing on haswbstatement from the start
  • Explore idea of deriving simple queries via the number of subject node values from discovery.processed_external_sparql_query
    • The general idea is to find queries that have a single unique subject, predicate and object from the processed metadata
  • Share baseline identification method
  • Check query results with WMF and WMDE stakeholders

Estimation

Estimate: 1 - 2 days (depending on specifics for the queries found out in the scope check)
Actual: 1.5 days

Data

The tables that will be referenced in this task.

  • discovery.processed_external_sparql_query
    • triples.subjectNode.nodeValue will be used to derive the number of triples and their characteristics
    • Note: We have to use this table as these results will be combined with the results of T370848 into a single DAG, so it doesn't make sense to mix two source tables

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

Manuel moved this task from Incoming to To-Do on the Wikidata Analytics (Kanban) board.

T370853_single_statement_wd_query_use.ipynb has the work associated with this task :) Moving this to review so that we can check the queries for this and the other WDQS task T370848.

AndrewTavis_WMDE changed the task status from Open to In Progress.Aug 13 2024, 3:04 PM
karapayneWMDE renamed this task from [Analytics] Identify queries that retrieve Items based on a simple statement to [Analytics] [WDQS SEG M1] Identify queries that retrieve Items based on a simple statement.Aug 21 2024, 3:40 PM

Task is done! The associated work is in https://phabricator.wikimedia.org/T370851 . closing this one as main critieria is met