Wikidata Analytics Request
This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!
Purpose
Please provide as much context as possible as well as what the produced insights or services will be used for.
T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS
- Identify SPARQL queries that should ideally go to Elasticsearch instead of WDQS.
Specific Results
Please detail the specific results that the task should deliver.
- Identify WDQS SPARQL queries that only retrieve entities based on a simple statement (e.g. Which Item has IMDb ID "tt0133093"?)
Desired Outputs
Please list the desired outputs of this task.
- Initial simple identification algorithm that can be used in T370854 (and iteratively improved on later).
Notes
- Using Elasticsearch for Wikidata this should currently work for all properties with "external identifier", "string", "item", "property", "lexeme", "form" and "sense" datatypes, except published in (P1433) and cites (P2860), which are currently omitted for performance reasons. Also, it is impossible to make use of Class hierarchy.
Open Questions
- Should we only focus on the SPARQL equivalents of haswbstatement, or should we also go for inlabel, wbstatementquantity, hasdescription, or haslabel?
Deadline
Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
DD.MM.YYYY
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Check scope of task
- It says "queries that only retrieve Items", but is this for properties and lexemes as well?
- I ask because in the notes it says that Elasticsearch also works for property and lexeme data types
- Answer: Also properties and lexemes
- Considering the various ways that simple statements could be defined
- The example is Which Item has IMDb ID "tt0133093"?
- So the query should have a single triple?
- Is there any restriction on the object of the triple, or can it be anything?
- Answer: Single triple, and the object can be anything
- It says that published in (P1433) and cites (P2860) are currently omitted from Elasticsearch for Wikidata for performance reasons
- Should queries referencing these two be conditionally removed from the results?
- So it would be individual triple queries that don't have these two as predicates, or all individual triple queries?
- Answer: We do not need to account for this
- It would be helpful to have more context on the Open Questions section
- "Should we only focus on the SPARQL equivalents of haswbstatement, or should we also go for inlabel, wbstatementquantity, hasdescription, or haslabel?"
- Context: Focusing on haswbstatement from the start
- Explore idea of deriving simple queries via the number of subject node values from discovery.processed_external_sparql_query
- The general idea is to find queries that have a single unique subject, predicate and object from the processed metadata
- Share baseline identification method
- Check query results with WMF and WMDE stakeholders
Estimation
Estimate: 1 - 2 days (depending on specifics for the queries found out in the scope check)
Actual: 1.5 days
Data
The tables that will be referenced in this task.
- discovery.processed_external_sparql_query
- triples.subjectNode.nodeValue will be used to derive the number of triples and their characteristics
- Note: We have to use this table as these results will be combined with the results of T370848 into a single DAG, so it doesn't make sense to mix two source tables
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note