== Wikidata Analytics Request ==
> This task was generated using the [Wikidata Analytics](https://phabricator.wikimedia.org/project/profile/5408) request form. Please use the task template linked on our project page to create issues for the team. Thank you!
=== Purpose ===
> Please provide as much context as possible as well as what the produced insights or services will be used for.
{T370416}
* Identify SPARQL queries that should ideally go to [[ https://www.wikidata.org/wiki/Wikidata:Data_access#Search | Elasticsearch ]] instead of WDQS.
=== Specific Results ===
> Please detail the specific results that the task should deliver.
* Identify WDQS SPARQL queries that only retrieve entities based on a simple statement (e.g. Which Item has IMDb ID "tt0133093"?)
=== Desired Outputs ===
> Please list the desired outputs of this task.
[x] Initial simple identification algorithm that can be used in T370854 (and iteratively improved on later).
- Notebook: [T370853_single_statement_wd_query_use.ipynb](https://gitlab.wikimedia.org/repos/wmde/analytics/-/blob/main/tasks/wikidata/2024/T370853_single_statement_wd_query_use/T370853_single_statement_wd_query_use.ipynb?ref_type=heads)
=== Notes ===
* Using Elasticsearch for Wikidata this should currently work for all properties with "external identifier", "string", "item", "property", "lexeme", "form" and "sense" datatypes, except published in (`P1433`) and cites (`P2860`), which are currently omitted for performance reasons. Also, it is impossible to make use of Class hierarchy.
=== Open Questions ===
* Should we only focus on the SPARQL equivalents of `haswbstatement`, or should we also go for `inlabel`, `wbstatementquantity`, `hasdescription`, or `haslabel`?
=== Deadline ===
> Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
DD.MM.YYYY
---
**Information below this point is filled out by the task assignee.**
== Assignee Planning ==
=== Sub Tasks ===
> A full breakdown of the steps to complete this task.
[x] Check scope of task
1. It says "queries that only retrieve Items", but is this for properties and lexemes as well?
- I ask because in the notes it says that Elasticsearch also works for property and lexeme data types
- Answer: Also properties and lexemes
2. Considering the various ways that simple statements could be defined
- The example is `Which Item has IMDb ID "tt0133093"?`
- So the query should have a single triple?
- Is there any restriction on the object of the triple, or can it be anything?
- Answer: Single triple, and the object can be anything
3. It says that published in (`P1433`) and cites (`P2860`) are currently omitted from Elasticsearch for Wikidata for performance reasons
- Should queries referencing these two be conditionally removed from the results?
- So it would be individual triple queries that don't have these two as predicates, or all individual triple queries?
- Answer: We do not need to account for this
4. It would be helpful to have more context on the `Open Questions` section
- "Should we only focus on the SPARQL equivalents of `haswbstatement`, or should we also go for `inlabel`, `wbstatementquantity`, `hasdescription`, or `haslabel`?"
- Context: Focusing on `haswbstatement` from the start
[x] Explore idea of deriving simple queries via the number of subject node values from `discovery.processed_external_sparql_query`
- The general idea is to find queries that have a single unique subject, predicate and object from the processed metadata
[] Check query results with WMF and WMDE stakeholders
[x] Share baseline identification method
=== Estimation ===
Estimate: 1 - 2 days (depending on specifics for the queries found out in the scope check)
Actual: 1.5 days
=== Data ===
> The tables that will be referenced in this task.
- `discovery.processed_external_sparql_query`
- `triples.subjectNode.nodeValue` will be used to derive the number of triples and their characteristics
- Note: We have to use this table as these results will be combined with the results of T370848 into a single DAG, so it doesn't make sense to mix two source tables
=== Notes ===
> Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note