Wikidata Analytics Request
This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!
Purpose
Please provide as much context as possible as well as what the produced insights or services will be used for.
Related epic: T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS
- Identify SPARQL queries that should ideally go to the Wikibase REST API, the Linked Data Interface (URI) or the MediaWiki Action API instead of WDQS.
- Note: See queripulator for an earlier example of this
Specific Results
Please detail the specific results that the task should deliver.
- The set of queries that includes queries that reference the following three sets
- Only one QID (wd:Q or <http://www.wikidata.org/entity/Q) in the subject position
- Only one PID (wd:P or <http://www.wikidata.org/entity/P) in the subject position
- Only one LID (wd:L or <http://www.wikidata.org/entity/L) in the subject position
Desired Outputs
Please list the desired outputs of this task.
- Initial simple identification algorithm that can be used in T370851 (and iteratively improved on later).
- We can use the table discovery.processed_external_sparql_query
- This table has a triples column that contains all triples for a given query
- triples.subjectNode.nodeValue is the subject of any triple
- For this task we're getting queries that have single unique subject for all triples that's also a Wikidata entity (QID, PID, LID)
- Notebook: T370848_discovery_single_wd_entity_query_use.ipynb
- We can use the table discovery.processed_external_sparql_query
Deadline
Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
DD.MM.YYYY
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Check scope of task
- Entity is an item, property or lexeme in this case? (Wikidata:Identifiers)
- Only one max means that the query for a property of an item doesn't qualify?
- The above assumptions were not correct, so the specific results have been edited to be more clear
- Reorientation to SPARQL query data
- Explore ways that entities can be included in a SPARQL query
- Check with team on ideas
- Write queries to identify one instance of an entity
- Write full query that combines all WHERE clauses
- Check full query results with WMF
- Decision was that a regex approach based on the full query is too complicated, so we'll use the triples from the parsed query data
- Write full query that derives queries to single entities
- Share baseline identification method
- Check new full query results with WMF and stakeholders
- We need to remove queries that have wildcards (*, + and /)
- There's a question of whether these are modeled in the parsing or not, but if not we can remove them via the query text and matching via RLIKE 'wdt:P[0-9]+[*+\/]'
- We should check to see if we need to remove queries that have entities in their triple objects
- A decision needs to be made on whether multiple triples for a single entity is included here
- We need to remove queries that have wildcards (*, + and /)
Estimation
Estimate: 2 days
Actual: 3 days (needed to pivot to a new data source after day 1)
Data
The tables that will be referenced in this task.
- event.wdqs_external_sparql_query
- First approach was not successful, so moving to discovery.processed_external_sparql_query
- We can use triples.subjectNode.nodeValue of this table
- Checking that this is an entity ID (wd:[QPL][0-9]+) and that each query id is associated with one such entity ID should be what we're after
- First approach was not successful, so moving to discovery.processed_external_sparql_query
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Something to note here is that the user can also use VALUES to substitute an entity into a part of a query
- This makes it more difficult to derive if the entity is actually the subject of the query or whether it's in fact the object where we'd not want to include the query in the results
- See example below
VALUES (?entity) ({wd:Q123}) ?entity rdfs:label ?labels