Page MenuHomePhabricator

[Analytics] [WDQS SEG M1] Identify queries that only retrieve data of a known single entity
Closed, ResolvedPublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

Related epic: T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS

Specific Results

Please detail the specific results that the task should deliver.

  • The set of queries that includes queries that reference the following three sets
    1. Only one QID (wd:Q or <http://www.wikidata.org/entity/Q) in the subject position
    2. Only one PID (wd:P or <http://www.wikidata.org/entity/P) in the subject position
    3. Only one LID (wd:L or <http://www.wikidata.org/entity/L) in the subject position

Desired Outputs

Please list the desired outputs of this task.

  • Initial simple identification algorithm that can be used in T370851 (and iteratively improved on later).
    • We can use the table discovery.processed_external_sparql_query
      • This table has a triples column that contains all triples for a given query
      • triples.subjectNode.nodeValue is the subject of any triple
      • For this task we're getting queries that have single unique subject for all triples that's also a Wikidata entity (QID, PID, LID)
    • Notebook: T370848_discovery_single_wd_entity_query_use.ipynb

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

DD.MM.YYYY


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Check scope of task
    • Entity is an item, property or lexeme in this case? (Wikidata:Identifiers)
    • Only one max means that the query for a property of an item doesn't qualify?
    • The above assumptions were not correct, so the specific results have been edited to be more clear
  • Reorientation to SPARQL query data
  • Explore ways that entities can be included in a SPARQL query
  • Check with team on ideas
  • Write queries to identify one instance of an entity
  • Write full query that combines all WHERE clauses
  • Check full query results with WMF
    • Decision was that a regex approach based on the full query is too complicated, so we'll use the triples from the parsed query data
  • Write full query that derives queries to single entities
  • Share baseline identification method
  • Check new full query results with WMF and stakeholders
    • We need to remove queries that have wildcards (*, + and /)
      • There's a question of whether these are modeled in the parsing or not, but if not we can remove them via the query text and matching via RLIKE 'wdt:P[0-9]+[*+\/]'
    • We should check to see if we need to remove queries that have entities in their triple objects
    • A decision needs to be made on whether multiple triples for a single entity is included here

Estimation

Estimate: 2 days
Actual: 3 days (needed to pivot to a new data source after day 1)

Data

The tables that will be referenced in this task.

  • event.wdqs_external_sparql_query
    • First approach was not successful, so moving to discovery.processed_external_sparql_query
      • We can use triples.subjectNode.nodeValue of this table
      • Checking that this is an entity ID (wd:[QPL][0-9]+) and that each query id is associated with one such entity ID should be what we're after

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Something to note here is that the user can also use VALUES to substitute an entity into a part of a query
    • This makes it more difficult to derive if the entity is actually the subject of the query or whether it's in fact the object where we'd not want to include the query in the results
    • See example below
VALUES (?entity) ({wd:Q123})
?entity rdfs:label ?labels

Event Timeline

Manuel updated the task description. (Show Details)
Manuel updated the task description. (Show Details)

Exploration is done and I've asked about the methodology in the Data channel on Mattermost :) The final queries where my assumptions on methods are correct have also been written.

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

T370848_discovery_single_wd_entity_query_use.ipynb has the work associated with this task :) Moving this to review so that we can check the queries for this and the other WDQS task T370853.

AndrewTavis_WMDE changed the task status from Open to In Progress.Aug 13 2024, 3:04 PM
karapayneWMDE renamed this task from [Analytics] Identify queries that only retrieve data of a known single entity to [Analytics] [WDQS SEG M1] Identify queries that only retrieve data of a known single entity.Aug 21 2024, 3:40 PM

Task is done! closing