Wikidata Analytics Request
This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!
Purpose
Please provide as much context as possible as well as what the produced insights or services will be used for.
Wikibase Product Platform team is currently working on discovering the biggest needs for reuse in Wikidata, in addition, Wikidata is trying to find out ways to reduce load on the Query service.
Specific Results
Please detail the specific results that the task should deliver.
- % of queries across a 90 day period which are asking for inverse statement lookup (see examples listed here slides 3-6)
Desired Outputs
Please list the desired outputs of this task.
Over the last 90 days (or 3x30 days, since the data might be too much to group directly over 90 days):
- The percentage of queries that were asking for inverse statement lookups
- Result: 7.47%
- Time frame: 90 days from 2024/08/24
- Total queries: 909,166,767 (could be a bit lower due to the use of the processed queries)
- Total inverse statement queries: 67,904,531
- Notebook for analysis: T377704_wd_inverse_statement_queries.ipynb
Deadline
Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
04.11.2024
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Explore potentially useful tables (DataHub)
- discovery.processed_external_sparql_query
- Using this so that the code for the query segmentation DAG job can be used for an initial segmentation
- Check with stakeholders to make sure plan is in alignment
- Q: Does "Medium Queries" in the section title of the slides mean that we do not want single statement queries?
- A: Can be single statements
- Q: Check on an example query
- A: Was removed, and we need to account for this in subsetting
- Q: Are predicates with wildcards valid for this? No such queries were in the examples?
- A: No need for wildcards
- Q: Does "Medium Queries" in the section title of the slides mean that we do not want single statement queries?
- Derive means of identifying queries of interest
- Look into inverse statement queries from T370853
- See if they cover the needed types of queries
- Derive total number of inverse statement queries that are were made over the last 90 days
Estimation
Estimate: 2 days
Actual: 3 days (lots of delay in getting to this)
Data
The tables that will be referenced in this task.
- discovery.processed_external_sparql_query
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Values are not exact as we can only remove single statement values or bind queries that mimic inverse statement queries
- We need to be able to assert that there are only four nodes and that the subject node is two of them, with one then being the values or bind
- If there are multiple statements, then we can't guarantee that the relationships would be applied correctly solely through counting the number of instances of the subjects in the nodes
- Exact method is:
- Is a single statement
- Query includes 'values' or 'bind'
- Only four nodes
- Subject node appears twice
- Subject is not an entity (from another conditional)