Wikidata Analytics Request
This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!
Purpose
Please provide as much context as possible as well as what the produced insights or services will be used for.
We want to get an understanding of how many simple and medium queries are run on the QS and what sub type. This will help us better work on a solution to move people over and reduce the load on the QS.
Specific Results
Please detail the specific results that the task should deliver.
Create a new data dump of WDQS queries of the last 3 months with the following classifications. In addition have a summary table with the total number of queries and the share of each of these types as a %.
SIMPLE
- Queries only requesting a label, description or alias
- Queries only requesting some or all statements from one or several entities
- Queries only checking a known relationship between two entities
MEDIUM
- Queries only checking for an inverse statement
- Identifier look up
- Identifier look up + retrieval
- Other (if something is inverse statement but does not fit the two buckets above)
- Queries trying to determine super classes or sub classes
- Queries only checking if there is any direct relation between two entities
- Queries that only get the value of a statement from an entity linked from a given entity (one hop)
COMPLEX
- Everything else
Desired Outputs
Please list the desired outputs of this task.
-
CSV data dump of the queries over the last 3 months (3x30 is perfect) - A temporary table in the data lake that has the data dump
- The queries are available in wmde.tmp_wdqs_query_segments
- Table as a comment below this task showing the % share of each of the queries in the data dump
- Table is in the task description below
Class Explanations
The following is an explanation of the various field names within the generated dataset.
Types of query sub-classification types:
- only: All query statements are of a specific type
- single: There's only one statement of a specific type
- includes: At least one statement is of a specific type
Small
- only_term_statements: labels, descriptions or aliases
- only_ent_subj_statements: Data from one or more explicit entities
- Note: The following query types are explicitly not included:
- only_term_statements
- includes_instance_or_subclass_statement
- single_known_relation_statement
- single_unknown_relation_statement
- Note: The following query types are explicitly not included:
- single_known_relation_statement: Checking if a relation exists
Medium
- single_unknown_relation_statement (no P31 or P279): Checking what the relationship is
- Note: Needs to be a single statement as the logic for this for more than one gets preventatively complex via overlaps with others
- single_inverse_statement: Get the subject given a predicate and object
- Note: Needs to be a single statement as the logic for this for more than one gets preventatively complex via overlaps with others
- Note: Explicitly does not include includes_instance_or_subclass_statement queries
- includes_instance_or_subclass_statement (has P31 or P279): These properties are in the query
- Note: Explicitly does not include single_known_relation_statement queries
- includes_subj_or_obj_link_statement (two statements): A subject or object is derived in a statement and then used in a second statement
- Note: Explicitly does not include includes_instance_or_subclass_statement queries
Complex
- All other queries
Results Breakdown
classification | total_queries_in_classification | percent_of_total_queries |
---|---|---|
complex | 698,634,326 | 70.32 |
medium | 240,082,524 | 24.16 |
small | 54,826,605 | 5.52 |
sub_classification | total_queries_in_sub_classification | percent_of_total_queries |
---|---|---|
complex_query | 698,634,326 | 70.32 |
includes_instance_or_subclass_statement | 191,771,387 | 19.30 |
single_inverse_statement | 35,454,195 | 3.57 |
only_term_statements | 29,714,783 | 2.99 |
only_ent_subj_statements | 15,784,048 | 1.59 |
single_known_relation_statement | 9,327,774 | 0.94 |
includes_subj_or_obj_link_statement | 8,553,083 | 0.86 |
single_unknown_relation_statement | 4,303,859 | 0.43 |
Deadline
Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
21.02.2025
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Clarify exact fields requested in the CSVs
- query
- sub_classification
- classification (simple, medium, complex)
- Determine relationship between new classifications and those made for the table wmde.wd_query_segments_daily
- Clarify new classifications being asked for
- Map out how to derive all classifications and all Spark QL query flags that are needed for boolean classification breakdowns
- Derive total queries for period so we have a check metric for the classification buckets
- Set up base queries for all necessary boolean flags
- Set up downstream queries for query classifications and sub-classifications
- Create a temporary tables for the results
- Run process to populate the temporary tables
- Hand off temporary tables and report classification breakdowns
Estimation
Estimate: 3-4 days
Actual: 5 days (making distinct subsets and running queries)
Data
The tables that will be referenced in this task.
- discovery.processed_external_sparql_query
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note