Page MenuHomePhabricator

[Analytics] Create a new data dump of WDQS queries with the new buckets
Closed, ResolvedPublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

We want to get an understanding of how many simple and medium queries are run on the QS and what sub type. This will help us better work on a solution to move people over and reduce the load on the QS.

Specific Results

Please detail the specific results that the task should deliver.

Create a new data dump of WDQS queries of the last 3 months with the following classifications. In addition have a summary table with the total number of queries and the share of each of these types as a %.

SIMPLE

  • Queries only requesting a label, description or alias
  • Queries only requesting some or all statements from one or several entities
  • Queries only checking a known relationship between two entities

MEDIUM

  • Queries only checking for an inverse statement
    • Identifier look up
    • Identifier look up + retrieval
    • Other (if something is inverse statement but does not fit the two buckets above)
  • Queries trying to determine super classes or sub classes
  • Queries only checking if there is any direct relation between two entities
  • Queries that only get the value of a statement from an entity linked from a given entity (one hop)

COMPLEX

  • Everything else

Desired Outputs

Please list the desired outputs of this task.

  • CSV data dump of the queries over the last 3 months (3x30 is perfect)
  • A temporary table in the data lake that has the data dump
    • The queries are available in wmde.tmp_wdqs_query_segments
  • Table as a comment below this task showing the % share of each of the queries in the data dump
    • Table is in the task description below

Class Explanations

The following is an explanation of the various field names within the generated dataset.

Types of query sub-classification types:

  • only: All query statements are of a specific type
  • single: There's only one statement of a specific type
  • includes: At least one statement is of a specific type
Small
  • only_term_statements: labels, descriptions or aliases
  • only_ent_subj_statements: Data from one or more explicit entities
    • Note: The following query types are explicitly not included:
      • only_term_statements
      • includes_instance_or_subclass_statement
      • single_known_relation_statement
      • single_unknown_relation_statement
  • single_known_relation_statement: Checking if a relation exists
Medium
  • single_unknown_relation_statement (no P31 or P279): Checking what the relationship is
    • Note: Needs to be a single statement as the logic for this for more than one gets preventatively complex via overlaps with others
  • single_inverse_statement: Get the subject given a predicate and object
    • Note: Needs to be a single statement as the logic for this for more than one gets preventatively complex via overlaps with others
    • Note: Explicitly does not include includes_instance_or_subclass_statement queries
  • includes_instance_or_subclass_statement (has P31 or P279): These properties are in the query
    • Note: Explicitly does not include single_known_relation_statement queries
  • includes_subj_or_obj_link_statement (two statements): A subject or object is derived in a statement and then used in a second statement
    • Note: Explicitly does not include includes_instance_or_subclass_statement queries
Complex
  • All other queries

Results Breakdown

NOTE: The queries are available in wmde.tmp_wdqs_query_segments
NOTE: Work for this task can be found in T386342_wdqs_query_segmentation_dataset on the wmde/analytics repo on GitLab.
classificationtotal_queries_in_classificationpercent_of_total_queries
complex698,634,32670.32
medium240,082,52424.16
small54,826,6055.52
sub_classificationtotal_queries_in_sub_classificationpercent_of_total_queries
complex_query698,634,32670.32
includes_instance_or_subclass_statement191,771,38719.30
single_inverse_statement35,454,1953.57
only_term_statements29,714,7832.99
only_ent_subj_statements15,784,0481.59
single_known_relation_statement9,327,7740.94
includes_subj_or_obj_link_statement8,553,0830.86
single_unknown_relation_statement4,303,8590.43

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

21.02.2025


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Clarify exact fields requested in the CSVs
    • query
    • sub_classification
    • classification (simple, medium, complex)
  • Determine relationship between new classifications and those made for the table wmde.wd_query_segments_daily
  • Clarify new classifications being asked for
  • Map out how to derive all classifications and all Spark QL query flags that are needed for boolean classification breakdowns
  • Derive total queries for period so we have a check metric for the classification buckets
  • Set up base queries for all necessary boolean flags
  • Set up downstream queries for query classifications and sub-classifications
  • Create a temporary tables for the results
  • Run process to populate the temporary tables
  • Hand off temporary tables and report classification breakdowns

Estimation

Estimate: 3-4 days
Actual: 5 days (making distinct subsets and running queries)

Data

The tables that will be referenced in this task.

  • discovery.processed_external_sparql_query

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

Hellooo, do we think it'll be possible to have this by end of next week? (or what would be feasible?)

Hellooo 👋 Sorry that this has fallen back given me being out. I'll vouch for this being the top priority for next week so that we can finalize it!

Mapping out my understanding of the classification buckets in relation to those that we made for the table wmde.wd_query_segments_daily.

Note: Editing this message with @Ifrahkhanyaree_WMDE right now :)

Columns for the CSV:

  • query
  • sub_classification
  • classification

SIMPLE

  • Queries only requesting a label, description or alias (total_single_term_statement, but do you want this to be beyond one statement?)
    • Example: Get the label for multiple items
    • The above example would be included. We're including any queries that are ONLY for "terms".
  • Queries only requesting some or all statements from one or several entities
    • We want to be explicitly defining the entities in our queries, not have them expand beyond a preset number
    • Note: Don't do labels above, just have some or all statements from one or several
    • Note: The subject needs to be defined in an explicit manner (QID)
  • Queries only checking a known relationship between two entities (total_single_known_relation_statement)

MEDIUM

  • Queries only checking if there is any direct relation between two entities (We have total_single_unknown_relation_statement, but I'd like to discuss this as we also have total_single_known_relation_statement that we're using above)
    • Note: These would be the batches to include, but specifically we need to make sure that we do not include instance/subclass of queries
    • Note: Don't include P31 and P279
    • Edit: This is only total_single_unknown_relation_statement as known is a category in simple
  • Queries only checking for an inverse statement (total_single_inverse_statement)
    • Identifier look up
    • Identifier look up + retrieval
    • Other (if something is inverse statement but does not fit the two buckets above)
  • Queries trying to determine super classes or sub classes (total_single_instance_or_subclass_statement, and it's fine that these are single statements?)
    • Note: We would want more than one statement here
  • Queries that only get the value of a statement from an entity linked from a given entity (one hop) (As I read it this is new, so we should clarify what this is exactly)
    • Note: Two statements, with the first having a defined predicate and either a subject or object, and then the undefined part is used in the second statement

Example:

SELECT ?q ?x { wd:QID wdt:PID1* ?q . ?q wdt:PID2 ?x }

COMPLEX

  • Determined by the classifications above

@Ifrahkhanyaree_WMDE: The query classification/sub classification dump can be found in wmde.tmp_wdqs_query_segments, which you have access to. The Class Explanations section in the task description details the mapping of your classes over to the identifiers for them as well as the work that was done to make sure that the queries are all in distinct classifications and sub classifications. The Results Breakdown in the task description has the requested tables.

Moving this into review :)

NOTE: Work for this task can be found in T386342_wdqs_query_segmentation_dataset on the wmde/analytics repo on GitLab.

Quick check here @Ifrahkhanyaree_WMDE: Is anything else needed for this task, or are we good to resolve? I'd be happy to make the task for normalizing the queries if you wanted me to outline the discussion that we had the other day?

Helloo sorry been meaning to leave a comment here and resolve it but 7569 things kept popping up. I'd say let's make a follow-up task like we spoke about but keep it in the backlog and refine it together once you're back?

I'd say we can close this ticket, sorry again and thank you!

Helloooo! No need to apologize whatsoever :) T390723 has been created as a follow-up task. Looking forward to looking into this more later!

Resolving this per the comment above 😊