Page MenuHomePhabricator

[Analytics] % of inverse statement look up queries
Closed, ResolvedPublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

Wikibase Product Platform team is currently working on discovering the biggest needs for reuse in Wikidata, in addition, Wikidata is trying to find out ways to reduce load on the Query service.

Specific Results

Please detail the specific results that the task should deliver.

  • % of queries across a 90 day period which are asking for inverse statement lookup (see examples listed here slides 3-6)

Desired Outputs

Please list the desired outputs of this task.

Over the last 90 days (or 3x30 days, since the data might be too much to group directly over 90 days):

  • The percentage of queries that were asking for inverse statement lookups
  • Result: 7.47%
  • Time frame: 90 days from 2024/08/24
  • Total queries: 909,166,767 (could be a bit lower due to the use of the processed queries)
  • Total inverse statement queries: 67,904,531
  • Notebook for analysis: T377704_wd_inverse_statement_queries.ipynb

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

04.11.2024


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Explore potentially useful tables (DataHub)
    • discovery.processed_external_sparql_query
    • Using this so that the code for the query segmentation DAG job can be used for an initial segmentation
  • Check with stakeholders to make sure plan is in alignment
    • Q: Does "Medium Queries" in the section title of the slides mean that we do not want single statement queries?
      • A: Can be single statements
    • Q: Check on an example query
      • A: Was removed, and we need to account for this in subsetting
    • Q: Are predicates with wildcards valid for this? No such queries were in the examples?
      • A: No need for wildcards
  • Derive means of identifying queries of interest
    • Look into inverse statement queries from T370853
    • See if they cover the needed types of queries
  • Derive total number of inverse statement queries that are were made over the last 90 days

Estimation

Estimate: 2 days
Actual: 3 days (lots of delay in getting to this)

Data

The tables that will be referenced in this task.

  • discovery.processed_external_sparql_query

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Values are not exact as we can only remove single statement values or bind queries that mimic inverse statement queries
    • We need to be able to assert that there are only four nodes and that the subject node is two of them, with one then being the values or bind
    • If there are multiple statements, then we can't guarantee that the relationships would be applied correctly solely through counting the number of instances of the subjects in the nodes
    • Exact method is:
      • Is a single statement
      • Query includes 'values' or 'bind'
      • Only four nodes
      • Subject node appears twice
      • Subject is not an entity (from another conditional)

Event Timeline

@AndrewTavis_WMDE would you happen to have an update on how this is going? thank youu!

Goal is to have this done by EOD tomorrow, @Ifrahkhanyaree_WMDE :) As of now no further explanations are needed, but I'll send along something by tomorrow morning if something comes up.

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

Hey @Ifrahkhanyaree_WMDE 👋 Result here is 7.47% of queries for 90 days from 2024/08/24 with the notebook for the work being T377704_wd_inverse_statement_queries.ipynb.

Note that the value here is going to be a bit higher than what it actually is because removing values and bind queries that are replacing the subject from the results is dramatically/preventatively more complex for non-single statement inverse statement queries. The issues here are detailed in the notes, with the gist being that if the query is more than one statement, then we can't assert that the values or bind operation is being done on the subject (meaning the query should not be included in here as the query actually does include a valid QID for the subject). The way that we do this for single statement queries is the following:

  • Check that the query text includes 'values' or 'bind'
  • Assert that there are only four nodes (subject, predicate, object and a fourth node for the values or bind element)
  • Assert that the subject node appears twice (i.e. that the nodes above actually are subject x2, predicate and object meaning that the missing element is the subject again and that's what's being set by values or bind)
  • That the subject is not an entity comes from another conditional

If we expand this to two ore more statements, then the "simple" method above gets really messy. I did look into the above with eight nodes and checking that both subjects from a two statement query appear twice in the nodes, but we were getting back queries that did include inverse statements meaning they shouldn't be removed.

It's hard to estimate the full tail of multi-statement values and bind queries, but with the assumption being that most single statement queries will be a lot of the subject replacement traffic, I'm hoping that the percent above would be enough for an impression here. I can of course look into it more :)

Thanks so much for this @AndrewTavis_WMDE! Since this is a little out of my zone of expertise, I need some more clarification

I did look into the above with eight nodes and checking that both subjects from a two statement query appear twice in the nodes, but we were getting back queries that did include inverse statements meaning they shouldn't be removed.

So, are you saying that I should somehow consider that inverse queries make up more than the 7% listed above or less? I'm not 100% how to interpret this, apologies!

Hello hello! I'd assume that there is a bit less than 7% and the process of finding other non-inverse values and bind queries to remove as not actually inverse statements is proving to be difficult right now, but can be looked into further. All of these values and bind queries would come up as inverse statements at first, and I'm assuming that we've removed a good amount by removing the single statement ones, but I'm not sure on those that are more than one statement. Can of course look into it if need be :)

Ah got it, thank you. I think this task with it's scope is finished, I'd have to think about what other analyses we need in order to understand whether moving these out would make an impact.

I wonder - is there a relatively easy way to know what kinds of queries are run the most? without having a hypothesis?

We can also have this conversation somewhere else and we can consider this done

I wonder - is there a relatively easy way to know what kinds of queries are run the most? without having a hypothesis?

I'd say that what was done in T370853: [Analytics] [WDQS SEG M1] Identify queries that retrieve Items based on a simple statement and related tasks that lead to the values in wmde.wd_query_segments_daily is kind of the baseline that we could work from. By that I mean segmenting based on the structures that are available to us from the parsed query table (link to Data Hub) - so the number of triples, the number of nodes, the operators used, etc. Going from the query text itself is really hard as we can see that people can write queries in different ways that end up with the same results.

We can certainly discuss if there are metadata structures that could be explored further. Writing regular expressions to parse the queries themselves though should be avoided as there are so many variations that the subsets get quite messy.