Page MenuHomePhabricator

[Analytics] [WDQS SEG M1] Create monitoring for the number of WDQS queries that retrieve Items based on a simple statement
Closed, ResolvedPublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS

Specific Results

Please detail the specific results that the task should deliver.

We would like to continuously monitor the following metric for WDQS:

  • Number of SPARQL queries that only retrieve data of a known single entity (based on T370848)
  • Number of SPARQL queries that retrieve Items based on a simple statement (based on T370853)
  • Number of all other SPARQL queries

Desired Outputs

Please list the desired outputs of this task.

Notes

  • We do not need 100% exact numbers, so it is okay to go for a random sample (e.g. every 100th query).

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

DD.MM.YYYY


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Check what the frequency of the job should be with stakeholders
  • Modify job queries based on results from T370853
  • Test job queries on Pyspark
  • Modify Airflow DAG to run jobs with new column being included in the output
    • Setup config to load in discovery.processed_external_sparql_query
    • Query
    • Export CSV
    • Move CSV to published data directories
  • Have table for outputs be made in a way that the analytics user has access to it
  • Before any further steps: Get approval from WMF for public data export via new Phab task
  • Test Airflow DAG on personal Airflow instance
  • Deploy new Airflow DAG

Estimation

Estimate: 2 days
Actual:

Data

The tables that will be referenced in this task.

  • discovery.processed_external_sparql_query as is used in previous tasks

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

Manuel moved this task from Incoming to To-Do on the Wikidata Analytics (Kanban) board.
AndrewTavis_WMDE changed the task status from Open to Stalled.Aug 7 2024, 11:40 AM

Stalled as this task is blocked by T370853.

AndrewTavis_WMDE changed the task status from Stalled to In Progress.Aug 13 2024, 3:02 PM

As with T370851, setup work for the DAG can now begin as the basic queries for these processes have been defined :)

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)
karapayneWMDE renamed this task from [Analytics] Monitor the number of WDQS queries that retrieve Items based on a simple statement to [Analytics] Create the monitoring the number of WDQS queries that retrieve Items based on a simple statement.Aug 19 2024, 9:12 AM
AndrewTavis_WMDE renamed this task from [Analytics] Create the monitoring the number of WDQS queries that retrieve Items based on a simple statement to [Analytics] Create monitoring for the number of WDQS queries that retrieve Items based on a simple statement.Aug 19 2024, 10:15 AM
karapayneWMDE renamed this task from [Analytics] Create monitoring for the number of WDQS queries that retrieve Items based on a simple statement to [Analytics] [WDQS SEG M1] Create monitoring for the number of WDQS queries that retrieve Items based on a simple statement.Aug 21 2024, 3:38 PM

Moving this to in review as all the work for this is currently being done in T370851.

Task is done! The associated work is in https://phabricator.wikimedia.org/T370851 . closing this one as main critieria is met