Page MenuHomePhabricator

[Analytics] [WDQS SEG M1] Create monitoring for the number of WDQS queries that only retrieve data of a known single entity
Open, In Progress, HighPublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS

Specific Results

Please detail the specific results that the task should deliver.

We would like to continuously monitor (e.g. daily, weekly) the following metric for WDQS:

  • Number of SPARQL queries that only retrieve data of a known single entity (based on T370848)
  • Number of all other SPARQL queries

Desired Outputs

Please list the desired outputs of this task.

Notes

  • We do not need 100% exact numbers, so it is okay to go for a random sample (e.g. every 100th query).

Open Questions

  • What is the frequency we need (e.g. daily, weekly)?

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

DD.MM.YYYY


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Check what the frequency of the job should be with stakeholders
    • Daily or weekly?
    • Answer: Weekly is fine
    • Note: Because of data constraints, it might make sense to do this daily as there's so much to parse in even one day
  • Setup job queries based on results from T370848
  • Test job queries on Pyspark
  • Setup Airflow DAG to run jobs
    • Setup config to load in discovery.processed_external_sparql_query
    • Query
    • Export CSV
    • Move CSV to published data directories
  • Have table for outputs be made in a way that the analytics user has access to it
  • Test Airflow DAG without CSV export step
  • Deploy Airflow DAG without CSV export
  • Figure out the reason for no returned data for the 5th and 13th of August
    • Current theory is that there wasn't queries for a whole hour on each of those days, and as the data is set up to be partitioned hourly the DAG is thus retrying regardless of the fact that there's data for the rest of the day
    • Confirmed that data doesn't exist for certain hours and we need data for all hours with the current setup
    • We need to look into using ExternalTaskSensor rather than a datasets.yaml based sensor
    • Ended up needing to use the new RestExternalTaskSensor
  • Before any further steps: Get approval from WMF for public data export via new Phab task
  • Test Airflow DAG with CSV export step
  • Deploy new Airflow DAG with CSV export

Estimation

Estimate: 2-3 days
Actual:

Data

The tables that will be referenced in this task.

  • discovery.processed_external_sparql_query as is used in previous tasks

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

Manuel moved this task from Incoming to To-Do on the Wikidata Analytics (Kanban) board.
AndrewTavis_WMDE changed the task status from Open to Stalled.Aug 7 2024, 11:39 AM

Stalled as this task is blocked by T370848.

AndrewTavis_WMDE changed the task status from Stalled to In Progress.Aug 13 2024, 3:00 PM

Basic DAG setup work can begin as a query have been defined that can provide the WDQS queries we need. The prior work still needs approval, so just setup for now :)

karapayneWMDE renamed this task from [Analytics] Monitor the number of WDQS queries that only retrieve data of a known single entity to [Analytics] Create the monitoring the number of WDQS queries that only retrieve data of a known single entity.Aug 19 2024, 9:11 AM
AndrewTavis_WMDE renamed this task from [Analytics] Create the monitoring the number of WDQS queries that only retrieve data of a known single entity to [Analytics] Create monitoring for the number of WDQS queries that only retrieve data of a known single entity.Aug 19 2024, 10:14 AM
karapayneWMDE renamed this task from [Analytics] Create monitoring for the number of WDQS queries that only retrieve data of a known single entity to [Analytics] [WDQS SEG M1] Create monitoring for the number of WDQS queries that only retrieve data of a known single entity.Aug 21 2024, 3:40 PM

Output of the Spark test of the DAG job queries:

daytotal_wdqs_queriestotal_single_entitytotal_all_entity_statementstotal_single_term_statementtotal_single_inverse_statementtotal_single_instance_or_subclass_statementtotal_single_known_relation_statementtotal_single_unknown_relation_statementtotal_complex_queries
2024-09-01883371874316171432846454379553117134494133178629827

Stalled due to data release requirement of WMF