Page MenuHomePhabricator

[PERIODIC] [Analytics] Sample complex WDQS queries
Open, Needs TriagePublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

In T370851 we're working on an Airflow pipeline to segment Wikidata query service queries as a means of monitoring traffic of certain queries that could be more suitable for other services (Wikidata REST API, etc). At the end of this process we have "complex queries", which are the result of all conditions for segmentation not being met. This task would periodically get a subset of queries that are deemed complex such that decisions can be made on whether new segments need to be made or if conditions for segments should be changed to include some of those deemed "complex".

Specific Results

Please detail the specific results that the task should deliver.

A Jupyter notebook that loads in the main branch version of the production query on the Analytics repo on GitLab and then generates a set of queries that would be counted as "complex". We want this to be based on the production query such that any changes to the query will also be reflected in the notebook automatically.

Desired Outputs

Please list the desired outputs of this task.

  • Creation of the "complex query" set sampling notebook
  • Providing the first sample
  • ... subsequent periods where the sample will be needed will be added

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

DD.MM.YYYY


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Creation of the "complex query" set sampling notebook
    • Load in query from raw file on the main branch
    • Split query and substitute variables for the period
    • Produce a CSV that's then uploaded to Google Drive

Estimation

Estimate: 1 day
Actual:

Data

The tables that will be referenced in this task.

  • discovery.processed_external_sparql_query via the query from T370851

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

AndrewTavis_WMDE renamed this task from [PERIODIC] [Analytics] Sample to [PERIODIC] [Analytics] Sample complex WDQS queries.Sep 20 2024, 2:19 PM