== Wikidata Analytics Request ==
> This task was generated using the [Wikidata Analytics](https://phabricator.wikimedia.org/project/profile/5408) request form. Please use the task template linked on our project page to create issues for the team. Thank you!
=== Purpose ===
> Please provide as much context as possible as well as what the produced insights or services will be used for.
In T370851 we're working on an Airflow pipeline to segment Wikidata query service queries as a means of monitoring traffic of certain queries that could be more suitable for other services (Wikidata REST API, etc). At the end of this process we have "complex queries", which are the result of all conditions for segmentation not being met. This task would periodically get a subset of queries that are deemed complex such that decisions can be made on whether new segments need to be made or if conditions for segments should be changed to include some of those deemed "complex".
=== Specific Results ===
> Please detail the specific results that the task should deliver.
A Jupyter notebook that loads in the main branch version of the testing query on the Analytics repo on GitLab and then generates a set of queries that would be counted as "complex". We want this to be based on the testing query such that any changes to the query will also be reflected in the notebook automatically.
=== Desired Outputs ===
> Please list the desired outputs of this task.
[] Creation of the "complex query" set sampling notebook
[] Providing the first sample
[] ... subsequent periods where the sample will be needed will be added
=== Deadline ===
> Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
DD.MM.YYYY
---
**Information below this point is filled out by the task assignee.**
== Assignee Planning ==
=== Sub Tasks ===
> A full breakdown of the steps to complete this task.
[] Creation of the "complex query" set sampling notebook
- Load in query from raw file on the main branch
- Split query and substitute variables for the period
- Produce a CSV that's then uploaded to Google Drive
=== Estimation ===
Estimate: 1 day
Actual:
=== Data ===
> The tables that will be referenced in this task.
- `discovery.processed_external_sparql_query` via the query from T370851
=== Notes ===
> Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note