Wikidata Analytics Request
This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!
Purpose
Please provide as much context as possible as well as what the produced insights or services will be used for.
In T370851 we're working on an Airflow pipeline to segment Wikidata query service queries as a means of monitoring traffic of certain queries that could be more suitable for other services (Wikidata REST API, etc). At the end of this process we have "complex queries", which are the result of all conditions for segmentation not being met. This task would periodically get a subset of queries that are deemed complex such that decisions can be made on whether new segments need to be made or if conditions for segments should be changed to include some of those deemed "complex".
Specific Results
Please detail the specific results that the task should deliver.
A Jupyter notebook that loads in the main branch version of the production query on the Analytics repo on GitLab and then generates a set of queries that would be counted as "complex". We want this to be based on the production query such that any changes to the query will also be reflected in the notebook automatically.
Desired Outputs
Please list the desired outputs of this task.
- Creation of the "complex query" set sampling notebook
- Providing the first sample
- ... subsequent periods where the sample will be needed will be added
Deadline
Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
DD.MM.YYYY
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Creation of the "complex query" set sampling notebook
- Load in query from raw file on the main branch
- Split query and substitute variables for the period
- Produce a CSV that's then uploaded to Google Drive
Estimation
Estimate: 1 day
Actual:
Data
The tables that will be referenced in this task.
- discovery.processed_external_sparql_query via the query from T370851
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note