Wikidata Analytics Request
This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!
Purpose
Please provide as much context as possible as well as what the produced insights or services will be used for.
T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS
Specific Results
Please detail the specific results that the task should deliver.
We would like to continuously monitor the following metric for WDQS:
- Number of SPARQL queries that only retrieve data of a known single entity (based on T370848)
- Number of SPARQL queries that retrieve Items based on a simple statement (based on T370853)
- Number of all other SPARQL queries
Desired Outputs
Please list the desired outputs of this task.
- Enhance existing Airflow pipeline (see T370851) to monitor the above metrics
- Done in T370851
- Output as CSV to https://analytics.wikimedia.org/published/datasets/wmde/analytics/
- Done in T370851
Notes
- We do not need 100% exact numbers, so it is okay to go for a random sample (e.g. every 100th query).
Deadline
Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
DD.MM.YYYY
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Check what the frequency of the job should be with stakeholders
- What we end up doing in T370851
- Modify job queries based on results from T370853
- Test job queries on Pyspark
- Modify Airflow DAG to run jobs with new column being included in the output
- Setup config to load in discovery.processed_external_sparql_query
- Query
- Export CSV
- Move CSV to published data directories
- Have table for outputs be made in a way that the analytics user has access to it
- Handled in T370851
- Before any further steps: Get approval from WMF for public data export via new Phab task
- Test Airflow DAG on personal Airflow instance
- Handled in T370851
- Deploy new Airflow DAG
- Handled in T370851
Estimation
Estimate: 2 days
Actual:
Data
The tables that will be referenced in this task.
- discovery.processed_external_sparql_query as is used in previous tasks
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note