Wikidata Analytics Request
This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!
Purpose
Please provide as much context as possible as well as what the produced insights or services will be used for.
T370416: [Analytics] [WDQS SEG M1] Segmentation of queries sent to WDQS
Specific Results
Please detail the specific results that the task should deliver.
We would like to continuously monitor (e.g. daily, weekly) the following metric for WDQS:
- Number of SPARQL queries that only retrieve data of a known single entity (based on T370848)
- Number of all other SPARQL queries
Desired Outputs
Please list the desired outputs of this task.
- Airflow pipeline to monitor the above metric
- Output as CSV to https://analytics.wikimedia.org/published/datasets/wmde/analytics/
Notes
- We do not need 100% exact numbers, so it is okay to go for a random sample (e.g. every 100th query).
Open Questions
- What is the frequency we need (e.g. daily, weekly)?
Deadline
Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
DD.MM.YYYY
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Check what the frequency of the job should be with stakeholders
- Daily or weekly?
- Answer: Weekly is fine
- Note: Because of data constraints, it might make sense to do this daily as there's so much to parse in even one day
- Setup job queries based on results from T370848
- Test job queries on Pyspark
- Setup Airflow DAG to run jobs
- Setup config to load in discovery.processed_external_sparql_query
- Query
- Export CSV
- Move CSV to published data directories
- Have table for outputs be made in a way that the analytics user has access to it
- Test Airflow DAG without CSV export step
- Deploy Airflow DAG without CSV export
- Figure out the reason for no returned data for the 5th and 13th of August
- Current theory is that there wasn't queries for a whole hour on each of those days, and as the data is set up to be partitioned hourly the DAG is thus retrying regardless of the fact that there's data for the rest of the day
- Confirmed that data doesn't exist for certain hours and we need data for all hours with the current setup
- We need to look into using ExternalTaskSensor rather than a datasets.yaml based sensor
- Ended up needing to use the new RestExternalTaskSensor
- Before any further steps: Get approval from WMF for public data export via new Phab task
- Task id: T372537
- Test Airflow DAG with CSV export step
- Deploy new Airflow DAG with CSV export
Estimation
Estimate: 2-3 days
Actual:
Data
The tables that will be referenced in this task.
- discovery.processed_external_sparql_query as is used in previous tasks
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note