== Wikidata Analytics Request ==
> This task was generated using the [Wikidata Analytics](https://phabricator.wikimedia.org/project/profile/5408) request form. Please use the task template linked on our project page to create issues for the team. Thank you!
=== Purpose ===
> Please provide as much context as possible as well as what the produced insights or services will be used for.
{T370416}
=== Specific Results ===
> Please detail the specific results that the task should deliver.
We would like to continuously monitor (e.g. daily, weekly) the following metric for WDQS:
- Number of SPARQL queries that only retrieve data of a known single entity (based on T370848)
- Number of all other SPARQL queries
=== Desired Outputs ===
> Please list the desired outputs of this task.
[x] Airflow pipeline to monitor the above metric
[] Output as CSV to https://analytics.wikimedia.org/published/datasets/wmde/analytics/
=== Notes ===
- We do not need 100% exact numbers, so it is okay to go for a random sample (e.g. every 100th query).
=== Open Questions ===
- What is the frequency we need (e.g. daily, weekly)?
=== Deadline ===
> Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
DD.MM.YYYY
---
**Information below this point is filled out by the task assignee.**
== Assignee Planning ==
=== Sub Tasks ===
> A full breakdown of the steps to complete this task.
[x] Check what the frequency of the job should be with stakeholders
- Daily or weekly?
- Answer: Weekly is fine
- Note: Because of data constraints, it might make sense to do this daily as there's so much to parse in even one day
[x] Setup job queries based on results from T370848
[x] Test job queries on Pyspark
[x] Setup Airflow DAG to run jobs
- Setup config to load in `discovery.processed_external_sparql_query`
- Query
- Export CSV
- Move CSV to published data directories
[x] Have table for outputs be made in a way that the analytics user has access to it
[x] Test Airflow DAG **without** CSV export step
[x] Deploy Airflow DAG **without** CSV export
[x] Figure out the reason for no returned data for the 5th and 13th of August
- Current theory is that there wasn't queries for a whole hour on each of those days, and as the data is set up to be partitioned hourly the DAG is thus retrying regardless of the fact that there's data for the rest of the day
- Confirmed that data doesn't exist for certain hours and we need data for all hours with the current setup
- We need to look into using `ExternalTaskSensor` rather than a `datasets.yaml` based sensor
- Ended up needing to use the new `RestExternalTaskSensor`
[x] Before any further steps: Get approval from WMF for public data export via new Phab task
- Task id: T372537
[] Test Airflow DAG **with** CSV export step
[] Deploy new Airflow DAG **with** CSV export
=== Estimation ===
Estimate: 2-3 days
Actual:
=== Data ===
> The tables that will be referenced in this task.
- `discovery.processed_external_sparql_query` as is used in previous tasks
=== Notes ===
> Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note