Copy-pasting from T384962:
Here's an issue I currently see: the data_quality_ops.data_quality_alerts doesn't have a column to put in metadata like tags like the metrics table does. This doesn't affect the actual alerting part, but would affect any future analyses and dashboarding someone might want to do on the verification checks. For instance if we want to alert on T388439 there isn't a way currently to differentiate records in the table that are checking monthly vs daily reconciles. Even now, there's an open question whether the source_table column in the alerts table should refer to data_quality_ops.data_quality_metrics or the underlying table that the metrics were computed against.
To support T388439 and future use cases, before I enabling alerting I'm going to work on some patches that'll allow inserting tags into the alerts table using deequ's ResultKey class so it (kinda) aligns with the way metrics works.
Also, it's a bit weird to call it the alerts table when it doesn't store alerts but the verification checks that if failed will create trigger an alert, but that's some bike shedding for some future time maybe.
This turns out to have a few more steps than I expected.
[x] Modify `refinery-source` to support new columns in a backwards-compatible way
[x] Modify `refinery` with the new schema
[x] Modify airflow jobs that use deequ alerts to use new jars
- Hopefully we don't have to modify the actual job themselves, but if we do it would probably require going back to refinery-source
[x] deploy `refinery-source`
[x] deploy `refinery`
The next 3 bullets would ideally be done within an hour so the hourly dags don't break
[x] Alter table with new schema
```lang=sql
ALTER TABLE data_quality_alerts ADD COLUMNS (
dataset_date BIGINT COMMENT 'AWS Deequ resultKey: key insertion time.',
tags MAP<STRING,STRING> COMMENT 'AWS Deequ resultKey: key tags.'
);
```
[x] deploy Airflow dags that use deequ alerts with new refinery source version
- [x] webrequest/refine_webrequest_analyzer_hourly_dag
- [x] mediawiki/mediawiki_history_metrics_monthly_dag
- [x] Update airflow variables
[x] `scap deploy`Airflow (still needs to be done for artifact sync; currently only dags are automatically synced)
Since the only users of `refinery-deequ-python` is mediawiki content dumps and it doesn't have alerting yet, the next bullet points can be less rushed:
[ ] Modify `refinery-deequ-python`
- [ ] Support new columns, change refinery version
- [ ] Change name of package, still called `refinery-python`
[ ] deploy `refinery-deequ-python` with new refinery source version
[ ] modify `compute_metrics` script to use the prod `refinery-deequ-python` instead of pointing to Gabriele's gitlab repo
- [ ] remember to rename imports
- [ ] Update airflow variables
- At this point we can actually do T384962