Page MenuHomePhabricator

[Data Quality] Add ability to add tags to alerts
Closed, ResolvedPublic

Description

Copy-pasting from T384962:

Here's an issue I currently see: the data_quality_ops.data_quality_alerts doesn't have a column to put in metadata like tags like the metrics table does. This doesn't affect the actual alerting part, but would affect any future analyses and dashboarding someone might want to do on the verification checks. For instance if we want to alert on T388439 there isn't a way currently to differentiate records in the table that are checking monthly vs daily reconciles. Even now, there's an open question whether the source_table column in the alerts table should refer to data_quality_ops.data_quality_metrics or the underlying table that the metrics were computed against.

To support T388439 and future use cases, before I enabling alerting I'm going to work on some patches that'll allow inserting tags into the alerts table using deequ's ResultKey class so it (kinda) aligns with the way metrics works.

Also, it's a bit weird to call it the alerts table when it doesn't store alerts but the verification checks that if failed will create trigger an alert, but that's some bike shedding for some future time maybe.

This turns out to have a few more steps than I expected.

  • Modify refinery-source to support new columns in a backwards-compatible way
  • Modify refinery with the new schema
  • Modify airflow jobs that use deequ alerts to use new jars
    • Hopefully we don't have to modify the actual job themselves, but if we do it would probably require going back to refinery-source
  • deploy refinery-source
  • deploy refinery

The next 3 bullets would ideally be done within an hour so the hourly dags don't break

  • Alter table with new schema
ALTER TABLE data_quality_alerts ADD COLUMNS (
    dataset_date BIGINT COMMENT 'AWS Deequ resultKey: key insertion time.',
    tags MAP<STRING,STRING> COMMENT 'AWS Deequ resultKey: key tags.'
);
  • deploy Airflow dags that use deequ alerts with new refinery source version
    • webrequest/refine_webrequest_analyzer_hourly_dag
    • mediawiki/mediawiki_history_metrics_monthly_dag
    • Update airflow variables
  • scap deployAirflow (still needs to be done for artifact sync; currently only dags are automatically synced)

Since the only users of refinery-deequ-python is mediawiki content dumps and it doesn't have alerting yet, the next bullet points can be less rushed:

  • Modify refinery-deequ-python
    • Support new columns, change refinery version
    • Change name of package, still called refinery-python
  • deploy refinery-deequ-python with new refinery source version
  • modify compute_metrics script to use the prod refinery-deequ-python instead of pointing to Gabriele's gitlab repo
    • remember to rename imports
    • Update airflow variables
    • At this point we can actually do T384962

Event Timeline

tchin updated the task description. (Show Details)

Change #1127964 had a related patch set uploaded (by TChin; author: TChin):

[analytics/refinery/source@master] Support inserting ResultKey into DeequVerificationSuiteToDataQualityAlerts

https://gerrit.wikimedia.org/r/1127964

Change #1127967 had a related patch set uploaded (by TChin; author: TChin):

[analytics/refinery@master] Add columns to data_quality_alerts to support inserting ResultKey

https://gerrit.wikimedia.org/r/1127967

tchin updated the task description. (Show Details)
Ahoelzl renamed this task from Add ability to add tags to alerts to [Data Quality] Add ability to add tags to alerts.Apr 3 2025, 3:43 PM

Change #1127964 merged by jenkins-bot:

[analytics/refinery/source@master] Support inserting ResultKey into DeequVerificationSuiteToDataQualityAlerts

https://gerrit.wikimedia.org/r/1127964

Change #1127967 merged by TChin:

[analytics/refinery@master] Add columns to data_quality_alerts to support inserting ResultKey

https://gerrit.wikimedia.org/r/1127967

Seems to be working, webrequest_analyzer dag runs normally and I can see the columns being filled in the table:

spark-sql (default)> select * from data_quality_alerts where source_table='`wmf`.`webrequest`' order by partition_ts desc limit 2;
source_table	partition_id	partition_ts	status	severity_level	value	constraint	error_message	pipeline_run_id	dataset_date	tags
`wmf`.`webrequest`	year=2025/month=4/day=4/hour=12	2025-04-04 12:00:00	Success	Warning	1.0	ComplianceConstraint(Compliance(x_analytics_duplicate_keys is non-negative,COALESCE(CAST(x_analytics_duplicate_keys AS DECIMAL(20,10)), 0.0) >= 0,None,List(x_analytics_duplicate_keys)))	NULL	scheduled__2025-04-04T12:00:00+00:00	1743781205453	{}
`wmf`.`webrequest`	year=2025/month=4/day=4/hour=12	2025-04-04 12:00:00	Failure	Warning	1.0	MaximumConstraint(Maximum(x_analytics_duplicate_keys,None))	Value: 1.0 does not meet the constraint requirement!	scheduled__2025-04-04T12:00:00+00:00	1743781205453	{}