Page MenuHomePhabricator

Implement alerting for wmf_content.mediawiki_content_history_v1
Closed, ResolvedPublic

Description

Deequ contains a VerificationSuite for validating assumptions on a dataframe. We can either apply verification checks to a mediawiki_content_history_v1 dataframe or on its computed data quality metrics.

We can do this on a per-wiki basis or across the entire dataset. For example:

  df = (spark.table("wmf_data_ops.data_quality_metrics")
      .where(f"partition_dt = CAST('{args.partition_dt}' AS TIMESTAMP)")
      .where("name = 'Completeness'")
      .where("tags['project'] = 'mediawiki_content_history'"))

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        .hasMin("value", lambda x: x == 1, "mediawiki_content_history is not complete") \
    .run()

Once the checks are in place we need to output a file with these alerts to hdfs and then pick it up in Airflow with the HdfsEmailOperator

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Fix alert path sanitizationrepos/data-engineering/airflow-dags!1324tchinfix-alert-pathmain
Bump mw content artifact to 0.7.0repos/data-engineering/airflow-dags!1314tchinbump-mw-content-0.7.0main
Fix alerts on no datarepos/data-engineering/mediawiki-content-pipelines!67tchinfix-alerts-no-datamain
Bump refinery version for mw_content_reconcile_mw_content_history_dailyrepos/data-engineering/airflow-dags!1304tchinbump-mw-content-refinerymain
Write all alerts to single filerepos/data-engineering/mediawiki-content-pipelines!66tchinwrite-all-alertsmain
Add alerts to mw content history dailyrepos/data-engineering/airflow-dags!1286tchinmw-content-history-dq-alertsmain
Customize query in GitLab

Event Timeline

Here's an issue I currently see: the data_quality_ops.data_quality_alerts doesn't have a column to put in metadata like tags like the metrics table does. This doesn't affect the actual alerting part, but would affect any future analyses and dashboarding someone might want to do on the verification checks. For instance if we want to alert on T388439 there isn't a way currently to differentiate records in the table that are checking monthly vs daily reconciles. Even now, there's an open question whether the source_table column in the alerts table should refer to data_quality_ops.data_quality_metrics or the underlying table that the metrics were computed against.

To support T388439 and future use cases, before I enabling alerting I'm going to work on some patches that'll allow inserting tags into the alerts table using deequ's ResultKey class so it (kinda) aligns with the way metrics works.

Also, it's a bit weird to call it the alerts table when it doesn't store alerts but the verification checks that if failed will create trigger an alert, but that's some bike shedding for some future time maybe.

Change #1127964 had a related patch set uploaded (by TChin; author: TChin):

[analytics/refinery/source@master] Support inserting ResultKey into DeequVerificationSuiteToDataQualityAlerts

https://gerrit.wikimedia.org/r/1127964

Change #1127967 had a related patch set uploaded (by TChin; author: TChin):

[analytics/refinery@master] Add columns to data_quality_alerts to support inserting ResultKey

https://gerrit.wikimedia.org/r/1127967

For instance if we want to alert on T388439 there isn't a way currently to differentiate records in the table that are checking monthly vs daily reconciles.

Makes sense, thanks for fixing this gap.

When you have a final ALTER for the table, please log it in here for completeness.

Change #1127964 merged by jenkins-bot:

[analytics/refinery/source@master] Support inserting ResultKey into DeequVerificationSuiteToDataQualityAlerts

https://gerrit.wikimedia.org/r/1127964

Change #1127967 merged by TChin:

[analytics/refinery@master] Add columns to data_quality_alerts to support inserting ResultKey

https://gerrit.wikimedia.org/r/1127967

Altered table:

ALTER TABLE wmf_data_ops.data_quality_alerts ADD COLUMNS (
    dataset_date BIGINT COMMENT 'AWS Deequ resultKey: key insertion time.',
    tags MAP<STRING,STRING> COMMENT 'AWS Deequ resultKey: key tags.'
);

Mentioned in SAL (#wikimedia-operations) [2025-05-11T22:54:04Z] <tchin@deploy1003> Started deploy [airflow-dags/analytics@301c74b]: Deploying airflow artifacts for T384962

Mentioned in SAL (#wikimedia-operations) [2025-05-11T22:55:32Z] <tchin@deploy1003> Finished deploy [airflow-dags/analytics@301c74b]: Deploying airflow artifacts for T384962 (duration: 02m 01s)

Deployed and updated airflow variables to use artifact v0.6.0

Mentioned in SAL (#wikimedia-operations) [2025-05-13T11:40:59Z] <tchin@deploy1003> Started deploy [airflow-dags/analytics@146dab1]: Deploying airflow artifacts for T384962

Mentioned in SAL (#wikimedia-operations) [2025-05-13T11:43:20Z] <tchin@deploy1003> Finished deploy [airflow-dags/analytics@146dab1]: Deploying airflow artifacts for T384962 (duration: 02m 44s)

Manually killed zombie yarn apps via:

analytics@an-launcher1002:/home/xcollazo$ yarn application -kill application_1741864027385_1222383
Killing application application_1741864027385_1222383
25/05/13 14:14:01 INFO impl.YarnClientImpl: Killed application application_1741864027385_1222383
analytics@an-launcher1002:/home/xcollazo$ yarn application -kill application_1741864027385_1245099
Killing application application_1741864027385_1245099
25/05/13 14:14:53 INFO impl.YarnClientImpl: Killed application application_1741864027385_1245099
analytics@an-launcher1002:/home/xcollazo$ yarn application -kill application_1741864027385_1222368
Killing application application_1741864027385_1222368
25/05/13 14:16:00 INFO impl.YarnClientImpl: Killed application application_1741864027385_1222368
analytics@an-launcher1002:/home/xcollazo$ yarn application -kill application_1741864027385_1245087
Killing application application_1741864027385_1245087
25/05/13 14:16:12 INFO impl.YarnClientImpl: Killed application application_1741864027385_1245087

Mentioned in SAL (#wikimedia-operations) [2025-05-13T15:00:25Z] <tchin@deploy1003> Started deploy [airflow-dags/analytics@0550b16]: Deploying airflow artifacts for T384962

Mentioned in SAL (#wikimedia-operations) [2025-05-13T15:02:25Z] <tchin@deploy1003> Finished deploy [airflow-dags/analytics@0550b16]: Deploying airflow artifacts for T384962 (duration: 02m 22s)