Page MenuHomePhabricator

Add metrics for monthly reconciles
Closed, ResolvedPublic

Description

On T357684, we built a metrics job for our daily reconciles, also known as last-24h runs as per schema of wmf_content.inconsistencies_of_mediawiki_content_history_v1.

On this task we should extend this mechanism to also produce metrics when we do monthly reconciles (aka all-of-wiki-time).

  • modify compute_metrics.py to accept a compute_class paramenter of last-24h | all-of-wiki-time.
  • Add metrics to the mw_content_reconcile_mw_content_history_monthly DAG.
  • Modify the data being kept in the metrics table accordingly so that we can segregate between last-24h | all-of-wiki-time runs.
  • Modify dashboard to include this data.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Add metrics for mw_content_history monthly reconcilerepos/data-engineering/airflow-dags!1492tchinmw-content-history-metrics-wiki-timemain
Use max_timestamp for partition_ts when computing metrics for all-of-wiki-timerepos/data-engineering/mediawiki-content-pipelines!73tchinmax-timestamp-for-all-of-wiki-timemain
Add alerts to metrics computationrepos/data-engineering/mediawiki-content-pipelines!60tchinadd-alertsmain
Customize query in GitLab

Event Timeline

Since implementing the metrics segregation, we should now update the legacy metrics with the computation class before implementing the monthly metrics

spark-sql (default)> SELECT COUNT(*) AS count
                   > FROM wmf_data_ops.data_quality_metrics
                   > WHERE tags['project'] = 'mediawiki_content_history'
                   >   AND (tags['computation_class'] IS NULL OR tags['computation_class'] = '')
                   > ;
count
577920

The metrics table has no unique index I can match on to update rows so I had to match on almost every column but it worked I guess

MERGE INTO wmf_data_ops.data_quality_metrics t
USING (
  SELECT
    dataset_date,
    tags,
    entity,
    instance,
    value,
    name,
    source_table,
    pipeline_run_id,
    partition_id,
    partition_ts,
    map_concat(tags, map('computation_class', 'last-24h')) AS updated_tags
  FROM wmf_data_ops.data_quality_metrics
  WHERE tags['project'] = 'mediawiki_content_history'
    AND (tags['computation_class'] IS NULL OR tags['computation_class'] = '')
  AND partition_ts <= TIMESTAMP '2025-05-10 00:00:00'
) s
ON (t.partition_id = s.partition_id
   AND t.dataset_date = s.dataset_date
   AND t.pipeline_run_id = s.pipeline_run_id
   AND t.name = s.name
   AND t.instance = s.instance
   AND t.value = s.value
   AND t.entity = s.entity
   AND t.source_table = s.source_table)
WHEN MATCHED THEN
  UPDATE SET tags = s.updated_tags;
spark-sql (default)> SELECT COUNT(*) AS count
                   > FROM wmf_data_ops.data_quality_metrics
                   > WHERE tags['project'] = 'mediawiki_content_history'
                   >   AND (tags['computation_class'] IS NULL OR tags['computation_class'] = '')
                   > ;
count
0

Just noticed that in the metrics computation script, it deletes any duplicated metrics WHERE partition_ts = CAST('{args.min_timestamp}' AS TIMESTAMP) in case of reruns. However, for all-of-wiki-time, min_timestamp is always 2000-01-01T00:00:00. We need the partition_ts column to be the max_timestamp for this case.

The metrics table has no unique index I can match on to update rows so I had to match on almost every column but it worked I guess

MERGE INTO wmf_data_ops.data_quality_metrics t
...

Nice!

In this case I think you could have also used an UPDATE for simpler syntax, just FYI:

UPDATE wmf_data_ops.data_quality_metrics
SET tags = map_concat(tags, map('computation_class', 'last-24h'))
WHERE tags['project'] = 'mediawiki_content_history'
  AND (tags['computation_class'] IS NULL OR tags['computation_class'] = '')
  AND partition_ts <= TIMESTAMP '2025-05-10 00:00:00'

Just noticed that in the metrics computation script, it deletes any duplicated metrics WHERE partition_ts = CAST('{args.min_timestamp}' AS TIMESTAMP) in case of reruns. However, for all-of-wiki-time, min_timestamp is always 2000-01-01T00:00:00. We need the partition_ts column to be the max_timestamp for this case.

Nice catch!

Mentioned in SAL (#wikimedia-operations) [2025-06-30T16:57:35Z] <tchin@deploy1003> Started deploy [airflow-dags/analytics@74e8d66]: Deploying artifacts for T388439

Mentioned in SAL (#wikimedia-operations) [2025-06-30T16:58:11Z] <tchin@deploy1003> Finished deploy [airflow-dags/analytics@74e8d66]: Deploying artifacts for T388439 (duration: 00m 52s)

Adjusted airflow variables to use the new conda artifact. Should be good to go now. Now the only question is how long the metrics computation will take...

@tchin are we done here? If so, can we tick the boxes and close?

tchin updated the task description. (Show Details)