Page MenuHomePhabricator

Airflow job to orchestrate the dumps reconcilliation emission mechanism
Closed, ResolvedPublic3 Estimated Story Points

Description

See T368753 for details.

Event Timeline

xcollazo changed the task status from Open to In Progress.Jul 29 2024, 4:53 PM
xcollazo set the point value for this task to 3.

Mentioned in SAL (#wikimedia-operations) [2024-07-31T21:16:01Z] <xcollazo@deploy1003> Started deploy [airflow-dags/analytics@82674dc]: deploy hot airflow analytics dag hot fix T368756

Mentioned in SAL (#wikimedia-operations) [2024-07-31T21:17:07Z] <xcollazo@deploy1003> Finished deploy [airflow-dags/analytics@82674dc]: deploy hot airflow analytics dag hot fix T368756 (duration: 01m 05s)

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/785

Add a dblist include/exclude mechanism for dumps_reconcile_wikitext_raw_daily.

Mentioned in SAL (#wikimedia-operations) [2024-08-02T16:00:00Z] <xcollazo@deploy1003> Started deploy [airflow-dags/analytics@d573c40]: Deploy latest DAGs for analytics Airflow instance. T368756

Mentioned in SAL (#wikimedia-operations) [2024-08-02T16:01:02Z] <xcollazo@deploy1003> Finished deploy [airflow-dags/analytics@d573c40]: Deploy latest DAGs for analytics Airflow instance. T368756 (duration: 01m 02s)

From myself from Slack:

Folks, I’m going to be OOO next week. I noticed that there are transient issues with dumps_reconcile_wikitext_raw_daily, so I am going to leave it paused for the time being. Dynamic Task Mapping’s Airflow UI is super confusing BTW…

I have now returned from OOO, and have deleted old runs with mismatch operators, and reset the DAG with a start date of 2024-08-11. Will now test whether we see these issues again.

Ottomata renamed this task from Airflow job to orchestrate the emission mechanism to Airflow job to orchestrate the dumps reconciliation emission mechanism.Aug 12 2024, 4:31 PM
Ottomata renamed this task from Airflow job to orchestrate the dumps reconciliation emission mechanism to Airflow job to orchestrate the dumps reconcilliation emission mechanism.

Mentioned in SAL (#wikimedia-analytics) [2024-08-12T17:06:08Z] <xcollazo> Ran " ALTER TABLE wmf_dumps.wikitext_inconsistent_rows_rc1 SET TBLPROPERTIES ( 'commit.retry.num-retries' = '10' ); ". T368756.

Mentioned in SAL (#wikimedia-analytics) [2024-08-12T17:06:08Z] <xcollazo> Ran " ALTER TABLE wmf_dumps.wikitext_inconsistent_rows_rc1 SET TBLPROPERTIES ( 'commit.retry.num-retries' = '10' ); ". T368756.

This ALTER should solve most sporadic failures due to Iceberg commit retries exhaustion. Iceberg tries 4 times by default, bumping to 10. Will reflect this change in code shortly.

2024-08-11 run took 01:21:16 and finished successfully with what seems like only one sporadic failure till we applied the ALTER above.

For completeness, this is what I ran in production:

ssh an-launcher1002.eqiad.wmnet
sudo -u analytics bash
kerberos-run-command analytics spark3-sql
ALTER TABLE wmf_dumps.wikitext_inconsistent_rows_rc1 SET TBLPROPERTIES ( 'commit.retry.num-retries' = '10' );