In T375402#10254097, @xcollazo wrote:Sad news:
Even though, as per T375402#10239416, the performance of the hourly ingest seemed to be great on my testing, after merging all changes into the production table, we were not able to reproduce the performance benefits.
The revision level MERGE INTO continues to take way more time than the allotted max of 1 hour.
At this time, I am throwing the towel. There are more things to look into, like figuring out why I was not able to reproduce the gains, but there is a lot of other work to be done for Dumps 2.0 that needs attention. Thus, I think it is best, in the interest of time, to rest this work and bite the bullet: we will have to do consume at a daily cadence rather than hourly.
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | xcollazo | T358877 Dumps 2.0 Phase II: Production intermediate table milestone | |||
Resolved | xcollazo | T377999 Run Dumps 2.0 main DAG at a daily cadence rather than hourly. |
Event Timeline
Comment Actions
xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/41
Allow processing a whole day.
Comment Actions
xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/888
Draft: Run Dumps 2.0 main DAG at a daily cadence rather than hourly.
Comment Actions
xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/888
Run Dumps 2.0 main DAG at a daily cadence rather than hourly.
Comment Actions
Mentioned in SAL (#wikimedia-operations) [2024-10-24T15:45:18Z] <xcollazo@deploy2002> Started deploy [airflow-dags/analytics@325d943]: Deploy latest DAGs to analytics Airflow instance. T377999.
Comment Actions
Mentioned in SAL (#wikimedia-analytics) [2024-10-24T15:46:38Z] <xcollazo> Deploy latest DAGs to analytics Airflow instance. T377999.
Comment Actions
Mentioned in SAL (#wikimedia-operations) [2024-10-24T15:47:12Z] <xcollazo@deploy2002> Finished deploy [airflow-dags/analytics@325d943]: Deploy latest DAGs to analytics Airflow instance. T377999. (duration: 01m 07s)
Comment Actions
xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/41
Allow processing a whole day.
Comment Actions
xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/42
Push down earliest rev_dt per wiki on the revision level MERGE INTO
Comment Actions
xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/894
Pickup 'Push down earliest rev_dt per wiki on the revision level MERGE INTO'
Comment Actions
xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/42
Push down earliest rev_dt per wiki on the revision level MERGE INTO
Comment Actions
xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/894
Pickup 'Push down earliest rev_dt per wiki on the revision level MERGE INTO'
Comment Actions
xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/906
Sync up DagProperties of dumps_merge_events_to_wikitext_raw_daily with overrides.
Comment Actions
xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/906
Sync up DagProperties of dumps_merge_events_to_wikitext_raw_daily with overrides.