Page MenuHomePhabricator

Accelerate sqoop landing for MediaWiki History private tables
Open, MediumPublic

Description

Summary

The MediaWiki History monthly DAG (mediawiki_history_denormalize) averages ~56h end-to-end wall-clock, of which ~46h (~80%) is sqoop wait, not Spark compute. The tail in every recent month is three tables from the mediawiki_private database — actor, comment, and centralauth_localuser — which consistently land 1–2 days after the public sqoop tables. Spark can't start the monthly rebuild until all sensors are green, so these three tables alone gate the entire job.

Pulling these private tables onto (or close to) the public-table timeline would save roughly a full day of DAG wall-clock for near-zero engineering cost on the history-pipeline side. This is the single highest-leverage freshness win available short of the full daily-delta work.

Findings from the last 5 monthly runs

Data pulled from Airflow URLSensor completion times in mediawiki_history_denormalize (Airflow link):

SnapshotSpark denormalize_historyDAG wall-clockSqoop waitNotes
2026-0311h 04m44h 13m~33h
2026-0214h 23m52h 53m~38htry_number=4 (3 failed Spark retries)
2026-019h 12m41h 24m~32h
2025-129h 23m66h 10m~57h
2025-118h 58m78h 46m~70hSLA miss
avg~10.6h~56.6h~46h

In every one of these five runs, the last three URLSensors to complete were actor, comment, and centralauth_localuser. Public-database tables (revision, archive, page, logging, user, user_groups, change_tag, etc.) typically land ~24–48h earlier.

Why this matters

  • Headline number: ~46h of a ~56h monthly wall-clock is sqoop wait. Even halving the Spark side (Option C in the umbrella plan) would only save ~5h — moving the private-table timeline is worth ~5× that.
  • Cadence prerequisite: as long as a monthly rebuild is 3–7 days late every month, any weekly cadence for the monthly-style full rebuild is infeasible — regardless of what we do in Spark.
  • Retry slack: 2026-02 used try 4, indicating 3 failed Spark attempts. Reducing sqoop-side jitter also reduces downstream retry slack; if the pipeline has more breathing room before the SLA, retries cost less wall-clock.

Proposed work

  1. Root-cause why actor, comment, centralauth_localuser land later than the public tables.
  2. Move these three tables onto the earlier timeline. The specific change depends on the root cause.
  3. Verify savings over two monthly runs before declaring the work done. Target: monthly DAG wall-clock drops by ~24h consistently.

Event Timeline

xcollazo reassigned this task from xcollazo to APizzata-WMF.
xcollazo triaged this task as Medium priority.

All the sqoop commands are executed by the service refinery-sqoop-whole-mediawiki.service that runs the bin /usr/local/bin/refinery-sqoop-whole-mediawiki.
The bin launches the following commands sequentially (notice the &&):

/usr/local/bin/refinery-sqoop-mediawiki-history \
  && /usr/local/bin/refinery-sqoop-mediawiki-production-history \
  && /usr/local/bin/refinery-sqoop-centralauth-production \
  && /usr/local/bin/refinery-sqoop-mediawiki-not-history \
  && /usr/local/bin/refinery-sqoop-mediawiki-production-not-history

All the public tables necessary for the DAG are ingested via refinery-sqoop-mediawiki-history (the first step of the pipeline), actor and comment through refinery-sqoop-mediawiki-production-history (the second step), and centralauth_localuser through refinery-sqoop-centralauth-production (third step).

A small win could be achieved by switching the order of the scripts (switching #3 with #1) but we would still be susceptible to delays if a node in the chain fails.

Possible options would be:

  1. split the single service in different services to parallelize and make them independent.
  2. change the logic of the service to handle parallel runs.

Questions:

  1. why the pipeline was designed in a sequential way?
  2. since two of the 3 we would parallelize are hitting the production database, would it be a hard no to the parallel solution?

After a discussion with @xcollazo and @JAllemandou we decided to create 3 parallel processes:

/usr/local/bin/refinery-sqoop-mediawiki-history \
  && /usr/local/bin/refinery-sqoop-mediawiki-not-history
/usr/local/bin/refinery-sqoop-mediawiki-production-history \
  && /usr/local/bin/refinery-sqoop-mediawiki-production-not-history
/usr/local/bin/refinery-sqoop-centralauth-production

I am currently running locally in parallel the first 3 of each group to quantify the times and the benefits of this parallel runs.

The run of /usr/local/bin/refinery-sqoop-mediawiki-production-history took about 2 hours, from 2026-05-05T09:04:38 to 2026-05-05T11:25:48.
The /usr/local/bin/refinery-sqoop-centralauth-production job completed in roughly 20 minutes, from 2026-05-05T09:10:13 to 2026-05-05T09:33:02.

/usr/local/bin/refinery-sqoop-mediawiki-history is still running, so to estimate its duration I looked at the last full monthly run triggered by the service. For the 2026-04 snapshot, that run started at midnight on the first of the month and finished around 2026-05-02 05:53 (timestamp of the _SUCCESS files for the sqooped tables), which is roughly 30 hours.

These preliminary numbers suggest we can reduce the overall Sqoop wait time by a couple of hours without touching refinery-sqoop-mediawiki-history. However, that script is by far the heaviest.
Using hdfs dfs -du -h -s on all tables it sqoops for the 2026-04 partition, I retrieved the following volumes:

  • slot_roles: 381.0 K
  • change_tag_def: 1.2 M
  • user_groups: 1.6 M
  • user: 5.7 G
  • archive: 11.2 G
  • change_tag: 16.3 G
  • page: 24.2 G
  • logging: 38.5 G
  • slots: 40.2 G
  • revision: 133.8 G
  • content: 261.6 G

Given this data, one idea is to split the tables into two groups by size and measure how much time we can save by running the smaller subset separately. I’m currently testing how long it takes to sqoop the following (smaller) set of tables: slot_roles, change_tag_def, user_groups, user, archive.

In addition, we can likely shave off more time by reducing the sensor poke_interval from 1 hour to something like 20 minutes. In the 2026-04 snapshot, the actor and comment tables finished sqooping at 2026-05-02 08:25, but the wait_for_data_in_actor sensor only detected this at 2026-05-02 09:11:18 UTC, i.e. almost 50 minutes later purely due to the current poke interval.

content: 261.6 G

Surprising size considering content actually has no content :D

we can likely shave off more time by reducing the sensor poke_interval from 1 hour to something like 20 minutes

+1

Given this data, one idea is to split the tables into two groups by size and measure how much time we can save by running the smaller subset separately. I’m currently testing how long it takes to sqoop the following (smaller) set of tables: slot_roles, change_tag_def, user_groups, user, archive.

This took roughly 5 hours:
start at 2026-05-06T09:40:45
finished at 2026-05-06T14:35:52

So if we could run this set of tables in parallel to change_tag,content,logging,page,revision,slots we could gain back 5 hours that added to the 2 hours and 20 minutes stated in T424355#11893543 could lead to almost 8 hours reduction in time.

Questions are: what are the blockers to split /usr/local/bin/refinery-sqoop-mediawiki-history in parallel runs? Stress on the clouddb?

Questions are: what are the blockers to split /usr/local/bin/refinery-sqoop-mediawiki-history in parallel runs? Stress on the clouddb?

That's right, we would put more stress on clouddb.
You can find the DB usage here: https://grafana.wikimedia.org/goto/cfl99k2j7w7b4b?orgId=1
It seems we could parallelize queries on small wikis more. The high spike we see being queries on the big wikis.

Another idea is to sqoop more tables from the analytics-replicas. We've been toalking about with @Milimetric for a long time. Maybe now could be the time?

After analyzing the graph I pasted above a bit more, I found that sqooping the content tables from clouddb is really inefficient, in the same way it was for actor and comment, because the sanitization-view requires joins: https://github.com/wikimedia/operations-puppet/blob/60639a7b7b6d954a9a47cc1a02f894e00189ddfb/modules/profile/templates/wmcs/db/wikireplicas/maintain-views.yaml#L350.
I recommend we sqoop this table from the analytics-replicas. This will make it a lot faster and will allow us to gain a few hours (expected 8h).

I recommend we sqoop this table from the analytics-replicas.

I can test this and come back with the results!

Note about what it means to change the sqooping from clouddb to analytics-replicas:

  • Change the sqoop launchers
  • Move the destination folder to /wmf/data/raw/mediawiki_private/tables/content instead of /wmf/data/raw/mediawiki/tables/content
  • Copy schema definition to have it named mediawiki_private, under the private folder
  • Apply schema changes

From the DB graphs, it seems we could parallelize the sqooping of small wikis more. Our problem will then be that an-launcher1003 is already full.

I can test this and come back with the results!

this took roughly 2 hours (started at 2026-05-06T15:55:02 ended at 2026-05-06T17:48:14).

Will now do a complete test of all the scripts in parallel with this change.

From the DB graphs, it seems we could parallelize the sqooping of small wikis more. Our problem will then be that an-launcher1003 is already full.

You mean the set of drivers running on an-launcher1003 at the beginning of the month already saturate the available RAM?

Note we had done a RAM bump recently: T411100: Bump mem of an-launcher1003.eqiad.wmnet to 32GB.

Will now do a complete test of all the scripts in parallel with this change.

Here are my results:

    • /usr/local/bin/refinery-sqoop-mediawiki-history, which sqooped the tables archive,change_tag,change_tag_def,logging,page,revision,slots,slot_roles,user,user_groups, started at 2026-05-07T07:14:28 UTC, finished at 2026-05-07T21:11:59 UTC and took 14 h.
    • /usr/local/bin/refinery-sqoop-mediawiki-production-history, which sqooped these tables actor,comment,content, started at 2026-05-07T07:14:54 UTC, finished at 2026-05-07T11:57:57 UTC and took 4h 30m
  • /usr/local/bin/refinery-sqoop-centralauth-production, which sqooped the tables centralauth_globaluser,centralauth_localuser, started at 2026-05-07T07:15:27 UTC, finished at 2026-05-07T07:34:32 UTC and took 19 m.

With these results the maximum waiting time would be 14 h if we decide to move the content table out of the main refinery-sqoop-mediawiki-history job.

Additionally this is the status of the Cloud DB during the operations.

The peaks are:

  • 07:54 UTC is connected to the enwiki.slots table.
  • from '09:06 UTC` to 09:30 UTC is connected to the wikidatawiki.slots table.
  • from '10:18 UTC` to 10:30 UTC is connected to the commonswiki.slots table.

I treated any user CPU utilization above 80% as a peak.

Created couple MR:

Once these are merged will follow @JAllemandou commentL

  • Copy schema definition to have it named mediawiki_private, under the private folder
  • Apply schema changes

Additionally:

  • move data from /wmf/data/raw/mediawiki/tables/content' to /wmf/data/raw/mediawiki_private/tables/content'
  • repair partitions

Change #1285335 had a related patch set uploaded (by A-pizzata; author: A-pizzata):

[operations/puppet@production] changes to accelerate sqoop landing for mediawiki_history_incremental_v1

https://gerrit.wikimedia.org/r/1285335

Change #1285337 had a related patch set uploaded (by A-pizzata; author: A-pizzata):

[analytics/refinery@master] change create table for mediawiki_content to become private

https://gerrit.wikimedia.org/r/1285337

With these results the maximum waiting time would be 14 h if we decide to move the content table out of the main refinery-sqoop-mediawiki-history job.

Am I reading this correctly that we would go from ~46h sqoop wait time to ~14h?

Change #1285335 had a related patch set uploaded (by A-pizzata; author: A-pizzata):

[operations/puppet@production] changes to accelerate sqoop landing for mediawiki_history_incremental_v1. This is done by changing the previous refinery-sqoop-whole-mediawiki.sh from one big sequential set of sqoops to a parallel structure: - refinery-sqoop-mediawiki-centralauth-production.sh to run the refinery-sqoop-centralauth-production to sqoop the centralauth production tables. - refinery-sqoop-mediawiki-clouddb.sh to replace refinery-sqoop-whole-mediawiki.sh and to run in sequence refinery-sqoop-mediawiki-history and refinery-sqoop-mediawiki-not-history to sqoop the cloudb tables. - refinery-sqoop-mediawiki-production.sh to run in sequence refinery-sqoop-mediawiki-production-history and refinery-sqoop-mediawiki-production-not-history to sqoop production replicas tables.

https://gerrit.wikimedia.org/r/1285335

Change #1285335 had a related patch set uploaded (by A-pizzata; author: A-pizzata):

[operations/puppet@production] changes to accelerate sqoop landing for mediawiki_history_incremental_v1

https://gerrit.wikimedia.org/r/1285335

Change #1290716 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove absented sqoop resources

https://gerrit.wikimedia.org/r/1290716

Change #1285335 merged by Btullis:

[operations/puppet@production] changes to accelerate sqoop landing for mediawiki_history_incremental_v1

https://gerrit.wikimedia.org/r/1285335

Change #1290716 merged by Btullis:

[operations/puppet@production] Remove absented sqoop resources

https://gerrit.wikimedia.org/r/1290716

Change #1285335 merged by Btullis:

[operations/puppet@production] changes to accelerate sqoop landing for mediawiki_history_incremental_v1

https://gerrit.wikimedia.org/r/1285335

I think we can merge the Airflow changes now?

Yes, tomorrow I will validate everything on an-launcher and merge the other MRs.

Change #1285337 had a related patch set uploaded (by A-pizzata; author: A-pizzata):

[analytics/refinery@master] change create table for mediawiki_content to become private

https://gerrit.wikimedia.org/r/1285337

Change #1285337 had a related patch set uploaded (by A-pizzata; author: A-pizzata):

[analytics/refinery@master] change create table for mediawiki_content to become private

https://gerrit.wikimedia.org/r/1285337

Change #1293659 had a related patch set uploaded (by A-pizzata; author: A-pizzata):

[analytics/refinery/source@master] Change the mediawiki_content from public to private

https://gerrit.wikimedia.org/r/1293659

Change #1293659 merged by jenkins-bot:

[analytics/refinery/source@master] Change the mediawiki_content from public to private

https://gerrit.wikimedia.org/r/1293659

Change #1285337 merged by A-pizzata:

[analytics/refinery@master] change create table for mediawiki_content to become private

https://gerrit.wikimedia.org/r/1285337

Change #1294816 had a related patch set uploaded (by A-pizzata; author: A-pizzata):

[analytics/refinery@master] Change mediawiki_content to private.

https://gerrit.wikimedia.org/r/1294816

Change #1294816 merged by A-pizzata:

[analytics/refinery@master] Change mediawiki_content to private.

https://gerrit.wikimedia.org/r/1294816

executed the following commands:

screen -S T424355-copy-content
sudo -u analytics kerberos-run-command analytics hdfs dfs -mkdir -p /wmf/data/raw/mediawiki_private/tables/content
 sudo -u analytics kerberos-run-command analytics hdfs dfs -cp /wmf/data/raw/mediawiki/tables/content/* /wmf/data/raw/mediawiki_private/tables/content


 sudo -u analytics kerberos-run-command analytics hdfs dfs -ls  /wmf/data/raw/mediawiki_private/tables/content

 sudo -u analytics kerberos-run-command analytics hdfs dfs -du -h -s  /wmf/data/raw/mediawiki_private/tables/content/*
254.8 G  /wmf/data/raw/mediawiki_private/tables/content/snapshot=2025-11
256.2 G  /wmf/data/raw/mediawiki_private/tables/content/snapshot=2025-12
257.6 G  /wmf/data/raw/mediawiki_private/tables/content/snapshot=2026-01
258.9 G  /wmf/data/raw/mediawiki_private/tables/content/snapshot=2026-02
260.2 G  /wmf/data/raw/mediawiki_private/tables/content/snapshot=2026-03
261.6 G  /wmf/data/raw/mediawiki_private/tables/content/snapshot=2026-04

sudo -u analytics kerberos-run-command analytics hdfs dfs -du -h -s  /wmf/data/raw/mediawiki/tables/content/*
254.8 G  /wmf/data/raw/mediawiki/tables/content/snapshot=2025-11
256.2 G  /wmf/data/raw/mediawiki/tables/content/snapshot=2025-12
257.6 G  /wmf/data/raw/mediawiki/tables/content/snapshot=2026-01
258.9 G  /wmf/data/raw/mediawiki/tables/content/snapshot=2026-02
260.2 G  /wmf/data/raw/mediawiki/tables/content/snapshot=2026-03
261.6 G  /wmf/data/raw/mediawiki/tables/content/snapshot=2026-04


sudo -u analytics kerberos-run-command analytics hive
msck repair table wmf_raw.mediawiki_private_content;
OK
Time taken: 20.479 seconds

Will monitor next sqooping process on Monday.

Yesterday's sqoop process terminated ingesting all the necessary tables for MWH at 2026-06-01 15:22 UTC. We are not that much far off the 14 hrs result obtained during testing!
Further optimization could be applied by increasing the parallelism of the smaller wiki, as long as it does not impact the performance of an-launcher1003.

Yesterday's sqoop process terminated ingesting all the necessary tables for MWH at 2026-06-01 15:22 UTC. We are not that much far off the 14 hrs result obtained during testing!
Further optimization could be applied by increasing the parallelism of the smaller wiki, as long as it does not impact the performance of an-launcher1003.

~14 hours is amazing. 🎉

Are we done here @APizzata-WMF ?

the task requests:

Verify savings over two monthly runs before declaring the work done. Target: monthly DAG wall-clock drops by ~24h consistently.

We can wait the beginning of July and close the task?

We can wait the beginning of July and close the task?

Ah yes, makes sense.