Page MenuHomePhabricator

Airflow jobs to do monthly XML dumps
Closed, ResolvedPublic

Description

On T346278, we implemented a basic Airflow job that will trigger the XML dumps for simplewiki.

In this task, we should expand that work to:

  • Use dynamic task mapping to generate dump jobs for all currently open, public wikis (example implementation).
  • There should be two runs: one on the 1st of the month for the 'full' XML dumps (two jobs: all revisions and current revisions), and another around the 15th for the 'partial' run (only current revisions).

Out of scope:
Moving output from HDFS to the servers from where the dumps are distributed. We will figure that out separately.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
main: fix import order in mw_content_xml_export_dagsrepos/data-engineering/airflow-dags!1727kevinbazirafix_imports_in_mw_content_xml_export_dagsmain
main: Add DAGs to do MW Content File Exportrepos/data-engineering/airflow-dags!1652xcollazofile-export-dagmain
common: always set airflow_api_kerberos_enabled to truerepos/data-engineering/airflow-dags!1649xcollazofix-kerberos-api-callsmain
Customize query in GitLab

Event Timeline

xcollazo renamed this task from Airflow job to do monthly XML dumps to Airflow jobs to do monthly XML dumps.Aug 1 2025, 6:38 PM
xcollazo updated the task description. (Show Details)
xcollazo changed the task status from Open to In Progress.Sep 3 2025, 2:56 PM
xcollazo claimed this task.

I have an idea for helping with the publication stage of these dumps, although I understand that this part is out-of-scope for this ticket.

It is similar to what is set out here: T366248#11152410 for the current rework of the cirrussearch dumps.
Briefly, the idea is as follows:

  • You would have a task within your DAG (or a separate downstream DAG) that would use a KubernetesPodOperator with our sync-utils container image.
  • This task then takes the responsibility of publishing the generated files to the dumps distribution servers - currently clouddumps100[1-2]

The sync-utils container already has the rclone utility installed, which has built-in support for HDFS locations.

We would configure the HDFS remote in rclone using the kerberos credential cache and Hadoop config files that are already present in the Airflow environment.
Then we would configure an SSH connection to a clouddumps100[1-2] for the other side. For this we use an SSH private key.

Then we can use an rclone sync command to copy files from the source to the destination.

Change #1186058 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery/source@master] Remove 'partial' flag from MediawikiDumper

https://gerrit.wikimedia.org/r/1186058

Something occurs to me here. Since T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes was completed, we already have Airflow DAGs that are already carrying out the Dumps_v1 processes.

Instead of creating new DAGs (as per the draft in !1652), why not integrate your new tasks with the existing DAGs?
We already have the scheduling,, the lists of large, regular, private wikis etc. and we have already done a lot of work to optimize the DAG parsing and execution times.

Wouldn't it be better to try to integrate these DAGs from the outset, rather than end up with 2 sets of dumps DAGs?

Something occurs to me here. Since T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes was completed, we already have Airflow DAGs that are already carrying out the Dumps_v1 processes.

Instead of creating new DAGs (as per the draft in !1652), why not integrate your new tasks with the existing DAGs?
We already have the scheduling,, the lists of large, regular, private wikis etc. and we have already done a lot of work to optimize the DAG parsing and execution times.

The optimizations for the draft DAG attached are significantly simpler than what was done for the DumpsV1 DAGs. Over here, there are only 2 jobs: XML full, and XML partial dumps. Both of these jobs run in Spark and YARN. Thus dynamic task mapping works well, and there is no need for the size and alphabet partitioning that was needed for DumpV1. Identifying and rerunning failed tasks is easier in this DAG.

Wouldn't it be better to try to integrate these DAGs from the outset, rather than end up with 2 sets of dumps DAGs?

The main goal of this phase of work is to simplify the code base and to make XML File Export more reliable. I made what I think is a reasonable argument against integrating these two codebases over at T400507#11051996, mainly to achieve that goal, and that would include the DAGs.

Now, T400507 is not done yet, and if the conclusion there is that we need to integrate, then we may just need to merge the DAGs as well. But for now, I would rather continue to treat these as two separate systems.

Change #1186058 merged by jenkins-bot:

[analytics/refinery/source@master] Remove 'partial' flag from MediawikiDumper

https://gerrit.wikimedia.org/r/1186058

OK, got it. Thanks @xcollazo for the explanation.

Change #1191153 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery/source@master] Memory performance improvements for MediaWikiDumper

https://gerrit.wikimedia.org/r/1191153

Change #1191153 merged by jenkins-bot:

[analytics/refinery/source@master] Memory performance improvements for MediaWikiDumper

https://gerrit.wikimedia.org/r/1191153

From MR 1652:

This MR introduces 3 new DAGs:

mw_content_xml_export_history_monthly: exports wmf_content.mediawiki_content_history_v1 on the 1st.

mw_content_xml_export_current_monthly: exports wmf_content.mediawiki_content_current_v1 on the 1st.

mw_content_xml_export_current_mid_month: exports wmf_content.mediawiki_content_current_v1 on the 15th.

Some general notes:

  • The idea here is to honor the schedule of DumpsV1, thus 3 DAGs.
  • All these DAGs reuse the same code, but change the schedule and/or the source table to accomplish the file export.
  • We declare 3 separate tunings for huge, big, and regular wikis. Huge wikis will use ~20% of the cluster reasources, big wikis ~12%, and regular wikis ~3%.

A note on cluster usage:

A first implementation used a single dynamically mapped task group to do all the exports, controlled with a single Airflow pool. However, since Airflow does not support dynamically setting pool_slots, this resulted in overwhelming the cluster when multiple huge or big wikis were running in parallel. Thus, the current implementation segregates wikis into two dynamically mapped task groups: export_small_wikis, and export_big_and_huge_wikis. We declare a pool with 16 slots for the small wikis, and a pool with 1 slot for the rest.

This way, we cap reasource usage quite well, with shorter bursts to ~25% of cluster, but typically closer to ~20%.


Some statistics from a full run of mw_content_xml_export_history_monthly with the 2 pools approach:

export_small_wikis exported 855 wikis in ~11h30m. For this group, loginwiki was the fastest at 36s, while metawiki was the longest at 2h11m.

export_big_and_huge_wikis exported 26 wikis in ~5d18n35m. Here are the individual runtimes of these bigger wikis, in order:

shwiki	0:49:29
fawiki	0:50:57
kowiki	0:57:27
svwiki	0:58:00
enwiktionary	0:59:57
ukwiki	1:00:58
frwiktionary	1:03:12
hewiki	1:03:28
arwiki	1:03:42
cebwiki	1:05:42
viwiki	1:06:29
trwiki	1:07:28
plwiki	1:10:29
cawiki	1:12:58
ptwiki	1:18:31
nlwiki	1:19:02
zhwiki	1:57:18
jawiki	2:00:05
itwiki	2:17:04
commonswiki	2:32:50
eswiki	2:42:00
ruwiki	2:52:22
frwiki	3:07:39
dewiki	4:45:46
enwiki	1d02:20:55
wikidatawiki	3d05:36:40

Side note: On a recent test run, 2 wikis (mswikiquote and thwikimedia) failed because they were added to the open dblist quite recently but data for their SiteInfo was not available yet. This will generate a bit of work for an OpsWeek person to double check wheter failures are legit, or because of new wikis not ready yet.

Bug: T384381

Change #1192291 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery/source@master] MediawikiDumper: generate files without `-pages-meta-history` append

https://gerrit.wikimedia.org/r/1192291

Change #1192291 merged by jenkins-bot:

[analytics/refinery/source@master] MediawikiDumper: generate files without `-pages-meta-history` append

https://gerrit.wikimedia.org/r/1192291

Ran the following as hdfs:

$ whoami
hdfs

$ hostname -f
an-launcher1002.eqiad.wmnet

$ hdfs dfs -mkdir /wmf/data/exports

$ hdfs dfs -chmod 775 /wmf/data/exports

$ hdfs dfs -ls /wmf/data | grep exports
drwxrwxr-x   - hdfs               hadoop                               0 2025-10-03 16:23 /wmf/data/exports

$ hdfs dfs -mkdir /wmf/data/exports/mediawiki_content_history
$ hdfs dfs -mkdir /wmf/data/exports/mediawiki_content_current

$ hdfs dfs -chown analytics /wmf/data/exports/mediawiki_content_history
$ hdfs dfs -chown analytics /wmf/data/exports/mediawiki_content_current

$ hdfs dfs -chgrp analytics-privatedata-users /wmf/data/exports/mediawiki_content_history
$ hdfs dfs -chgrp analytics-privatedata-users /wmf/data/exports/mediawiki_content_current

$ hdfs dfs -chmod 755 /wmf/data/exports/mediawiki_content_history
$ hdfs dfs -chmod 755 /wmf/data/exports/mediawiki_content_current

$ hdfs dfs -ls /wmf/data/exports
Found 2 items
drwxr-xr-x   - analytics analytics-privatedata-users          0 2025-10-03 16:36 /wmf/data/exports/mediawiki_content_current
drwxr-xr-x   - analytics analytics-privatedata-users          0 2025-10-03 16:36 /wmf/data/exports/mediawiki_content_history

Mentioned in SAL (#wikimedia-analytics) [2025-10-03T16:46:05Z] <xcollazo> ran a bunch of hdfs dfs commands as the hdfs user to setup /wmf/data/exports. Details at T384381#11242210.

Hi @xcollazo whenever you get a minute, please have a look at this MR that fixes the import order in mw_content_xml_export_dags. Thanks!

Hi @xcollazo whenever you get a minute, please have a look at this MR that fixes the import order in mw_content_xml_export_dags. Thanks!

Thanks to Joseph Allemandou, the MR has been merged.