Page MenuHomePhabricator

Publish Dumps 2 to dumps.wikimedia.org and provide only monthly dumps
Closed, ResolvedPublic

Description

On 1/12/2026 in a WMF-wide group meeting it has been decided to move forward with the publication of the Dumps 2 XML artifacts on dumps.wikimedia.org

  • The XML artifacts from the new Dumps 2 process will be published on dumps.wikimedia.org next to the existing Dumps 1 artifacts.
  • The publication cadence of both Dumps 1 and Dumps 2 artifacts will be reduced to monthly, i.e. mid-month partial updates will no longer be provided.
  • The Dumps 1 XML artifacts will be marked as deprecated. Note that this deprecation affects only the XML content artifacts. All other SQL dumps of various database tables will continue, but on a monthly basis.

The changes will be accompanied with a communication to the community that the new Dumps 2 XML files are more stable and hence should be preferred for downloading.

Tasks

Event Timeline

xcollazo changed the task status from Open to In Progress.Fri, Jan 16, 3:50 PM
xcollazo claimed this task.
xcollazo triaged this task as High priority.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1924

mw file export: remove mid-month file export DAG

I first paused the mid-month File Expor DAG (aka Dumps2) and then merged in these changes ^^^^.

I've now paused all Dumpsv1 DAGs tagged as partial-dump from the test-k8s Airflow instance that runs them.

Following up with a MR for the test-k8s instance shortly.

Hi, is this why the mid-month dump run (20260120) has not started?

Hi, is this why the mid-month dump run (20260120) has not started?

Yes. We will make an announcement shortly.

Change #1199783 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] dumps: Release the new MW Content File Export. Deprecate legacy XML dumps.

https://gerrit.wikimedia.org/r/1199783

Change #1199783 merged by Brouberol:

[operations/puppet@production] dumps: Release the new MW Content File Export. Deprecate legacy XML dumps.

https://gerrit.wikimedia.org/r/1199783

Change #1230956 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] dumps: Fix MW Content File Export. Remove already absented file def.

https://gerrit.wikimedia.org/r/1230956

Change #1230956 merged by Brouberol:

[operations/puppet@production] dumps: Fix MW Content File Export. Remove already absented file def.

https://gerrit.wikimedia.org/r/1230956

Change #1230965 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/deployment-charts@master] dumps: Update index.html file to reflect XML dumps deprecation

https://gerrit.wikimedia.org/r/1230965

Change #1230965 merged by Brouberol:

[operations/deployment-charts@master] dumps: Update index.html file to reflect XML dumps deprecation

https://gerrit.wikimedia.org/r/1230965

TTO subscribed.

Two comments:

  • I'm a little perplexed as to why the fact that "mid-month partial updates will no longer be provided" wasn't announced before 20 January. It isn't even in Tech News 2026-05 (the 26 Jan edition). Sure, perhaps you want to announce the deprecation of Dumps 1 in favour of Dumps 2 "when the time is right" - no problem with that. But the cessation of mid-month dumps broke people's workflows (including mine). This specific aspect should have been announced in advance of it occurring. I'm boldly adding the User-notice project so that the Tech News editors can decide what to do with this.
  • Dumps 1 also contains SQL dumps of various database tables. At this time, Dumps 2 doesn't contain an equivalent. Can we assume this aspect of Dumps 1 is not deprecated? If so, could this please be made clearer in the messaging?

Two comments:

  • I'm a little perplexed as to why the fact that "mid-month partial updates will no longer be provided" wasn't announced before 20 January. It isn't even in Tech News 2026-05 (the 26 Jan edition). Sure, perhaps you want to announce the deprecation of Dumps 1 in favour of Dumps 2 "when the time is right" - no problem with that. But the cessation of mid-month dumps broke people's workflows (including mine). This specific aspect should have been announced in advance of it occurring. I'm boldly adding the User-notice project so that the Tech News editors can decide what to do with this.

We should have announced it before removing, you are right.

  • Dumps 1 also contains SQL dumps of various database tables. At this time, Dumps 2 doesn't contain an equivalent. Can we assume this aspect of Dumps 1 is not deprecated?

Correct, this decision only affects the XML content artifacts. All other SQL dumps of various database tables will continue to be done on a monthly basis.

If so, could this please be made clearer in the messaging?

I will now modify the description of this ticket to make this clear. If you have suggestions for how to communicate this more clearly at https://dumps.wikimedia.org/ we'll take them as well.

Change #1233836 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] analytics: refinery: add data purge for File Export.

https://gerrit.wikimedia.org/r/1233836

Do Dumps 2 files, which correspond to a revision range of a single page, have correct names?

I can see that, for example, on https://dumps.wikimedia.org/other/mediawiki_content_history/enwiki/2026-01-01/xml/bzip2/ there's a file enwiki-2026-01-01-p10701605r123221892r123221892.xml.bz2 which apparently contains revisions in range 123221892 - 123221892 (both ends are the same). The same can be found in the documentation, which I think is counterintuitive and likely suggests an issue with interpretation of the file name.

I don't see a combined file for mediawiki_content_current like pages-meta-current.xml.bz2 that exists in the Dumps 1 directory. I have a workflow that utilizes the combined file: TemplateParameterBot (enwiki)
I also don't see an equivalent to pages-articles.xml.bz2 for "Articles, templates, media/file descriptions, and primary meta-pages". I have multiple workflows that utilize this file for multiple wikis: TemplateParameterBot (several wikis), WikidataClassBrowser (wikidata), CheckWiki (several wikis). These could be converted to use the full content dumps, they will take longer to run though.

Do Dumps 2 files, which correspond to a revision range of a single page, have correct names?

I can see that, for example, on https://dumps.wikimedia.org/other/mediawiki_content_history/enwiki/2026-01-01/xml/bzip2/ there's a file enwiki-2026-01-01-p10701605r123221892r123221892.xml.bz2 which apparently contains revisions in range 123221892 - 123221892 (both ends are the same). The same can be found in the documentation, which I think is counterintuitive and likely suggests an issue with interpretation of the file name.

Thanks for the report, opened T416176: File Export files with revision ranges do not show end revision in filename to track this.

I don't see a combined file for mediawiki_content_current like pages-meta-current.xml.bz2 that exists in the Dumps 1 directory. I have a workflow that utilizes the combined file: TemplateParameterBot (enwiki)

Right, the new system is indeed simpler. There is no combined files. There is also no recompression into .7z.

I also don't see an equivalent to pages-articles.xml.bz2 for "Articles, templates, media/file descriptions, and primary meta-pages". I have multiple workflows that utilize this file for multiple wikis: TemplateParameterBot (several wikis), WikidataClassBrowser (wikidata), CheckWiki (several wikis). These could be converted to use the full content dumps, they will take longer to run though.

There is no plan to remove the pages-articles job from the legacy system. The only XML artifacts that are deprecated are the jobs related to pages-meta-current and pages-meta-history. I do see the need to be more specific about what is deprecated and what is not, will do so via T416180: Modify legacy dumps to specify what artifacts are deprecated.

Change #1233836 merged by Brouberol:

[operations/puppet@production] analytics: refinery: add data purge for File Export.

https://gerrit.wikimedia.org/r/1233836