Page MenuHomePhabricator

Implement the data layout, UI, and documentation for the XML file export
Closed, ResolvedPublic

Description

Let's implement a simple (for now) UI for the data file export that takes into consideration learnings from T400507:

In T400507#11051996, @xcollazo wrote:

...
There are multiple options for alternatives that break the current behavior that do not suffer from the above integration issues:

i) Simple, low effort way: We can utilize the same mechanism we have for publishing analytics assets. This would allow us to, say, publish under http://dumps.wikimedia.org/mediawiki_content_history/examplewiki/YYYY-MM-DD/xml. The effort to do this is minimal, since we would leverage existing Apache HTTP file indexing (it will look like other analytics assets (see readme and http file listing for Commons Impact Metrics as an example). In fact, this will look quite similar to existing Dumpsv1. An able developer should be able to migrate to this easily.

ii) A more elaborate UI, still utilizing static assets that get generated as part of the Airflow job that does the file export. That is, similar to (i), but making sure that instead of an Apache HTTP file listing, we have a nicer statically generated UI that would provide basic CSS templating options.

Both options (i) and (ii) above would not impede us on implementing a login requirement in the future.

In this task we should:

  • Agree on data layout that makes it clear when the dump was done, what the content is, and what format.
  • Prepare documentation for an index.html that explains the new file export, its content, how to use, etc. A great example is the one done for Commons Impact Metrics via T358701.
  • Make sure the layout allows us to have other formats (i.e. JSON lines ) in the future.
  • Validate with DPE SRE team that the chosen mechanism is easily portable to a login wall in the future if there is a need (See T400507 for rationale on this). (Looks like we will not be pursuing this).

Event Timeline

xcollazo renamed this task from Design the data layout and the UI for the XML file export to Implement the data layout, UI, and documentation for the XML file export.Aug 1 2025, 6:56 PM
xcollazo updated the task description. (Show Details)

Parking this here before I forget: one thing we should consider is abandoning md5 and sha1 checksums for data integrity and adopting sha256 instead. This newer algorithm is still cryptographically unbroken, and implementations seem to be widely available.

See https://en.wikipedia.org/wiki/SHA-2 and https://www.gnu.org/software/coreutils/sha256sum.

Change #1198152 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery/source@master] Add utility to create SHA256 fingerprints of the files of a particular HDFS folder.

https://gerrit.wikimedia.org/r/1198152

Change #1198152 merged by jenkins-bot:

[analytics/refinery/source@master] Add utility to create SHA256 fingerprints of the files of a particular HDFS folder.

https://gerrit.wikimedia.org/r/1198152

Parking this here before I forget: one thing we should consider is abandoning md5 and sha1 checksums for data integrity and adopting sha256 instead. This newer algorithm is still cryptographically unbroken, and implementations seem to be widely available.

See https://en.wikipedia.org/wiki/SHA-2 and https://www.gnu.org/software/coreutils/sha256sum.

Changeset https://gerrit.wikimedia.org/r/1198152 implements sha256 sums.

From MR 1764:

In this MR we:

  • Add a step to each file export to create a SHA256SUMS file that is compatible with sha256sum -c for verifying files. Since this file will only be generated when the file export of a particular wiki is successful, it also serves as a completion file and a file listing for automated downloading.
  • Bump to max_active_tasks=64. This is just so that we can control the parallelization at the pool level instead of being capped by this setting.
  • Change output path from /wmf/data/exports/{dataset]/{wiki_id}/{date} to /wmf/data/exports/{dataset]/{wiki_id}/{date}/xml/{compression} to give us more flexibility for if in the future we decide to export other compressions, or decide to do, say, ndjson.

Change #1199783 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] dumps: Link to new MW Content File Export. Deprecate legacy XML dumps.

https://gerrit.wikimedia.org/r/1199783

Ran the following to remove older test runs:

hdfs dfs -rm -r -skipTrash /wmf/data/exports/mediawiki_content_history/*
hdfs dfs -rm -r -skipTrash /wmf/data/exports/mediawiki_content_current/*

Rerunning mw_content_xml_export_current_mid_month for 2025-10-15 to double check the changes from MR 1764.

Rerunning mw_content_xml_export_current_mid_month for 2025-10-15 to double check the changes from MR 1764.

Most wikis are now done, and the output is WAD.

Two wikis did did fail:

minwikisource
pcmwikiquote

Both of these are very new wikis, and their SiteInfo data is still not available, so these failures are expected.

Confirmed SHA256SUMS is WAD for a big wiki like commonswiki:

analytics@an-launcher1002:/mnt/hdfs/wmf/data/exports/mediawiki_content_current/commonswiki/2025-09-15/xml/bzip2$ sha256sum --check SHA256SUMS 
commonswiki-2025-09-15-p100315804p100592129.xml.bz2: OK
commonswiki-2025-09-15-p100592130p100829308.xml.bz2: OK
...

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1788

main: mw file export: apply umask so that files are readabale by others

Ran the following manually to fix existing files and folders:

hadoop fs -chmod -R 755 /wmf/data/exports/mediawiki_content_current
hadoop fs -chmod -R 755 /wmf/data/exports/mediawiki_content_history
xcollazo changed the task status from Open to Stalled.Nov 13 2025, 3:55 PM
xcollazo updated the task description. (Show Details)

Change #1199783 merged by Brouberol:

[operations/puppet@production] dumps: Release the new MW Content File Export. Deprecate legacy XML dumps.

https://gerrit.wikimedia.org/r/1199783