Page MenuHomePhabricator

Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script
Closed, ResolvedPublic5 Estimated Story Points

Description

As of today the search index dumps accessible from https://dumps.wikimedia.org/other/cirrussearch/ are generated by a MW script.
This script is starting to be slow enough that it does not have time to complete before the code it's relying on is being cleaned up by scap.
The search index are also exported using a separate process that populates a hive table in the avro format. This process is much more efficient and does not relying on MW.
The idea would be to source the data from there instead of running a long MW maint script.
I believe that the Dumps 2.0 projet might share similar needs in the sense that the data would also be sourced from hadoop.

My current understanding of what would be needed is as follow:

  • have a spark process that converts the avro table into the elasticsearch bulk format.
  • a process that rsync this folder from HDFS to a host serving dumps.wikimedia.org
    • this process should rename the spark partitions into something human friendly like $wikiid-$snapshort-cirrussearch-$type.$counter.gz (or bz2)

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
search: cirrus import: Match permissions between outputsrepos/data-engineering/airflow-dags!1784ebernhardsonwork/ebernhardson/dumps-umaskmain
search: New job to format public cirrus dumpsrepos/data-engineering/airflow-dags!1635ebernhardsonwork/ebernhardson/cirrus-public-dumpmain
Generate cirrus dumps in hadooprepos/search-platform/discolytics!58ebernhardsonformat-cirrus-dumpmain
dumps: Format hdfs dumps as bulk insert linesrepos/search-platform/discolytics!55ebernhardsonwork/ebernhardson/T366248-public-dumpsmain
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Super happy to see this, and to learn that the path to get this dump away from the Dumps 1.0 infrastructure is straightforward.

this process should rename the spark partitions into something human friendly

FYI @Antoine_Quhen recently implemented a file rename for this very purpose for Dumps 2.0, just in case you'd like to do look at it and perhaps implement similarly. So you can do this as part of the Spark process.

Gehel triaged this task as High priority.Jun 10 2024, 3:38 PM
Gehel moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.
pfischer set the point value for this task to 5.Aug 18 2025, 3:25 PM

It was fairly trivial to write some pyspark to reformat the hive table into a text file, but the result is awkward in a couple ways:

  • The old dumps produces, for example, a single 4.3GB .json.gz file for enwiktionary_content and a 2.1GB .json.gz file for enwiktionary_general. The current iteration of formatting the hive table to text produces 271 .json.snappy files with an average size of ~50MB for enwiktionary.
  • The file names are now mysterious, where previously we had enwiktionary-20250630-cirrussearch-content.json.gz we now have a directory named cirrus_index=enwikitionary_content and a bunch of numbered files such as part-00269-ae2f56d9-04af-4585-95e2-dee08cc1f5ad.c000.txt.gz

Is it worthwhile to go through and rename everything? Maybe subdirectories is "good enough".

It was fairly trivial to write some pyspark to reformat the hive table into a text file, but the result is awkward in a couple ways:

  • The old dumps produces, for example, a single 4.3GB .json.gz file for enwiktionary_content and a 2.1GB .json.gz file for enwiktionary_general. The current iteration of formatting the hive table to text produces 271 .json.snappy files with an average size of ~50MB for enwiktionary.

You can balance this with a repartition just before the write call. You could repartition(1) and you will get one giant file like before, but you loose all parallelization since it will all be computed by the one executor. Or you could, say, repartition(20) to get 20 reasonably sized files.

  • The file names are now mysterious, where previously we had enwiktionary-20250630-cirrussearch-content.json.gz we now have a directory named cirrus_index=enwikitionary_content and a bunch of numbered files such as part-00269-ae2f56d9-04af-4585-95e2-dee08cc1f5ad.c000.txt.gz

Is it worthwhile to go through and rename everything? Maybe subdirectories is "good enough".

If you decide to spit one file, you could use our HDFSArchiveOperator as is (link) to rename the file. But even if you decided on multiple files, I bet we could modify that Airflow operator to support your use case.

It was fairly trivial to write some pyspark to reformat the hive table into a text file, but the result is awkward in a couple ways:

  • The old dumps produces, for example, a single 4.3GB .json.gz file for enwiktionary_content and a 2.1GB .json.gz file for enwiktionary_general. The current iteration of formatting the hive table to text produces 271 .json.snappy files with an average size of ~50MB for enwiktionary.

You can balance this with a repartition just before the write call. You could repartition(1) and you will get one giant file like before, but you loose all parallelization since it will all be computed by the one executor. Or you could, say, repartition(20) to get 20 reasonably sized files.

Sadly all that will result in is yarn killing the executor for blowing out the memory limits. The input data here is over a terabyte and we write out all 1000+ indexes at once, not just enwiktionary. We could shuffle the data to attempt to get the partitioning better, but it's always tedious because there is no natural partitioning key. Some indices have 100M documents, and some indices have 14. So we can't just partition by the index, we have to pre-calculate some statistics over all indices and then do our own manual partitioning where some indices get 1 partition and some get 100. This is possible, but i was hoping to avoid the tedium. The source data is actually already partitioned as well, a single file holds documents from a single index, but spark isn't really aware of the finer details of the data so we can't simply tell spark to coalesce only partitions from the the same index. If we issue a generic coalesce, from the 33k source partitions to 1,000 output partitions, it only very minorly effects the number of output files because for the most part it's joining data from two separate indices, and then writing them to two separate files. The only change is that happens to run in the same task and it requires significantly more memory tuning for the larger task sizes.

  • The file names are now mysterious, where previously we had enwiktionary-20250630-cirrussearch-content.json.gz we now have a directory named cirrus_index=enwikitionary_content and a bunch of numbered files such as part-00269-ae2f56d9-04af-4585-95e2-dee08cc1f5ad.c000.txt.gz

Is it worthwhile to go through and rename everything? Maybe subdirectories is "good enough".

If you decide to spit one file, you could use our HDFSArchiveOperator as is (link) to rename the file. But even if you decided on multiple files, I bet we could modify that Airflow operator to support your use case.

I'm not sure if we want a single file or not. The main desire for a single file is to keep the existing "structure" of the dumps, to avoid surprising external users and anything they might have built up around these dumps. On the other hand large indices like commonswiki_file result in dumps of approximately 50GB (compressed). I suppose what i'm indecisive about is the appropriate number? Looking at my test dump commonswiki_file resulted in 1,332 separate dump files. that seems like too many, but a single 50gb file is also tedious to deal with. Plucking random numbers out of the air, i was going to aim for around 1gb per output file.

I see it is more complicated than I though. We can always:

  • Generate a new version of the dump, with different file format and/or semantics.
  • Deprecate but continue running the old dump in parallel for to give consumers time to adjust.
  • Announce the new format and the deprecation of the old.
  • Come back later to sunset the old version of the dump.

Dumping to hdfs is basically ready to go:

  • Dumps are a directory per search index
  • Each directory contains one or more files. In directories with multiple files typical file size is 0.5Gb - 1.5Gb.
  • Files have been renamed for publication to {index_name}-{snapshot_id}-{part_num}.json.bz2

To be done:

  • We need to add the step that syncs from hdfs to public. Not sure how that should be done.

I suppose a side benefit of replacing the dumps with these is that the schema is now significantly stricter. It's not exact, because we still forward un-schema'd extra_cols from the source to the output, but for the columns that are schema'd they are guaranteed to match the expected types. In the historical dump, for example, file_text could be a string, false, or the empty array. It's now always either a string or null. Similarly wikibase descriptions had a different shape in wikidata vs commonswiki, in this dump they are the same.

We need to add the step that syncs from hdfs to public. Not sure how that should be done.

In HDFS, we have folder /wmf/data/archive where you can move your files to. Let's say you do it to /wmf/data/archive/cirrus-search-index/{date}/blah.

Then you can set an hdfs_tools::hdfs_rsync_job in puppet to rsync from that HDFS path to the clouddumps* nodes that serve the dumps (examples here).

Change #1184585 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] dumps: Sync cirrus index dumps from hdfs

https://gerrit.wikimedia.org/r/1184585

We need to add the step that syncs from hdfs to public. Not sure how that should be done.

In HDFS, we have folder /wmf/data/archive where you can move your files to. Let's say you do it to /wmf/data/archive/cirrus-search-index/{date}/blah.

Then you can set an hdfs_tools::hdfs_rsync_job in puppet to rsync from that HDFS path to the clouddumps* nodes that serve the dumps (examples here).

I'd say that there are some other options to be considered, too. That puppet based mechanism that calls hdfs-rsync will work, but it's maybe a bit of a legacy way to do it.
When we migrated the dumps v1 to Airflow recently, we needed to find a way to publish from the CephFS mount point /mnt/dumpsdata to the clouddumps hosts.

We created a sync-utils container image and we then add specific tasks into our DAG that is responsible for publishing the files created.
In the case of the dumps, we found that the best option was to use parallel-rsync and specify both clouddumps1001 and clouddumps1002 as the targets.
This allows us to have one task that either successfully publishes to both target locations, or it fails.

So for example, if you look at the current cirrussearch dumps, you will see that they have a sync_cirrussearch_dumps task, which calls parallel-rsync with these custom arguments.

image.png (577×1 px, 149 KB)

There are many options around how you would schedule and trigger these publishing tasks, so they don't all need to be sequential like the example shown here.

However, for this requirement it would be a little different, because the source files are presumably going to be created on HDFS, rather than CephFS.
This means that we won't be able to have the source directory mounted as a locally available file system.

I think that we have at least a few ways that we could tackle this, though.
One that occurs to me immediately is that we could use rclone instead of parallel-rsync

rclone already has an hdfs remote capability built-in, so we could use this for one side of the connection and an sftp remote on the other side.
We would be able to give it access to the kerberos credential cache and Hadoop configuration files for the HDFS connection, and the SSH private key for the SFTP connection.

Then we could just execute an rclone sync command and supply the source and destination paths.

@xcollazo - I can see this being a good option for T384381: Airflow jobs to do monthly XML dumps as well.

What do we think is the right way forward here? If SRE will be prioritizing implementing a newer method of getting data from hdfs to the public sites in the next month or so then it seems like this could wait around, but if it's uncertain when we will be prioritizing this work it seems reasonable to move forward with the existing puppet bits that invoke hdfs_tools::hdfs_rsync_job

I like @BTullis's idea of having an Airflow operator that takes care of this. As mentioned, it would immediately have two use cases: this ticket, and T384381.

Thus I've been bold and created T405360: Implement an Airflow operator for moving data from point A to B.

We can discuss priority for this on the "DPE SRE / DE sync up" meeting coming up?

It looks like DE are going to move forward with the existing sync mechanisms (T405360#11277591), we should probably do the same.

I've updated the airflow-dags patch to no longer have the Draft flag, it's ready for review and deployment. discolytics has already shipped a new version containing the appropriate code.

To be reviewed (but the puppet patch should not be merged until a clean run of the dag has been verified):

First run of the updated dag completed, dumps were formatted and moved to the exports path in hdfs. Reviewing the output it all looks reasonable and as expected. Next up is to enable the public sync via the puppet patch.

Change #1184585 merged by Ryan Kemper:

[operations/puppet@production] dumps: Sync cirrus index dumps from hdfs

https://gerrit.wikimedia.org/r/1184585

With puppet deployed should expect to see these arrive at https://dumps.wikimedia.org/other/cirrus_search_index/ after 05:00 UTC tomorrow.

@EBernhardson if you're happy with these new dumps, do you still want the "old" cirrussearch dumps to run on Airflow?

Change #1202289 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] dumps: Fix missing trailing slash in cirrus-search-index path

https://gerrit.wikimedia.org/r/1202289

Change #1202289 merged by Ryan Kemper:

[operations/puppet@production] dumps: Fix missing trailing slash in cirrus-search-index path

https://gerrit.wikimedia.org/r/1202289

Mentioned in SAL (#wikimedia-operations) [2025-11-05T22:12:16Z] <ryankemper> T366248 sudo rm -rfv /srv/dumps/xmldatadumps/public/other/cirrus_search_index/cirrus-search-index/ on clouddumps100[1,2].wikimedia.org

@EBernhardson if you're happy with these new dumps, do you still want the "old" cirrussearch dumps to run on Airflow?

In a few weeks, but not yet. While we only know who a couple of them are, we do know there are downstream consumers of this dataset. The replacement is shaped slightly differently, so we want to give a few weeks (month?) of transition time after we announce availability.

In the communication we went with promising dumps through november, shutting off sometime in december:

We will continue producing the old dumps through November, expecting to shut them off before the end of the year.

Should we start by disabling the legacy cirrussearch dumps in the Airflow UI?
https://airflow-test-k8s.wikimedia.org/dags/mediawiki_cirrussearch_dump/grid

image.png (933×2 px, 325 KB)

If nothing falls over and nobody complains after a couple of weeks, then we can remove the code from Airflow-DAGs.

Should we start by disabling the legacy cirrussearch dumps in the Airflow UI?
https://airflow-test-k8s.wikimedia.org/dags/mediawiki_cirrussearch_dump/grid

image.png (933×2 px, 325 KB)

If nothing falls over and nobody complains after a couple of weeks, then we can remove the code from Airflow-DAGs.

That sounds good to me, we initially said we intended to turn it off by the end of the year. This is probably the best next step to getting people moved over.

Change #1223722 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] dumps: Repoint cirrus dumps to new location

https://gerrit.wikimedia.org/r/1223722

I went through today to verify if everything is ready to go:

  • Pulled the simplewiki_content dump from 20260104 and imported it into a local opensearch instance. Full dump loaded with no glaring issues. There is still potential for subtle issues.
  • Noticed that while the directory is dated 20260104, the individual files were dated one week prior (airflow data_interval_start). Pushed patches to fix that up and re-triggered the 20260104 dump to verify both the fix, and to verify that a re-run of the dumps replaces the old dump (instead of co-mingling two dumps). Ran successfully, dump replaced with new files.
  • Put up a patch that repoints the link at https://dumps.wikimedia.org/other at the new dumps
  • Manually disabled the airflow job for the old dumps.
  • Put together a proposed deprecation document, to be placed in the old dumps directory (https://dumps.wikimedia.org/other/cirrussearch/):

1The CirrusSearch dumps in this directory (other/cirrussearch/) are
2no longer being updated.
3
4NEW LOCATION:
5https://dumps.wikimedia.org/other/cirrus_search_index/
6
7WHY IT CHANGED:
8These dumps have gotten slower over time, to the point where
9it would take 7 or 8 days to produce a weekly dump. The replacement
10orchestration is designed to better handle the hundreds of GB
11dumped each week. The replacement orchestration publishes the full
12dump approximately 12 hours after starting up.
13
14WHAT CHANGED:
15Dumps are now sharded into smaller files to support parallelization
16of the dumps process. For example, 'commonswiki_file' is now
17located in a subdirectory and split into multiple 1GB chunks
18rather than a single large blob.
19
20Please update your scrapers and automated jobs to use the new
21directory structure.

Mentioned in SAL (#wikimedia-operations) [2026-01-07T21:54:26Z] <inflatador> bking@clouddumps100{1,2} created /srv/dumps/xmldatadumps/public/other/cirrussearch/DEPRECATED.txt T366248

Deprecation doc has been placed, this should be complete.

Change #1223722 abandoned by Ebernhardson:

[operations/puppet@production] dumps: Repoint cirrus dumps to new location

Reason:

Resolved in I589e2910235d928020632e0242b568d314acf708

https://gerrit.wikimedia.org/r/1223722