Page MenuHomePhabricator

Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script
Open, HighPublic

Description

As of today the search index dumps accessible from https://dumps.wikimedia.org/other/cirrussearch/ are generated by a MW script.
This script is starting to be slow enough that it does not have time to complete before the code it's relying on is being cleaned up by scap.
The search index are also exported using a separate process that populates a hive table in the avro format. This process is much more efficient and does not relying on MW.
The idea would be to source the data from there instead of running a long MW maint script.
I believe that the Dumps 2.0 projet might share similar needs in the sense that the data would also be sourced from hadoop.

My current understanding of what would be needed is as follow:

  • have a spark process that converts the avro table into the elasticsearch bulk format.
  • a process that rsync this folder from HDFS to a host serving dumps.wikimedia.org
    • this process should rename the spark partitions into something human friendly like $wikiid-$snapshort-cirrussearch-$type.$counter.gz (or bz2)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Super happy to see this, and to learn that the path to get this dump away from the Dumps 1.0 infrastructure is straightforward.

this process should rename the spark partitions into something human friendly

FYI @Antoine_Quhen recently implemented a file rename for this very purpose for Dumps 2.0, just in case you'd like to do look at it and perhaps implement similarly. So you can do this as part of the Spark process.

Gehel triaged this task as High priority.Mon, Jun 10, 3:38 PM
Gehel moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.