As of today the search index dumps accessible from https://dumps.wikimedia.org/other/cirrussearch/ are generated by a MW script.
This script is starting to be slow enough that it does not have time to complete before the code it's relying on is being cleaned up by scap.
The search index are also exported using a separate process that populates a hive table in the avro format. This process is much more efficient and does not relying on MW.
The idea would be to source the data from there instead of running a long MW maint script.
I believe that the Dumps 2.0 projet might share similar needs in the sense that the data would also be sourced from hadoop.
My current understanding of what would be needed is as follow:
- have a spark process that converts the avro table into the elasticsearch bulk format.
- a process that rsync this folder from HDFS to a host serving dumps.wikimedia.org
- this process should rename the spark partitions into something human friendly like $wikiid-$snapshort-cirrussearch-$type.$counter.gz (or bz2)