Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script
Open, HighPublic
Actions

Assigned To

None

Authored By

	dcausse
	Thu, May 30, 9:07 AM

Description

As of today the search index dumps accessible from https://dumps.wikimedia.org/other/cirrussearch/ are generated by a MW script.
This script is starting to be slow enough that it does not have time to complete before the code it's relying on is being cleaned up by scap.
The search index are also exported using a separate process that populates a hive table in the avro format. This process is much more efficient and does not relying on MW.
The idea would be to source the data from there instead of running a long MW maint script.
I believe that the Dumps 2.0 projet might share similar needs in the sense that the data would also be sourced from hadoop.

My current understanding of what would be needed is as follow:

have a spark process that converts the avro table into the elasticsearch bulk format.
a process that rsync this folder from HDFS to a host serving dumps.wikimedia.org
- this process should rename the spark partitions into something human friendly like $wikiid-$snapshort-cirrussearch-$type.$counter.gz (or bz2)

Related Objects

Mentioned In: T364856: Outreach to producers of "other dumps" to raise awareness about Dumps 2.0 and options for deprecation or migration

Event Timeline

dcausse created this task.Thu, May 30, 9:07 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptThu, May 30, 9:07 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dcausse mentioned this in T364856: Outreach to producers of "other dumps" to raise awareness about Dumps 2.0 and options for deprecation or migration.Thu, May 30, 9:30 AM

Super happy to see this, and to learn that the path to get this dump away from the Dumps 1.0 infrastructure is straightforward.

this process should rename the spark partitions into something human friendly

FYI @Antoine_Quhen recently implemented a file rename for this very purpose for Dumps 2.0, just in case you'd like to do look at it and perhaps implement similarly. So you can do this as part of the Spark process.

VirginiaPoundstone moved this task from Incoming to Radar (other teams) on the Data Products board.Thu, May 30, 7:16 PM

Gehel triaged this task as High priority.Mon, Jun 10, 3:38 PM

Gehel moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

Source the CirrusSearch index dumps from hadoop instead of a MW maintenance scriptOpen, HighPublicActions

Description

Related Objects

Event Timeline

Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script
Open, HighPublic
Actions