Page MenuHomePhabricator

Dump Cirrus index into a file
Closed, ResolvedPublic

Description

It'd be useful to have a tool to dump the Elasticsearch index to a file. It'd be most useful if that file were in the Elasticsearch bulk format. That way you could load the file with a one line command if it is small:

curl -s -XPOST localhost:9200/_bulk --data-binary @thefile

Or if it is large you could chunk it:

rm -f chunked
split thefile chunked/chunk
for chunk in chunked/chunk*; do 
 echo $chunk
 curl -s -XPOST localhost:9200/_bulk --data-binary @$chunk
done

Stakeholders: Cirrus developers, external users
Benefits: Allows developers to pull an index back to their local development system for testing. We could also do this as a dump like we do the xml dumps for other folks.
Estimate: David Causse intro task. Highly variable because its for getting acclimated with the code.

Event Timeline

Manybubbles raised the priority of this task from to Medium.
Manybubbles updated the task description. (Show Details)
Manybubbles moved this task to Search on the Discovery-ARCHIVED board.
Manybubbles subscribed.
Manybubbles set Security to None.

Assigning to myself because David doesn't have a phab account that I can find yet. He'll get one and when he does he'll take this task from me.

Note: its probably a good idea to do this as a maintenance script inside Cirrus.

Change 217716 had a related patch set uploaded (by DCausse):
Add a maintenance script to dump an index into a file

https://gerrit.wikimedia.org/r/217716

Interesting. Does this sort of dump contain private data?

Interesting. Does this sort of dump contain private data?

I don't think so. Its just the contents of ?action=cirrusdump but for every page. It might make sense not to publish the user namespace but that data is already available in our xml dumps.

Change 217716 merged by jenkins-bot:
Add a maintenance script to dump an index to stdout

https://gerrit.wikimedia.org/r/217716

I don't think so. Its just the contents of ?action=cirrusdump but for every page. It might make sense not to publish the user namespace but that data is already available in our xml dumps.

Thanks. So it would be great to archive the dumps somewhere.

I don't think so. Its just the contents of ?action=cirrusdump but for every page. It might make sense not to publish the user namespace but that data is already available in our xml dumps.

Thanks. So it would be great to archive the dumps somewhere.

Indeed.

Looks like a zend bug:

manybubbles@deployment-bastion:/srv/mediawiki-staging/php-master$ mwscript extensions/CirrusSearch/maintenance/dumpIndex.php --wiki enwiki --indexType content
Dumping 19315 documents (19315 in the index)
{"index":{"_type":"page","_id":"1956"}}

Warning: json_encode() expects parameter 2 to be long, string given in /mnt/srv/mediawiki-staging/php-master/extensions/CirrusSearch/maintenance/dumpIndex.php on line 175


Fatal error: Call to private method CirrusSearch\Maintenance\DumpIndex::outputProgress() from context '' in /mnt/srv/mediawiki-staging/php-master/extensions/CirrusSearch/maintenance/dumpIndex.php on line 153
manybubbles@deployment-bastion:/srv/mediawiki-staging/php-master$ php --version
PHP 5.3.10-1ubuntu3.18 with Suhosin-Patch (cli) (built: Apr 17 2015 15:11:25) 
Copyright (c) 1997-2012 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2012 Zend Technologies
manybubbles@deployment-bastion:/srv/mediawiki-staging/php-master$

Change 219902 had a related patch set uploaded (by DCausse):
Add support for PHP 5.3 to the dumpIndex maintenance script

https://gerrit.wikimedia.org/r/219902

Change 219902 merged by jenkins-bot:
Add support for PHP 5.3 to the dumpIndex maintenance script

https://gerrit.wikimedia.org/r/219902

Hydriz subscribed.

Perhaps this bug may have been resolved, but please provide a download URL so that Datasets-Archiving can pick it up and push to the Internet Archive according, thanks!

These dumps are not generated automatically. They contain only a subset of the data that is already available in db dumps.
This is mostly a tool for developers because this format is very convenient to reproduce a production index on another system (db dumps are hard to use).
I'd like to generate these dumps regularly like db dumps but I don't know where to start (on which server, where should we store them etc..).

Today, when I need a dump, I run the script from terbium and I stash the data in my home.