Dump Cirrus index into a file
Closed, ResolvedPublic
Actions

Description

It'd be useful to have a tool to dump the Elasticsearch index to a file. It'd be most useful if that file were in the Elasticsearch bulk format. That way you could load the file with a one line command if it is small:

curl -s -XPOST localhost:9200/_bulk --data-binary @thefile

Or if it is large you could chunk it:

rm -f chunked
split thefile chunked/chunk
for chunk in chunked/chunk*; do 
 echo $chunk
 curl -s -XPOST localhost:9200/_bulk --data-binary @$chunk
done

Stakeholders: Cirrus developers, external users
Benefits: Allows developers to pull an index back to their local development system for testing. We could also do this as a dump like we do the xml dumps for other folks.
Estimate: David Causse intro task. Highly variable because its for getting acclimated with the code.

Details

	Subject	Repo	Branch	Lines +/-
	Add support for PHP 5.3 to the dumpIndex maintenance script	mediawiki/extensions/CirrusSearch	master	+11 -9
	Add a maintenance script to dump an index to stdout	mediawiki/extensions/CirrusSearch	master	+428 -59

Customize query in gerrit

Related Objects

Mentioned In: T109715: Replicate production elasticsearch indices to labs
rECIR40a13c5e155a: Add support for PHP 5.3 to the dumpIndex maintenance script
rMEXT4c8939506dd4: Updated mediawiki/extensions Project: mediawiki/extensions/CirrusSearch…
rECIR7bb1f50d453b: Add a maintenance script to dump an index to stdout
rMEXT238f3e8a10d1: Updated mediawiki/extensions Project: mediawiki/extensions/CirrusSearch…

Event Timeline

• Manybubbles created this task.Jun 8 2015, 11:03 AM

• Manybubbles raised the priority of this task from to Medium.

• Manybubbles updated the task description. (Show Details)

• Manybubbles added projects: Discovery-ARCHIVED, Discovery-Search (Current work), CirrusSearch.

• Manybubbles moved this task to Search on the Discovery-ARCHIVED board.

• Manybubbles subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 8 2015, 11:03 AM

Assigning to myself because David doesn't have a phab account that I can find yet. He'll get one and when he does he'll take this task from me.

Note: its probably a good idea to do this as a maintenance script inside Cirrus.

• Manybubbles moved this task from Search to On Sprint Board on the Discovery-ARCHIVED board.Jun 8 2015, 6:03 PM

dcausse claimed this task.Jun 9 2015, 9:05 AM

Change 217716 had a related patch set uploaded (by DCausse):
Add a maintenance script to dump an index into a file

https://gerrit.wikimedia.org/r/217716

gerritbot added a project: Patch-For-Review.Jun 11 2015, 9:58 PM

dcausse moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jun 12 2015, 12:32 PM

dcausse moved this task from Needs review to not in use - please delete on the Discovery-Search (Current work) board.Jun 12 2015, 4:15 PM

dcausse moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jun 16 2015, 1:18 PM

Interesting. Does this sort of dump contain private data?

In T101691#1374119, @Nemo_bis wrote:

Interesting. Does this sort of dump contain private data?

I don't think so. Its just the contents of ?action=cirrusdump but for every page. It might make sense not to publish the user namespace but that data is already available in our xml dumps.

Change 217716 merged by jenkins-bot:
Add a maintenance script to dump an index to stdout

https://gerrit.wikimedia.org/r/217716

https://gerrit.wikimedia.org/r/217716 (branch master): WMF-deploy-2015-06-16_(1.26wmf10)

Diffusion mentioned this in rMEXT238f3e8a10d1: Updated mediawiki/extensions Project: mediawiki/extensions/CirrusSearch….Jun 17 2015, 4:24 PM

dcausse mentioned this in rECIR7bb1f50d453b: Add a maintenance script to dump an index to stdout.Jun 17 2015, 4:33 PM

In T101691#1374547, @Manybubbles wrote:

I don't think so. Its just the contents of ?action=cirrusdump but for every page. It might make sense not to publish the user namespace but that data is already available in our xml dumps.

Thanks. So it would be great to archive the dumps somewhere.

dcausse moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Jun 18 2015, 9:02 PM

In T101691#1376363, @Nemo_bis wrote:

In T101691#1374547, @Manybubbles wrote:

I don't think so. Its just the contents of ?action=cirrusdump but for every page. It might make sense not to publish the user namespace but that data is already available in our xml dumps.

Thanks. So it would be great to archive the dumps somewhere.

Indeed.

Looks like a zend bug:

manybubbles@deployment-bastion:/srv/mediawiki-staging/php-master$ mwscript extensions/CirrusSearch/maintenance/dumpIndex.php --wiki enwiki --indexType content
Dumping 19315 documents (19315 in the index)
{"index":{"_type":"page","_id":"1956"}}

Warning: json_encode() expects parameter 2 to be long, string given in /mnt/srv/mediawiki-staging/php-master/extensions/CirrusSearch/maintenance/dumpIndex.php on line 175


Fatal error: Call to private method CirrusSearch\Maintenance\DumpIndex::outputProgress() from context '' in /mnt/srv/mediawiki-staging/php-master/extensions/CirrusSearch/maintenance/dumpIndex.php on line 153
manybubbles@deployment-bastion:/srv/mediawiki-staging/php-master$ php --version
PHP 5.3.10-1ubuntu3.18 with Suhosin-Patch (cli) (built: Apr 17 2015 15:11:25) 
Copyright (c) 1997-2012 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2012 Zend Technologies
manybubbles@deployment-bastion:/srv/mediawiki-staging/php-master$

Change 219902 had a related patch set uploaded (by DCausse):
Add support for PHP 5.3 to the dumpIndex maintenance script

https://gerrit.wikimedia.org/r/219902

dcausse moved this task from Needs Reporting to Needs review on the Discovery-Search (Current work) board.Jun 22 2015, 7:10 PM

Change 219902 merged by jenkins-bot:
Add support for PHP 5.3 to the dumpIndex maintenance script

https://gerrit.wikimedia.org/r/219902

Diffusion mentioned this in rMEXT4c8939506dd4: Updated mediawiki/extensions Project: mediawiki/extensions/CirrusSearch….Jun 22 2015, 8:46 PM

dcausse mentioned this in rECIR40a13c5e155a: Add support for PHP 5.3 to the dumpIndex maintenance script.Jun 22 2015, 8:47 PM

https://gerrit.wikimedia.org/r/219902 (branch master): WMF-deploy-2015-06-23_(1.26wmf11)

Jdforrester-WMF removed a project: WMF-deploy-2015-06-16_(1.26wmf10).Jun 22 2015, 9:17 PM

dcausse moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Jun 26 2015, 8:09 AM

Legoktm removed a subscriber: • Forrestbot.Jun 29 2015, 5:49 PM

dcausse closed this task as Resolved.Jul 28 2015, 8:05 PM

Perhaps this bug may have been resolved, but please provide a download URL so that Datasets-Archiving can pick it up and push to the Internet Archive according, thanks!

These dumps are not generated automatically. They contain only a subset of the data that is already available in db dumps.
This is mostly a tool for developers because this format is very convenient to reproduce a production index on another system (db dumps are hard to use).
I'd like to generate these dumps regularly like db dumps but I don't know where to start (on which server, where should we store them etc..).

Today, when I need a dump, I run the script from terbium and I stash the data in my home.

dcausse mentioned this in T109715: Replicate production elasticsearch indices to labs.Aug 20 2015, 3:25 PM

• ksmith added a project: Essential-Work.Sep 11 2015, 9:58 PM

• Deskana closed this task as Resolved.Sep 12 2015, 2:53 AM

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.

• Deskana subscribed.

• MZMcBride subscribed.Nov 24 2015, 6:08 AM

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Dec 31 2015, 5:09 AM

Aklapper added a project: Datasets-Archiving.May 16 2023, 12:05 PM

Aklapper removed a subscriber: Datasets-Archiving.

Maintenance_bot removed a project: Patch-For-Review.May 16 2023, 12:13 PM

Frostly moved this task from Incoming to Done on the Datasets-Archiving board.Jul 28 2023, 1:28 AM