Page MenuHomePhabricator

Make elasticsearch exports publicly available
Closed, ResolvedPublic

Description

Cirrussearch has a method to export it's current search indexes to a file. The export contains one json string per article formatted for use with the elasticsearch bulk api. Because the bulk requests are just json it can easily be processed with anything that reads json. This information is already public, but only on a per-article basis[1]. The full wiki dumps could be made publicly available and might be useful to anyone doing text based analysis of the corpus. This is also something we could point to when throttling abusive clients.

The total cluster size is currently 2.5TB. Limiting to only the content namespaces it is 1.2TB. These generally compress 10 to 1 vs. their reported size with gzip. For reference the top 10 largest indexes:

enwiki_general            438 GB
enwiki_content            200 GB
commonswiki_file          239 GB
dewiki_content            55 GB
commonswiki_general       70 GB
frwiki_general            65 GB
jawiki_content            62 GB
dewiki_general            62 GB
frwiki_content            55 GB
metawiki_general          54 GB

[1]http://en.wikipedia.org/wiki/California?action=cirrusdump

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Hydriz changed the task status from Open to Stalled.Oct 16 2015, 11:01 AM

Is there any progress on this? It would be great to have a timeline or task list available for resolving this task.

Also, ideally dumps can be provided for the different types (main/all namespaces) with the date of dump made included in the name of the file. This is to simplify the process of pushing these dumps to Archive.org, thanks!

Is there any progress on this? It would be great to have a timeline or task list available for resolving this task.

This item is in the backlog, so there is no timeline at present.

In terms of actual work here it is probably a few days, maybe a week, of work for a discovery engineer. The php code necessary is already written, the work would all be focused around how does data get from a production machine to the dumps site, how do we make sure disk space is available, how do we get it running in an automated manner, etc.

The work to be done isn't specific to Discovery or CirrusSearch in any way, if someone that already knows how these things work wants to work on this I would be happy to provide any necessary information about the process to get dumps from CirrusSearch.

I don't see how this is stalled (waiting for further input from reporter or a third party) hence resetting task status. It's just that noone works on this currently which could be expressed by the corresponding team by setting the Priority field.

Aklapper changed the task status from Stalled to Open.Oct 17 2015, 11:12 AM

Thank you for resetting the bug status. It was originally stalled when I changed it but now that we have gotten proper updates from the original reporter, that status became irrelevant.

When we make this available it should probably be a .gz file per index, so dewiki_content.json.gz or eswiki_general.json.gz can be downloaded for research.

Change 248596 had a related patch set uploaded (by EBernhardson):
Generate weekly cirrussearch dumps

https://gerrit.wikimedia.org/r/248596

This job is running in a screen on snapshot1003 right now. I started it at 8:30 am UTC, it's 16:50 or so now, and it's in the middle of cirrusdump-dewiki-20151027-cirrussearch-content.log right now (general log still to go). I'll report on it again just before going to bed.

looking at the logs to get an idea of run times:

-rw-rw-r-- 1 datasets datasets 709 Oct 27 15:58 cirrusdump-commonswiki-20151027-cirrussearch-file.log
-rw-rw-r-- 1 datasets datasets 709 Oct 27 11:48 cirrusdump-commonswiki-20151027-cirrussearch-general.log
-rw-rw-r-- 1 datasets datasets 705 Oct 27 10:49 cirrusdump-commonswiki-20151027-cirrussearch-content.log
-rw-rw-r-- 1 datasets datasets 490 Oct 27 10:48 cirrusdump-cnwikimedia-20151027-cirrussearch-general.log
...

commons takes white a while to produce that file search index dump, so we should keep that in mind for the future.

First dump in the list produced at Oct 27 08:31; last dump produced at Oct 28, 22:56.

Change 248596 merged by ArielGlenn:
Generate weekly cirrussearch dumps

https://gerrit.wikimedia.org/r/248596

Please keep an eye on the first cron run and let me know if there are any issues.

I suggest adding a link to https://dumps.wikimedia.org/other/ first if we do intend to announce.

user-notice/Tech News question: if you would explain this in one or two simple sentences, how would you put it?

Change 249761 had a related patch set uploaded (by EBernhardson):
Add cirrussearch to dumps.wikimedia.org/other html page

https://gerrit.wikimedia.org/r/249761

@Johan how about the following:

JSON dumps of the production search indexes. Can be imported to Elasticsearch

It doesn't really capture the fact that these can be used by anything that can read json, and that the search indices contain a good bit more information broken out than just the wikitext (this contains the wikitext, the content after stripping html tags from the parsed result, a list of incoming links, a list of outgoing links, a list of headings in the page, etc. etc.). But even just announcing they exist is better than not saying anything since i can't seem to describe the above very well :)

Change 249761 merged by ArielGlenn:
Add cirrussearch to dumps.wikimedia.org/other html page

https://gerrit.wikimedia.org/r/249761

ArielGlenn claimed this task.

Looks like the cronjob runs successfully. Your link is in, so closing.