Make elasticsearch exports publicly available
Closed, ResolvedPublic
Actions

Description

Cirrussearch has a method to export it's current search indexes to a file. The export contains one json string per article formatted for use with the elasticsearch bulk api. Because the bulk requests are just json it can easily be processed with anything that reads json. This information is already public, but only on a per-article basis[1]. The full wiki dumps could be made publicly available and might be useful to anyone doing text based analysis of the corpus. This is also something we could point to when throttling abusive clients.

The total cluster size is currently 2.5TB. Limiting to only the content namespaces it is 1.2TB. These generally compress 10 to 1 vs. their reported size with gzip. For reference the top 10 largest indexes:

enwiki_general            438 GB
enwiki_content            200 GB
commonswiki_file          239 GB
dewiki_content            55 GB
commonswiki_general       70 GB
frwiki_general            65 GB
jawiki_content            62 GB
dewiki_general            62 GB
frwiki_content            55 GB
metawiki_general          54 GB

[1]http://en.wikipedia.org/wiki/California?action=cirrusdump

Details

	Subject	Repo	Branch	Lines +/-
	Add cirrussearch to dumps.wikimedia.org/other html page	operations/puppet	production	+2 -1
	Generate weekly cirrussearch dumps	operations/puppet	production	+180 -0

Customize query in gerrit

Event Timeline

EBernhardson created this task.Aug 20 2015, 8:15 AM

EBernhardson raised the priority of this task from to Needs Triage.

EBernhardson updated the task description. (Show Details)

EBernhardson added projects: CirrusSearch, Datasets-General-or-Unknown.

EBernhardson subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptAug 20 2015, 8:15 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

EBernhardson updated the task description. (Show Details)Aug 20 2015, 8:17 AM

EBernhardson set Security to None.

Hydriz added a project: Datasets-Archiving.Aug 26 2015, 5:35 AM

Hydriz subscribed.

Is there any progress on this? It would be great to have a timeline or task list available for resolving this task.

Also, ideally dumps can be provided for the different types (main/all namespaces) with the date of dump made included in the name of the file. This is to simplify the process of pushing these dumps to Archive.org, thanks!

In T109690#1730383, @Hydriz wrote:

Is there any progress on this? It would be great to have a timeline or task list available for resolving this task.

This item is in the backlog, so there is no timeline at present.

In terms of actual work here it is probably a few days, maybe a week, of work for a discovery engineer. The php code necessary is already written, the work would all be focused around how does data get from a production machine to the dumps site, how do we make sure disk space is available, how do we get it running in an automated manner, etc.

The work to be done isn't specific to Discovery or CirrusSearch in any way, if someone that already knows how these things work wants to work on this I would be happy to provide any necessary information about the process to get dumps from CirrusSearch.

I don't see how this is stalled (waiting for further input from reporter or a third party) hence resetting task status. It's just that noone works on this currently which could be expressed by the corresponding team by setting the Priority field.

Aklapper changed the task status from Stalled to Open.Oct 17 2015, 11:12 AM

Thank you for resetting the bug status. It was originally stalled when I changed it but now that we have gotten proper updates from the original reporter, that status became irrelevant.

When we make this available it should probably be a .gz file per index, so dewiki_content.json.gz or eswiki_general.json.gz can be downloaded for research.

Change 248596 had a related patch set uploaded (by EBernhardson):
Generate weekly cirrussearch dumps

https://gerrit.wikimedia.org/r/248596

gerritbot added a project: Patch-For-Review.Oct 24 2015, 5:20 AM

This job is running in a screen on snapshot1003 right now. I started it at 8:30 am UTC, it's 16:50 or so now, and it's in the middle of cirrusdump-dewiki-20151027-cirrussearch-content.log right now (general log still to go). I'll report on it again just before going to bed.

looking at the logs to get an idea of run times:

-rw-rw-r-- 1 datasets datasets 709 Oct 27 15:58 cirrusdump-commonswiki-20151027-cirrussearch-file.log
-rw-rw-r-- 1 datasets datasets 709 Oct 27 11:48 cirrusdump-commonswiki-20151027-cirrussearch-general.log
-rw-rw-r-- 1 datasets datasets 705 Oct 27 10:49 cirrusdump-commonswiki-20151027-cirrussearch-content.log
-rw-rw-r-- 1 datasets datasets 490 Oct 27 10:48 cirrusdump-cnwikimedia-20151027-cirrussearch-general.log
...

commons takes white a while to produce that file search index dump, so we should keep that in mind for the future.

ArielGlenn moved this task from Backlog to Active on the Datasets-General-or-Unknown board.Oct 27 2015, 5:21 PM

First dump in the list produced at Oct 27 08:31; last dump produced at Oct 28, 22:56.

Change 248596 merged by ArielGlenn:
Generate weekly cirrussearch dumps

https://gerrit.wikimedia.org/r/248596

Please keep an eye on the first cron run and let me know if there are any issues.

EBernhardson added a project: User-notice.Oct 29 2015, 2:29 PM

I suggest adding a link to https://dumps.wikimedia.org/other/ first if we do intend to announce.

user-notice/Tech News question: if you would explain this in one or two simple sentences, how would you put it?

Change 249761 had a related patch set uploaded (by EBernhardson):
Add cirrussearch to dumps.wikimedia.org/other html page

https://gerrit.wikimedia.org/r/249761

Johan moved this task from To Triage to Not ready to announce on the User-notice board.Oct 30 2015, 9:42 AM

@Johan how about the following:

JSON dumps of the production search indexes. Can be imported to Elasticsearch

It doesn't really capture the fact that these can be used by anything that can read json, and that the search indices contain a good bit more information broken out than just the wikitext (this contains the wikitext, the content after stripping html tags from the parsed result, a list of incoming links, a list of outgoing links, a list of headings in the page, etc. etc.). But even just announcing they exist is better than not saying anything since i can't seem to describe the above very well :)

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 3 2015, 10:04 PM

Change 249761 merged by ArielGlenn:
Add cirrussearch to dumps.wikimedia.org/other html page

https://gerrit.wikimedia.org/r/249761

Looks like the cronjob runs successfully. Your link is in, so closing.

@EBernhardson: OK, thanks!

Johan moved this task from Not ready to announce to Announce in next Tech/News on the User-notice board.Nov 5 2015, 9:21 AM

Johan moved this task from Announce in next Tech/News to In current Tech/News draft on the User-notice board.Nov 6 2015, 8:19 AM

Hydriz moved this task from Incoming to Review on the Datasets-Archiving board.Nov 6 2015, 9:04 AM

Johan moved this task from In current Tech/News draft to Recently announced in Tech/News on the User-notice board.Nov 11 2015, 9:53 AM

Quiddity moved this task from Recently announced in Tech/News to Already announced/Archive on the User-notice board.Nov 20 2015, 12:48 AM

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Dec 31 2015, 5:08 AM

ArielGlenn moved this task from Active to Done on the Datasets-General-or-Unknown board.Mar 4 2016, 10:37 AM

Hydriz moved this task from Review to Done on the Datasets-Archiving board.May 14 2017, 3:11 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 14 2017, 3:11 PM

Ladsgroup edited projects, added User-notice-archive; removed User-notice.Aug 13 2022, 1:53 PM

Make elasticsearch exports publicly availableClosed, ResolvedPublicActions

Description

Details

Event Timeline

Make elasticsearch exports publicly available
Closed, ResolvedPublic
Actions