Replicate production elasticsearch indices to labs
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	• demon
	Aug 20 2015, 3:12 PM

Description

Think of all the awesome tools people could write with search data :)

Major things to figure out:

replication strategy (including skipping private wikis)
making sure we've got the resources in labs to take a replica of the prod indices

Use cases:

Provides access to the community to query arbitrary data out of elasticsearch. The mediawiki search api pales in comparison to what can be done with the ES api directly.
The Elasticsearch query and document format is, for many tasks, much easier to use. In the mysql labs database getting a page and all its information is a complicated join. In ES it is a simple query[1], and by default the returned document contains everything we know[2].
Allows tools to be built by the community to take advantage of all this data in elasticsearch.
Gives discovery department access to constantly updated indices in labs for analysis and research into potential changes
Replace use of mwgrep (which can only be used by people with shell access)
Allow people to search across multiple projects for deprecated JS code

[1] http://elasticsearch/enwiki_content/page/_search?q=title:Jimmy_Wales
[2] https://en.wikipedia.org/wiki/Jimmy_Wales?action=cirrusdump

Details

	Subject	Repo	Branch	Lines +/-
	Enable labs ES replica for english and german	operations/mediawiki-config	master	+3 -2

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined	Feature	None	T71489 Expose mwgrep functionality on-wiki
Resolved		None	T109715 Replicate production elasticsearch indices to labs
Resolved		• Mathew.onipe	T214921 Setup elasticsearch on cloudelastic100[1-4]
Resolved		debt	T214922 Create cloudelastic-root group
Open		None	T220069 Build authenticating reverse proxy for Cloud CirrusSearch replicas
Resolved		EBernhardson	T220547 Document CirrusSearch schema
Open		None	T220205 Define constraints for cloudelastic use cases
Duplicate		Gehel	T220554 Open cloudelastic to wmf cloud hosts
Declined		None	T220557 Prepare hackathon presentation about how to use cloudelastic
Resolved		EBernhardson	T220625 Initialize CirrusSearch on cloudelastic
Resolved		debt	T223519 Expose cloudelastic to wmf cloud
Resolved		debt	T224324 LB for cloudelastic
Resolved		EBernhardson	T230495 Partition CirrusSearch mediawiki jobs by cluster
Resolved		EBernhardson	T236186 Move bulk content out of the ElasticaWrite job
Resolved		Ottomata	T239135 Create partitioned CirrusSearchElasticaWrite topic

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• demon added projects: Discovery-ARCHIVED, Elasticsearch.Aug 20 2015, 3:12 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 20 2015, 3:12 PM

• demon added a project: Cloud-Services.Aug 20 2015, 3:13 PM

• demon subscribed.

Using dumps in esbulk format is certainly not the fastest and convenient way to replicate indices but there's one major advantage: it allows us to experiment with new mappings.
Side note (not directly related but somewhat similar) : someone asked (in T101691) for a link to upload the dumps to Internet Archive. Would it be possible to do something similar to xml dumps, I mean do we have a place somewhere where we could put these big dump files?

Woooo! :D +1 <3

Do note that for labsdb we run them on real hardware just in the labs subnet and would want to do the same for this too.

In T109715#1557289, @dcausse wrote:

Using dumps in esbulk format is certainly not the fastest and convenient way to replicate indices but there's one major advantage: it allows us to experiment with new mappings.

My initial thought was either using snapshot/restore or a river.

Side note (not directly related but somewhat similar) : someone asked (in T101691) for a link to upload the dumps to Internet Archive. Would it be possible to do something similar to xml dumps, I mean do we have a place somewhere where we could put these big dump files?

I don't see why not, as long as we have the space. @ArielGlenn?

In T109715#1557312, @yuvipanda wrote:

Do note that for labsdb we run them on real hardware just in the labs subnet and would want to do the same for this too.

That would be ideal. In production, the total disk usage by ES is almost 8TB. That's with 2 replicas + 1 primary. In labs, we'd only really need a primary, replication would just be a waste of disk space imho.

How much traffic can we support with just a primary? Just reads + the replication writes, I guess.

In T109715#1557530, @yuvipanda wrote:

How much traffic can we support with just a primary? Just reads + the replication writes, I guess.

Not much, but the level of traffic is many many orders of magnitude lower than what we serve in production. Nice thing is we can always scale it later if need be.

My initial thought was either using snapshot/restore or a river.

rivers have been deprecated [1] (they suggest using logstash), snapshots should be most efficient way to do this task.
Another solution would be to use logstash [2].

[1] https://www.elastic.co/blog/deprecating-rivers
[2] http://blog.sematext.com/2015/05/04/recipe-reindexing-elasticsearch-documents-with-logstash/

Logstash sounds much easier than a river, good idea.

In terms of resources the current prod cluster is 2.5TB worth of primary shards. Elasticsearch seems quite efficient in terms of writes, the issues i've seen are with reads. I'm not sure what ratio of memory to working set will be necessary to keep queries from being stalled on disk IO. For reference in prod we are running 1.4GB of memory for every 1GB of primary shard size. We are seeing reasonable results on a 2 node lab cluster w/ 32GB memory against 50GB of primary shard.

• MZMcBride mentioned this in T71489: Expose mwgrep functionality on-wiki.Aug 20 2015, 10:19 PM

• MZMcBride subscribed.

@EBernhardson so does this mean we can put some of the old lsearchd machines onto the labs-vnet and replicate our production cluster to them? Those boxes had 150G disks but 48G RAM

how many boxes are there? the main concern would be if 2.5TB of data will fit.

@EBernhardson we'd need to pick and choose from wikitech.wikimedia.org/wiki/Server_Spares and then justify it :)

We'd also need to make sure that deleted / revdelled content doesn't show up.

I have actually just written the code to send cirrussearch updates to multiple clusters in T109734. This would be a good test of that in addition to the cluster being added in codfw. A hole would need to be opened up between the job runners and this new labs cluster of elasticsearch machines for the update operations.

Speaking of firewalls, we also need to block off the server inside labs from general http actions. Anyone can change/update/delete anything in general. We should setup a reverse proxy in front of the labs cluster that only allows GET requests through. We may need to look into rate limiting/filtering/etc after that.

Yeah, a simple reverse proxy seems easily doable.

yuvipanda mentioned this in T112163: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use.Sep 10 2015, 8:15 PM

Magnus subscribed.Sep 11 2015, 3:05 PM

Ricordisamoa subscribed.Sep 11 2015, 3:23 PM

In T109715#1627648, @yuvipanda wrote:

We'd also need to make sure that deleted / revdelled content doesn't show up.

Deleted content disappears from the primary index as well at deletion time so nbd.

Ok, so we have a machine approved to test for 6 weeks! Should be racked next week, I think.

Things I'd like to get out of this test:

Working replication from our primary cluster of all indexes
Measurements of how much traffic this can sustain on this hardware (mem/cpu/disk)
An authentication / authorization / rate limiting story for access from tool labs.

EBernhardson merged a task: T113064: Figure out goals of six week test for labs replica of elasticsearch index.Sep 18 2015, 5:58 PM

@EBernhardson Can you add some of the use cases for this to the task description?

EBernhardson updated the task description. (Show Details)Sep 23 2015, 6:21 PM

EBernhardson set Security to None.

Ok, so this requires:

Port 80 on nobelium be available from labs instances, for instances to be able to query this through the proxy.
Port 9200 on nobelium be available from prod machines (esp. jobrunners) for them to push new content through.

If we want to keep things limited, it should be safe to only allow jobrunners + terbium. Writes are always processed in a job. I would like terbium to also have access so we can do things like update mappings from the production mediawiki configuration.

yuvipanda added a project: labs-sprint-116.Sep 28 2015, 5:16 PM

I allowed labs instances to talk to port 80 on this box as requested...

+      term nobelium-elastic {
+          from {
+              destination-address {
+                  10.64.37.14/32;
+              }
+              protocol tcp;
+              destination-port 80;
+          } 
+          then accept;
+      }

yuvipanda added a project: labs-sprint-117.Oct 5 2015, 5:02 PM

yuvipanda moved this task from To do to Code Review/Blocked on the labs-sprint-117 board.Oct 6 2015, 4:41 AM

Testwiki is replicating there now! \o/

• MZMcBride mentioned this in T115683: Add a command-line option to mwgrep to allow it to search a particular page across all wikis.Oct 16 2015, 5:30 AM

@yuvipanda: Community Tech is very interested in using this. Is there any way that we can help beta test?

• Deskana moved this task from Needs triage to Tracking on the Discovery-ARCHIVED board.Oct 16 2015, 5:50 PM

@kaldari Unfortunatly the CirrusSearch end of this has taken longer than anticipated, but we have finally finished up all the moving parts (we kept identifying more problems as we moved along). Next week we will be able to test sending the full stream of writes to this server and see what kind of performance characteristics we can expect.

The issue you will run into is this server is only available for a 6 week test deployment, I'm not sure when the 6 weeks officially started, According to T112163 the machine was allocated on Mon, Sep 21. Six weeks would mean we give this server back on October 2nd.

For the most part, we arn't sure (finding out next week!) but are expecting that a single server of this type may not be sufficient hardware to keep up with the write load while also being able to serve queries in a timely manner. Simply being able to point to other teams that would like access to this functionality may be a big help in any kind of procurement that ends up being required. So just voicing your needs is perhaps the biggest thing you can do to help this progress into something available for general use.

Our current use case is basically using this as an alternative to mwgrep. We need to see what on-wiki Javascript is running where and update things that are broken and bit-rotted. Being able to do this on Tool Labs means that the community could actually do this in most cases instead of us :)

Glaisher subscribed.Oct 19 2015, 3:55 PM

Krenair subscribed.Oct 21 2015, 7:32 PM

Just got directed here from @yuvipanda. I was hoping to be able to use Elastic search to index/query past answers in the Teahouse so that we can offer suggestions to newcomers. Our ultimate goal is to increase the capacity of the Teahouse for accepting more newcomers. Being able to do this in Labs would make it easier to experiment and direct access to Elastic search (probably?) means we can take fuller advantage of what it has to offer.

yuvipanda added a project: labs-sprint-118.Oct 26 2015, 4:03 PM

yuvipanda added a project: labs-sprint-119.Nov 2 2015, 6:03 PM

kaldari updated the task description. (Show Details)Nov 23 2015, 10:00 PM

kaldari updated the task description. (Show Details)

Change 255033 had a related patch set uploaded (by EBernhardson):
Enable labs ES replica for english and german

https://gerrit.wikimedia.org/r/255033

gerritbot added a project: Patch-For-Review.Nov 23 2015, 10:51 PM

Change 255033 merged by jenkins-bot:
Enable labs ES replica for english and german

https://gerrit.wikimedia.org/r/255033

nobelium looks reasonably happy with only enwiki and dewiki turned on. disk read/write is around 20MB/s. When we did this before (with all but enwiki and dewiki) jobs started timing out with nobelium doing around 30MB/s of disk activity. I'm not sure MB/s of read/write is the best metric to watch, but it seems reasonable enough.

turned on wikidatawiki and commonswiki as well. Will see what a day worth of logs looks like and maybe turn on a few more tomorrow.

overnight load looked quite reasonable, turned on nlwiki, frwiki and eswiki this morning.

merges now look to be backing up and getting start/stop throttling indexing messages in the logs. I've applied the following to all the indices we are currently writing to, which changes from the default 1s to 1 minute. This basically means it will take at least 1 minute between performing a write and it being available for search. It should also mean we only create segements once a minute instead of once a second (if there is a write for that time period).

curl -XPUT nobelium.eqiad.wmnet:9200/commonswiki_content/_settings -d '{"index":{"refresh_interval": "1m"}}'

I've also increased the disk throughput limit from 20MB/s to 25MB/s. This will negatively impact query performance, but will help it keep from getting backed up.

curl -XPUT nobelium.eqiad.wmnet:9200/_cluster/settings -d '{"transient":{"indices.store.throttle.max_bytes_per_sec":"25mb"}}'

after applying those changes the merge rate on nobelium looks to be going back towards healthy.

nobelium merge rate nob 24th (308×586 px, 45 KB)

http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1448400296.569&from=-24hours&yMinRight=0&yMin=0&yMinLeft=0&target=secondYAxis(servers.nobelium.elasticsearch.indices.merges.current)&target=servers.nobelium.elasticsearch.indices.merges.current_size_in_bytes

• MZMcBride mentioned this in T88247: insource should search article text on non-wikitext pages. Probably..Dec 24 2015, 2:02 AM

• demon unsubscribed.Feb 7 2017, 5:50 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptFeb 7 2017, 5:50 AM

This was a proof of concept, that proof is completed. We know that a single server with spinning rust cannot handle the load of production index updates. A more complete system, partially based on this experiment, is being proposed as part of FY17-18 to expose a queryable elasticsearch cluster with up-to-date indices in labs.

In T109715#3069491, @EBernhardson wrote:

This was a proof of concept, that proof is completed. We know that a single server with spinning rust cannot handle the load of production index updates. A more complete system, partially based on this experiment, is being proposed as part of FY17-18 to expose a queryable elasticsearch cluster with up-to-date indices in labs.

This task is called "Replicate production elasticsearch indices to labs" and should not be resolved until there is an active ElasticSearch cluster in labs containing up-to-date- indicates from (most) production wikis.

I also don't see a replacement task for the FY17-18 description that would unblock T71489 or related tasks such as "Expose mwgrep in labs" so that volunteers may use mwgrep to perform cross-wiki searches on public wikis.

The experiment for enwiki was also reverted as wmgCirrusSearchWriteClusters is now back to the previous default value:

'wmgCirrusSearchWriteClusters' => [
	'default' => [ 'eqiad', 'codfw' ],
],

I'm not sure there is a specific task but i helped @bd808 spec out what hardware would be necessary, and ops put us together with rough pricing estimates that are included in the cloud services budget request for FY17-18. Last i heard the machines made it through the first round of budgeting, but I don't think we will know for sure until the final budget comes out. The experiment was indeed reverted, as the original machine from ops was a short term loan to test out the concept.

In T109715#3159508, @EBernhardson wrote:

I'm not sure there is a specific task but i helped @bd808 spec out what hardware would be necessary, and ops put us together with rough pricing estimates that are included in the cloud services budget request for FY17-18. Last i heard the machines made it through the first round of budgeting, but I don't think we will know for sure until the final budget comes out. The experiment was indeed reverted, as the original machine from ops was a short term loan to test out the concept.

The capex budget for hardware to support replicas of the CirrusSearch indexes is kind of in the FY17/18 plan. We have rough quotes on the hardware that we think would be needed, but the budget allocation has been marked as something that could be requested if there is a budget underrun in other areas. Functionally this means that there is a chance we will find the money needed in Q3 or Q4 (Jan-Jul 2018) but there is no guarantee. If we don't come up with the funding and staff time I will try again in the next annual planning cycle.

Building this infrastructure with 'spares' (typically out of warranty or general purpose backup hardware) is unlikely do to the IOPS requirements needed to keep up with the production data stream.

• bd808 removed EBernhardson as the assignee of this task.Apr 6 2017, 4:04 AM

• bd808 removed projects: Patch-For-Review, labs-sprint-119, labs-sprint-118, labs-sprint-117, labs-sprint-116.

zhuyifei1999 subscribed.Apr 6 2017, 4:11 AM

• Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.Apr 20 2017, 5:20 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:47 PM

Hi, any news here?

Framawiki awarded a token.Feb 10 2019, 5:57 PM

The servers have been purchased and racked up. Patches were going through puppet last week getting new security groups setup for accessing the cluster, installing the servers, etc. Basically, things are progressing and I'm optimistic we will have a public service ready in time for the summer hackathon.

• bd808 added a subtask: T214921: Setup elasticsearch on cloudelastic100[1-4].Feb 10 2019, 8:52 PM

EBernhardson merged a task: T220545: [epic] Deploy replica of CirrusSearch search indices to cloud.Apr 9 2019, 8:07 PM

EBernhardson added a subtask: T220205: Define constraints for cloudelastic use cases.Apr 9 2019, 8:14 PM

Krinkle unsubscribed.Apr 11 2019, 4:06 PM

debt closed subtask T214921: Setup elasticsearch on cloudelastic100[1-4] as Resolved.Apr 15 2019, 6:03 PM

Gehel closed subtask T220554: Open cloudelastic to wmf cloud hosts as Resolved.May 21 2019, 8:52 AM

EBernhardson closed subtask T220557: Prepare hackathon presentation about how to use cloudelastic as Declined.May 24 2019, 10:12 PM

debt closed subtask T220547: Document CirrusSearch schema as Resolved.May 28 2019, 11:53 PM

debt closed subtask T220625: Initialize CirrusSearch on cloudelastic as Resolved.Aug 21 2019, 6:08 PM

taavi closed this task as Resolved.Feb 7 2023, 9:51 AM