Page MenuHomePhabricator

Replicate production elasticsearch indices to labs
Closed, ResolvedPublic

Description

Think of all the awesome tools people could write with search data :)

Major things to figure out:

  1. replication strategy (including skipping private wikis)
  2. making sure we've got the resources in labs to take a replica of the prod indices

Use cases:

  • Provides access to the community to query arbitrary data out of elasticsearch. The mediawiki search api pales in comparison to what can be done with the ES api directly.
  • The Elasticsearch query and document format is, for many tasks, much easier to use. In the mysql labs database getting a page and all its information is a complicated join. In ES it is a simple query[1], and by default the returned document contains everything we know[2].
  • Allows tools to be built by the community to take advantage of all this data in elasticsearch.
  • Gives discovery department access to constantly updated indices in labs for analysis and research into potential changes
  • Replace use of mwgrep (which can only be used by people with shell access)
  • Allow people to search across multiple projects for deprecated JS code

[1] http://elasticsearch/enwiki_content/page/_search?q=title:Jimmy_Wales
[2] https://en.wikipedia.org/wiki/Jimmy_Wales?action=cirrusdump

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Using dumps in esbulk format is certainly not the fastest and convenient way to replicate indices but there's one major advantage: it allows us to experiment with new mappings.
Side note (not directly related but somewhat similar) : someone asked (in T101691) for a link to upload the dumps to Internet Archive. Would it be possible to do something similar to xml dumps, I mean do we have a place somewhere where we could put these big dump files?

Woooo! :D +1 <3

Do note that for labsdb we run them on real hardware just in the labs subnet and would want to do the same for this too.

Using dumps in esbulk format is certainly not the fastest and convenient way to replicate indices but there's one major advantage: it allows us to experiment with new mappings.

My initial thought was either using snapshot/restore or a river.

Side note (not directly related but somewhat similar) : someone asked (in T101691) for a link to upload the dumps to Internet Archive. Would it be possible to do something similar to xml dumps, I mean do we have a place somewhere where we could put these big dump files?

I don't see why not, as long as we have the space. @ArielGlenn?

Do note that for labsdb we run them on real hardware just in the labs subnet and would want to do the same for this too.

That would be ideal. In production, the total disk usage by ES is almost 8TB. That's with 2 replicas + 1 primary. In labs, we'd only really need a primary, replication would just be a waste of disk space imho.

How much traffic can we support with just a primary? Just reads + the replication writes, I guess.

How much traffic can we support with just a primary? Just reads + the replication writes, I guess.

Not much, but the level of traffic is many many orders of magnitude lower than what we serve in production. Nice thing is we can always scale it later if need be.

My initial thought was either using snapshot/restore or a river.

rivers have been deprecated [1] (they suggest using logstash), snapshots should be most efficient way to do this task.
Another solution would be to use logstash [2].

[1] https://www.elastic.co/blog/deprecating-rivers
[2] http://blog.sematext.com/2015/05/04/recipe-reindexing-elasticsearch-documents-with-logstash/

Logstash sounds much easier than a river, good idea.

In terms of resources the current prod cluster is 2.5TB worth of primary shards. Elasticsearch seems quite efficient in terms of writes, the issues i've seen are with reads. I'm not sure what ratio of memory to working set will be necessary to keep queries from being stalled on disk IO. For reference in prod we are running 1.4GB of memory for every 1GB of primary shard size. We are seeing reasonable results on a 2 node lab cluster w/ 32GB memory against 50GB of primary shard.

@EBernhardson so does this mean we can put some of the old lsearchd machines onto the labs-vnet and replicate our production cluster to them? Those boxes had 150G disks but 48G RAM

how many boxes are there? the main concern would be if 2.5TB of data will fit.

@EBernhardson we'd need to pick and choose from wikitech.wikimedia.org/wiki/Server_Spares and then justify it :)

We'd also need to make sure that deleted / revdelled content doesn't show up.

I have actually just written the code to send cirrussearch updates to multiple clusters in T109734. This would be a good test of that in addition to the cluster being added in codfw. A hole would need to be opened up between the job runners and this new labs cluster of elasticsearch machines for the update operations.

Speaking of firewalls, we also need to block off the server inside labs from general http actions. Anyone can change/update/delete anything in general. We should setup a reverse proxy in front of the labs cluster that only allows GET requests through. We may need to look into rate limiting/filtering/etc after that.

Yeah, a simple reverse proxy seems easily doable.

We'd also need to make sure that deleted / revdelled content doesn't show up.

Deleted content disappears from the primary index as well at deletion time so nbd.

Ok, so we have a machine approved to test for 6 weeks! Should be racked next week, I think.

Things I'd like to get out of this test:

  1. Working replication from our primary cluster of all indexes
  2. Measurements of how much traffic this can sustain on this hardware (mem/cpu/disk)
  3. An authentication / authorization / rate limiting story for access from tool labs.

@EBernhardson Can you add some of the use cases for this to the task description?

Ok, so this requires:

  1. Port 80 on nobelium be available from labs instances, for instances to be able to query this through the proxy.
  2. Port 9200 on nobelium be available from prod machines (esp. jobrunners) for them to push new content through.

If we want to keep things limited, it should be safe to only allow jobrunners + terbium. Writes are always processed in a job. I would like terbium to also have access so we can do things like update mappings from the production mediawiki configuration.

I allowed labs instances to talk to port 80 on this box as requested...

+      term nobelium-elastic {
+          from {
+              destination-address {
+                  10.64.37.14/32;
+              }
+              protocol tcp;
+              destination-port 80;
+          } 
+          then accept;
+      }

Testwiki is replicating there now! \o/

@yuvipanda: Community Tech is very interested in using this. Is there any way that we can help beta test?

@kaldari Unfortunatly the CirrusSearch end of this has taken longer than anticipated, but we have finally finished up all the moving parts (we kept identifying more problems as we moved along). Next week we will be able to test sending the full stream of writes to this server and see what kind of performance characteristics we can expect.

The issue you will run into is this server is only available for a 6 week test deployment, I'm not sure when the 6 weeks officially started, According to T112163 the machine was allocated on Mon, Sep 21. Six weeks would mean we give this server back on October 2nd.

For the most part, we arn't sure (finding out next week!) but are expecting that a single server of this type may not be sufficient hardware to keep up with the write load while also being able to serve queries in a timely manner. Simply being able to point to other teams that would like access to this functionality may be a big help in any kind of procurement that ends up being required. So just voicing your needs is perhaps the biggest thing you can do to help this progress into something available for general use.

Our current use case is basically using this as an alternative to mwgrep. We need to see what on-wiki Javascript is running where and update things that are broken and bit-rotted. Being able to do this on Tool Labs means that the community could actually do this in most cases instead of us :)

Just got directed here from @yuvipanda. I was hoping to be able to use Elastic search to index/query past answers in the Teahouse so that we can offer suggestions to newcomers. Our ultimate goal is to increase the capacity of the Teahouse for accepting more newcomers. Being able to do this in Labs would make it easier to experiment and direct access to Elastic search (probably?) means we can take fuller advantage of what it has to offer.

kaldari updated the task description. (Show Details)

Change 255033 had a related patch set uploaded (by EBernhardson):
Enable labs ES replica for english and german

https://gerrit.wikimedia.org/r/255033

Change 255033 merged by jenkins-bot:
Enable labs ES replica for english and german

https://gerrit.wikimedia.org/r/255033

nobelium looks reasonably happy with only enwiki and dewiki turned on. disk read/write is around 20MB/s. When we did this before (with all but enwiki and dewiki) jobs started timing out with nobelium doing around 30MB/s of disk activity. I'm not sure MB/s of read/write is the best metric to watch, but it seems reasonable enough.

turned on wikidatawiki and commonswiki as well. Will see what a day worth of logs looks like and maybe turn on a few more tomorrow.

overnight load looked quite reasonable, turned on nlwiki, frwiki and eswiki this morning.

merges now look to be backing up and getting start/stop throttling indexing messages in the logs. I've applied the following to all the indices we are currently writing to, which changes from the default 1s to 1 minute. This basically means it will take at least 1 minute between performing a write and it being available for search. It should also mean we only create segements once a minute instead of once a second (if there is a write for that time period).

curl -XPUT nobelium.eqiad.wmnet:9200/commonswiki_content/_settings -d '{"index":{"refresh_interval": "1m"}}'

I've also increased the disk throughput limit from 20MB/s to 25MB/s. This will negatively impact query performance, but will help it keep from getting backed up.

curl -XPUT nobelium.eqiad.wmnet:9200/_cluster/settings -d '{"transient":{"indices.store.throttle.max_bytes_per_sec":"25mb"}}'
EBernhardson claimed this task.

This was a proof of concept, that proof is completed. We know that a single server with spinning rust cannot handle the load of production index updates. A more complete system, partially based on this experiment, is being proposed as part of FY17-18 to expose a queryable elasticsearch cluster with up-to-date indices in labs.

Krinkle subscribed.

This was a proof of concept, that proof is completed. We know that a single server with spinning rust cannot handle the load of production index updates. A more complete system, partially based on this experiment, is being proposed as part of FY17-18 to expose a queryable elasticsearch cluster with up-to-date indices in labs.

This task is called "Replicate production elasticsearch indices to labs" and should not be resolved until there is an active ElasticSearch cluster in labs containing up-to-date- indicates from (most) production wikis.

I also don't see a replacement task for the FY17-18 description that would unblock T71489 or related tasks such as "Expose mwgrep in labs" so that volunteers may use mwgrep to perform cross-wiki searches on public wikis.

The experiment for enwiki was also reverted as wmgCirrusSearchWriteClusters is now back to the previous default value:

'wmgCirrusSearchWriteClusters' => [
	'default' => [ 'eqiad', 'codfw' ],
],

I'm not sure there is a specific task but i helped @bd808 spec out what hardware would be necessary, and ops put us together with rough pricing estimates that are included in the cloud services budget request for FY17-18. Last i heard the machines made it through the first round of budgeting, but I don't think we will know for sure until the final budget comes out. The experiment was indeed reverted, as the original machine from ops was a short term loan to test out the concept.

I'm not sure there is a specific task but i helped @bd808 spec out what hardware would be necessary, and ops put us together with rough pricing estimates that are included in the cloud services budget request for FY17-18. Last i heard the machines made it through the first round of budgeting, but I don't think we will know for sure until the final budget comes out. The experiment was indeed reverted, as the original machine from ops was a short term loan to test out the concept.

The capex budget for hardware to support replicas of the CirrusSearch indexes is kind of in the FY17/18 plan. We have rough quotes on the hardware that we think would be needed, but the budget allocation has been marked as something that could be requested if there is a budget underrun in other areas. Functionally this means that there is a chance we will find the money needed in Q3 or Q4 (Jan-Jul 2018) but there is no guarantee. If we don't come up with the funding and staff time I will try again in the next annual planning cycle.

Building this infrastructure with 'spares' (typically out of warranty or general purpose backup hardware) is unlikely do to the IOPS requirements needed to keep up with the production data stream.

The servers have been purchased and racked up. Patches were going through puppet last week getting new security groups setup for accessing the cluster, installing the servers, etc. Basically, things are progressing and I'm optimistic we will have a public service ready in time for the summer hackathon.