Page MenuHomePhabricator

Estimate hardware requirements for relevance lab elasticsearch servers
Closed, ResolvedPublic

Description

As part of being able to improve relevancy of search results we need a test cluster where we can store multiple copies of an index with different settings, and can change cluster wide settings without fear of breaking our production cluster.

This is currently served by 4 xlarge instances in labs, estest100{1..4}.eqiad.wmflabs. The use case has also been partially served by a test run with nobelium.eqiad.wmnet, which contains all production indices on spinning disks and receives updates for the top 10 wikis. The primary issue we have with the labs instances revolves around disk space. The issue with nobelium is it is not guaranteed hardware, it is on loan from ops for testing purposes to help plan a labs replica of elasticsearch and/or relevance lab. If the hardware for relevance lab is slow to respond to queries that is undesirable, but acceptable.

Requirements:

  • Be able to contain multiple of our largest wikis for testing relevance on more than just enwiki
  • Be able to contain multiple copies of at least 1 of our largest indexes. This should generally be a 3x storage space requirement so we can have two copies of the index, and we can reindex one of those copies in a different manner without deleting.
  • The ability to run a query set (at least 1K queries, 5K would be nice) in under an hour

Nice to have (in no particular order):

  • Enough space to contain *all* production indices
  • Enough CPU and Disk throughput to receive real time updates from the prod cluster on all indices, keeping the test up to date with live data.
  • 20+ qps against larger (200GB) indices. As we expand relevance lab to be able to compute results like MAP, nDCG and other scores we will want to be able to repeat a query set many times with different weighting parameters to explore the search space and optimize how scores are calculated. 20 qps would bring us to 72k queries an hour, or 14 runs through a 5k query set. Exploring a 2-dimensional search space of 0 to 1 with .1 increments will still take around 7 hours at this rate.

Prior tests:

  • estest100{1..4}.eqiad.wmflabs - test cluster made of 4 xlarge virtual machines in labs.
  • nobelium.eqiad.wmnet - test cluster with a single node of physical hardware

Event Timeline

First off, some references:

Current hypothesis testing cluster:

  • 4 xlarge labs instances
  • 16G memory each
  • 140G disk each
  • 8 cpu threads each

Total hypothesis testing cluster thus has:

  • 64G memory
  • 560G disk
  • 32 hardware threads

We have also been testing with nobelium.eqiad.wmnet, which has to be returned to ops:

  • 64G memory
  • 4x 3TB disks. 5.4T user visible in software raid10
  • 16 hardware threads

Then some sizing estimations. The entire set of primary data in elasticsearch is 2.75TB. The top 10 indices by size are as follows:

index namesize
enwiki_general400G
commonswiki_file342G
enwiki_content216G
commonswiki_general103G
metawiki_general62G
frwiki_content62G
jawiki_content59G
dewiki_content56G
frwiki_general55.5G
dewiki_general53G

These top 10 add up to ~1.4TB, or half of the total size of the cluster. Our desire to be able to have enough space for all of this, plus the space for 2 more of the largest index would give a total size of 2.1T. If we wanted to keep everything, and have space for 2 more of the largest index, we are talking about 3.6T.

If we limit ourselves, we could say we will only have a couple of the largest indices on at a given time. We could also limit ourselves to only having extra space for having copies + space to reindex for enwiki_content and smaller, leaving some extra work involved if we want to deal with commonswiki_file or enwiki_general. We could estimate 3*216 + 300G for other indices to work with, which would be roughly 1TB of space required.

To meet our bare minimum requirements, I think a single machine of the same specs as nobelium would be sufficient for our needs. The primary requirement is disk space, nad 4x3T disks in raid10 fits this nicely. It would be nice to be able to take the full write load, so we have constantly updated indexes that match production to use as control cases, but it is not a requirement. We know from testing that nobelium can take the write load for the top 10 wikis. We also know from testing that pointing the full write load at nobeium caused it some issues (disks couldn't keep up with writing to many different files).

Would a pair of machines of nobelium's spec be enough to handle the write load, or would we need 3x? I'm not really sure.

In terms of timing, nobelium currently takes 1m51s to run 10 queries, this is an average of around 11s per query. Running these same 10 queries through the hypothesis testing cluster takes 12s. This variance isn't likely due as much to the hardware, as that the hypothesis testing cluster is optimized to have 1 shard per node, and those shards are optimized down into a single segment each. nobelium currently has 128 segments for enwiki. I've paused indexing to nobelium and am optimizing that down to a single segment to see if it makes a noticable difference in query latency. Looks like that might take a few hours though.

Overall while it would be nice to have, i think trying to build a test cluster at this point that takes live updates might just be too much to be useful. In my opinion we should prove out the usefulness of this with an "offline" cluster where we load indices from the dumps.wikimedia.org, and when the relevance lab proves out to be useful and we can make a strong case for the level of hardware necessary for live updates we can then, in another FY, expand.

Thoughts?

Verified by pausing indexing on nobelium and optimizing enwiki_content into 2 segments, gets a similar 11s for 10 queries. Most likely the latency of searching is related to indexing load in some form or another. Resuming indexing, with enwiki still (mostly) optimized, looking at 30s for a similar set of 10 queries.

I also tried with query parallelization enabled, and with a sorted set of queries so they are hitting similar area's of the disk. nobelium does 2k queries in ~9:20, averaging 3.6qps. hypothesis testing did a similar set in 5:30min, at 6.1qps. Disabiling live indexing and re-optimizing the enwiki index brings nobelium down to 1:34 for the same query set, around 21.2qps. The summary here is that for relevance lab, spinning disks seem to be plenty capable of meeting our needs. Additionally with spinning disks (vs SSDs) we can get enough disk space across 2 nodes to actually have a replica of each shard and not have to be concerned with taking the service down during hardware maintenance and such.

For reference nobelium is almost entirely IO bound. Re-running the same 2k queries a second time with most of the relevant data still in the page cache runs at ~41qps. Indexing is also IO bound.

If we did want to aim for working with live updated indices i think we would have to go with SSDs.. But i think this would be going too far budget wise, even though we don't have a particular budget to work with yet. Going with SSD's we would probably be looking at a compromise, something like 2x800G SSD's in a raid1 on 2 servers, giving a user visible disk space of around 1.3TB or so. This could hold a multitude of smaller wiki's along with a few (but not the whole) of the top10. A setup capable of storing the full 3.6TB estimate would be something like 4x1.2T in raid10 x2 machines, giving a user visible space in the 4TB range.. I'm not sure what ops pays for those disks, but in the past we've been using intel enterprise SSD's. An Intel DC S3610 2.5" 1.2TB SATA III comes up at ~1k on newegg, and we would be asking for 8 of them. Compare to the SAS drives in nobelium which are 3TB each and come in at $140 on newegg (note again, these are not proper quotes. Just random internet numbers for ballpark reference). I'm not sure the benefit is there if the only user of the cluster is for relevance lab. Since this cluster will be available to general labs consumers, and not just discovery, there may be some additional benefit to going the extra mile, but I'm not yet convinced.

@EBernhardson—nice write up! It all sounds reasonable.

I think that 2x nobelium (rather than 3x) with spinning disks (rather than SSDs) and without live updates (rather than with) would be excellent.

Live updates are nice for obvious reasons. Even if an example of a problem was indexed just yesterday and not available in the lab, we should be able to find other similar examples—if we can't, then it can't be too big of a problem, can it? Periodic updates—probably spaced out by months—would be fine.

And the ability to reindex enwiki (say with a reverse index) and have two instances with different indexing would be great! Actually, that may be an argument against live updates. We'd have to live update any second copy, too, or else the indexing and the content would differ between the versions.

Hmm, now that I think about it, maybe live indexing isn't a good thing anyway. I'd like to be able to run a few thousand queries (1-5K each from enwiki, dewiki, and frwiki, say) against a default config, and then be able to use them a week or two later to see how a config change affects results. Live updates would introduce additional changes.

Yeah, I think I'm against live updating. David has a tool that shows scoring details for queries, and gathers info from production. Running it a few minutes apart gives slightly different scores (several decimal places in, so minor, but visible), because the TF/IDF scores for particular words can change very slightly after any edit.

One point of the Relevance Lab is repeatability, and live updates undermine that.

Okay, but yeah, something big enough to host two copies of enwiki and also host several of the other top-10 wikis at the same time would be great. Sounds like nobelium x1 could squeak by, and nobelium x2 would be comfortably commodious.

Sounds reasonable. I will put the ask for 2x nobelium level hardware in the strategic goals portion of discovery budget with a note that we could also potentially run with a single node.

I've been running some more relevance lab tests against nobelium, this time with a more reperesentative query set. This is again with all writes from mediawiki paused, so better simulating what we expect to see in the relevance lab cluster. Specifically the previous test was run using the first 2k words from a ~100k word dictionary i had built up for another task. I've now been running a query set of 10k real user queries. Each query is the first query of a single search session and all queries come from a different search session. I'm seeing runtimes of around 9k queries per hour. In theory 2x servers would something like double our capacity to 16k+ queries/hr.

I'm thinking maybe we should aim for a slightly higher spec than nobelium, specifically in terms of memory. nobelium has 64G of memory, 32G for elasticsearch and 32G for OS + page cache. We actually probably don't need a full 32G for jvm heap with low qps and batched writes on the once a week or less scale. ES just doesn't need that much internal memory to keep GC from happening in this case. I'm thinking if we were to adjust our spec to 128G memory per server we would have around 192G for page cache which should be able to hold the entirety of our larger indexes in memory. The slowest part would probably be streaming the index off disk and into this memory. For a use case such as repeatedly running the same set of queries with different sets of parameters, which i think might end up being fairly common, I think it makes sense. Would the following server spec make sense, roughly?

dual socket 4 or 6 core processor, or single socket 6 core?
128G memory
4x2T or 3T drives in raid10

CPU throughput doesn't mean much here, we are entirely bottlenecked at the IO layer, at least on nobelium. I can't push the user space cpu usage over ~150% running my sample query sets.

I don't have strong opinions on the specific specs (these machines are all studly beyond my ken), but it sounds like something a little beefier than nobelium is justified.

Closing this as resolved, decision on hardware sizing has been taken on T131184, with input from this task.