Page MenuHomePhabricator

Load test the codfw elasticsearch cluster to verify it can handle production load in a switchover
Closed, ResolvedPublic


The codfw cluster is now up and running. If the eqiad cluster falls over as it did a few months ago the on call ops can perform a one line mediawiki-config change to shift search traffic over. We need to make sure that codfw can handle this.

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a subscriber: EBernhardson.

One possibility to investigate is . This can sniff the port elasticsearch is listening on for http requests on and duplicate some % (or all) of them to a second cluster. I don't know anything about its performance characteristics yet.

another possibility is to not try and reproduce exact production load, but to just throw requests at the cluster. This can be done with the runSearch.php script in CirrusSearch along with extracting a set of queries from the CirrusSearchRequests log on fluorine. This will be a bit harder to simulate requests from 900 wiki's with though.

This really should be done sooner rather than later; the sooner we find out if there are any issues, the less likely we are to block Ops from doing a full switchover test in Q3. @EBernhardson thinks that we can test something like this over the holidays, so I'm pulling it into the sprint.

@EBernhardson thinks that we can test something like this over the holidays, so I'm pulling it into the sprint.

This did not happen due to vacations. We're still planning on working on this soon.

not having much luck with gor, it doesn't seem to correctly sniff the entire stream. Will try out something more directly with libpcap (ngrep, tcpreplay, something like that)

I dug into the gor source code and found a few problems, it has to do its own tcp stream reconstruction and there are a few bugs there. I cleaned it up enough and was able to record a reasonably complete (~80%?) sample of queries send to the eqiad cluster between 6pm and 7pm PST. I ran a php script to clean up the sample to only the connections that resulted in valid requests and shipped them from eqiad to codfw to run the load test.

Initial testing at low levels worked well. Once i sped up playback to approximate the peak loads we see in a week things went much worse.

At around 11k prefix/s, 5k fulltext/s, and 700 morelike/s (these are to shards, it works out to ~3k mediawiki level requests/s) the codfw cluster started pushing requests into thread pool rejection. As of 10pm PST the entire cluster stopped responding to both query and indexing load.

I didn't immediatly notice something was wrong, so the load testing kept throwing requests at it for about 20 more minutes before i shut it off. 15 minutes after turning off the load test elasticsearch is still clearing requests out of the input and pushing them into threadpool rejection.

Having a load test in this manner does give us some opportunity though, we can write some scripts to rewrite the requests and try out a few different parameters to see which parameters within the requests can be tuned to allow the cluster to handle this load.

Another important point, when used through CirrusSearch the PoolCounter will prevent more than 600 concurrent requests being made to the cluster. This load test included no such limit, it kept spinning up http workers and throwing requests at the cluster to keep up with the timing of the original requests.

When re-running the load test I will be limiting the number of http workers, initially to 25 per machine, with 24 machines sending requests that works out to the same 600 max concurrent requests.

If I recall there is a difference in how the disks are setup in eqiad and codfw that wouldn't surprise me if it was impactful. I came in late to the procurement process and reasoning here but in eqiad the OS is on a software RAID 1 drive and /var/lib/elasticsearch is on a RAID0 drive. In codfw both the OS and the data var on RAID1.

root@elastic1001:~# cat /proc/mdstat
Personalities : [linear] [raid1] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdb2[1] sda2[0]
      976320 blocks super 1.2 [2/2] [UU]

md2 : active raid0 sdb3[1] sda3[0]
      525523968 blocks super 1.2 512k chunks

md0 : active raid1 sdb1[1] sda1[0]
      29279104 blocks super 1.2 [2/2] [UU]

unused devices: <none>
root@elastic1001:~# df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs   48G  4.0K   48G   1% /dev
tmpfs          tmpfs     9.5G  812K  9.5G   1% /run
/dev/md0       ext3       28G   18G  8.8G  67% /
none           tmpfs     4.0K     0  4.0K   0% /sys/fs/cgroup
none           tmpfs     5.0M     0  5.0M   0% /run/lock
none           tmpfs      48G     0   48G   0% /run/shm
none           tmpfs     100M     0  100M   0% /run/user
/dev/md2       ext4      494G  402G   93G  82% /var/lib/elasticsearch


rush@elastic2001>cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[1] sda1[0]
      29280128 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[0] sdb3[1]
      750974528 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      976064 blocks super 1.2 [2/2] [UU]
root@elastic2001>df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs   63G  4.0K   63G   1% /dev
tmpfs          tmpfs      13G  1.3M   13G   1% /run
/dev/md0       ext3       28G  7.8G   19G  30% /
none           tmpfs     4.0K     0  4.0K   0% /sys/fs/cgroup
none           tmpfs     5.0M     0  5.0M   0% /run/lock
none           tmpfs      63G     0   63G   0% /run/shm
none           tmpfs     100M     0  100M   0% /run/user
/dev/md2       ext4      705G  369G  301G  56% /var/lib/elasticsearch

whether this idea was every performance tested prior to this point I don't know and I don't have the RT task at hand

David and I spent some time looking into things this morning, and re-running the load test at a reduced load.

codfw per server query QPS:
codfw total cluster QPS by type:

for comparison here is the same graphs, but for last weekends peak load on eqiad:

eqiad per server query QPS:
eqiad total cluster QPS by type:

Additionally I've re-run the test with the same query load that knocked the cluster over yesterday, but limiting the number of concurrent http requests of the load testing program to match what CirrusSearch does with the pool counter in production. I'm fairly confident that the cluster falling over only happened due to not having this limit in place, and in production usage while we might reject some queries the cluster's overall health would stay in a good state.

For reference the modified version of gor I am using is here:
The script i used to clean up the resulting gor files: P2498

Requests were recorded using clusterssh to connect to elastic10{07..31} with: sudo ./gor --input-raw --http-disallow-url /_bulk --output-file ./requests.gor

These were filtered and then sent to codfw by running this on terbium:

for i in {01..24}; do 
    nc -l 9999 | pv > requests.gor
    php gorfix.php>/dev/null
    pv requests.gor.out | nc -l 9999

And running this on my local machine:

for i in {07..31}; do 
    echo copy from elastic10${i}
    ssh elastic10${i}.eqiad.wmnet 'sudo chown ebernhardson requests.gor; nc terbium.eqiad.wmnet 9999 < requests.gor'
    j=$(expr $i - 6)
    if [ $j -lt 10 ]; then j="0$j"; fi
    sleep 20s
    echo copy to elastic20${j}
    ssh elastic20${j}.codfw.wmnet 'nc terbium.eqiad.wmnet 9999 > requests.gor'

Playback was done using clusterssh to connect to elastic20{01..24} with: ./gor --input-file ./requests.gor'|1500%' --input-file-repeat 3 --output-http http://search.svc.codfw.wmnet:9200 --stats --output-http-stats --output-http-workers 30

One more test, trying to get the cluster wide qps up to the range we see peak in production. I wasn't able to get it up there using --output-http-workers 30, i had to push it up to 35 (*24 servers). This equates to 840 concurrent searches (vs the 600 max we use in prod). This does seem to indicate the new cluster can handle our current peak load, but needs some settings changes in CirrusSearch to do so.

codfw per server QPS:
codfw total cluster QPS per type:

Based on the existing graphs we have in the average query and fetch latencies elasticsearch reports (no percentiles, ES doesn't calculate them. T123801) show that latencies were well within our normal bounds. Indexing latency was slower than our new eqiad nodes, but faster than the old ones. Likely still acceptable. Disk read/write counts along with IOPS were slightly higher than anything we currently see in production, but with everything else OK i'm willing to say this is just fine.

re-ran the test with output-http-workers set to 40 (960), 50 (1200), and 60 (1440). At 60 we started seeing some (~20) thread pool rejections so i stopped the test. peak QPS reported by the cluster was on either 40 or 50. 60 Showed a marked reduction in QPS. I suspect this is related to T124215, as a couple of the servers are doing almost double the work of the other servers. After resolving that ticket should re-run these load tests.

This is still "In progress" because we need to run the load test again.

Sharding changes have been merged/deployed and i've kicked off the in place reindex on codfw. This is basically all of our biggest indices so it could take a few days...

Completed the initial load test of codfw, results as follows:

Cluster Stats: 24 nodes, 32 cores each. 768 cores in cluster.
Based on collected data:

Peak Server QPS: 1.1 - 1.2k
Peak Cluster QPS: 26k

Unfortunatly elasticsearch doesn't report percentiles, so these are average latency numbers.

req/sconcurrencyprefixfull textmore like

The current peaks for average fan-out of queries is 5, so this cluster can support (assuming balance of queries similar to the hour i recorded, which is currently true) around 5.2k user requests per second.

I will be following this up with a load test rewriting the more like queries to be more performant, to estimate an optimization we are planning. Will also load test with rewriting along with dropping 4 out of 5 more like queries, to estimate caching we plan to add to the more like feature.

Completed load test with more like rewritten to use opening text. The adjustment to queries is as follows. The full script used as gor middleware is P2537:

if query['stats'][0] == 'more_like':
    query['fields'] = 'opening_text.word_count'
    query['query']['more_like_this']['fields'] = ['opening_text']
    query['highlight']['fields']['opening_text'] = query['highlight']['fields']['text']
    del query['highlight']['fields']['text']
    del query['highlight']['fields']['file_text']
    query['highlight']['fields']['opening_text']['matched_fieds'] = ['opening_text', 'opening_text.plain']

Based on collected data:

Peak Server QPS: 1.5k-1.7k
Peak Cluster QPS: 36k

req/sconcurrencyprefixfull textmore like

With a peak average fan-out of 5, changing the more like queries could allow the cluster to serve arround 7k user requests per second.

Completed final load test with more like rewritten to use opening text and dropping 4 out of 5 more like queries to estimate the results of caching.

Based on collected data:

Peak Server QPS: 1.8-2k
Peak Cluster QPS: 39k

req/sconcurrencyprefixfull textmore like

With the current peak average fan out rate of 5, this comes out to 7.8k user requests per second.

I've also just noticed that elastic2001 isn't reporting it's measurements for some reason, likely there is a problem with how diamond is handling the master node. That basically means all the req/s numbers are about 4% higher than shown (for all load tests, not just this one).

Thanks for the awesome work! I do get from these numbers that the ES cluster is performing as expected, right?

Yes these numbers are great, and I'm fully confident in sending user traffic at this cluster. Unrelated to these numbers though doing that may require increasing the PoolCounter size for prefix search, the added round trip latency will cause requests to hold their lock for 40-60ms rather than 10-30ms, which will effect the maximum requests served based on how many are in flight.

Sounds like we're good here.