Page MenuHomePhabricator

Investigate I/O limits on elasticsearch servers
Closed, ResolvedPublic

Description

During cluster recovery (if a node fails, or during planned cluster restarts) elasticsearch moves shards around, which increases I/O. Currently recoveries are throttled at 40MB/second. Even with this relatively low additional I/O, we see a significant increase in disk utilisation (reported as high as 20%) and a decrease in elasticsearch response time. The specs for our SSDs indicate that we should be able to get more IOPS than we currently see.

As an example on elastic2006, which did see increased activity around 2016-12-12 16:15 UTC:

Tuning HP Smart Path might help increase performances (thanks @Volans for the idea). Some benchmarking tools might help.

Event Timeline

Gehel triaged this task as High priority.Dec 13 2016, 3:48 PM

Mentioned in SAL (#wikimedia-operations) [2016-12-15T16:22:35Z] <gehel> shutting down elasticsearch on elastic2006 for IO benchmarking - T153083

benchmark with the current server configuration (on elastic2006) is generated with bonnie++ -d /var/lib/elasticsearch/bonnie/ -n 1024 -m elastic2006_baseline 2>&1 | tee bonnie.txt | ts | tee bonnie.ts.txt.

Steps:

Dec 15 19:38:56 Writing a byte at a time...done
Dec 15 19:49:47 Writing intelligently...done
Dec 15 20:13:05 Rewriting...done
Dec 15 20:13:06 Reading a byte at a time...done
Dec 15 20:23:02 Reading intelligently...done
Dec 15 20:23:05 start 'em...done...done...done...done...done...
Dec 15 20:23:33 Create files in sequential order...done.
Dec 15 20:23:34 Stat files in sequential order...done.
Dec 15 20:24:14 Delete files in sequential order...done.
Dec 15 20:24:28 Create files in random order...done.
Dec 15 20:24:29 Stat files in random order...done.
Dec 15 20:25:33 Delete files in random order...done.

Results:

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
elastic2006_ba 252G  1039  99 406012  42 188991  26  4853  99 443489  29 +++++ +++
Latency             14760us     859ms     299ms    5131us    6815us    3844us
Version  1.97       ------Sequential Create------ --------Random Create--------
elastic2006_baselin -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
               1024 76335  72 1058113  99 25877  47 78336  71 1136043  99 16340  41
Latency               148ms     576us     571ms     148ms     235us     515ms
1.97,1.97,elastic2006_baseline,1,1481808043,252G,,1039,99,406012,42,188991,26,4853,99,443489,29,+++++,+++,1024,,,,,76335,72,1058113,99,25877,47,78336,71,1136043,99,16340,41,14760us,859ms,299ms,5131us,6815us,3844us,148ms,576us,571ms,148ms,235us,515ms

Changing the IO scheduler to noop instead of deadline makes mostly no difference:

Steps:

Dec 15 20:50:38 Writing a byte at a time...done
Dec 15 21:01:16 Writing intelligently...done
Dec 15 21:24:36 Rewriting...done
Dec 15 21:24:37 Reading a byte at a time...done
Dec 15 21:34:39 Reading intelligently...done
Dec 15 21:34:42 start 'em...done...done...done...done...done...
Dec 15 21:35:10 Create files in sequential order...done.
Dec 15 21:35:11 Stat files in sequential order...done.
Dec 15 21:35:51 Delete files in sequential order...done.
Dec 15 21:36:06 Create files in random order...done.
Dec 15 21:36:07 Stat files in random order...done.
Dec 15 21:37:10 Delete files in random order...done.

Results:

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
elastic2006_no 252G  1004  99 414154  43 188788  26  5205  99 438854  29 +++++ +++
Latency             16942us     963ms     310ms    3981us    9268us    3313us
Version  1.97       ------Sequential Create------ --------Random Create--------
elastic2006_noop    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
               1024 76682  72 1054890  99 25894  50 74667  72 1069200  99 16453  42
Latency               134ms     293us     555ms     159ms     735us     516ms
1.97,1.97,elastic2006_noop,1,1481844092,252G,,1004,99,414154,43,188788,26,5205,99,438854,29,+++++,+++,1024,,,,,76682,72,1054890,99,25894,50,74667,72,1069200,99,16453,42,16942us,963ms,310ms,3981us,9268us,3313us,134ms,293us,555ms,159ms,735us,516ms

Doing a test with multiple bonnie instance in parallel (see script below) give slightly higher numbers in term of throughput (~300-400 Mb/s).

Intermediate observations:

  • those test can fairly easily generate throughput (~300-400MB/s) and IOPs (~12-15k IOPs) close to the spec of our SSDs
  • under high write load, the latency climbs to 1/2 second, which is pretty high, but coherent with what we see in real life (so the test seems to be meaningful)
  • our main issue is not really throughput (we don't even get close to 400Mb/s in real life)
  • the impact of write load on read latency is problematic, it is expected in the test cases here, where we saturate IO, but in the real life case, we see significant latency impacts starting at a write load of ~40Mb/s
  • it looks very much like our IO is tuned to high throughput and not to low latency (of course, for elasticsearch, we are much more interested in low read latency than in high overall throughput)

Next steps:

  • create a test scenario where we limit IO writes to ~40Mb/s and measure the impact on latency (ioping?)
  • create a test scenario where we slowly increase write throughput, and correlate the write load with the read latency
  • find a way to tune IO to low latency and not to high thoughput
#!/bin/bash

LABEL=elastic2006_baseline
NB_PROCESSES=16
DIR_UNDER_TEST=/var/lib/elasticsearch/bonnie
SIZE=16g:4k
FILES=10
RAM=1g
BONNIE_OPTS="-d ${DIR_UNDER_TEST} -s ${SIZE} -n ${FILES} -m ${LABEL} -r ${RAM}"

bonnie++ -p ${NB_PROCESSES} ${BONNIE_OPTS}

for i in `seq 1 ${NB_PROCESSES}`;
do
  echo "starting bonnie ${i}"
  bonnie++ -y s ${BONNIE_OPTS} 2>&1 | tee bonnie-${i}.out | ts | tee bonnie-${i}.ts.out &
done

for job in `jobs -p`
do
  echo "waiting for ${job}"
  wait $job
done

bonnie++ -p -1 ${BONNIE_OPTS}

Approximate timings:

Dec 16 15:30:12 Writing a byte at a time...done
Dec 16 15:41:35 Writing intelligently...done
Dec 16 15:57:57 Rewriting...done
Dec 16 15:58:02 Reading a byte at a time...done
Dec 16 16:01:46 Reading intelligently...done
Dec 16 16:01:51 start 'em...done...done...done...done...done...
Dec 16 16:01:54 Create files in sequential order...done.
Dec 16 16:01:54 Stat files in sequential order...done.
Dec 16 16:01:57 Delete files in sequential order...done.
Dec 16 16:01:59 Create files in random order...done.
Dec 16 16:01:59 Stat files in random order...done.
Dec 16 16:02:02 Delete files in random order...done.

And the raw results:

1.97,1.97,elastic2006_baseline,1,1481927966,16G,4k,40,44,24571,4,17093,1,3309,99,74913,3,10446,90,10,,,,,8885,33,+++++,+++,3506,31,6755,31,+++++,+++,3018,26,228ms,548ms,517ms,5090us,18952us,28681us,37557us,1118us,107ms,44806us,638us,113ms
Dec 16 16:02:02 1.97,1.97,elastic2006_baseline,1,1481927966,16G,4k,40,44,24571,4,17093,1,3309,99,74913,3,10446,90,10,,,,,8885,33,+++++,+++,3506,31,6755,31,+++++,+++,3018,26,228ms,548ms,517ms,5090us,18952us,28681us,37557us,1118us,107ms,44806us,638us,113ms
1.97,1.97,elastic2006_baseline,1,1481927954,16G,4k,40,44,24610,4,17126,1,3237,99,74788,3,2704,96,10,,,,,7600,31,+++++,+++,3525,33,6758,30,+++++,+++,2950,25,219ms,417ms,404ms,3767us,16649us,36273us,47543us,2382us,107ms,40565us,1277us,168ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927954,16G,4k,40,44,24610,4,17126,1,3237,99,74788,3,2704,96,10,,,,,7600,31,+++++,+++,3525,33,6758,30,+++++,+++,2950,25,219ms,417ms,404ms,3767us,16649us,36273us,47543us,2382us,107ms,40565us,1277us,168ms
1.97,1.97,elastic2006_baseline,1,1481927958,16G,4k,40,43,25218,4,17126,1,4563,99,74834,3,2762,98,10,,,,,6723,30,+++++,+++,3372,30,6759,29,+++++,+++,2905,25,224ms,646ms,426ms,4131us,17354us,9086us,41482us,2480us,107ms,40652us,1029us,112ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927958,16G,4k,40,43,25218,4,17126,1,4563,99,74834,3,2762,98,10,,,,,6723,30,+++++,+++,3372,30,6759,29,+++++,+++,2905,25,224ms,646ms,426ms,4131us,17354us,9086us,41482us,2480us,107ms,40652us,1029us,112ms
1.97,1.97,elastic2006_baseline,1,1481927946,16G,4k,40,43,24625,4,17046,1,5353,99,74845,3,2808,97,10,,,,,8774,33,+++++,+++,3349,31,6753,30,+++++,+++,2890,24,227ms,405ms,499ms,3601us,19710us,8098us,41986us,2197us,107ms,38264us,1534us,112ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927946,16G,4k,40,43,24625,4,17046,1,5353,99,74845,3,2808,97,10,,,,,8774,33,+++++,+++,3349,31,6753,30,+++++,+++,2890,24,227ms,405ms,499ms,3601us,19710us,8098us,41986us,2197us,107ms,38264us,1534us,112ms
1.97,1.97,elastic2006_baseline,1,1481927950,16G,4k,40,44,24861,4,17023,1,3245,98,74869,3,10000,78,10,,,,,6874,30,+++++,+++,3349,30,6770,29,+++++,+++,2952,26,225ms,496ms,612ms,10469us,18537us,159ms,49615us,1935us,108ms,31717us,636us,168ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927950,16G,4k,40,44,24861,4,17023,1,3245,98,74869,3,10000,78,10,,,,,6874,30,+++++,+++,3349,30,6770,29,+++++,+++,2952,26,225ms,496ms,612ms,10469us,18537us,159ms,49615us,1935us,108ms,31717us,636us,168ms
1.97,1.97,elastic2006_baseline,1,1481927938,16G,4k,40,43,24628,4,17031,1,3231,97,74957,3,2747,95,10,,,,,8078,32,+++++,+++,3466,31,6759,32,+++++,+++,2878,25,227ms,401ms,476ms,13916us,23887us,10033us,38198us,1809us,107ms,47103us,156us,112ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927938,16G,4k,40,43,24628,4,17031,1,3231,97,74957,3,2747,95,10,,,,,8078,32,+++++,+++,3466,31,6759,32,+++++,+++,2878,25,227ms,401ms,476ms,13916us,23887us,10033us,38198us,1809us,107ms,47103us,156us,112ms
1.97,1.97,elastic2006_baseline,1,1481927942,16G,4k,618,99,38781,4,17017,1,3491,99,75125,3,2640,96,10,,,,,6613,33,+++++,+++,3388,32,6771,31,+++++,+++,2931,25,14991us,265ms,574ms,3166us,18661us,4539us,41647us,1420us,107ms,33231us,1025us,169ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927942,16G,4k,618,99,38781,4,17017,1,3491,99,75125,3,2640,96,10,,,,,6613,33,+++++,+++,3388,32,6771,31,+++++,+++,2931,25,14991us,265ms,574ms,3166us,18661us,4539us,41647us,1420us,107ms,33231us,1025us,169ms
1.97,1.97,elastic2006_baseline,1,1481927994,16G,4k,40,43,24629,4,17148,1,3315,97,74729,3,2830,95,10,,,,,8153,33,+++++,+++,3345,30,6769,29,+++++,+++,3018,26,217ms,381ms,372ms,19834us,14000us,8337us,40209us,311us,107ms,31775us,1667us,168ms
Dec 16 16:02:02 1.97,1.97,elastic2006_baseline,1,1481927994,16G,4k,40,43,24629,4,17148,1,3315,97,74729,3,2830,95,10,,,,,8153,33,+++++,+++,3345,30,6769,29,+++++,+++,3018,26,217ms,381ms,372ms,19834us,14000us,8337us,40209us,311us,107ms,31775us,1667us,168ms
1.97,1.97,elastic2006_baseline,1,1481927998,16G,4k,40,43,24647,4,17044,1,3448,99,74876,3,14402,119,10,,,,,7543,32,+++++,+++,3668,32,6751,29,+++++,+++,3286,28,224ms,385ms,482ms,4180us,22677us,2598us,41122us,1789us,107ms,29035us,910us,102ms
Dec 16 16:02:02 1.97,1.97,elastic2006_baseline,1,1481927998,16G,4k,40,43,24647,4,17044,1,3448,99,74876,3,14402,119,10,,,,,7543,32,+++++,+++,3668,32,6751,29,+++++,+++,3286,28,224ms,385ms,482ms,4180us,22677us,2598us,41122us,1789us,107ms,29035us,910us,102ms
1.97,1.97,elastic2006_baseline,1,1481927986,16G,4k,40,44,24606,4,17114,1,3261,98,74963,3,2726,95,10,,,,,10088,35,+++++,+++,3470,32,6756,38,+++++,+++,2887,25,222ms,413ms,423ms,15676us,24568us,41458us,36498us,2195us,108ms,29640us,1001us,112ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927986,16G,4k,40,44,24606,4,17114,1,3261,98,74963,3,2726,95,10,,,,,10088,35,+++++,+++,3470,32,6756,38,+++++,+++,2887,25,222ms,413ms,423ms,15676us,24568us,41458us,36498us,2195us,108ms,29640us,1001us,112ms
1.97,1.97,elastic2006_baseline,1,1481927990,16G,4k,40,43,24648,4,17170,1,3456,98,74817,3,2691,95,10,,,,,7298,31,+++++,+++,3348,29,6765,46,+++++,+++,2884,25,217ms,385ms,390ms,6294us,14455us,15900us,51012us,1806us,107ms,47021us,277us,112ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927990,16G,4k,40,43,24648,4,17170,1,3456,98,74817,3,2691,95,10,,,,,7298,31,+++++,+++,3348,29,6765,46,+++++,+++,2884,25,217ms,385ms,390ms,6294us,14455us,15900us,51012us,1806us,107ms,47021us,277us,112ms
1.97,1.97,elastic2006_baseline,1,1481927978,16G,4k,40,44,24572,4,17079,1,3257,99,74868,4,2933,98,10,,,,,6868,32,+++++,+++,3466,31,6750,36,+++++,+++,3018,26,222ms,417ms,511ms,2871us,18609us,4864us,41695us,2990us,108ms,29000us,771us,112ms
Dec 16 16:02:02 1.97,1.97,elastic2006_baseline,1,1481927978,16G,4k,40,44,24572,4,17079,1,3257,99,74868,4,2933,98,10,,,,,6868,32,+++++,+++,3466,31,6750,36,+++++,+++,3018,26,222ms,417ms,511ms,2871us,18609us,4864us,41695us,2990us,108ms,29000us,771us,112ms
1.97,1.97,elastic2006_baseline,1,1481927982,16G,4k,40,44,24685,4,17035,1,3533,99,74919,3,2762,95,10,,,,,9954,36,+++++,+++,3372,30,6750,50,+++++,+++,2905,25,231ms,393ms,414ms,4449us,20973us,37472us,36495us,2846us,107ms,29168us,756us,112ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927982,16G,4k,40,44,24685,4,17035,1,3533,99,74919,3,2762,95,10,,,,,9954,36,+++++,+++,3372,30,6750,50,+++++,+++,2905,25,231ms,393ms,414ms,4449us,20973us,37472us,36495us,2846us,107ms,29168us,756us,112ms
1.97,1.97,elastic2006_baseline,1,1481927970,16G,4k,40,43,24572,4,17041,1,3346,99,74830,3,2672,96,10,,,,,7624,31,+++++,+++,3859,34,6751,32,+++++,+++,2905,25,229ms,413ms,612ms,4720us,22795us,5555us,37636us,958us,107ms,52181us,1400us,112ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927970,16G,4k,40,43,24572,4,17041,1,3346,99,74830,3,2672,96,10,,,,,7624,31,+++++,+++,3859,34,6751,32,+++++,+++,2905,25,229ms,413ms,612ms,4720us,22795us,5555us,37636us,958us,107ms,52181us,1400us,112ms
1.97,1.97,elastic2006_baseline,1,1481927974,16G,4k,40,43,24575,4,17160,1,3356,99,74632,3,2736,96,10,,,,,6891,31,+++++,+++,3388,30,6762,29,+++++,+++,2881,25,218ms,393ms,398ms,4855us,26648us,5784us,42315us,1562us,107ms,27498us,1535us,148ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927974,16G,4k,40,43,24575,4,17160,1,3356,99,74632,3,2736,96,10,,,,,6891,31,+++++,+++,3388,30,6762,29,+++++,+++,2881,25,218ms,393ms,398ms,4855us,26648us,5784us,42315us,1562us,107ms,27498us,1535us,148ms
1.97,1.97,elastic2006_baseline,1,1481927962,16G,4k,40,44,24659,4,17039,1,3206,99,74852,3,10012,84,10,,,,,7297,33,+++++,+++,3525,32,6771,29,+++++,+++,2952,26,227ms,393ms,605ms,4384us,24369us,49252us,41098us,2825us,107ms,38341us,418us,148ms
Dec 16 16:02:03 1.97,1.97,elastic2006_baseline,1,1481927962,16G,4k,40,44,24659,4,17039,1,3206,99,74852,3,10012,84,10,,,,,7297,33,+++++,+++,3525,32,6771,29,+++++,+++,2952,26,227ms,393ms,605ms,4384us,24369us,49252us,41098us,2825us,107ms,38341us,418us,148ms

The read latency we see are coherent with the current configuration of the deadline scheduler:

gehel@elastic2006:~$ cat /sys/block/sda/queue/iosched/read_expire 
500
gehel@elastic2006:~$ cat /sys/block/sda/queue/iosched/write_expire 
5000

IO scheduler tries (and seems to succeed) to keep read latency under 0.5 ms. We probably want to reduce the read_expire, which will probably have an impact on throughput. Experimentation needed...

We are using software RAID on eqiad because the controllers we have are known to be not so good (INTEL C600/X79 and INTEL C610/X99). We are also using software RAID on codfw, where we have HP P440ar controllers. This installation of the codfw cluster predates me, so I'm not sure if software RAID in codfw was done just to match what we have in eqiad or if there was a technical reason. @RobH do you know anything about that?

Having software RAID in codfw is to keep configuration uniform between DCs. It might make sense to experiment switching to hardware RAID and see if the improvement is enough to break this uniformity.

debt moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.
debt subscribed.

We've fixed the overall issue - so closing.