Page MenuHomePhabricator

Run load tests with the completion suggester in prod
Closed, ResolvedPublic


We need to make sure that the completion suggester is able to handle our search traffic.
If performance is an issue with the current configuration we will be able to tune it with :

  • Increase the number of shards (4 shards today vs 7 for prefix search)
  • optimize, the index is read only, optimizing the index to one segment will help
  • tune various suggestion parameter (prefix len, fuzzyness ...)
  • Remove the 2nd pass query

Event Timeline

dcausse raised the priority of this task from to Needs Triage.
dcausse updated the task description. (Show Details)
dcausse added a subscriber: dcausse.
  • number of shards

I mostly guessed at 4 for the number of shards. My thought process though was that due to the small size of the indices we might be better off scaling by increasing the replica count. As a side bonus ES allows to change the replication levels on demand. I'm not really sure how to test that hypothesis though.

  • optimize

Looks very useful, we should probably use it by default in prod for titlesuggest

How should we go about actually performing the load test?

EBernhardson added a subscriber: chasemp.

adding in @chasemp as load testing may also come up with the new codfw cluster

I ran some basic tests on enwiki_titlesuggest. I used ab on terbium.
I was able to run 520 concurrent requests without flooding the pool counter :

typetestdescqpsavg time (ms)50% (ms)worst time (ms)
normal caseab -k -n 100000 -c 520 ''2 exacts + 2 fuzzy, pref len=0, size 10 (internal 20), no 2nd pass2993.73173.6971521467
best caseab -k -n 100000 -c 520 ''2 exacts, size 10 (internal 20), no 2nd pass4530.47114.778991247
best case 2ab -k -n 100000 -c 520 ''2 exacts, size 5 (internal 10), no 2nd pass6322.1082.251681160
worst case (*)ab -k -n 100000 -c 100 ''2 exacts + 2 fuzzy, pref len=0, size 10 (internal 20), with 2nd pass764.05130.881911416

(*): When I started to test the worst case it started to flood the pool counter, I was not able to re-run other tests. I think it's due to normal search traffic. I'll try to rerun the tests at the same time tomorrow.

It looks like we serve between 2 and 5 millions prefix queries per hour, it's about about 80000 queries per minute. Pool counter is set to block more than 600 concurrent queries.
If we want to be able to server 100000 queries per minute the completion suggester must support more than 1666 qps.

I'd like to run the same test for prefix search but I can't measure it because of varnish. The worst/best case for prefix search are the total opposite, a short query is the worst case for prefix search while a long query that returns 0 result is the best case.

Running these tests increased significantly the load on the cluster. Old nodes like elastic1001 started to hit a critical load level and seemed to slow down the whole queries (it's clearly CPU bound).

Test details: P1966

I could not re-run the test today with 520 concurrent requests without flooding the pool counter.
I stopped the worst case test in the middle but overall it's very slow when running at full rate (updated P1966).

Running the normal case with -c 200 increased the global qps from 1500qps to 4000qps, running at ab with -c 100 increased the global qps to 3000 so +1500.
Running the best case with -c 520 increased the global qps to 6500qps (+5000). It confirms the numbers shown by ab as I can see the same effect on fluorine with :

tail -f /a/mw-log/CirrusSearchRequests.log   | pv -l -i5 > /dev/null

Prefix search average seems to be around 1500qps (rough estimation at 8pm UTC). Unless I'm wrong it means we are running around 100 concurrent prefix search queries (50 for enwiki
and 50 for the rest of the world).

At a glance it seems we have to optimize :

  • the worst case by removing the second pass query: bigger index.
  • reduce the internal size: we're running with internal = limit * 2 in order to display 10 pages in case all results have matched to title and a redirect.
  • increase prefix len for small queries (done)
NOTE: running load tests against the completion suggester is very easy because nothing is cached (everything is already in memory, it's just a heavy CPU process), running load tests with prefix search is a bit more complex because we have to run random queries.
Deskana triaged this task as Medium priority.Sep 3 2015, 4:39 PM
Deskana added a subscriber: Deskana.

Just a side note (long term):

I'm wondering if it wouldn't be possible to handle and reroute all search as you type traffic (50% of the total) to a set of 4 or 6 smaller boxes:

  • fast CPU with lot of cores
  • 32G ram (maybe 48)
  • small drives (128G should be ok).

This would help to stop flooding our big cluster with small prefix search queries.

I can't do more tests as we are already running at 2000qps with spikes around 3000qps, running more queries would affect user experience.

Someone told me how to bypass varnish for opensearch.
I tried to simulate enwiki traffic by running 50 concurrent requests ( ab -k -n 100000 -c 50 )

worstcasesearch for msearch for einsteinal
normal casesearch for max webersearch for maxweber
best casesearch for doesnotexistsearch for m
  • qps is query per sec: higher is better
  • avg is the average query time: lower is better
  • tp90 is the top 90 percentile time: lower is better
  • tp99 is the top 99 persentile time: lower is better
worstcase200 qps248 ms300 ms397 ms715 qps69 ms103 ms199 ms
normalcase464 qps107 ms143 ms197 ms943 qps53 ms81 ms113 ms
bestcase894 qps55 ms84 ms115ms1013 qps49 ms75 ms106 ms

The suggester wins in all round.

Unfortunately this test is too simple to make any conclusion because :

  • opensearch returns more informations than the suggest api, it will return a text snippet with the page description (opening text?).
  • we run always the same query for opensearch, lucene will likely hit the same portion of the disk having good chances to have it in cache, the suggester is likely to run at the same speed for all queries as nothing is on disk nor cached (except for the 2nd pass query in the worstcase)
  • prefix search is running on 31 nodes while the suggester is running on 12 nodes
  • the suggester always returned results even in the worstcase scenario. Prefix search is optimized to return no result
  • it's probable that opensearch suffers from API overhead and there is maybe some db lookups done outside cirrus
  • opensearch worstcase is likely to be cached by varnish as it is small queries (one/two letters)
  • prefixsearch is running on an index with constant writes while the suggester's index is read only

We should re-run a test with sample search queries, but even if we run 100% of the time in the suggester worstcase it's still better than the opensearch normal case...

I used codfw with the same data used in T117714 to test the completion suggester perf.

It's not easy to compare prefix vs suggest. They use a different thread pool, different shard configuration.
But unless I missed something the completion suggester is slower (or more demanding?) than prefix search. This tend to contradict the numbers seen by ab.

It was extremely hard to run at 70 workers and with the slow morelikes without rejections. I had to:

  • deploy the suggester index on all nodes for large wikis, 5 replicas for en,es,ja,ru,zh,fr,de
  • increase the search.queue to 1500 (very bad idea)
  • reduce suggest.size to 10 (instead of 32: defaults to the # of cpu)

And highest throughput was :

workersreq/ssuggest qpssuggest avg timefulltext qpsfulltext avg timemorelike qpsmorelike avg timedata~Cirrus QPS

QPS seen by cirrus (suggest fan-out of 4 and 5 for others): 3442 qps which is our peak.
With 16 slower nodes in eqiad I'm afraid that we cannot run the completion suggester for all wikis and the current morelike queries.

Rewriting more like queries with opening_text helps (using 5 replicas for en,es,ja,ru,zh,fr,de, suggest thread_pool size: 10)

workersreq/ssuggest qpssuggest avg timefulltext qpsfulltext avg timemorelike qpsmorelike avg timedata~Cirrus QPS

I think that codfw can handle the completion suggester taffic, sadly this is 25% lower than prefix queries.
Now it's hard to guarantee that eqiad will handle the load with its slower nodes even if we rewrite all more like queries.

it's clear that we need to increase the number of replica for the busiest wikis (en,es,ja,ru,zh,fr,de).
We need also to tune the thread pools, with a default to 32 it increases the number of concurrent queries to 81 (32+49 under full load). The tests here use a suggest threadpool size of 10.
I found elastic thread pools info very interesting to monitor and thread pool config and shard replication can be adjusted at runtime.

Concerning production-rollout I'd suggest to enable one big wiki first and see how eqiad handles the load. I think that we could enable frwiki first and see.

I'll run another test with only frwiki queries rewritten and see.

I run a test by rewritting prefix queries into suggest queries with a different profiles, instead of a fuzzy_prefix_length of 0 I tested with 2 and 1.
The speed improvements are significant, avg suggest time for prefix length 2 is 2ms and 3ms for a length of 1.

Running with prefix length 1 means that we won't be able to catch typo in the first letter, searching for oogle or foogle won't suggest Google.
I tend to think that the extra 10ms added by running with a prefix length of 0 is not worthwhile.

Prefix length 1:

workersreq/ssuggest qpssuggest avg timefulltext qpsfulltext avg timemorelike qpsmorelike avg timedata~Cirrus QPS

@Deskana are we ok to drop support for typos in the first character?
I'd prefer to start with prefix length 1 and see if the cluster can support prefix length 0.

I think we still need to deploy one big wiki at a time, tuning is (2 threadpools) is slightly more complex now.

Results here look good, plenty of information to work off of. Calling this complete.

@Deskana are we ok to drop support for typos in the first character?
I'd prefer to start with prefix length 1 and see if the cluster can support prefix length 0.

Sure. We can always look to adjust this later if it seems problematic.

The verdict seems to be that we're good to go from a load perspective. Closing this as resolved. Yay!