Run load tests with the completion suggester in prod
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Sep 1 2015, 10:34 PM

Description

We need to make sure that the completion suggester is able to handle our search traffic.
If performance is an issue with the current configuration we will be able to tune it with :

Increase the number of shards (4 shards today vs 7 for prefix search)
optimize, the index is read only, optimizing the index to one segment will help
tune various suggestion parameter (prefix len, fuzzyness ...)
Remove the 2nd pass query

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Deskana	T121616 EPIC: Resolve technical or product blockers to enable us to do a more full rollout of the completion suggester
		Resolved		dcausse	T111120 Run load tests with the completion suggester in prod

Event Timeline

dcausse created this task.Sep 1 2015, 10:34 PM

dcausse raised the priority of this task from to Needs Triage.

dcausse updated the task description. (Show Details)

dcausse added projects: Discovery-Search (Current work), OKR-Work.

dcausse subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2015, 10:34 PM

number of shards

I mostly guessed at 4 for the number of shards. My thought process though was that due to the small size of the indices we might be better off scaling by increasing the replica count. As a side bonus ES allows to change the replication levels on demand. I'm not really sure how to test that hypothesis though.

optimize

Looks very useful, we should probably use it by default in prod for titlesuggest

How should we go about actually performing the load test?

adding in @chasemp as load testing may also come up with the new codfw cluster

I ran some basic tests on enwiki_titlesuggest. I used ab on terbium.
I was able to run 520 concurrent requests without flooding the pool counter :

type	test	desc	qps	avg time (ms)	50% (ms)	worst time (ms)
normal case	ab -k -n 100000 -c 520 'https://en.wikipedia.org/w/api.php?action=cirrus-suggest&text=maxweber&limit=10&format=json'	2 exacts + 2 fuzzy, pref len=0, size 10 (internal 20), no 2nd pass	2993.73	173.697	152	1467
best case	ab -k -n 100000 -c 520 'https://en.wikipedia.org/w/api.php?action=cirrus-suggest&text=m&limit=10&format=json'	2 exacts, size 10 (internal 20), no 2nd pass	4530.47	114.778	99	1247
best case 2	ab -k -n 100000 -c 520 'https://en.wikipedia.org/w/api.php?action=cirrus-suggest&text=m&limit=5&format=json'	2 exacts, size 5 (internal 10), no 2nd pass	6322.10	82.251	68	1160
worst case (*)	ab -k -n 100000 -c 100 'https://en.wikipedia.org/w/api.php?action=cirrus-suggest&text=einsteinal&limit=10&format=json'	2 exacts + 2 fuzzy, pref len=0, size 10 (internal 20), with 2nd pass	764.05	130.881	91	1416

(*): When I started to test the worst case it started to flood the pool counter, I was not able to re-run other tests. I think it's due to normal search traffic. I'll try to rerun the tests at the same time tomorrow.

It looks like we serve between 2 and 5 millions prefix queries per hour, it's about about 80000 queries per minute. Pool counter is set to block more than 600 concurrent queries.
If we want to be able to server 100000 queries per minute the completion suggester must support more than 1666 qps.

I'd like to run the same test for prefix search but I can't measure it because of varnish. The worst/best case for prefix search are the total opposite, a short query is the worst case for prefix search while a long query that returns 0 result is the best case.

Running these tests increased significantly the load on the cluster. Old nodes like elastic1001 started to hit a critical load level and seemed to slow down the whole queries (it's clearly CPU bound).

Test details: P1966

dcausse claimed this task.Sep 2 2015, 1:56 PM

dcausse moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

I could not re-run the test today with 520 concurrent requests without flooding the pool counter.
I stopped the worst case test in the middle but overall it's very slow when running at full rate (updated P1966).

Running the normal case with -c 200 increased the global qps from 1500qps to 4000qps, running at ab with -c 100 increased the global qps to 3000 so +1500.
Running the best case with -c 520 increased the global qps to 6500qps (+5000). It confirms the numbers shown by ab as I can see the same effect on fluorine with :

tail -f /a/mw-log/CirrusSearchRequests.log   | pv -l -i5 > /dev/null

Prefix search average seems to be around 1500qps (rough estimation at 8pm UTC). Unless I'm wrong it means we are running around 100 concurrent prefix search queries (50 for enwiki
and 50 for the rest of the world).

At a glance it seems we have to optimize :

the worst case by removing the second pass query: bigger index.
reduce the internal size: we're running with internal = limit * 2 in order to display 10 pages in case all results have matched to title and a redirect.
increase prefix len for small queries (done)

NOTE: running load tests against the completion suggester is very easy because nothing is cached (everything is already in memory, it's just a heavy CPU process), running load tests with prefix search is a bit more complex because we have to run random queries.

• Deskana triaged this task as Medium priority.Sep 3 2015, 4:39 PM

• Deskana subscribed.

Just a side note (long term):

I'm wondering if it wouldn't be possible to handle and reroute all search as you type traffic (50% of the total) to a set of 4 or 6 smaller boxes:

fast CPU with lot of cores
32G ram (maybe 48)
small drives (128G should be ok).

This would help to stop flooding our big cluster with small prefix search queries.

I can't do more tests as we are already running at 2000qps with spikes around 3000qps, running more queries would affect user experience.

dcausse moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Sep 3 2015, 5:21 PM

Someone told me how to bypass varnish for opensearch.
I tried to simulate enwiki traffic by running 50 concurrent requests ( ab -k -n 100000 -c 50 )
Tests:

name	prefix	suggester
worstcase	search for m	search for einsteinal
normal case	search for max weber	search for maxweber
best case	search for doesnotexist	search for m

qps is query per sec: higher is better
avg is the average query time: lower is better
tp90 is the top 90 percentile time: lower is better
tp99 is the top 99 persentile time: lower is better

test	pref-qps	pref-avg	pref-tp90	pref-tp99	sugg-qps	sugg-avg	sugg-tp90	sugg-tp99
worstcase	200 qps	248 ms	300 ms	397 ms	715 qps	69 ms	103 ms	199 ms
normalcase	464 qps	107 ms	143 ms	197 ms	943 qps	53 ms	81 ms	113 ms
bestcase	894 qps	55 ms	84 ms	115ms	1013 qps	49 ms	75 ms	106 ms

The suggester wins in all round.

Unfortunately this test is too simple to make any conclusion because :

opensearch returns more informations than the suggest api, it will return a text snippet with the page description (opening text?).
we run always the same query for opensearch, lucene will likely hit the same portion of the disk having good chances to have it in cache, the suggester is likely to run at the same speed for all queries as nothing is on disk nor cached (except for the 2nd pass query in the worstcase)
prefix search is running on 31 nodes while the suggester is running on 12 nodes
the suggester always returned results even in the worstcase scenario. Prefix search is optimized to return no result
it's probable that opensearch suffers from API overhead and there is maybe some db lookups done outside cirrus
opensearch worstcase is likely to be cached by varnish as it is small queries (one/two letters)
prefixsearch is running on an index with constant writes while the suggester's index is read only

We should re-run a test with sample search queries, but even if we run 100% of the time in the suggester worstcase it's still better than the opensearch normal case...

dcausse moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Sep 8 2015, 7:15 AM

• Deskana closed this task as Resolved.Sep 9 2015, 2:29 AM

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.Sep 9 2015, 2:36 AM

dcausse reopened this task as Open.Jan 25 2016, 8:36 AM

dcausse moved this task from Resolved to Incoming on the Discovery-Search (Current work) board.Jan 25 2016, 8:39 AM

dcausse added a parent task: T121616: EPIC: Resolve technical or product blockers to enable us to do a more full rollout of the completion suggester.

• Deskana added a project: Discovery-ARCHIVED.Jan 25 2016, 10:42 PM

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.

dcausse moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jan 29 2016, 10:04 AM

I used codfw with the same data used in T117714 to test the completion suggester perf.

It's not easy to compare prefix vs suggest. They use a different thread pool, different shard configuration.
But unless I missed something the completion suggester is slower (or more demanding?) than prefix search. This tend to contradict the numbers seen by ab.

It was extremely hard to run at 70 workers and with the slow morelikes without rejections. I had to:

deploy the suggester index on all nodes for large wikis, 5 replicas for en,es,ja,ru,zh,fr,de
increase the search.queue to 1500 (very bad idea)
reduce suggest.size to 10 (instead of 32: defaults to the # of cpu)

And highest throughput was :

workers	req/s	suggest qps	suggest avg time	fulltext qps	fulltext avg time	morelike qps	morelike avg time	data	~Cirrus QPS
70	16K	8200	13-14ms	6330	55-66ms	880	280-380ms	grafana	3442

QPS seen by cirrus (suggest fan-out of 4 and 5 for others): 3442 qps which is our peak.
With 16 slower nodes in eqiad I'm afraid that we cannot run the completion suggester for all wikis and the current morelike queries.

Rewriting more like queries with opening_text helps (using 5 replicas for en,es,ja,ru,zh,fr,de, suggest thread_pool size: 10)

workers	req/s	suggest qps	suggest avg time	fulltext qps	fulltext avg time	morelike qps	morelike avg time	data	~Cirrus QPS
60	20K	9810	10-12ms	7530	55-61ms	1190	66-73ms	grafana	4196
70	24K	11860	10-12ms	8340	55-61ms	1420	66-73ms	grafana	4917
80	25K	12180	10-13ms	9110	55-61ms	1560	76-91ms	grafana	5179
90	24K	10530	12-14ms	8600	55-65ms	1300	80-88ms	grafana	4612

I think that codfw can handle the completion suggester taffic, sadly this is 25% lower than prefix queries.
Now it's hard to guarantee that eqiad will handle the load with its slower nodes even if we rewrite all more like queries.

Conclusion:
it's clear that we need to increase the number of replica for the busiest wikis (en,es,ja,ru,zh,fr,de).
We need also to tune the thread pools, with a default to 32 it increases the number of concurrent queries to 81 (32+49 under full load). The tests here use a suggest threadpool size of 10.
I found elastic thread pools info very interesting to monitor and thread pool config and shard replication can be adjusted at runtime.

Concerning production-rollout I'd suggest to enable one big wiki first and see how eqiad handles the load. I think that we could enable frwiki first and see.

I'll run another test with only frwiki queries rewritten and see.

I run a test by rewritting prefix queries into suggest queries with a different profiles, instead of a fuzzy_prefix_length of 0 I tested with 2 and 1.
The speed improvements are significant, avg suggest time for prefix length 2 is 2ms and 3ms for a length of 1.

Running with prefix length 1 means that we won't be able to catch typo in the first letter, searching for oogle or foogle won't suggest Google.
I tend to think that the extra 10ms added by running with a prefix length of 0 is not worthwhile.

Prefix length 1:

workers	req/s	suggest qps	suggest avg time	fulltext qps	fulltext avg time	morelike qps	morelike avg time	data	~Cirrus QPS
80	28K	14000	2-3ms	9750	55-61ms	1190	66-73ms	grafana	5704

@Deskana are we ok to drop support for typos in the first character?
I'd prefer to start with prefix length 1 and see if the cluster can support prefix length 0.

I think we still need to deploy one big wiki at a time, tuning is (2 threadpools) is slightly more complex now.

dcausse moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Feb 3 2016, 4:27 PM

Results here look good, plenty of information to work off of. Calling this complete.

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Feb 3 2016, 5:56 PM

In T111120#1993761, @dcausse wrote:

@Deskana are we ok to drop support for typos in the first character?
I'd prefer to start with prefix length 1 and see if the cluster can support prefix length 0.

Sure. We can always look to adjust this later if it seems problematic.

The verdict seems to be that we're good to go from a load perspective. Closing this as resolved. Yay!

Run load tests with the completion suggester in prodClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Run load tests with the completion suggester in prod
Closed, ResolvedPublic
Actions

Related Objects
Search...