Load test the ltr-query plugin
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Jun 27 2017, 5:57 PM

Description

To get an idea of what to expect when we roll out learning to rank we need to run a load test. This can be based on the previous load testing work, T117714

Variations to run:

(3) original speed, 150% playback speed and 200% playback speed
(2) 100 tree and 500 tree models
(2) ltr on enwiki and dewiki, and ltr on top 10 wikis by search volume
(2) 1024 rescore window and 4096 rescore window
(2) original retrieval query and simplified retrieval query using all field

In total thats 3*2*2*2*2 = 48 tests to run, and since each test is using 40 minutes of input data it takes roughly 40min+30min+20min = 90 min = 1.5 hours. This can be mostly automated, although someone should keep an eye on things to make sure we don't overload the cluster. Call it 2 hours per set of 3 speeds, and its about 32 hours worth of testing. Thankfully it's mostly hands off testing.

Details

	Subject	Repo	Branch	Lines +/-
	Monitor elasticsearch stats for load test	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	None	T174066 [Q1 2017-18 Objective] Perform load and A/B tests on new models (interleaved search results)
Resolved	EBernhardson	T169002 Load test the ltr-query plugin

Event Timeline

EBernhardson created this task.Jun 27 2017, 5:57 PM

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jun 28 2017, 6:25 PM

EBernhardson added a comment.Jun 28 2017, 7:43 PM

This comment was removed by EBernhardson.

EBernhardson added a comment.Jun 28 2017, 8:24 PM

This comment was removed by EBernhardson.

Change 362091 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Monitor elasticsearch stats for load test

https://gerrit.wikimedia.org/r/362091

gerritbot added a project: Patch-For-Review.Jun 28 2017, 9:52 PM

EBernhardson added a comment.Jun 28 2017, 10:38 PM

This comment was removed by EBernhardson.

Unfortunately i'm not able to grab the stats for only the rewritten comments right now. Will need to merge the above puppet patch to get those. I ran a couple tests with LTR rewrites today just to make sure it will work, but I think it will be best to make sure we can collect the data specifically about the rewritten queries as well. Also now that i think about it it might be worthwhile to rerun the baselines as well with the queries that would be rewritten tagged, but not actually rewritten.

@dcausse I've updated the ticket description with a list of load tests to run, does that seem sufficient? Is there anything more/less we should test? More features would be nice, but I don't really have any features prepared that i can build a model with to test.

It seems since we are looking at load in particular, could perhaps cut off the 100% speed test, and only run 150% and 200%. Could also simplify to only the more expensive tests (more trees, more wikis, bigger rescore window) and as long as that looks reasonable assume the rest would be fine as well.

EBernhardson added a comment.Jun 28 2017, 11:40 PM

This comment was removed by EBernhardson.

EBernhardson updated the task description. (Show Details)Jun 28 2017, 11:47 PM

Change 362091 merged by Gehel:
[operations/puppet@production] Monitor elasticsearch stats for load test

https://gerrit.wikimedia.org/r/362091

@EBernhardson I agree with you, keeping the most expensive setup sounds good to me.
Did you keep the phrase rescore as a first pass with ltr?

@dcausse No phrase rescore as a first pass, because it would be entirely overridden by the LTR rescore

Baseline load tests of existing production deployment

100%

baseline ltr load test (988×1 px, 242 KB)

150%

baseline ltr load test 150% (988×1 px, 232 KB)

200%

baseline ltr load test 200% (988×1 px, 234 KB)

Rewrite full text for 10 wikis with 500 trees, 1024 rescore window and original retrieval query

150%

loadtest 150% 10 wikis 500 trees 1024 rescore window orig query (988×1 px, 230 KB)

200%

loadtest 200% 10 wikis 500 trees 1024 rescore window orig query (988×1 px, 222 KB)

Rewrite full text for 10 wikis with 500 trees, 4096 rescore window and simplified retrieval query

150%

loadtest 150% 10 wikis 500 trees 4096 rescore window simplified queery (988×1 px, 244 KB)

200%

loadtest 200% 10 wikis 500 trees 4096 rescore window simplified queery (988×1 px, 236 KB)

Rewrite full text for 10 wikis with 500 trees, 4096 rescore window and original retrieval query

100%

loadtest 100% 10 wikis 500 trees 4096 rescore window orig query (988×1 px, 242 KB)

150%

loadtest 150% 10 wikis 500 trees 4096 rescore window orig query (988×1 px, 238 KB)

200%

loadtest 200% 10 wikis 500 trees 4096 rescore window orig query (988×1 px, 227 KB)

250%

loadtest 250% 10 wikis 500 trees 4096 rescore window orig query (988×1 px, 245 KB)

Load tests are run. . The tl/dr is basically that the new queries are expensive, as expected. but not insane. Increasing the load from the current peak request rate to 150% and 200% of the current peak load had no effect on latencies. The average per-shard latency on the most expensive query was 100ms, vs 45ms in the baseline. Prefix search and morelike response times are uneffected by running the most expensive query with 200% of the current peak load. Running the most expensive query at 250% of current peak load put the cluster under heavy load with constant thread pool rejection. Unfortunately i don't have any data on if those thread pool rejections turned into request failures, but under normal circumstances a thread pool rejecting a query should reschedule that query on another replica.

Overall, i think it's pretty safe to continue. Things might change a bit with more features, but we seem to heave plenty of overhead to both make queries more expensive and increase request rates to 2x their current levels.

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jun 30 2017, 1:52 AM

I agree with the conclusions, moving to done.
We also have some rooms for improvements that we could explore in the future:

faster tree evaluations using a representation with arrays of primitives
explore using custom ltr rescore context to allow bulk scoring, this could reduce the number of iterations made over the list of features

Well done! :)

• EBjune added a parent task: T174066: [Q1 2017-18 Objective] Perform load and A/B tests on new models (interleaved search results).Aug 31 2017, 4:49 PM

	F8575513: loadtest 150% 10 wikis 500 trees 4096 rescore window orig query
	Jun 30 2017, 1:44 AM

	F8575516: loadtest 200% 10 wikis 500 trees 4096 rescore window orig query
	Jun 30 2017, 1:44 AM

	F8575521: loadtest 150% 10 wikis 500 trees 4096 rescore window simplified queery
	Jun 30 2017, 1:44 AM

	F8575534: loadtest 200% 10 wikis 500 trees 1024 rescore window orig query
	Jun 30 2017, 1:44 AM

	F8575582: loadtest 250% 10 wikis 500 trees 4096 rescore window orig query
	Jun 30 2017, 1:44 AM

Load test the ltr-query pluginClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Baseline load tests of existing production deployment

Rewrite full text for 10 wikis with 500 trees, 1024 rescore window and original retrieval query

Rewrite full text for 10 wikis with 500 trees, 4096 rescore window and simplified retrieval query

Rewrite full text for 10 wikis with 500 trees, 4096 rescore window and original retrieval query

Load test the ltr-query plugin
Closed, ResolvedPublic
Actions

Related Objects
Search...