Page MenuHomePhabricator

Load test the ltr-query plugin
Closed, ResolvedPublic

Description

To get an idea of what to expect when we roll out learning to rank we need to run a load test. This can be based on the previous load testing work, T117714

Variations to run:

  • (3) original speed, 150% playback speed and 200% playback speed
  • (2) 100 tree and 500 tree models
  • (2) ltr on enwiki and dewiki, and ltr on top 10 wikis by search volume
  • (2) 1024 rescore window and 4096 rescore window
  • (2) original retrieval query and simplified retrieval query using all field

In total thats 3*2*2*2*2 = 48 tests to run, and since each test is using 40 minutes of input data it takes roughly 40min+30min+20min = 90 min = 1.5 hours. This can be mostly automated, although someone should keep an eye on things to make sure we don't overload the cluster. Call it 2 hours per set of 3 speeds, and its about 32 hours worth of testing. Thankfully it's mostly hands off testing.

Event Timeline

This comment was removed by EBernhardson.
This comment was removed by EBernhardson.

Change 362091 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Monitor elasticsearch stats for load test

https://gerrit.wikimedia.org/r/362091

This comment was removed by EBernhardson.
EBernhardson added a comment.EditedJun 28 2017, 11:16 PM

Unfortunately i'm not able to grab the stats for only the rewritten comments right now. Will need to merge the above puppet patch to get those. I ran a couple tests with LTR rewrites today just to make sure it will work, but I think it will be best to make sure we can collect the data specifically about the rewritten queries as well. Also now that i think about it it might be worthwhile to rerun the baselines as well with the queries that would be rewritten tagged, but not actually rewritten.

EBernhardson updated the task description. (Show Details)EditedJun 28 2017, 11:29 PM
EBernhardson added a subscriber: dcausse.

@dcausse I've updated the ticket description with a list of load tests to run, does that seem sufficient? Is there anything more/less we should test? More features would be nice, but I don't really have any features prepared that i can build a model with to test.

It seems since we are looking at load in particular, could perhaps cut off the 100% speed test, and only run 150% and 200%. Could also simplify to only the more expensive tests (more trees, more wikis, bigger rescore window) and as long as that looks reasonable assume the rest would be fine as well.

This comment was removed by EBernhardson.

Change 362091 merged by Gehel:
[operations/puppet@production] Monitor elasticsearch stats for load test

https://gerrit.wikimedia.org/r/362091

@EBernhardson I agree with you, keeping the most expensive setup sounds good to me.
Did you keep the phrase rescore as a first pass with ltr?

@dcausse No phrase rescore as a first pass, because it would be entirely overridden by the LTR rescore

Baseline load tests of existing production deployment

100%

150%

200%

Rewrite full text for 10 wikis with 500 trees, 1024 rescore window and original retrieval query

150%

200%

Rewrite full text for 10 wikis with 500 trees, 4096 rescore window and simplified retrieval query

150%

200%

Rewrite full text for 10 wikis with 500 trees, 4096 rescore window and original retrieval query

100%

150%

200%

250%

Load tests are run. . The tl/dr is basically that the new queries are expensive, as expected. but not insane. Increasing the load from the current peak request rate to 150% and 200% of the current peak load had no effect on latencies. The average per-shard latency on the most expensive query was 100ms, vs 45ms in the baseline. Prefix search and morelike response times are uneffected by running the most expensive query with 200% of the current peak load. Running the most expensive query at 250% of current peak load put the cluster under heavy load with constant thread pool rejection. Unfortunately i don't have any data on if those thread pool rejections turned into request failures, but under normal circumstances a thread pool rejecting a query should reschedule that query on another replica.

Overall, i think it's pretty safe to continue. Things might change a bit with more features, but we seem to heave plenty of overhead to both make queries more expensive and increase request rates to 2x their current levels.

I agree with the conclusions, moving to done.
We also have some rooms for improvements that we could explore in the future:

  • faster tree evaluations using a representation with arrays of primitives
  • explore using custom ltr rescore context to allow bulk scoring, this could reduce the number of iterations made over the list of features
debt closed this task as Resolved.Jun 30 2017, 9:20 PM
debt added a subscriber: debt.

Well done! :)