Page MenuHomePhabricator

Load test the ltr-query plugin
Closed, ResolvedPublic

Description

To get an idea of what to expect when we roll out learning to rank we need to run a load test. This can be based on the previous load testing work, T117714

Variations to run:

  • (3) original speed, 150% playback speed and 200% playback speed
  • (2) 100 tree and 500 tree models
  • (2) ltr on enwiki and dewiki, and ltr on top 10 wikis by search volume
  • (2) 1024 rescore window and 4096 rescore window
  • (2) original retrieval query and simplified retrieval query using all field

In total thats 3*2*2*2*2 = 48 tests to run, and since each test is using 40 minutes of input data it takes roughly 40min+30min+20min = 90 min = 1.5 hours. This can be mostly automated, although someone should keep an eye on things to make sure we don't overload the cluster. Call it 2 hours per set of 3 speeds, and its about 32 hours worth of testing. Thankfully it's mostly hands off testing.

Event Timeline

Change 362091 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Monitor elasticsearch stats for load test

https://gerrit.wikimedia.org/r/362091

Unfortunately i'm not able to grab the stats for only the rewritten comments right now. Will need to merge the above puppet patch to get those. I ran a couple tests with LTR rewrites today just to make sure it will work, but I think it will be best to make sure we can collect the data specifically about the rewritten queries as well. Also now that i think about it it might be worthwhile to rerun the baselines as well with the queries that would be rewritten tagged, but not actually rewritten.

EBernhardson added a subscriber: dcausse.

@dcausse I've updated the ticket description with a list of load tests to run, does that seem sufficient? Is there anything more/less we should test? More features would be nice, but I don't really have any features prepared that i can build a model with to test.

It seems since we are looking at load in particular, could perhaps cut off the 100% speed test, and only run 150% and 200%. Could also simplify to only the more expensive tests (more trees, more wikis, bigger rescore window) and as long as that looks reasonable assume the rest would be fine as well.

Change 362091 merged by Gehel:
[operations/puppet@production] Monitor elasticsearch stats for load test

https://gerrit.wikimedia.org/r/362091

@EBernhardson I agree with you, keeping the most expensive setup sounds good to me.
Did you keep the phrase rescore as a first pass with ltr?

@dcausse No phrase rescore as a first pass, because it would be entirely overridden by the LTR rescore

Baseline load tests of existing production deployment

100%

baseline ltr load test (988×1 px, 242 KB)

150%

baseline ltr load test 150% (988×1 px, 232 KB)

200%

baseline ltr load test 200% (988×1 px, 234 KB)

Rewrite full text for 10 wikis with 500 trees, 1024 rescore window and original retrieval query

150%

loadtest 150% 10 wikis 500 trees 1024 rescore window orig query (988×1 px, 230 KB)

200%

loadtest 200% 10 wikis 500 trees 1024 rescore window orig query (988×1 px, 222 KB)

Rewrite full text for 10 wikis with 500 trees, 4096 rescore window and simplified retrieval query

150%

loadtest 150% 10 wikis 500 trees 4096 rescore window simplified queery (988×1 px, 244 KB)

200%

loadtest 200% 10 wikis 500 trees 4096 rescore window simplified queery (988×1 px, 236 KB)

Rewrite full text for 10 wikis with 500 trees, 4096 rescore window and original retrieval query

100%

loadtest 100% 10 wikis 500 trees 4096 rescore window orig query (988×1 px, 242 KB)

150%

loadtest 150% 10 wikis 500 trees 4096 rescore window orig query (988×1 px, 238 KB)

200%

loadtest 200% 10 wikis 500 trees 4096 rescore window orig query (988×1 px, 227 KB)

250%

loadtest 250% 10 wikis 500 trees 4096 rescore window orig query (988×1 px, 245 KB)

Load tests are run. . The tl/dr is basically that the new queries are expensive, as expected. but not insane. Increasing the load from the current peak request rate to 150% and 200% of the current peak load had no effect on latencies. The average per-shard latency on the most expensive query was 100ms, vs 45ms in the baseline. Prefix search and morelike response times are uneffected by running the most expensive query with 200% of the current peak load. Running the most expensive query at 250% of current peak load put the cluster under heavy load with constant thread pool rejection. Unfortunately i don't have any data on if those thread pool rejections turned into request failures, but under normal circumstances a thread pool rejecting a query should reschedule that query on another replica.

Overall, i think it's pretty safe to continue. Things might change a bit with more features, but we seem to heave plenty of overhead to both make queries more expensive and increase request rates to 2x their current levels.

I agree with the conclusions, moving to done.
We also have some rooms for improvements that we could explore in the future:

  • faster tree evaluations using a representation with arrays of primitives
  • explore using custom ltr rescore context to allow bulk scoring, this could reduce the number of iterations made over the list of features
debt subscribed.

Well done! :)