Page MenuHomePhabricator

Run load tests of cross-project searching to verify its stability
Closed, ResolvedPublic

Description

We should run some load tests of our cross-project searching plan to ensure that it is performant enough for release.

Our plan is to take X% of searches (X to be determined) and run cross-project searches for those queries. The results presented to the user will not be changed during this test, as the test is meant purely to determine server load.

Event Timeline

Setting to high priority; this is a significant part of a current quarterly goal.

A couple things that have to be taken into account:

  • Interwiki search will only be on by default in the web interface, api users will need to explicitly opt-in to interwiki search via an api parameter. The load test should then only be enabled for the requests via web
  • Interwiki search will only be enabled for wikipedia's.
  • We will only be able to see some of the effect in the grafana dashboards, likely we also need to mark these requests in the CirrusSearchRequestSet logs so we can pull percentiles there as well.

My idea:

  • Create some new configuration variable that contains a value from 0 to 1, indicating the % of requests against Special:Search that should have interwiki results queried, but not shown to users.
  • Turn that variable on in mediawiki-config repository by using the special 'wikipedia' group
  • Start at a low %, and slowly ramp up

Caveats:

  • Not sure yet how to know a request is to special:search

I suppose to be able to do this, it also means we will need to come up with the map of wikipedia -> sister projects + prefixes to fill wgCirrusSearchInterwikiSources. Or have to delay the test to work together with @dcausse work to refactor the core InterwikiLookup such that we can lookup the relevant sister projects.

Change 319485 had a related patch set uploaded (by EBernhardson):
[WIP] Add configuration value to run interwiki load test

https://gerrit.wikimedia.org/r/319485

Change 319498 had a related patch set uploaded (by EBernhardson):
[WIP] Script to generate wmgCirrusSearchInterwikiSources

https://gerrit.wikimedia.org/r/319498

I pulled some numbers from hive to get an idea of the level of impact. This data is for Oct 30 19:00-20:00 UTC which is a weekly peak on our grafana graphs. It is restricted to queries that have at least one full_text query in the log.

typeavg qps
all498
web120
wikipedia web117

So, something like 24% of full text searches, at peak load, would start also doing interwiki search. If we wanted to expand the interwiki search (at some later point) such that we were doing queries from say itwikisource to all it's sisters, the additional load would probably be un-noticable.

Change 320220 had a related patch set uploaded (by EBernhardson):
Setup CirrusSearch interwiki load test

https://gerrit.wikimedia.org/r/320220

Change 319485 merged by jenkins-bot:
Add configuration value to run interwiki load test

https://gerrit.wikimedia.org/r/319485

Change 320220 merged by jenkins-bot:
Setup CirrusSearch interwiki load test

https://gerrit.wikimedia.org/r/320220

Change 321492 had a related patch set uploaded (by EBernhardson):
Revert "Revert "Setup CirrusSearch interwiki load test""

https://gerrit.wikimedia.org/r/321492

There were some log spamming issues, so we had to revert. We're looking at trying this again within a day or two.

Change 321492 merged by jenkins-bot:
Revert "Revert "Setup CirrusSearch interwiki load test""

https://gerrit.wikimedia.org/r/321492

Mentioned in SAL (#wikimedia-operations) [2016-11-15T19:51:49Z] <thcipriani@tin> Synchronized wmf-config/CirrusSearch-interwikiSources.php: SWAT: [[gerrit:321492|Revert "Revert "Setup CirrusSearch interwiki load test"" (T149740)]] PART I (duration: 01m 43s)

Mentioned in SAL (#wikimedia-operations) [2016-11-15T19:53:39Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:321492|Revert "Revert "Setup CirrusSearch interwiki load test"" (T149740)]] PART II (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2016-11-15T19:54:58Z] <thcipriani@tin> Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[gerrit:321492|Revert "Revert "Setup CirrusSearch interwiki load test"" (T149740)]] PART III (duration: 00m 48s)

Change 321724 had a related patch set uploaded (by EBernhardson):
Increase CirrusSearch interwiki load test to 25%

https://gerrit.wikimedia.org/r/321724

Initial 5% deployment had no noticable change to our load or latency metrics. will increase to 25% in this evenings SWAT deployment. Will also work up how to query this information out of hive, so we have data on just these requests and not the aggregate api+web that our standard metrics report.

Change 321724 merged by jenkins-bot:
Increase CirrusSearch interwiki load test to 25%

https://gerrit.wikimedia.org/r/321724

Mentioned in SAL (#wikimedia-operations) [2016-11-16T00:26:08Z] <dereckson@tin> Synchronized wmf-config/CirrusSearch-production.php: Increase CirrusSearch interwiki load test to 25% (T149740) (duration: 00m 58s)

Initial latency changes look pretty minor. They are there, but minor.

isloadtestcount% reqp50p75p95
False45600595.8%159.0243.0764.0
True196834.1%182.0258.0765.0
change:+23ms / 14%+15ms / 6%+1ms / 0.1%

The load test was set for 5%, so not sure why this is showing 4.1%. Perhaps my selection of wikis (a lazy len(wikiid) == 6) wasn't selective enough,

hive query for future reference:

select payload['interwikiLoadTest'] as isLoadTest, count(1), percentile(cast(tookms as int), 0.50) as p50, percentile(cast(tookms as int), 0.75) as p75, percentile(cast(tookms as int), 0.95) as p95 from cirrussearchrequestset where year=2016 and month=11 and day=15 and hour=23 and length(wikiid) = 6 and source = 'web' group by payload['interwikiLoadTest'];

Traffic today looks to have peaked between 17:00 and 18:00 UTC. Stats for today:

interwikireqs% reqsp50p75p95
false41770776.5%152.0219.0685.0
true12853923.5%189.0270.0742.0
change37ms / 24%51ms / 23%57ms / 8%

The difference is a little larger and more noticable than at 5%. That could though be an effect of time, 5% was deployed after the daily peak where this is taken from the peak hour. I'm not seeing any noticeable changes in cluster cpu or iowait load. IO usage does appear to be up. It's a little hard to see, because the weekly hadoop dump ran from the 14th at 13:00 until the 15th at 22:00. Es looks to have also finished most of it's segment merges by 22:00. We deployed 5% interwiki on the 15th around 20:00, and 25% on the 16th at 00:20.

Some data about disk usage, values are kinda guestimates by looking at the graphite data between 17:00 and 18:00 UTC and choosing a reasonable value. Peaks are taken by looking at 5 servers with the highest values, and taking the highest peak that isn't an obvious outlier.

iops/serverbytes/s/serverpeak iopspeak bytes/s
last week~4704MB/s1.2k18 MB/s
current~4704MB/s1.7k24 MB/s

So we arn't necessarily pulling more data on average, but some servers might be hot spotting a bit more than before. Overall though i think we are good to increase to 50%.

Change 321925 had a related patch set uploaded (by EBernhardson):
Increase cirrus interwiki loadtest to 50%

https://gerrit.wikimedia.org/r/321925

Change 321925 merged by jenkins-bot:
Increase cirrus interwiki loadtest to 50%

https://gerrit.wikimedia.org/r/321925

Mentioned in SAL (#wikimedia-operations) [2016-11-17T00:35:15Z] <dereckson@tin> Synchronized wmf-config/CirrusSearch-production.php: Increase cirrus interwiki loadtest to 50% (T149740) (duration: 00m 48s)

There's a question outstanding; was this load test done with all requests, including API requests, at 50%? If so, we're already past what we'd handle if we just did it for desktop, so the load test is successful already! @dcausse will check, and if it's just desktop traffic, increase it to 100%. We expect to deploy that change next week, since this week is a holiday week.

If I'm reading the code correctly the load test should only test desktop searches, the interwiki flag is on by default but the API will set it to false to default. This is basically everything except API requests. I'll try to extract some numbers to compare with what Erik already analyzed. I'll prepare a patch to increase it to 75% accordingly.

Some data for Nov 22 at 12 UTC:

interwiki#queriesp50p75p95
NULL (assuming false)284462149.0228.0840.0
true230105168.0245.0704.0

Numbers seem stable, I'm not sure to understand why p95 is better when interwiki is enabled...
I'll take another snapshot today at 23h UTC so we can directly compare with the previous numbers but overall this looks promising, I'll prepare a patch to increase to 75%, I'll swat it tomorrow.

Change 322900 had a related patch set uploaded (by DCausse):
[cirrus] Increase interwiki loadtest to 75%

https://gerrit.wikimedia.org/r/322900

Change 322900 merged by jenkins-bot:
[cirrus] Increase interwiki loadtest to 75%

https://gerrit.wikimedia.org/r/322900

Mentioned in SAL (#wikimedia-operations) [2016-11-22T19:17:08Z] <thcipriani@tin> Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[gerrit:322900|[cirrus] Increase interwiki loadtest to 75%]] (T149740) (duration: 00m 55s)

Another snapshot taken on Nov 24 at 23h UTC:

interwiki#queriesp50p75p95
NULL (assuming false)136117171.0281.0982.0
true235698160.0236.0680.0

I still don't know how to interpret that interwiki times are better...

That is a bit odd that interwiki searches are coming back faster than the standard ones. Overall though I don't think it's a big deal, and the more important question is if overall cluster health seems worse, and if the timings are noticably worse than before we started the test. Overall the numbers don't seem to have too much variation from before, so this generally looks like a success. Will push a patch to make it 100% today, but not expecting anything particularly bad.

I'm not going to complain if it's even faster. ^_^

Change 323868 had a related patch set uploaded (by EBernhardson):
Increase Cirrus interwiki load test to 100%

https://gerrit.wikimedia.org/r/323868

Change 323868 merged by jenkins-bot:
Increase Cirrus interwiki load test to 100%

https://gerrit.wikimedia.org/r/323868

Mentioned in SAL (#wikimedia-operations) [2016-11-28T19:34:49Z] <thcipriani@tin> Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[gerrit:323868|Increase Cirrus interwiki load test to 100%]] (T149740) (duration: 00m 46s)

Change 325329 had a related patch set uploaded (by EBernhardson):
Turn off CirrusSearch interwiki load test

https://gerrit.wikimedia.org/r/325329

Final results against the busiest hour since 100% was enabled (dec 4, 19:00 UTC)

isloadtestcountp50p75p95
NULL91853239.0479.01143.0
true396219175.0249.0616.0

Same conclusions as before, the interwiki searches are not seeing any significant degredation of performance vs pre-test numbers.

Change 325329 merged by jenkins-bot:
Turn off CirrusSearch interwiki load test

https://gerrit.wikimedia.org/r/325329

Mentioned in SAL (#wikimedia-operations) [2016-12-05T19:41:20Z] <thcipriani@tin> Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[gerrit:325329|Turn off CirrusSearch interwiki load test]] T149740 PART I (duration: 00m 44s)

Mentioned in SAL (#wikimedia-operations) [2016-12-05T19:44:48Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:325329|Turn off CirrusSearch interwiki load test]] T149740 PART II (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2016-12-05T19:45:53Z] <thcipriani@tin> Synchronized wmf-config: SWAT: [[gerrit:325329|Turn off CirrusSearch interwiki load test]] T149740 PART III (duration: 00m 46s)

Change 319498 abandoned by EBernhardson:
[WIP] Script to generate wmgCirrusSearchInterwikiSources

Reason:
was used for interwiki load testing, but will not be necessary longer term. I6aebfa22c05 allows deprecation of the wgCirrusSearchInterwikiSources config

https://gerrit.wikimedia.org/r/319498