Page MenuHomePhabricator

Synthetic Load Test
Open, Needs TriagePublic

Description

Run a synthetic load test using locust to determine if the current service can stand up. And to reassure ourselves that we don't cause a meltdown by some positive feedback loop.

A/C

  • Traffic should be of a similar or slightly higher load to production (1-5/s)
  • Traffic should avoid caching so as to fully stress all moving parts fully
  • Entities requested should be representative of entities on Wikidata
  • Languages requested should be variable
  • Languages shouldn't be so variable we mostly request low term incidence languages

Event Timeline

Tarrow created this task.Tue, Aug 6, 9:31 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Aug 6, 9:31 AM
Tarrow updated the task description. (Show Details)Tue, Aug 6, 11:29 AM

I reckon this is about a 5 point task- ish

Tarrow updated the task description. (Show Details)Thu, Aug 8, 2:12 PM

Change 529741 had a related patch set uploaded (by Tarrow; owner: Tarrow):
[wikibase/termbox@master] Add simple load testing for service only

https://gerrit.wikimedia.org/r/529741

Tarrow updated the task description. (Show Details)Tue, Aug 13, 9:39 AM

Change 529741 merged by jenkins-bot:
[wikibase/termbox@master] Add simple load testing for service only

https://gerrit.wikimedia.org/r/529741

Mentioned in SAL (#wikimedia-operations) [2019-08-14T10:28:37Z] <tarrow> Starting smoketest of termbox service on eqiad: T229907

We successfully ran the load test from around 10:30-11:20 UTC on 2019-08-14. Over an ssh tunnel between a developer laptop and eqiad.

The following command was run .venv/bin/locust -f ./service_only.py --no-web -c 5 -r 0.03 --run-time 30m --host 'http://localhost:3456

This approximated 5 req/s (climbing from 1req/ over a 2.5mins period). Hitting items between Q1 and Q3600 with a random selection of languages. Notably around 2% resulted in a 500 back from the service due to a 400 from the Special:EntityData request. This happened when attempting to retrieve items that were a redirect.

We saw a response reported by the service of around 200ms which seemed perfectly acceptable to us. The latency between the client and service was substantially higher (and variable) probably due to the distance between laptop and datacenter.

While double checking that this was sufficient load we discovered that since the estimations were done last year the mobile load on wikidata.org has increase.

For example see: https://tools.wmflabs.org/siteviews/?platform=mobile-web&source=pageviews&agent=all-agents&range=this-year&sites=wikidata.org where there is peak traffic on 2019-07-16 of 3M requests in a day. This includes all requests. so may well not result in traffic to the service e.g. varnish cache hits, non item pages, parsercache.

To check that our numbers were still valid a quick analysis of page request was made.

Using Turnilo we can see the number of mobile varnish cache misses: https://w.wiki/77e. For example the the last 10 days there was a peak 2.3M varnish cache misses on mobile per day on 2019-08-12. This is out of total mobile traffic of 2.5M.

Looking at https://grafana.wikimedia.org/d/FxKUKqUik/wikibase-parseroutputgenerator?orgId=1&var-entityType=item&var-summarize=1d&from=1565568690228&to=1565655909962 there are around 3.7 total ParserCache misses across desktop and mobile. From turnilo we estimate the Varnish misses to be about 13M.

Taking this rough ratio (PCache Miss / Total Varnish Varnish Misses * Mobile only Varnish Cache Misses) = (3.7M/13M*2.5M) = 710k.

Equally spread across the day this is: 493/min or 8/s.