Page MenuHomePhabricator

Diagnose and fix 4.5k req/min ceiling for ores* requests
Closed, DeclinedPublic

Description

We ran into a strange limitation in T169246, seemingly not caused by CPU, memory, or the number of ORES celery workers. A blind guess (process of elimination) is that uWSGI is the bottleneck.

Run more stress tests, maybe by depooling a data center.

See if we can increase the ceiling.

Event Timeline

awight created this task.Dec 6 2017, 9:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 6 2017, 9:08 PM
awight added a comment.Dec 6 2017, 9:28 PM

At their current performance, the celery workers we have should be able to handle 9 nodes * 150 workers / node * (1 request / 1.17 node-s) * (60 s/min) = 69,000 req/min. We would probably be limited by CPU (at 5x 20% CPU = 22,500 req/min).

This is more evidence that there's a bottleneck upstream of celery.

removing uWSGI from the tests is very easy. Just submit directly to the celery queue the jobs/min you 'd like and see if the scores processed per minute are > 4.5k. Viewing it in a nice way might need adding some instrumentation into the celery worker (sending to statsd/graphite a message everytime a score is processed), but it's probably well worth it anyway as it would help with operational problems in the future. But even that is not required and more crude methods like just submitting a ton of jobs and counting results submitted per minute would be fine.

Samtar added a subscriber: Samtar.Dec 7 2017, 8:16 AM
Halfak added a comment.Dec 7 2017, 3:27 PM

We do have graphite logging in celery workers every time a score is processed. I'm not sure how that helps us in this situation. We're currently both counting jobs on the request side and via graphite and everything agrees.

All the evidence I have seen seems to point to uwsgi having a queue that fills up with requests that just block until there's a web worker ready. None of our timeouts engage until a uwsgi worker starts processing a request so, it seems that requests can pile up indefinitely. This is a dangerous place to be because it can lead to a state where we're totally overloaded and we can't tell (unless we start limiting/inspecting the uwsgi queue somehow).

I think we have a nice simple test for this: Increase the uwsgi worker count and start again. Right now, we have 9 servers * 40 cores * 2 $workers_per_core = 720 workers. I think we'll need bump this to 4 $workers_per_core for 1440 workers. Given that we have 9 * 150 = 1350 celery workers, we'll then be able to process 1350 scoring requests in parallel without waiting on a web worker. Currently, we can only use 720 celery workers in parallel so the celery queue never fills up.

Change 396055 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Refactor ORES uWSGI workers to use an absolute count

https://gerrit.wikimedia.org/r/396055

Change 396064 had a related patch set uploaded (by Halfak; owner: halfak):
[operations/puppet@production] Bumps stresstest web workers_per_core from 2 to 6.

https://gerrit.wikimedia.org/r/396064

Halfak added a comment.Dec 7 2017, 5:20 PM

OK so, I think we actually need to bump the worker count to celery_workers + queue_size. Since queue_size is 600, it looks like we should actually have 6 processes per core in order to do that. 9 servers * 40 cores * 6 $workers_per_core = 2160. 2160 - 1350 = 810. 5 won't quite do it at 1800 - 1350 = 450.

awight triaged this task as Medium priority.Dec 11 2017, 5:51 PM
awight moved this task from Active to Review on the Machine Learning Platform (Current) board.

Change 396064 merged by Alexandros Kosiaris:
[operations/puppet@production] Refactor web workers for ORES

https://gerrit.wikimedia.org/r/396064

Mentioned in SAL (#wikimedia-operations) [2017-12-14T12:56:03Z] <awight> stress testing on ores1*.eqiad.wmnet cluster, T182249

Ran another test:
https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1513256100000&to=1513257600000
Test conditions:

  • Only 8 of the ores* machines were online
  • Test harness was hitting 7 machines' uwsgi endpoints
  • Using a delay of 0.001s.
  • uWSGI workers: 230 per machine
  • Celery workers: 150 per machine

Observations:

  • As with earlier tests, there's a large sine wave in the performance, which is in phase across all machines. This must be produced by a centralized bottleneck.
  • Score processing was only slightly better than in earlier tests, averaging about 500 req/(min x machine). We expected a much higher thoroughput thanks to more uWSGI workers.
  • Memory bottomed out, but there were no signs of Celery crashing due to OOM.
  • Very few overload errors, which is surprising since the test harness was set to emit 60k req/min. This suggests that uWSGI was still choking the inputs to Celery.
  • CPU usage averaged about 25%
  • Busy web workers is very level, at c. 140. This is surprising since 250 were available. What were the other workers doing? Possibly returning immediately with TimeoutError... but I still don't get it.
  • T182871: ORES web worker memory usage graph is meaningless
  • 13,000 timeout errors out of 100,000 requests. Maybe the MediaWiki API is a bottleneck? Also, our graph is wrong: T182876: ORES timeout error graph is incorrect
awight added a comment.EditedDec 14 2017, 6:10 PM

Ran a tricky test, in which I stepped up from 1 to 3 test harnesses, then back down. Note that the tests are nested, with the outer test lasting the longest and the inner test the shortest.

https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1513270920000&to=1513275000000

Test conditions:

  • All 9 celery nodes were functional.
  • We used all 9 web endpoints, but the "scores requested" graph shows that ores1004 had no web workers.
  • tester 1 ran 17:03 – 18:09
  • tester 2 ran 17:11 – 17:51
  • tester 3 ran 17:31 – 17:40

Observations:

  • Overload experience is nice from the consumer's perspective. A 503 error is returned, with a JSON explanation.
  • The Celery queue never grows beyond about 900 elements. This probably means that we're correctly dumping overloading requests without submitting anything to the work queue.
  • With 2 testers, we were able to cause sustained overloading.
  • 2 testers made it possible to keep all 230 web workers busy, but this didn't increase Celery throughput.
  • The number of scores processed fell from about 540 to 325 req/(min x machine) while overloads were happening.

Based on this report, I think we should go live with this. Any follow-up stress testing can be done in a more experimental context. Do you feel the same @awight?

I'm happy with that. It looks like it's going to be difficult to break through this ceiling, and we have lots of data to compare against when we try again, somewhere experimental. The capacity is fine, overload behavior is much better than before.

Only two concerns:

  • There's something centralized throttling us, and that might be harder to find in an experimental setting, unless we set up a very realistic cluster.
  • Available memory is too close to the line, and I think ores1001 may have died from 18:05-18:10.

I think we should tune uWSGI workers down to celery + 20 or something, so that we have enough to handle and respond to overload. Then run one more test. We might as well dial Celery workers down a bit as well, to stay clear of OOM.

Change 398316 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Tune down ORES worker counts on the stress testing cluster

https://gerrit.wikimedia.org/r/398316

Change 398316 merged by Dzahn:
[operations/puppet@production] Tune down ORES worker counts on the stress testing cluster

https://gerrit.wikimedia.org/r/398316

awight changed the task status from Open to Stalled.EditedDec 14 2017, 9:16 PM

https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1513283400000&to=1513286400000

I think we've got our tuning parameters! 45 minutes of overload, and everything looks good. Available memory hasn't dropped below 8GB. We'll be eating into this margin as we add models, so it's perfect to leave this way for now.

Cluster performance is at least 3k req/min processed during overload, and 6k req/min in bursts.

We still have some weird artifacts during overload, where the Celery worker performance fluctuates together in a sine wave, and slowly drops on average, but that's definitely a software architecture issue.

I'll put this task on hold and remove as a blocker.

Interesting hypothesis from IRC conversation: the sine waves could be a garbage collection artifact.

Python includes some toys for debugging GC: https://docs.python.org/3.6/library/gc.html

Change 396055 abandoned by Awight:
Refactor ORES uWSGI workers to use an absolute count

https://gerrit.wikimedia.org/r/396055

@awight it seems we'll still want that change when we get to the new cluster, won't we?

What's the status of this? Maybe because of the redis task tracker we already can process way more requests already. Should we double check everything?

awight added a comment.Nov 7 2018, 6:34 PM

What's the status of this? Maybe because of the redis task tracker we already can process way more requests already. Should we double check everything?

Interesting! It's hard to do a load test on production, maybe we could accomplish this by announcing an hour of maintenance, during which we disconnect changeprop and disallow outside traffic?

We can have traffic on eqiad only and do crazy things on codfw *shrugs*

Halfak added a comment.Nov 8 2018, 2:26 AM

I like the proposal of depooling one datacenter. What do you think @akosiaris? Is this crazy?

awight added a comment.Nov 8 2018, 2:32 AM

Another thing to consider is that, although we're all very curious about our ceiling, it doesn't really matter until we see real traffic causing overload events. And when that happens, we can study the graphs.

I like the proposal of depooling one datacenter. What do you think @akosiaris? Is this crazy?

Well we can do it, but for what purpose ? Knowing how many req/s on average our infrastructure can do? As @awight points out it doesn't really matter until we meet actual problems. It would be helpful for capacity planning, but we 've already done some guesstimations for the next couple of years and should be sufficient.

Most importantly the architecture of the infrastructure will change in the next quarters with the advent of kubernetes, the idea being that we will be way more elastic then, being able to spawn more processes to accommodate spikes in demand. At which point the ceiling becomes way more fluid as we can easily add more capacity to the service.

That being said, doing benchmarking tests in order to figure out our bottlenecks (and conversely the capacity of each component) is a worthy goal, and would help with the move to kubernetes (especially for autoscaling to meet spikes in demand), but that does not need happen in production, it can happen in any kind of environment.

Another thing to consider is that, although we're all very curious about our ceiling, it doesn't really matter until we see real traffic causing overload events. And when that happens, we can study the graphs.

+1 to that.

Halfak added a comment.Nov 9 2018, 5:01 PM

If we could handle twice the capacity, then we could allow researchers to query us twice as fast. :) We could lift our simultaneous connection limit. We'd achieve this by exploring how high we can push the # of celery workers we are running. Such a tests would require some reboots and stress testing.

We can't increase number of our celery workers due to memory allocation. Maybe we need to tackle T182350: Profile ORES code memory use, optimize pre-fork allocation then fix things and then come back to increasing number of celery workers. I would encourage against lifting the connection limit but making it a little bit higher would be nice. Also, providing all of ores scores in the analytics data lake, or the brand new hadoop/presto cluster for labs (that is being built), or in dumps.wikimedia.org would lift most of the need for the researchers.

Halfak added a comment.Nov 9 2018, 6:55 PM

It looks like we do have a bit more ceiling for memory usage. I half-remembered us tuning our worker-count down due to issues with celery 3.x. But upon review, it seems that we're operating within expected params with regards to memory usage (assuming we keep adding models over time). So I agree with @Ladsgroup, we should probably be addressing memory usage of individual workers before experimenting with bumping up the number of workers or increasing the max parallel connections substantially.

Change 475487 had a related patch set uploaded (by Ladsgroup; owner: Awight):
[operations/puppet@production] Refactor ORES uWSGI workers to use an absolute count

https://gerrit.wikimedia.org/r/475487

Change 475487 abandoned by Ladsgroup:
Refactor ORES uWSGI workers to use an absolute count

Reason:
It's already done: https://github.com/wikimedia/puppet/blob/production/modules/ores/manifests/web.pp#L4

https://gerrit.wikimedia.org/r/475487

We should probably decline this task in favor of T182331: [Epic] Deploy ORES in kubernetes cluster.

Halfak closed this task as Declined.Jan 17 2019, 3:29 PM

I made: T214054: Stress test ORES/kubernetes (above 4.5k scores/second) as a subtask of that epic so we can continue our work there.