Page MenuHomePhabricator

Shellbox resource management
Closed, ResolvedPublic

Description

Shellbox is inherently bursty in its CPU usage: it uses almost no resources, almost all the time, but in response to certain classes of request it experiences brief spikes of load which exceed its CPU limit.

At the time Shellbox was designed and deployed, we discussed how best to provision it: give it too few resources, and it won't be able to sustain those spikes; give it too many, and all that hardware will sit idle most of the time. We settled on a compromise value, which suffices most of the time but still runs into CPU exhaustion sometimes, as in T310298 last week. This task is to revisit that decision and see if we can come up with a way to serve those bursts without wasting resources. (The answer might be that the status quo is still the best we can practically do for now.)

In response to last week's incident, we doubled the number of replicas; we may or may not revert that.

A couple of thoughts from the serviceops meeting:

  • Autoscaling (e.g. on CPU utilization) is the obvious approach, but as @JMeybohm points out, we'd need to examine how long these spikes last: if they're too brief, the traffic will be gone by the time new replicas are available, so that would be no help.
  • @JMeybohm suggested this might be a good use case for running without a CPU limit, which we haven't previously done on wikikube. All our services have a CPU request defined, so there shouldn't be a problem with Shellbox CPU consumption causing exhaustion for another tenant, but being able to burst into the shared capacity would give us a nice cushion. But resource overcommitment is still the risk: we'll need to be careful about expecting multiple services to be able to creep into the same set of resources, especially with mw-on-k8s.

Event Timeline

RLazarus triaged this task as Medium priority.Jun 13 2022, 11:03 PM

Did we determine whether the most recent spike was legitimate user traffic or malicious/DoS?

The Abstract Wikipedia team has a proposal somewhere for rendering some fragments async, we could do a dumber version of that - if Shellbox returns an error, we log it but keep rendering the page, and have a little message like "Please wait a few minutes for this score to be rendered" in place of the <score> and queue a delayed job (like a minute or two) to re-render the page, assuming that the Shellbox spike is over. We could also protect Shellbox with PoolCounter so that all the requests don't hit Shellbox and fail.

Another idea, we could combine some of the Shellbox pools so that the overprovisioning problem isn't as bad. Downside, we lose some of the security isolation.

Also, one of the Wikisources has some Lua magic that renders each score like 4 times because they're PNGs. I think if we switched to/enabled SVG rendering (T49578) we could cut that down to just one, which hopefully reduces traffic significantly.

My 2 cents:

  • Allowing Shellbox to burst beyond its cpu limit seems like the right first, easy thing to try. There's little risk to enabling this for a few services, and (AFAIK?) depending on the caller, Shellbox is somewhere between "latency-critical user query path" and "batch processing"
  • If we can differentiate between Shellbox jobs that are latency-critical and ones that aren't, we should -- and we should only allow the more batch-y jobs to use a certain fraction of the workers (probably using Poolcounter). This could also potentially make autoscaling more effective.
matmarex merged tasks: Restricted Task, Restricted Task.EditedJul 5 2022, 3:57 PM
matmarex added subscribers: TheresNoTime, cscott, matmarex.

Cross-referencing: https://wikitech.wikimedia.org/wiki/Incidents/2022-07-03_shellbox_request_spike (it links to this task)

(I think the tasks I merged are related to that incident)

Change 814267 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] shellbox: Bump replicas to 24 to support request spikes

https://gerrit.wikimedia.org/r/814267

Change 814267 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: Bump replicas to 24 to support request spikes

https://gerrit.wikimedia.org/r/814267

With https://gerrit.wikimedia.org/r/813924 we ought to see smaller bursts in utilization, so I'm going to tentatively crank the shellbox replicas back down to 8, where it was before https://gerrit.wikimedia.org/r/803953.

Change 816873 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] shellbox: Restore replicas to 8, now that T312319 is resolved.

https://gerrit.wikimedia.org/r/816873

Change 816873 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: Restore replicas to 8, now that T312319 is resolved.

https://gerrit.wikimedia.org/r/816873

RLazarus claimed this task.