Page MenuHomePhabricator

Explore optimizations/scaling for Revise Tone Task Generator in LiftWing
Open, Needs TriagePublic

Description

The Revise Tone Task Generator is deployed on production.

According to our Istio Dashboard, our traffic is oscillating between 1-2 requests per second with us responding in ~180ms at p0.5, ~500ms at p0.95 and in ~1s in p0.99. This traffic is coming from Changeprop and those are edit events on en, pt, fr, ar and test wikis. We don't expect any other traffic then Changeprop.

We are currently able to sustain the traffic, however if the traffic would at least double, we want to explore our options for scaling without the need to use multiple GPUs to save resources:

  1. Explore the performance of CPU-only deployment
  2. Explore running multiple workers in one pod. This would enable us to use 1 GPU for multiple workers, but this has not been yet done with any other LiftWing model.

Event Timeline

Change #1215112 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Deploy experimental CPU-only revise-tone-task-generator.

https://gerrit.wikimedia.org/r/1215112

Change #1215112 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Deploy experimental CPU-only revise-tone-task-generator.

https://gerrit.wikimedia.org/r/1215112

Change #1216754 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[machinelearning/liftwing/inference-services@main] revise-tone-task-generator: Use multiple workers in a single deployment.

https://gerrit.wikimedia.org/r/1216754

Change #1216754 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revise-tone-task-generator: Use multiple workers in a single deployment.

https://gerrit.wikimedia.org/r/1216754

Change #1216787 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Update experimental revise-tone-task-generator.

https://gerrit.wikimedia.org/r/1216787

Change #1216787 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Update experimental revise-tone-task-generator.

https://gerrit.wikimedia.org/r/1216787

I'm coming with a small update from early experimentation results.

Currently used resources

  1. Defined container resources (copied from edit-check):
    • 4 CPUs
    • 8GB memory
    • 1 GPU with 16GB VRAM
  2. Utilization on prod (https://grafana.wikimedia.org/goto/hKuMrVGDg?orgId=1):
    • ~1% CPU usage
    • 1.9GB memory
    • GPU: No way for MLE to check? Model weighs ~2GB

CPU-only model:

Median inference time goes from ~5ms to ~620ms.
Current median response time is ~180ms.
This means that some percentage of requests would be slightly over 4x slower. This percentage depends on whether we run the inference (topic matching criteria) or not.

Estimating percentage of requests with matching topics based on the logs from the last 24 hours:

  • Topics matching: 37.5% (54,381 logs)
  • Topics not matching: 62.5% (90,724 logs)

This means we'd add ~225ms on average, which translates to being ~2.3x slower on average with the CPU-only model.


Multiple workers on GPU:

Performed early load tests on staging-experimental with some setup modifications/assumptions:

  • Excluded sending events to Eventgate to avoid pollution
  • Used ingestion dataset as input (it only contains matching article topics, so we always run inference)
  • 10 actors sending concurrent requests

Results under those conditions:

  • 1 worker: ~14 RPS
  • 2 workers: ~19 RPS, 3GB mem usage, ~2.5% CPU usage
  • 4 workers: ~27 RPS, 5GB mem usage, ~8% CPU usage

Next steps:

  1. Experiment with multiple workers on CPU-only deployment. This could potentially meet our RPS demands without needing to use GPUs, however the same can be achieved safer and with better scalability by just increasing number of CPU-only pods.
  2. Find a good balance of CPU/MEM usage that should be used for single and multi-worker edit-check deployments

Change #1229059 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Enable multiple workers for revise tone service.

https://gerrit.wikimedia.org/r/1229059

Change #1229059 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Enable multiple workers for revise tone service.

https://gerrit.wikimedia.org/r/1229059