Page MenuHomePhabricator

Move the article-descriptions model server from staging to production
Closed, ResolvedPublic4 Estimated Story Points

Description

In T343123#9380663, we migrated the article-descriptions model server from Toolforge to LiftWing staging. Both the Android and the Research teams tested the LiftWing instance and provided feedback. After the ML team resolved the outstanding issues, the Android team gave a green light to move the model server from staging to production.

Event Timeline

kevinbazira changed the task status from Open to In Progress.Feb 26 2024, 8:09 AM
kevinbazira triaged this task as High priority.
kevinbazira set the point value for this task to 4.
kevinbazira moved this task from Unsorted to In Progress on the Machine-Learning-Team board.

Change 1006194 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: move article-descriptions to prod

https://gerrit.wikimedia.org/r/1006194

Change 1006194 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: move article-descriptions to prod

https://gerrit.wikimedia.org/r/1006194

Before deploying the article-descriptions model server in prod, I tried running helmfile -e ml-serve-* diff for both *eqiad and *codfw and got the error below:

$ helmfile -e ml-serve-eqiad diff
skipping missing values file matching "/etc/helmfile-defaults/private/ml-serve_services/article-descriptions/ml-serve-eqiad.yaml"
skipping missing values file matching "/etc/helmfile-defaults/private/ml-serve_services/article-descriptions/ml-serve-eqiad.yaml"
skipping missing values file matching "values-ml-serve-eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Comparing release=service-secrets, chart=wmf-stable/secrets
Comparing release=main, chart=wmf-stable/kserve-inference
in ./helmfile.yaml: 2 errors:
err 0: command "/usr/bin/helm3" exited with non-zero status:

PATH:
  /usr/bin/helm3

ARGS:
  0: helm3 (5 bytes)
  1: diff (4 bytes)
  2: upgrade (7 bytes)
  3: --reset-values (14 bytes)
  4: --allow-unreleased (18 bytes)
  5: main (4 bytes)
  6: wmf-stable/kserve-inference (27 bytes)
  7: --namespace (11 bytes)
  8: article-descriptions (20 bytes)
  9: --values (8 bytes)
  10: /tmp/values158807938 (20 bytes)
  11: --values (8 bytes)
  12: /tmp/values279237369 (20 bytes)
  13: --kubeconfig=/etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config (78 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  Error: Failed to get release main in namespace default: exit status 1: W0226 12:29:54.268182    7996 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  W0226 12:29:54.268315    7996 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: Kubernetes cluster unreachable: Get "http://localhost:8080/version": dial tcp [::1]:8080: connect: connection refused
  Error: plugin "diff" exited with error

COMBINED OUTPUT:
  Error: Failed to get release main in namespace default: exit status 1: W0226 12:29:54.268182    7996 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  W0226 12:29:54.268315    7996 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: Kubernetes cluster unreachable: Get "http://localhost:8080/version": dial tcp [::1]:8080: connect: connection refused
  Error: plugin "diff" exited with error
err 1: command "/usr/bin/helm3" exited with non-zero status:

PATH:
  /usr/bin/helm3

ARGS:
  0: helm3 (5 bytes)
  1: diff (4 bytes)
  2: upgrade (7 bytes)
  3: --reset-values (14 bytes)
  4: --allow-unreleased (18 bytes)
  5: service-secrets (15 bytes)
  6: wmf-stable/secrets (18 bytes)
  7: --namespace (11 bytes)
  8: article-descriptions (20 bytes)
  9: --kubeconfig=/etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config (78 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  Error: Failed to get release service-secrets in namespace default: exit status 1: W0226 12:29:54.270338    8002 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  W0226 12:29:54.270445    8002 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: Kubernetes cluster unreachable: Get "http://localhost:8080/version": dial tcp [::1]:8080: connect: connection refused
  Error: plugin "diff" exited with error

COMBINED OUTPUT:
  Error: Failed to get release service-secrets in namespace default: exit status 1: W0226 12:29:54.270338    8002 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  W0226 12:29:54.270445    8002 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: Kubernetes cluster unreachable: Get "http://localhost:8080/version": dial tcp [::1]:8080: connect: connection refused
  Error: plugin "diff" exited with error

Change rMW10065220e29c had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] k8s: Add faux secrest for article-descriptions on Lift Wing

https://gerrit.wikimedia.org/r/1006522

Change rMW10065220e29c merged by Klausman:

[labs/private@master] k8s: Add faux secrest for article-descriptions on Lift Wing

https://gerrit.wikimedia.org/r/1006522

Change 1006528 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] LiftWing: add missing entry for article-desc certs

https://gerrit.wikimedia.org/r/1006528

Change 1006528 merged by jenkins-bot:

[operations/deployment-charts@master] LiftWing: add missing entry for article-desc certs

https://gerrit.wikimedia.org/r/1006528

After @klausman helped add secrets, deploy configs, and certs we are now getting this error:

$ helmfile -e ml-serve-eqiad diff
skipping missing values file matching "values-ml-serve-eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Comparing release=service-secrets, chart=wmf-stable/secrets
Comparing release=main, chart=wmf-stable/kserve-inference
in ./helmfile.yaml: 2 errors:
err 0: command "/usr/bin/helm3" exited with non-zero status:

PATH:
  /usr/bin/helm3

ARGS:
  0: helm3 (5 bytes)
  1: diff (4 bytes)
  2: upgrade (7 bytes)
  3: --reset-values (14 bytes)
  4: --allow-unreleased (18 bytes)
  5: service-secrets (15 bytes)
  6: wmf-stable/secrets (18 bytes)
  7: --namespace (11 bytes)
  8: article-descriptions (20 bytes)
  9: --values (8 bytes)
  10: /tmp/values899501062 (20 bytes)
  11: --kubeconfig=/etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config (78 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: Failed to get release service-secrets in namespace article-descriptions: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: query: failed to query with labels: secrets is forbidden: User "article-descriptions-deploy" cannot list resource "secrets" in API group "" in the namespace "article-descriptions"
  Error: plugin "diff" exited with error

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: Failed to get release service-secrets in namespace article-descriptions: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: query: failed to query with labels: secrets is forbidden: User "article-descriptions-deploy" cannot list resource "secrets" in API group "" in the namespace "article-descriptions"
  Error: plugin "diff" exited with error
err 1: command "/usr/bin/helm3" exited with non-zero status:

PATH:
  /usr/bin/helm3

ARGS:
  0: helm3 (5 bytes)
  1: diff (4 bytes)
  2: upgrade (7 bytes)
  3: --reset-values (14 bytes)
  4: --allow-unreleased (18 bytes)
  5: main (4 bytes)
  6: wmf-stable/kserve-inference (27 bytes)
  7: --namespace (11 bytes)
  8: article-descriptions (20 bytes)
  9: --values (8 bytes)
  10: /tmp/values448409517 (20 bytes)
  11: --values (8 bytes)
  12: /tmp/values693055784 (20 bytes)
  13: --values (8 bytes)
  14: /tmp/values160432231 (20 bytes)
  15: --kubeconfig=/etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config (78 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: Failed to get release main in namespace article-descriptions: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: query: failed to query with labels: secrets is forbidden: User "article-descriptions-deploy" cannot list resource "secrets" in API group "" in the namespace "article-descriptions"
  Error: plugin "diff" exited with error

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: Failed to get release main in namespace article-descriptions: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config
  Error: query: failed to query with labels: secrets is forbidden: User "article-descriptions-deploy" cannot list resource "secrets" in API group "" in the namespace "article-descriptions"
  Error: plugin "diff" exited with error

I had missed pushing the admin_ng change. That is fixed now, so pushing the model server config should work now.

Thanks @klausman. As discussed yesterday, with the current configuration, a request that was taking <3s on staging is now >8s in prod as shown below:

$ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.07287430763244629,"mwapi - first paragraphs (s)":0.27934741973876953,"total network (s)":0.3130209445953369,"model (s)":7.659532070159912,"total (s)":7.9725823402404785},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]}
real	0m8.049s
user	0m0.014s
sys	0m0.001s

These results are close to what we had in T353127#9398823. A model server with 4 CPUs was taking >8s and when we used 8 CPUs, the response time dropped to <4s.

Based on our findings in T353127#9421055, using 16 CPUs is the configuration that brought response time close to <3s, which is the Android team's requirement for this service.

Is it possible to increase the CPU allocation from 6 in prod?

Change 1006855 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] admin_ng/ml-services: raise request maximums for art-desc

https://gerrit.wikimedia.org/r/1006855

Change 1006855 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng/ml-services: raise request maximums for art-desc

https://gerrit.wikimedia.org/r/1006855

Change 1006866 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: increase article-descriptions CPUs

https://gerrit.wikimedia.org/r/1006866

Change 1006866 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase article-descriptions CPUs

https://gerrit.wikimedia.org/r/1006866

I just got an error when querying the service:

$ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"error":"AttributeError : 'NoneType' object has no attribute 'shape'"}
real    0m0.317s
user    0m0.005s
sys     0m0.009s

From the service logs:

2024-02-27 10:38:50.321 uvicorn.error ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/lib/python/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi                                                                                                 
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/lib/python/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/opt/lib/python/site-packages/timing_asgi/middleware.py", line 70, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/opt/lib/python/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/opt/lib/python/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/lib/python/site-packages/kserve/protocol/rest/v1_endpoints.py", line 76, in predict
    response, response_headers = await self.dataplane.infer(model_name=model_name,
  File "/opt/lib/python/site-packages/kserve/protocol/dataplane.py", line 311, in infer
    response = await model(request, headers=headers)
  File "/opt/lib/python/site-packages/kserve/model.py", line 122, in __call__
    else self.predict(payload, headers)
  File "/srv/article_descriptions/model_server/model.py", line 103, in predict
    prediction = self.model.predict(
  File "/srv/article_descriptions/model_server/utils.py", line 149, in predict
    tokens = self.model.generate(
  File "/opt/lib/python/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/srv/article_descriptions/model_server/descartes/src/models/descartes_mbart.py", line 607, in generate                                                                                         
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
  File "/srv/article_descriptions/model_server/descartes/src/models/descartes_mbart.py", line 401, in _prepare_encoder_decoder_kwargs_for_generation                                                   
    lang_out = torch.ones((attention_mask[target_lang].shape[0], 1, 1),
AttributeError: 'NoneType' object has no attribute 'shape'

@klausman helped increase the caps on this model server's resource constraints. I pushed a patch that increased the number of CPUs used by the article-descriptions model server from 6 to 16 so that prod can match staging performance. The previous request we tested in T358467#9579190 has dropped from >8s to <3s:

$ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org"
{"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.043555498123168945,"mwapi - first paragraphs (s)":0.22956347465515137,"total network (s)":0.2606837749481201,"model (s)":2.4616105556488037,"total (s)":2.7223129272460938},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]}
real	0m2.744s
user	0m0.000s
sys	0m0.013s

One addendum to the 'None has no attribute "shape"': this happened only once, the same request seconds later (and before!) worked just fine.

Change 1006870 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: increase article-descriptions memory

https://gerrit.wikimedia.org/r/1006870

Change 1006870 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase article-descriptions memory

https://gerrit.wikimedia.org/r/1006870

The article-descriptions model server was firing InfServiceHighMemoryUsage alerts. This usually happens when an isvc uses >90% of its limit for 5mins. I have increased the memory limit used by this model server from 4Gi to 5Gi so that prod can handle processing more isvc requests without running out of memory.

The article-descriptions model server was firing InfServiceHighMemoryUsage alerts. This usually happens when an isvc uses >90% of its limit for 5mins. I have increased the memory limit used by this model server from 4Gi to 5Gi so that prod can handle processing more isvc requests without running out of memory.

With the changes in https://phabricator.wikimedia.org/T358742#9601035, this should not be a problem anymore.