In T343123#9380663, we migrated the article-descriptions model server from Toolforge to LiftWing staging. Both the Android and the Research teams tested the LiftWing instance and provided feedback. After the ML team resolved the outstanding issues, the Android team gave a green light to move the model server from staging to production.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | kevinbazira | T343123 Migrate Machine-generated Article Descriptions from toolforge to liftwing. | |||
Resolved | kevinbazira | T358467 Move the article-descriptions model server from staging to production | |||
Resolved | klausman | T358654 Create external endpoint for article-descriptions isvc hosted on LiftWing | |||
Open | klausman | T358655 Set SLO for the article-descriptions isvc hosted on LiftWing |
Event Timeline
Change 1006194 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[operations/deployment-charts@master] ml-services: move article-descriptions to prod
Change 1006194 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: move article-descriptions to prod
Before deploying the article-descriptions model server in prod, I tried running helmfile -e ml-serve-* diff for both *eqiad and *codfw and got the error below:
$ helmfile -e ml-serve-eqiad diff skipping missing values file matching "/etc/helmfile-defaults/private/ml-serve_services/article-descriptions/ml-serve-eqiad.yaml" skipping missing values file matching "/etc/helmfile-defaults/private/ml-serve_services/article-descriptions/ml-serve-eqiad.yaml" skipping missing values file matching "values-ml-serve-eqiad.yaml" skipping missing values file matching "values-main.yaml" Comparing release=service-secrets, chart=wmf-stable/secrets Comparing release=main, chart=wmf-stable/kserve-inference in ./helmfile.yaml: 2 errors: err 0: command "/usr/bin/helm3" exited with non-zero status: PATH: /usr/bin/helm3 ARGS: 0: helm3 (5 bytes) 1: diff (4 bytes) 2: upgrade (7 bytes) 3: --reset-values (14 bytes) 4: --allow-unreleased (18 bytes) 5: main (4 bytes) 6: wmf-stable/kserve-inference (27 bytes) 7: --namespace (11 bytes) 8: article-descriptions (20 bytes) 9: --values (8 bytes) 10: /tmp/values158807938 (20 bytes) 11: --values (8 bytes) 12: /tmp/values279237369 (20 bytes) 13: --kubeconfig=/etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config (78 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: Error: Failed to get release main in namespace default: exit status 1: W0226 12:29:54.268182 7996 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config W0226 12:29:54.268315 7996 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: Kubernetes cluster unreachable: Get "http://localhost:8080/version": dial tcp [::1]:8080: connect: connection refused Error: plugin "diff" exited with error COMBINED OUTPUT: Error: Failed to get release main in namespace default: exit status 1: W0226 12:29:54.268182 7996 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config W0226 12:29:54.268315 7996 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: Kubernetes cluster unreachable: Get "http://localhost:8080/version": dial tcp [::1]:8080: connect: connection refused Error: plugin "diff" exited with error err 1: command "/usr/bin/helm3" exited with non-zero status: PATH: /usr/bin/helm3 ARGS: 0: helm3 (5 bytes) 1: diff (4 bytes) 2: upgrade (7 bytes) 3: --reset-values (14 bytes) 4: --allow-unreleased (18 bytes) 5: service-secrets (15 bytes) 6: wmf-stable/secrets (18 bytes) 7: --namespace (11 bytes) 8: article-descriptions (20 bytes) 9: --kubeconfig=/etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config (78 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: Error: Failed to get release service-secrets in namespace default: exit status 1: W0226 12:29:54.270338 8002 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config W0226 12:29:54.270445 8002 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: Kubernetes cluster unreachable: Get "http://localhost:8080/version": dial tcp [::1]:8080: connect: connection refused Error: plugin "diff" exited with error COMBINED OUTPUT: Error: Failed to get release service-secrets in namespace default: exit status 1: W0226 12:29:54.270338 8002 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config W0226 12:29:54.270445 8002 loader.go:221] Config not found: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: Kubernetes cluster unreachable: Get "http://localhost:8080/version": dial tcp [::1]:8080: connect: connection refused Error: plugin "diff" exited with error
Change rMW10065220e29c had a related patch set uploaded (by Klausman; author: Klausman):
[labs/private@master] k8s: Add faux secrest for article-descriptions on Lift Wing
Change rMW10065220e29c merged by Klausman:
[labs/private@master] k8s: Add faux secrest for article-descriptions on Lift Wing
Change 1006528 had a related patch set uploaded (by Klausman; author: Klausman):
[operations/deployment-charts@master] LiftWing: add missing entry for article-desc certs
Change 1006528 merged by jenkins-bot:
[operations/deployment-charts@master] LiftWing: add missing entry for article-desc certs
After @klausman helped add secrets, deploy configs, and certs we are now getting this error:
$ helmfile -e ml-serve-eqiad diff skipping missing values file matching "values-ml-serve-eqiad.yaml" skipping missing values file matching "values-main.yaml" Comparing release=service-secrets, chart=wmf-stable/secrets Comparing release=main, chart=wmf-stable/kserve-inference in ./helmfile.yaml: 2 errors: err 0: command "/usr/bin/helm3" exited with non-zero status: PATH: /usr/bin/helm3 ARGS: 0: helm3 (5 bytes) 1: diff (4 bytes) 2: upgrade (7 bytes) 3: --reset-values (14 bytes) 4: --allow-unreleased (18 bytes) 5: service-secrets (15 bytes) 6: wmf-stable/secrets (18 bytes) 7: --namespace (11 bytes) 8: article-descriptions (20 bytes) 9: --values (8 bytes) 10: /tmp/values899501062 (20 bytes) 11: --kubeconfig=/etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config (78 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: Failed to get release service-secrets in namespace article-descriptions: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: query: failed to query with labels: secrets is forbidden: User "article-descriptions-deploy" cannot list resource "secrets" in API group "" in the namespace "article-descriptions" Error: plugin "diff" exited with error COMBINED OUTPUT: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: Failed to get release service-secrets in namespace article-descriptions: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: query: failed to query with labels: secrets is forbidden: User "article-descriptions-deploy" cannot list resource "secrets" in API group "" in the namespace "article-descriptions" Error: plugin "diff" exited with error err 1: command "/usr/bin/helm3" exited with non-zero status: PATH: /usr/bin/helm3 ARGS: 0: helm3 (5 bytes) 1: diff (4 bytes) 2: upgrade (7 bytes) 3: --reset-values (14 bytes) 4: --allow-unreleased (18 bytes) 5: main (4 bytes) 6: wmf-stable/kserve-inference (27 bytes) 7: --namespace (11 bytes) 8: article-descriptions (20 bytes) 9: --values (8 bytes) 10: /tmp/values448409517 (20 bytes) 11: --values (8 bytes) 12: /tmp/values693055784 (20 bytes) 13: --values (8 bytes) 14: /tmp/values160432231 (20 bytes) 15: --kubeconfig=/etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config (78 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: Failed to get release main in namespace article-descriptions: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: query: failed to query with labels: secrets is forbidden: User "article-descriptions-deploy" cannot list resource "secrets" in API group "" in the namespace "article-descriptions" Error: plugin "diff" exited with error COMBINED OUTPUT: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: Failed to get release main in namespace article-descriptions: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config Error: query: failed to query with labels: secrets is forbidden: User "article-descriptions-deploy" cannot list resource "secrets" in API group "" in the namespace "article-descriptions" Error: plugin "diff" exited with error
I had missed pushing the admin_ng change. That is fixed now, so pushing the model server config should work now.
Thanks @klausman. As discussed yesterday, with the current configuration, a request that was taking <3s on staging is now >8s in prod as shown below:
$ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json" --http1.1 {"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.07287430763244629,"mwapi - first paragraphs (s)":0.27934741973876953,"total network (s)":0.3130209445953369,"model (s)":7.659532070159912,"total (s)":7.9725823402404785},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]} real 0m8.049s user 0m0.014s sys 0m0.001s
These results are close to what we had in T353127#9398823. A model server with 4 CPUs was taking >8s and when we used 8 CPUs, the response time dropped to <4s.
Based on our findings in T353127#9421055, using 16 CPUs is the configuration that brought response time close to <3s, which is the Android team's requirement for this service.
Is it possible to increase the CPU allocation from 6 in prod?
Change 1006855 had a related patch set uploaded (by Klausman; author: Klausman):
[operations/deployment-charts@master] admin_ng/ml-services: raise request maximums for art-desc
Change 1006855 merged by jenkins-bot:
[operations/deployment-charts@master] admin_ng/ml-services: raise request maximums for art-desc
Change 1006866 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[operations/deployment-charts@master] ml-services: increase article-descriptions CPUs
Change 1006866 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: increase article-descriptions CPUs
I just got an error when querying the service:
$ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json" --http1.1 {"error":"AttributeError : 'NoneType' object has no attribute 'shape'"} real 0m0.317s user 0m0.005s sys 0m0.009s
From the service logs:
2024-02-27 10:38:50.321 uvicorn.error ERROR: Exception in ASGI application Traceback (most recent call last): File "/opt/lib/python/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi result = await app( # type: ignore[func-returns-value] File "/opt/lib/python/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__ return await self.app(scope, receive, send) File "/opt/lib/python/site-packages/fastapi/applications.py", line 276, in __call__ await super().__call__(scope, receive, send) File "/opt/lib/python/site-packages/starlette/applications.py", line 122, in __call__ await self.middleware_stack(scope, receive, send) File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 184, in __call__ raise exc File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 162, in __call__ await self.app(scope, receive, _send) File "/opt/lib/python/site-packages/timing_asgi/middleware.py", line 70, in __call__ await self.app(scope, receive, send_wrapper) File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 79, in __call__ raise exc File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 68, in __call__ await self.app(scope, receive, sender) File "/opt/lib/python/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__ raise e File "/opt/lib/python/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__ await self.app(scope, receive, send) File "/opt/lib/python/site-packages/starlette/routing.py", line 718, in __call__ await route.handle(scope, receive, send) File "/opt/lib/python/site-packages/starlette/routing.py", line 276, in handle await self.app(scope, receive, send) File "/opt/lib/python/site-packages/starlette/routing.py", line 66, in app response = await func(request) File "/opt/lib/python/site-packages/fastapi/routing.py", line 237, in app raw_response = await run_endpoint_function( File "/opt/lib/python/site-packages/fastapi/routing.py", line 163, in run_endpoint_function return await dependant.call(**values) File "/opt/lib/python/site-packages/kserve/protocol/rest/v1_endpoints.py", line 76, in predict response, response_headers = await self.dataplane.infer(model_name=model_name, File "/opt/lib/python/site-packages/kserve/protocol/dataplane.py", line 311, in infer response = await model(request, headers=headers) File "/opt/lib/python/site-packages/kserve/model.py", line 122, in __call__ else self.predict(payload, headers) File "/srv/article_descriptions/model_server/model.py", line 103, in predict prediction = self.model.predict( File "/srv/article_descriptions/model_server/utils.py", line 149, in predict tokens = self.model.generate( File "/opt/lib/python/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/srv/article_descriptions/model_server/descartes/src/models/descartes_mbart.py", line 607, in generate model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation( File "/srv/article_descriptions/model_server/descartes/src/models/descartes_mbart.py", line 401, in _prepare_encoder_decoder_kwargs_for_generation lang_out = torch.ones((attention_mask[target_lang].shape[0], 1, 1), AttributeError: 'NoneType' object has no attribute 'shape'
@klausman helped increase the caps on this model server's resource constraints. I pushed a patch that increased the number of CPUs used by the article-descriptions model server from 6 to 16 so that prod can match staging performance. The previous request we tested in T358467#9579190 has dropped from >8s to <3s:
$ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H "Host: article-descriptions.article-descriptions.wikimedia.org" {"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.043555498123168945,"mwapi - first paragraphs (s)":0.22956347465515137,"total network (s)":0.2606837749481201,"model (s)":2.4616105556488037,"total (s)":2.7223129272460938},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]} real 0m2.744s user 0m0.000s sys 0m0.013s
One addendum to the 'None has no attribute "shape"': this happened only once, the same request seconds later (and before!) worked just fine.
Change 1006870 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[operations/deployment-charts@master] ml-services: increase article-descriptions memory
Change 1006870 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: increase article-descriptions memory
The article-descriptions model server was firing InfServiceHighMemoryUsage alerts. This usually happens when an isvc uses >90% of its limit for 5mins. I have increased the memory limit used by this model server from 4Gi to 5Gi so that prod can handle processing more isvc requests without running out of memory.
With the changes in https://phabricator.wikimedia.org/T358742#9601035, this should not be a problem anymore.