Page MenuHomePhabricator

Revertrisk multilingual predictor returning 500s
Open, Needs TriagePublic

Description

Hi folks! While being on-call for SRE I noticed some recurrent RR multilingual errors:

INFO:root:Received request for revision 149673768 (ru).
2025-11-09 08:16:53.192 1 kserve.trace requestId: 1fe51d39-3b89-458f-b1ae-9ce3ab45ac70, preprocess_ms: 119.79341507, explain_ms: 0, predict_ms: 184.958696365, postprocess_ms: 0.005722046
2025-11-09 08:16:53.193 uvicorn.access INFO:     127.0.0.6:0 1 - "POST /v1/models/revertrisk-multilingual%3Apredict HTTP/1.1" 200 OK
2025-11-09 08:16:53.193 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.3053722381591797 ['http_status:200', 'http_method:POST', 'time:wall']
2025-11-09 08:16:53.193 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.36014800000157265 ['http_status:200', 'http_method:POST', 'time:cpu']
2025-11-09 08:16:54.254 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.00027871131896972656 ['http_status:500', 'http_method:POST', 'time:wall']
2025-11-09 08:16:54.254 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.00026900000011664815 ['http_status:500', 'http_method:POST', 'time:cpu']
2025-11-09 08:16:54.254 1 kserve ERROR [errors.py:generic_exception_handler():143] Exception:
Traceback (most recent call last):
  File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/lib/python/site-packages/timing_asgi/middleware.py", line 70, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/lib/python/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 78, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/lib/python/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 75, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 302, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 213, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-11-09 08:16:54.254 uvicorn.access INFO:     127.0.0.6:0 1 - "POST /v1/models/revertrisk-multilingual%3Apredict HTTP/1.1" 500 Internal Server Error
  File "/opt/lib/python/site-packages/kserve/protocol/rest/v1_endpoints.py", line 84, in predict
    response, response_headers = await self.dataplane.infer(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/kserve/protocol/dataplane.py", line 461, in infer
    response, res_headers = await model(request, headers=headers)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/kserve/model.py", line 259, in __call__
    await self.preprocess(body, headers)
  File "/srv/revert_risk_model/model_server/base_model.py", line 170, in preprocess
    lang = inputs.get("lang")
           ^^^^^^^^^^
AttributeError: 'list' object has no attribute 'get'

It seems that some requests trigger the issue, you can find one at the top of the stacktrace. I also saved a pod's log in my home on deploy2002 for future inspection (/home/elukey/T409657). I restricted the task only to subscribers since in theory this could be used as nice dos vector :)

Event Timeline

elukey created this object with visibility "Custom Policy".
elukey updated the task description. (Show Details)
gkyziridis changed the visibility from "Custom Policy" to "Public (No Login Required)".Nov 10 2025, 10:21 AM
Tested Locally

I downloaded the same model binary (pickle) that we are using currently on staging and build the model server locally using make revertrisk-multilingual (from README.md).
The model server runs smoothly locally and the request was successful.

# Request
$ curl -i -X \
POST localhost:8080/v1/models/revertrisk-multilingual:predict \
-d '{"lang": "ru", "rev_id": 149673768}'

# Logs
INFO:root:Received request for revision 149673768 (ru).
2025-11-10 15:03:07.664 23177 kserve.trace requestId: N.A., preprocess_ms: 467.435121536, explain_ms: 0, predict_ms: 133.578062057, postprocess_ms: 0.010967255
2025-11-10 15:03:07.665 uvicorn.access INFO:     127.0.0.1:58958 23177 - "POST /v1/models/revertrisk-multilingual%3Apredict HTTP/1.1" 200 OK
2025-11-10 15:03:07.665 23177 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.6017708778381348 ['http_status:200', 'http_method:POST', 'time:wall']
2025-11-10 15:03:07.665 23177 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.20669199999999677 ['http_status:200', 'http_method:POST', 'time:cpu']

# Response
HTTP/1.1 200 OK
date: Mon, 10 Nov 2025 14:12:43 GMT
server: uvicorn
content-length: 203
content-type: application/json

{"model_name":"revertrisk-multilingual","model_version":"4","wiki_db":"ruwiki","revision_id":149673768,"output":{"prediction":true,"probabilities":{"true":0.7813304845261667,"false":0.2186695154738333}}}
Tested on Staging
# Request
$ curl -X \
POST "https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict" \
-H "Host: revertrisk-multilingual.revertrisk.wikimedia.org" \
-d '{"lang": "ru", "rev_id": 149673768}'

# Logs
INFO:root:Received request for revision 149673768 (ru).
2025-11-10 14:22:14.015 1 kserve.trace requestId: 74a2ac8e-accb-4bbc-8188-a80fde1ccfaa, preprocess_ms: 105.349063873, explain_ms: 0, predict_ms: 140.214443207, postprocess_ms: 0.007152
2025-11-10 14:22:14.015 uvicorn.access INFO:     127.0.0.6:0 1 - "POST /v1/models/revertrisk-multilingual%3Apredict HTTP/1.1" 200 OK
2025-11-10 14:22:14.016 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.24666237831115723 ['http_status:200', 'http_method:POST', 'time:wall']
2025-11-10 14:22:14.016 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.4472849999983737 ['http_status:200', 'http_method:POST', 'time:cpu']

# Response
HTTP/2 200 
content-length: 203
content-type: application/json
date: Mon, 10 Nov 2025 14:22:46 GMT
server: istio-envoy
x-envoy-upstream-service-time: 262

{"model_name":"revertrisk-multilingual","model_version":"4","wiki_db":"ruwiki","revision_id":149673768,"output":{"prediction":true,"probabilities":{"true":0.7813304845261667,"false":0.2186695154738333}}}
Tested on Prod
# Request
$ curl -i -X \
POST "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-multilingual:predict" \
-H "Host: revertrisk-multilingual.revertrisk.wikimedia.org" \
-d '{"lang": "ru", "rev_id": 149673768}'

# Logs
INFO:root:Received request for revision 149673768 (ru).
2025-11-10 14:27:02.904 1 kserve.trace requestId: c52b5a53-2bc3-4f6d-8ae4-3fcc9de5ffef, preprocess_ms: 76.379537582, explain_ms: 0, predict_ms: 181.194782257, postprocess_ms: 0.0092
2025-11-10 14:27:02.904 uvicorn.access INFO:     127.0.0.6:0 1 - "POST /v1/models/revertrisk-multilingual%3Apredict HTTP/1.1" 200 OK
2025-11-10 14:27:02.904 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.2583799362182617 ['http_status:200', 'http_method:POST', 'time:wall']
2025-11-10 14:27:02.904 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.5917550000012852 ['http_status:200', 'http_method:POST', 'time:cpu']

# Response
HTTP/2 200 
content-length: 203
content-type: application/json
date: Mon, 10 Nov 2025 14:27:01 GMT
server: istio-envoy
x-envoy-upstream-service-time: 265

{"model_name":"revertrisk-multilingual","model_version":"4","wiki_db":"ruwiki","revision_id":149673768,"output":{"prediction":true,"probabilities":{"true":0.7813304845261667,"false":0.2186695154738333}}}

@gkyziridis I am not 100% sure if the rev-id in the task's description is the problematic one, I thought it was when checking the logs but you may need to review /home/elukey/T409657 on deploy2002 to get other testing samples :(

@gkyziridis I am not 100% sure if the rev-id in the task's description is the problematic one, I thought it was when checking the logs but you may need to review /home/elukey/T409657 on deploy2002 to get other testing samples :(

It is not easy to get testing-examples from the logs under your space (deploy2002:/home/elukey/T409657) because It is not reporting the rev-id.
For instance here we have the rev-id: 28389742 for idwiki(id) but this is a 200 status hit.
The one bellow which is a 500 one, we do not see the rev-id from these logs.

INFO:root:Received request for revision 28389742 (id).
2025-11-09 08:36:09.753 1 kserve.trace requestId: 21710ee9-378f-46af-87fa-b6559f0e79ce, preprocess_ms: 85.550785065, explain_ms: 0, predict_ms: 124.191761017, postprocess_ms: 0.008583069
2025-11-09 08:36:09.754 uvicorn.access INFO:     127.0.0.6:0 1 - "POST /v1/models/revertrisk-multilingual%3Apredict HTTP/1.1" 200 OK
2025-11-09 08:36:09.754 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.21045875549316406 ['http_status:200', 'http_method:POST', 'time:wall']
2025-11-09 08:36:09.754 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.3261270000002696 ['http_status:200', 'http_method:POST', 'time:cpu']
2025-11-09 08:36:09.828 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.0003421306610107422 ['http_status:500', 'http_method:POST', 'time:wall']
2025-11-09 08:36:09.828 1 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 0.0003370000013092067 ['http_status:500', 'http_method:POST', 'time:cpu']
2025-11-09 08:36:09.829 uvicorn.access INFO:     127.0.0.6:0 1 - "POST /v1/models/revertrisk-multilingual%3Apredict HTTP/1.1" 500 Internal Server Error

The rev-id I used for my testing was 149673768 (ru), taken from the paste you made in the description, but again that was a 200 one, so it is probably not the problematic one.
I will dig in the logstash to find more info.
I feel that there is something going on with the batcher. The error indications that the model server receives a list object on which tries to "get" the field "lang" from it.
The batcher collects requests and append them in a list (so that's why I feel that there is something going wrong over there).

Update

The issue I am facing for reproducing the error is that we are logging the incoming request if it is successful (status code 200), but we do not log it if is not.
We need somehow to log it immediately after we receive it in order to reproduce it.
I will open a ticket for upgrading the logging on the model server side: https://phabricator.wikimedia.org/T409931

Thoughts

I can imagine two issues for the above error:

  1. Model server receives something like: [{"lang": "ru", "text": "1234"}] or something like: ["ru", "text"]
  2. One possibility could be that something is going wrong with the Kserve Batcher which collects the individual requests in a batch and appends them into a list. But it is not activated in the revertrisk helm. The Kserve batch request schema by default needs to look like this:
{
    "instances": [
      {
        "lang": "en",
        "rev_id": 123456
      },
      {
        "lang": "en",
        "rev_id": 23456
      }
    ]
}

@gkyziridis I am still puzzled by the main exception:

I checked kerve's model.py and it seems to me that it is not always the case that inputs is a dict. This other bit is suspicious as well. Maybe there was something specific in the API calls that triggered the inputs list issue?

The client seems to be a single one from the Istio Ingress logs, but the URI seemed to be /v1/models/revertrisk-multilingual:predict so it may make sense that they tried to query the service with a JSON payload that triggered the problem.