Page MenuHomePhabricator

Investigate `edit-check` returning empty responses
Closed, ResolvedPublic

Description

It was reported in the Slack that edit-check returned number of empty responses with 200 status codes, possibly due to increased model load. Link to the Slack thread: https://wikimedia.slack.com/archives/C0833GPPYNA/p1753477984327219.

This should not happen as we should either return error response if we can't generate a response or return a proper non-empty response along with 200 status code.

The incident happened on Friday, 25th of July. We are waiting for more details regarding the:

  1. Number of empty responses that have been returned
  2. More exact timespan of the incident
  3. Examples of the requests that returned the empty responses

Event Timeline

Unfortunately, it seems that we won't be able to retrieve the exact timestamps nor the number of failed requests as they were not logged in our general client-error-logging.

The only exceptions/errors I could find in our logstash were happening simultaneously with the scale-down of our replicas, but are not really informative and it's unclear if they were caused when processing a request: https://logstash.wikimedia.org/goto/a58c1d4823bdcd946d2e70783c7cc9c4.

I will try replicating the issue on our ml-staging-codfw cluster by running load tests on edit-check and recording all responses to make sure they are non-empty.

Change #1173796 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Increase `max_replicas` to 3 for edit-check on staging.

https://gerrit.wikimedia.org/r/1173796

I've ran a load-test on staging cluster with 10000 requests, each of them returned a proper non-empty response. The statistics are shown below.

{'total_requests': 10000, 'successful_requests': 10000, 'failed_requests': 0, 'success_rate': 100.0, 'total_test_time': 648.3256633281708, 'requests_per_second': 15.424346999724088, 'response_time_stats': {'min': 0.2191174030303955, 'max': 0.8146030902862549, 'mean': 0.64795749168396, 'median': 0.6472955942153931, 'p95': 0.6710119247436523, 'p99': 0.6840834259986878}}

Currently on staging we have only 1 replica deployed without any autoscaling. For the next test, I will bump the max_replicas to 3 to simulate our production environment and see if we experience any issues during the scale-ups and scale-downs.

Change #1173796 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Increase `max_replicas` to 3 for edit-check on staging.

https://gerrit.wikimedia.org/r/1173796

I've updated the staging deployment of edit-check to be able to autoscale up to 3 replicas. I've re-ran the load-testing script with the statistics shown below:

{'total_requests': 10000, 'successful_requests': 10000, 'failed_requests': 0, 'success_rate': 100.0, 'total_test_time': 282.07794427871704, 'requests_per_second': 35.451194263239344, 'response_time_stats': {'min': 0.1394789218902588, 'max': 1.254246711730957, 'mean': 0.2819031985759735, 'median': 0.23579776287078857, 'p95': 0.642729651927948, 'p99': 0.7503527879714966}}

During the time of execution, we've experienced scale-up from 1 to 3 replicas. There were 0 unsuccessful requests, however I got a couple of corrupted "successful" responses:

{'read tcp 127.0.0.1:59484->127.0.0.1:8080: read: connection reset by peer\n', 'read tcp 127.0.0.1:59472->127.0.0.1:8080: read: connection reset by peer\n', 'read tcp 127.0.0.1:59678->127.0.0.1:8080: read: connection reset by peer\n', 'read tcp 127.0.0.1:36434->127.0.0.1:8080: read: connection reset by peer\n',}

There are similar error logs in logstash to what I've seen in the production logs when the empty responses appeared - https://logstash.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c?_g=h@c194595&_a=h@310a63f. However, there are only 2 such error logs where we have 4 corrupt responses.


I've re-ran the load-test experiment against 3 replicas, which were running all the time without scale-ups and scale-downs.

Similar corrupted responses appeared, however there were no error logs in logstash. This might indicate that the errors shown in logstash appear during scale-ups and scale-downs, but are not connected to the corrupted responses.

This might indicate that the corrupt responses might be connected to the increased load combined with multiple replicas available.

Small update: The connection reset errors shown above are "successful" requests, but they are not returning 200 status codes - they return 502 codes as expected. This is due our service reaching max capacity during load tests of around 40 requests per second. Thus, it's most likely unrelated to the reported bug.

I've also re-ran the load test against production and I can see similar 502 connection reset errors happening, but I cannot reproduce receiving empty response from the edit-check.


I've also verified that our inference code does not return anything other than dictionary with predictions key under any conditions.


Currently we could not reproduce the error in staging and production.
To investigate further, we're waiting for some more details regarding the specific endpoint that was used and what was the environment in which the requests were sent.

We realized that the original issue happened by querying model via API Gateway at https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict, but previous experiments were being performed by querying the service from our VPC via https://inference.svc.eqiad.wmnet:30443/v1/models/edit-check:predict.

I've re-ran the load test experiments by targeting the API Gateway, but I could not reproduce the issue of blank responses.


One idea, which came to my mind was that possibly we might be experiencing a race condition during pod initialization of a new deployment replica - my hypothesis was that potentially traffic is routed to a new replica too early during pod scale-up, when the service is not fully configured and not ready to serve predictions. This could result in the requests just pinging the service (potentially also istio sidecar or knative queue) instead of correctly hitting the prediction endpoint, which could potentially explain blank 200 responses.

To test this hypothesis, I've used load tests to trigger scale-up on the edit-check deployment and recorded all responses from API Gateway. However, I still could not replicate receiving empty 200 responses, all responses were valid.


To further investigate the issue, we might need to wait for it to re-occur and gather more details about the requests resulting in empty responses such as time of occurrence and contents of the sent request.

isarantopoulos subscribed.

Since there have been no further reports on additional information regarding the time of the requests and we weren't able to reproduce the issues, I'm resolving this task
We can reopen it if the issue reoccurs.