In last 12 hours, ORES doesn't work for ru/ukwiki, it affected our anti-vandalism tools: bad edits highlighting in watchlist and anti-vandal bots.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | BUG REPORT | None | T362503 ORES doesn't work (at least for ru- and ukwiki) | ||
Resolved | klausman | T362661 Create basic alerts for isvcs to catch outages | |||
Open | None | T362663 Add slow-logs for ML isvcs | |||
Restricted Task |
Event Timeline
Thanks for reporting this. We are investigating and will report back with more information.
We have restarted an associated services and its logs show no more errors. It's not quite root-caused yet, but the functionality should be back to working order now. I have confirmed this for ruwiki.
The reported ukwiki failures I can't reproduce at the moment, can you confirm that there is still a problem there?
As a creator of the other ticket (T362506) I would add that the ORES/LiftWing infrastructure in the Russian Wikipedia was quite unstable during the whole last week. There were numerous small outages, ranging from few minutes to about one hour. But the recent one was really long: 16-18 hours.
Is it perhaps possible to install some monitoring to prevent such issues in the future?
Timeline (times in UTC):
20240414 14:45:00 Rate of 200 responses from the service drops to essentially zero, average latency of requests jumps to over 5s.
20240414 15:58:45 Rate of 200s rises again, lasting until about 16:28, but rate of error responses does not decrease
20240414 16:52:00 Another bout of 200 responses, lasting until 17:10:20, again, the error rate does not decrease
20240415 04:47:00 T362503 is opened
20240415 07:17:43 Iluvatar pokes ml-team on IRC
20240415 09:37:00 T362506 is opened by Q-bit-array
20240415 08:30:44 klausman deletes the failing pod (kubectl -n revscoring-editquality-damaging delete pod ruwiki-damaging-predictor-default-00017-deployment-64c6587v5lxx
20240415 08:31:20 new pod is started and receives traffic, the logs show now more errors
20240415 10:20:00 initial spike of backlogged requests subsides and error rate is back to pre-outage levels
Something strange: from the istio gateway logs, the HTTP response code logged is 0 https://logstash.wikimedia.org/goto/9003f0bd1a3c34e303ac5fbe86eff693
Usually this means that the client drops the connection, we didn't return any 50x response or similar.
Other relevant dashboards:
- Preprocess latency goes 10s stable right after the issue started - dashboard - maybe a specific rev-id caused issues in revscoring?
- Istio sidecar INBOUND traffic to the ruwiki isvc: dashboard - Downstream closes as well
- Istio sidecar OUTBOUND traffic to api-ro (common to all damaging isvcs) - dashboard - I'd have expected to see some issues if ruwiki was failing all the calls to api-ro, but afaics nothing visible.
- Queue proxy (sits in between the istio sidecar INBOUND traffic and the kserve container): dashboard - here a 502 appear but it seems taking ages for every request, meanwhile the 502 responses usually are very quick.
Errors in logstash for queue-proxy: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-webrequest-1-7.0.0-1-2024.04.14?id=CUL53Y4BeyUZP0OuT_kR
error reverse proxying request; sockstat: sockets: used 285 TCP: inuse 211 orphan 3 tw 113 alloc 13542 mem 6117 UDP: inuse 0 mem 5 UDPLITE: inuse 0 RAW: inuse 0 FRAG: inuse 0 memory 0
knative.dev/pkg/network.ErrorHandler.func1 /go/github.com/knative/serving/vendor/knative.dev/pkg/network/error_handler.go:33 net/http/httputil.(*ReverseProxy).ServeHTTP /usr/lib/go-1.18/src/net/http/httputil/reverseproxy.go:301 knative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP /go/github.com/knative/serving/pkg/queue/request_metric.go:199 knative.dev/serving/pkg/queue.ProxyHandler.func1 /go/github.com/knative/serving/pkg/queue/handler.go:76 net/http.HandlerFunc.ServeHTTP /usr/lib/go-1.18/src/net/http/server.go:2084 knative.dev/serving/pkg/queue.ForwardedShimHandler.func1 /go/github.com/knative/serving/pkg/queue/forwarded_shim.go:54 net/http.HandlerFunc.ServeHTTP /usr/lib/go-1.18/src/net/http/server.go:2084 knative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4 /go/github.com/knative/serving/pkg/http/handler/timeout.go:117
I think that multiple requests caused a ton of time spent in preprocess(), causing the isvc to totally stall and get into a weird state (most probably revscoring ended up in a weird/not-working state).
Overall view of kserve logs for the pod: https://logstash.wikimedia.org/goto/b2a5d4084eac6862465c16e5f7c46a6f
First occurrence of high latency:
Apr 14, 2024 @ 14:53:09.871 revscoring-editquality-damaging ml-serve1004 INFO:root:Function get_revscoring_extractor_cache took 404.2053 seconds to execute.
Unfortunately we don't have a way to log the JSON payload of the request in case preprocess/process take more than X seconds (we should probably add it, in order to be able to repro in the future).
Definitely, you are right. We were focusing on SLO alerts that would have caught this, but those have been stalled for months due to issues on the Observability stack. We should add some temporary monitors/alerts to catch these cases.
Just to let you know - LiftWing/ORES has crashed again
$ curl https://api.wikimedia.org/service/lw/inference/v1/models/ruwiki-damaging:predict -X POST -d '{"rev_id": 137258595}' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 78 100 57 0 21 1 0 0:00:57 0:00:30 0:00:27 17{"httpCode":504,"httpReason":"upstream request timeout"}
@Q-bit-array I just managed to make the above request a couple of times.
$ curl https://api.wikimedia.org/service/lw/inference/v1/models/ruwiki-damaging:predict -X POST -d '{"rev_id": 137258595}' {"ruwiki":{"models":{"damaging":{"version":"0.5.0"}},"scores":{"137258595":{"damaging":{"score":{"prediction":true,"probability":{"false":0.15409867172576575,"true":0.8459013282742343}}}}}}}%
However we will investigate this as well and we are also in the process of taking action for better monitoring and alerts to prevent this in the future. Thank you for reporting this!
Yes, now it works again.
But we had 15-20 minutes of outage just before I posted here.
Saved ruwiki's pod logs to deploy1002:/home/elukey/T362503
I noticed high latency for preprocess(), plus a lot of the following:
2024-04-17 18:18:22.227 kserve.trace requestId: 5702d375-b40e-4312-9d3a-7eb852e54875, preprocess_ms: 174.714326859, explain_ms: 0, predict_ms: 1.538276672, postprocess_ms: 0.001192093 2024-04-17 18:18:29.433 kserve.trace requestId: 05e14dff-31d2-4a79-97cc-eb1e0ae7b75f, preprocess_ms: 793.685674667, explain_ms: 0, predict_ms: 1.977443695, postprocess_ms: 0.001192093
This is likely a bug in revscoring that we are hitting, but we still don't know what rev-ids are causing it (likely some that are very big causing revscoring to end up in a wrong state).
Killed the pod to restart the service. Thanks @Q-bit-array for the ping!
Change #1020898 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] revscoring: add flag to log JSON inputs
We are still seeing high latency for ruwiki's damaging, the current theory is that some rev-ids are causing troubles in the preprocessing (feature extraction etc..) phase ending up into an inconsistent state, that affects the other requests as well. We'll deploy https://gerrit.wikimedia.org/r/1020898 as stop-gap to figure out what requests are causing this issue, and then we'll try to find a fix. For the moment users may experience troubles when calling lift wing, apologies in advance, but please keep reporting connectivity issues if you find them.
I've noticed some autoscaling, and high cpu usage in the kserve containers. I've raised the min/max replicas from 1/4 to 4/6, and with more capacity the latency is way better (at least for now).
Change #1020898 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] revscoring: add flag to log JSON inputs
Change #1021056 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: increase replicas for ruwiki damaging and log payload
Change #1021056 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: increase replicas for ruwiki damaging and log payload
Change #1021421 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: enable payload logging in revscoring-damaging in ml-staging
Change #1021421 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: enable payload logging in revscoring-damaging in ml-staging
Change #1023846 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: enable payload logging in all revscoring services
Change #1023846 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: enable payload logging in all revscoring services
Change #1032517 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: increase viwiki-reverted replicas
Change #1032517 abandoned by Ilias Sarantopoulos:
[operations/deployment-charts@master] ml-services: increase viwiki-reverted replicas
Reason:
not needed because 4 maxreplicas is already enough
Change #1032517 restored by Ilias Sarantopoulos:
[operations/deployment-charts@master] ml-services: increase viwiki-reverted replicas
Change #1032531 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] ml-services: update autoscaling settings for editquality reverted
Change #1032531 abandoned by Elukey:
[operations/deployment-charts@master] ml-services: update autoscaling settings for editquality reverted
Reason: