Page MenuHomePhabricator

ORES doesn't work (at least for ru- and ukwiki)
Open, HighPublicBUG REPORT

Description

In last 12 hours, ORES doesn't work for ru/ukwiki, it affected our anti-vandalism tools: bad edits highlighting in watchlist and anti-vandal bots.

Event Timeline

Restricted Application added subscribers: Base, Aklapper. · View Herald Transcript
MBH renamed this task from ORES doesn't work (at least for ruwiki) to ORES doesn't work (at least for ru- and ukwiki).Apr 15 2024, 4:51 AM
MBH updated the task description. (Show Details)

Thanks for reporting this. We are investigating and will report back with more information.

We have restarted an associated services and its logs show no more errors. It's not quite root-caused yet, but the functionality should be back to working order now. I have confirmed this for ruwiki.

The reported ukwiki failures I can't reproduce at the moment, can you confirm that there is still a problem there?

Looks like ukwiki works now.

As a creator of the other ticket (T362506) I would add that the ORES/LiftWing infrastructure in the Russian Wikipedia was quite unstable during the whole last week. There were numerous small outages, ranging from few minutes to about one hour. But the recent one was really long: 16-18 hours.
Is it perhaps possible to install some monitoring to prevent such issues in the future?

Timeline (times in UTC):

20240414 14:45:00 Rate of 200 responses from the service drops to essentially zero, average latency of requests jumps to over 5s.
20240414 15:58:45 Rate of 200s rises again, lasting until about 16:28, but rate of error responses does not decrease
20240414 16:52:00 Another bout of 200 responses, lasting until 17:10:20, again, the error rate does not decrease
20240415 04:47:00 T362503 is opened
20240415 07:17:43 Iluvatar pokes ml-team on IRC
20240415 09:37:00 T362506 is opened by Q-bit-array
20240415 08:30:44 klausman deletes the failing pod (kubectl -n revscoring-editquality-damaging delete pod ruwiki-damaging-predictor-default-00017-deployment-64c6587v5lxx
20240415 08:31:20 new pod is started and receives traffic, the logs show now more errors
20240415 10:20:00 initial spike of backlogged requests subsides and error rate is back to pre-outage levels

Something strange: from the istio gateway logs, the HTTP response code logged is 0 https://logstash.wikimedia.org/goto/9003f0bd1a3c34e303ac5fbe86eff693

Usually this means that the client drops the connection, we didn't return any 50x response or similar.

Other relevant dashboards:

  • Preprocess latency goes 10s stable right after the issue started - dashboard - maybe a specific rev-id caused issues in revscoring?
  • Istio sidecar INBOUND traffic to the ruwiki isvc: dashboard - Downstream closes as well
  • Istio sidecar OUTBOUND traffic to api-ro (common to all damaging isvcs) - dashboard - I'd have expected to see some issues if ruwiki was failing all the calls to api-ro, but afaics nothing visible.
  • Queue proxy (sits in between the istio sidecar INBOUND traffic and the kserve container): dashboard - here a 502 appear but it seems taking ages for every request, meanwhile the 502 responses usually are very quick.

Errors in logstash for queue-proxy: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-webrequest-1-7.0.0-1-2024.04.14?id=CUL53Y4BeyUZP0OuT_kR

error reverse proxying request; sockstat: sockets: used 285
TCP: inuse 211 orphan 3 tw 113 alloc 13542 mem 6117
UDP: inuse 0 mem 5
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
knative.dev/pkg/network.ErrorHandler.func1
	/go/github.com/knative/serving/vendor/knative.dev/pkg/network/error_handler.go:33
net/http/httputil.(*ReverseProxy).ServeHTTP
	/usr/lib/go-1.18/src/net/http/httputil/reverseproxy.go:301
knative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP
	/go/github.com/knative/serving/pkg/queue/request_metric.go:199
knative.dev/serving/pkg/queue.ProxyHandler.func1
	/go/github.com/knative/serving/pkg/queue/handler.go:76
net/http.HandlerFunc.ServeHTTP
	/usr/lib/go-1.18/src/net/http/server.go:2084
knative.dev/serving/pkg/queue.ForwardedShimHandler.func1
	/go/github.com/knative/serving/pkg/queue/forwarded_shim.go:54
net/http.HandlerFunc.ServeHTTP
	/usr/lib/go-1.18/src/net/http/server.go:2084
knative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4
	/go/github.com/knative/serving/pkg/http/handler/timeout.go:117

I think that multiple requests caused a ton of time spent in preprocess(), causing the isvc to totally stall and get into a weird state (most probably revscoring ended up in a weird/not-working state).

Overall view of kserve logs for the pod: https://logstash.wikimedia.org/goto/b2a5d4084eac6862465c16e5f7c46a6f

First occurrence of high latency:

Apr 14, 2024 @ 14:53:09.871	revscoring-editquality-damaging	ml-serve1004	INFO:root:Function get_revscoring_extractor_cache took 404.2053 seconds to execute.

Unfortunately we don't have a way to log the JSON payload of the request in case preprocess/process take more than X seconds (we should probably add it, in order to be able to repro in the future).

As a creator of the other ticket (T362506) I would add that the ORES/LiftWing infrastructure in the Russian Wikipedia was quite unstable during the whole last week. There were numerous small outages, ranging from few minutes to about one hour. But the recent one was really long: 16-18 hours.
Is it perhaps possible to install some monitoring to prevent such issues in the future?

Definitely, you are right. We were focusing on SLO alerts that would have caught this, but those have been stalled for months due to issues on the Observability stack. We should add some temporary monitors/alerts to catch these cases.

elukey triaged this task as High priority.Apr 15 2024, 4:24 PM

Created two follow ups:

  • Basic alerting - T362661 (in place of the SLO dashboard etc.. that we can't use right now).
  • Add slow logs - T362663 (log slow requests verbosely to have a way to reproduce issues).

Just to let you know - LiftWing/ORES has crashed again

curl output example
$ curl https://api.wikimedia.org/service/lw/inference/v1/models/ruwiki-damaging:predict -X POST -d '{"rev_id": 137258595}'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    78  100    57    0    21      1      0  0:00:57  0:00:30  0:00:27    17{"httpCode":504,"httpReason":"upstream request timeout"}

@Q-bit-array I just managed to make the above request a couple of times.

$ curl https://api.wikimedia.org/service/lw/inference/v1/models/ruwiki-damaging:predict -X POST -d '{"rev_id": 137258595}'

{"ruwiki":{"models":{"damaging":{"version":"0.5.0"}},"scores":{"137258595":{"damaging":{"score":{"prediction":true,"probability":{"false":0.15409867172576575,"true":0.8459013282742343}}}}}}}%

However we will investigate this as well and we are also in the process of taking action for better monitoring and alerts to prevent this in the future. Thank you for reporting this!

Yes, now it works again.
But we had 15-20 minutes of outage just before I posted here.

Saved ruwiki's pod logs to deploy1002:/home/elukey/T362503

I noticed high latency for preprocess(), plus a lot of the following:

2024-04-17 18:18:22.227 kserve.trace requestId: 5702d375-b40e-4312-9d3a-7eb852e54875, preprocess_ms: 174.714326859, explain_ms: 0, predict_ms: 1.538276672, postprocess_ms: 0.001192093
2024-04-17 18:18:29.433 kserve.trace requestId: 05e14dff-31d2-4a79-97cc-eb1e0ae7b75f, preprocess_ms: 793.685674667, explain_ms: 0, predict_ms: 1.977443695, postprocess_ms: 0.001192093

This is likely a bug in revscoring that we are hitting, but we still don't know what rev-ids are causing it (likely some that are very big causing revscoring to end up in a wrong state).

Killed the pod to restart the service. Thanks @Q-bit-array for the ping!

Change #1020898 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] revscoring: add flag to log JSON inputs

https://gerrit.wikimedia.org/r/1020898

We are still seeing high latency for ruwiki's damaging, the current theory is that some rev-ids are causing troubles in the preprocessing (feature extraction etc..) phase ending up into an inconsistent state, that affects the other requests as well. We'll deploy https://gerrit.wikimedia.org/r/1020898 as stop-gap to figure out what requests are causing this issue, and then we'll try to find a fix. For the moment users may experience troubles when calling lift wing, apologies in advance, but please keep reporting connectivity issues if you find them.

I've noticed some autoscaling, and high cpu usage in the kserve containers. I've raised the min/max replicas from 1/4 to 4/6, and with more capacity the latency is way better (at least for now).

Change #1020898 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revscoring: add flag to log JSON inputs

https://gerrit.wikimedia.org/r/1020898

Change #1021056 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase replicas for ruwiki damaging and log payload

https://gerrit.wikimedia.org/r/1021056

Change #1021056 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase replicas for ruwiki damaging and log payload

https://gerrit.wikimedia.org/r/1021056

Change #1021421 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: enable payload logging in revscoring-damaging in ml-staging

https://gerrit.wikimedia.org/r/1021421

Change #1021421 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: enable payload logging in revscoring-damaging in ml-staging

https://gerrit.wikimedia.org/r/1021421

Change #1023846 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: enable payload logging in all revscoring services

https://gerrit.wikimedia.org/r/1023846

Change #1023846 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: enable payload logging in all revscoring services

https://gerrit.wikimedia.org/r/1023846

Change #1032517 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase viwiki-reverted replicas

https://gerrit.wikimedia.org/r/1032517

Change #1032517 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: increase viwiki-reverted replicas

Reason:

not needed because 4 maxreplicas is already enough

https://gerrit.wikimedia.org/r/1032517

Change #1032517 restored by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: increase viwiki-reverted replicas

https://gerrit.wikimedia.org/r/1032517

Change #1032531 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update autoscaling settings for editquality reverted

https://gerrit.wikimedia.org/r/1032531

Change #1032531 abandoned by Elukey:

[operations/deployment-charts@master] ml-services: update autoscaling settings for editquality reverted

Reason:

https://gerrit.wikimedia.org/r/1032531