In T319214 we are evaluating if Benthos could be a good candidate as stream processor for Lift Wing. As side effect, we are finding code bugs related to corner cases that we didn't test/took-in-consideration before now. In this task we are going to collect bugs and tests done, together with their fixes.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | elukey | T320374 Test ML model-servers with Benthos | |||
| Resolved | • isarantopoulos | T323624 Test revscoring model servers on Lift Wing |
Event Timeline
Change 842494 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] admin_ng: set higher circuit breaking limits for EventGate on ml-serve
Change 842695 had a related patch set uploaded (by AikoChou; author: AikoChou):
[machinelearning/liftwing/inference-services@main] articlequality: use the fetch_features shared function
Change 842695 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] articlequality: use the fetch_features shared function
Change 842697 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] ml-services: update revscoring Docker images
Change 842697 merged by Elukey:
[operations/deployment-charts@master] ml-services: update revscoring Docker images
Change 842494 merged by Elukey:
[operations/deployment-charts@master] admin_ng: set higher circuit breaking limits for EventGate on ml-serve
Thanks to several fixes the overall codebase seems now working better while serving Benthos requests. Before the patches I observed a lot of HTTP 500 responses due to bugs/corner-cases that we didn't handle in the code.
Now the focus should be more related to performance testing, to see how many requests the production clusters can sustain while sending events to EventGate.
Change 843869 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] editquality: improve aiohttp and asyncio performances
Change 843869 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] editquality: improve aiohttp and asyncio performances
Change 843951 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] ml-services: update revscoring-editquality-goodfaith's docker image
Change 843951 merged by Elukey:
[operations/deployment-charts@master] ml-services: update revscoring-editquality-goodfaith's docker image
Change 844916 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] articlequality: improve aiohttp client session handling
Change 844917 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] draftquality: improve aiohttp client session handling
Change 844919 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] topic: improve aiohttp client session handling
A new series of patches to avoid the creation of a aiohttp http session for every request landing to the model server. The refactor allows to use the native aiohttp's client pool (creating the session while creating the model server).
Change 844923 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] editquality: run the score method in a separate thread
Change 844916 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] articlequality: improve aiohttp client session handling
Change 844917 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] draftquality: improve aiohttp client session handling
Change 844919 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] topic: improve aiohttp client session handling
I am observing a weird behavior of Lift Wing when using Benthos (listening to revision-create events, and then hitting Lift Wing for a score). Sometimes I see a burst of HTTP 500s returned to Benthos (for various combinations of wiki/rev-id, not predictable), and if I inspect logs I see the following when calling the MW API:
File "/opt/lib/python/site-packages/mwapi/async_session.py", line 107, in _request
.format(prefix))
ValueError: Could not decode as JSON:
upstream connect error or disconnect/reset before headers. reset reason: connection terminationAlongside with these 500s, I also see peaks of latency, up to 30s+ for P99. The curious thing is that if I try to hit Lift Wing with Benthos or wrk with the same wiki/rev-id combination (so essentially asking for the same score over and over), I don't see any issue at all.
My current line of investigation is on two fronts:
- Try to figure out if there is a specific issue/setting/etc.. between Envoy and the MW api that triggers this behavior. A similar issue with a somehow similar environment is described in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/627322, so it could be a good starting point.
- As https://docs.python.org/3/library/asyncio-eventloop.html#executing-code-in-thread-or-process-pools describes, we should really use a process pool to run our model.score functions in predict (that is our blocking code). I am still wondering if certain rev-id/wiki combinations may lead to stalls due to model.score taking longer to finish (and blocking the asyncio loop consequently).
After reading T247484 and T262490, together with https://github.com/envoyproxy/envoy/issues/14981, it seems to me that there is $something that happens at the connection level on the MW API Envoy proxy side. More specifically:
- The HTTP response header Connection: close may not be sent before stopping a TCP connection.
- Envoy may cleanup/manage incoming TCP connections in an aggressive way, causing RST or similar to downstream clients.
The Serviceops' envoy proxies have the following settings (I translated them into Istio's HTTP settings for the Destination Rule for clarity):
idleTimeout: 5s maxRequestsPerConnection: 1000
After applying them, I see improvements but the connection breakages reports still pop up in Benthos. Adding the following makes them completely go away (Virtual Service HTTP config):
retries: attempts: 2 retryOn: connect-failure,refused-stream,reset
I suspect that the issue may be on the Istio proxy's envoy side, or maybe on both fronts (MW API and Istio sidecar). The above config seems good enough for the moment, so I'll apply it. The idea is to revisit the config once we'll have a new version of Istio.
Change 848344 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] admin_ng: update Istio settings for ml-serve clusters
Change 844923 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] editquality: run the score method in a separate process
Change 848344 merged by Elukey:
[operations/deployment-charts@master] admin_ng: update Istio settings for ml-serve clusters
Change 849016 had a related patch set uploaded (by AikoChou; author: AikoChou):
[machinelearning/liftwing/inference-services@main] editquality: move aiohttp session creation out of __init__
Change 849016 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] editquality: move aiohttp session creation out of __init__
Change 849040 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] admin_ng: add a Istio vs and retry settings on ml-serve for eventgate
Change 849040 merged by Elukey:
[operations/deployment-charts@master] admin_ng: add a Istio vs and retry settings on ml-serve for eventgate
Change 849062 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] articlequality: add process pool and improve aiohttp's client session
Change 849064 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] draftquality: add process pool and improve aiohttp's client session
Change 849086 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] topic: add process pool and improve aiohttp's client session
Change 849062 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] articlequality: add process pool and improve aiohttp's client session
Change 849064 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] draftquality: add process pool and improve aiohttp's client session
Change 849086 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] topic: add process pool and improve aiohttp's client session
I had to apply a similar fix for eventgate-main.discovery.wmnet's Istio config since the connection errors popped up as well for it. After a chat with Aiko we also decided to go ahead with the separate asyncio ProcessPool to offload model.score functions (CPU bound, blocking the asyncio main thread). Preliminary tests for enwiki goodfaith show a reduction in latency up to p75.
I am still testing with Benthos, the connection errors are not completely gone, more investigations are needed.
Change 849485 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] ml-services: update revscoring's Docker images
Change 849485 merged by Elukey:
[operations/deployment-charts@master] ml-services: update revscoring's Docker images
Change 850986 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] Refactor multi-processing code in revscoring model servers
Something interesting I found when I ran benthos for outlink isvc.
In the kserve logs, I found sudden spikes similar to what Luca observed in the revscoring models in the outlink transformer, but the difference was that their status code was 400. HTTP 400 occurred 56 times out of a total of 651 requests in 3 mins. Among that 56 requests, most of them take less than 200ms, and only 2 requests take more than 1s. (shown below)
[I 221029 21:54:45 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 90.23ms [I 221029 21:54:45 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 92.26ms [I 221029 21:54:45 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 87.34ms [I 221029 21:54:45 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 117.58ms [W 221029 21:54:45 web:2243] 400 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 2296.59ms ... [I 221029 21:55:15 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 75.70ms [I 221029 21:55:15 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 77.60ms [I 221029 21:55:15 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 104.58ms [W 221029 21:55:15 web:2243] 400 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 165.42ms [W 221029 21:55:16 web:2243] 400 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 1036.04ms
I'm not sure what's causing the 400 for these requests as there aren't any other logs and error messages. I'm wondering if I could set tornado to debug mode to see more info.
In the predictor's logs, I don't see any deviation. There were only 595 requests (meaning requests got 400 in the transformer wouldn't proceed to predict), and most of them took less than 40ms.
You can also check the latency metrics in Grafana.
Really nice finding! One thing that I noticed is that in the code, IIUC, we just return a meaningful error in the HTTP response, but we don't ever attempt to logging.debug anything when we raise BAD_REQUEST responses. Maybe we could start doing it, finding a way to switch between info and debug logging when needed?
In the predictor's logs, I don't see any deviation. There were only 595 requests (meaning requests got 400 in the transformer wouldn't proceed to predict), and most of them took less than 40ms.
This is great, probably the separate process pool is not needed, too much overhead for this case.
Change 850986 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] Refactor multi-processing code in revscoring model servers
Change 851093 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] ml-services: update docker images
Very interesting - I added more decorators to the code and after testing locally this is the result for enwiki-goodfaith:
[I 221031 16:26:34 decorators:13] Function get_revscoring_extractor_cache took 0.8577 seconds to execute. [I 221031 16:26:36 decorators:25] Function fetch_features took 1.9728 seconds to execute. [I 221031 16:26:36 decorators:13] Function run_in_process_pool took 0.2101 seconds to execute. [I 221031 16:26:36 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (172.17.0.1) 3044.41ms
I checked with DEBUG logging and fetch_features doesn't call any MW API endpoint (as expected, due to the extractor cache created in get_revscoring_extractor_cache) so all that it must be cpu-bound time. Probably a good candidate for the MP process pool, I'll investigate more!
Change 852129 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] revscoring: Improve MP code and logging
Change 852129 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] revscoring: Improve MP code and logging
Change 851093 merged by Elukey:
[operations/deployment-charts@master] ml-services: update docker images
I'll keep reporting improvements to the Istio connection termination issues in https://phabricator.wikimedia.org/T322196, that seems a more specific task.
I have been testing ML code over night for enwiki-goodfaith, and the results were really good - latency stayed consistent across all percentiles (except a little blip of several seconds of latency across the board). I've set the pod with 5 processes in the pool, and the following request/limits:
resources:
limits:
cpu: "5"
memory: 2Gi
requests:
cpu: "2"
memory: 2GiIt may be a lot for a single pod, and I am not sure how it plays out when multiple pods use the same resources (do we have enough capacity?).
Change 853258 had a related patch set uploaded (by AikoChou; author: AikoChou):
[machinelearning/liftwing/inference-services@main] Add logging for BAD_REQUEST responses
Change 853258 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] Add logging for BAD_REQUEST responses
Articlequality testing went really great, latencies are stable and acceptable without multi-processing.
Change 853298 had a related patch set uploaded (by AikoChou; author: AikoChou):
[operations/deployment-charts@master] ml-services: update docker images
Change 853298 merged by Elukey:
[operations/deployment-charts@master] ml-services: update docker images
articletopic seems to work well without multi-processing (need to test it but I think that drafttopic will follow the same trend).
The model servers (revscoring-related) that make 3 API calls during preprocess (edit and draft quality) have most of the issues with latency, meanwhile the rest work really nicely (even without MP). I am wondering if asyncio.gather plays a role (since it is used only only in the former use case), or if it is due to cpu-bound code with more chances to block I/O calls.
I tested outlink with benthos for around 9 hours the other day (here is the grafana metrics), I observed it returned ~1800 Bad Requests error with "No matching article or the page has no outlinks".
I looked at these pages, some of them don’t exist (deleted), but some of them actually exist and I tried to quey MW API locally, I was able to fetch data for these pages.
I noticed that sometimes contacting MW API might be failed at first few times (return 400) but after retry 1-3 times, it would finally get data. In the benthos config, we set "retry_period: 30s", so it retries every 30s, but I observed after 4 retries failed fetching data, it would stop retring.
I'm wondering if there is some mechanism in the MW API that causes fetching data failed when there is traffic. In revert-risk batch inference test, I've seen similar cases - for some rev_ids, in isvc it failed to fetch data from the MW API, but if I just ran a simple script to create a mw api session to fetch data locally, it'll work. That's very weird.
Change 855963 had a related patch set uploaded (by AikoChou; author: AikoChou):
[machinelearning/liftwing/inference-services@main] outlink: log response returned from MW API
Change 855963 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] outlink: log response returned from MW API
Change 855995 had a related patch set uploaded (by AikoChou; author: AikoChou):
[operations/deployment-charts@master] ml-services: update outlink's model binary and docker image
Change 855995 merged by Elukey:
[operations/deployment-charts@master] ml-services: update outlink's model binary and docker image
Change 856520 had a related patch set uploaded (by Elukey; author: Elukey):
[machinelearning/liftwing/inference-services@main] Improve fetch_features handling in revscoring model servers
I tested outlink with benthos for around 9 hours the other day (here is the grafana metrics), I observed it returned ~1800 Bad Requests error with "No matching article or the page has no outlinks".
My guess is it's related to the Mediawiki API or server load like you suggested but just in case it's article-specific, a few other hypotheses to maybe test. If you have some example page IDs, I'm also happy to take a look and see if I notice any patterns with the articles:
- Is it possible that these are new articles for which the corresponding pagelinks table / Mediawiki API hasn't been updated yet when the initial edit goes through?
- Edits to redirects and maybe we're handling redirects via the API inconsistently? Or articles that had some other log action like a page move and therefore where the page ID points to is being updated?
Hi @Isaac the issue has been solved, but the task hasn't been updated. I'm doing it now. Thanks for the heads up.
After adding more logging, we found MW API sometimes returned {'batchcomplete': True} at the end of async_generator object which caused keyError to occur, but actually the model server has fetched data, such as
[E 221111 17:08:31 outlink_transformer:91] KeyError occurs for Battle_of_Pisky_(2022) (en). Reason: 'query'. MW API returns: {'batchcomplete': True}
[D 221111 17:08:31 outlink_transformer:94] en Battle_of_Pisky_(2022) fetched 463 outlinksWe have changed the code so it doesn't raise a 400 HTTPError exception in this case!
We have changed the code so it doesn't raise a 400 HTTPError exception in this case!
Great to hear! And I'd always wondered about batchcomplete and now I understand (T84977#936343).
Very nice tests on draftquality with the process pool:
elukey@wintermute:~/Wikimedia/scratch-dir/model_servers_tests$ wrk -c 5 -t 5 -s inference-draftquality.lua -d 300 --timeout 10s http://localhost:8080/v1/models/enwiki-draftquality:predict --latency
Running 5m test @ http://localhost:8080/v1/models/enwiki-draftquality:predict
5 threads and 5 connections
^C Thread Stats Avg Stdev Max +/- Stdev
Latency 8.35s 1.21s 9.98s 56.67%
Req/Sec 0.00 0.00 0.00 100.00%
Latency Distribution
50% 8.06s
75% 9.68s
90% 9.90s
99% 9.98s
41 requests in 1.38m, 17.14KB read
Socket errors: connect 0, read 0, write 0, timeout 11
Requests/sec: 0.50
Transfer/sec: 211.95B
elukey@wintermute:~/Wikimedia/scratch-dir/model_servers_tests$ wrk -c 5 -t 5 -s inference-draftquality.lua -d 300 --timeout 10s http://localhost:8080/v1/models/enwiki-draftquality:predict --latency
Running 5m test @ http://localhost:8080/v1/models/enwiki-draftquality:predict
5 threads and 5 connections
^C Thread Stats Avg Stdev Max +/- Stdev
Latency 5.79s 1.39s 9.97s 89.36%
Req/Sec 0.00 0.00 0.00 100.00%
Latency Distribution
50% 5.57s
75% 5.83s
90% 8.53s
99% 9.97s
53 requests in 1.16m, 22.15KB read
Socket errors: connect 0, read 0, write 0, timeout 6
Requests/sec: 0.76
Transfer/sec: 325.54B
elukey@wintermute:~/Wikimedia/scratch-dir/model_servers_tests$ wrk -c 5 -t 5 -s inference-draftquality.lua -d 300 --timeout 10s http://localhost:8080/v1/models/enwiki-draftquality:predict --latency
Running 5m test @ http://localhost:8080/v1/models/enwiki-draftquality:predict
5 threads and 5 connections
^C Thread Stats Avg Stdev Max +/- Stdev
Latency 4.31s 881.05ms 9.93s 84.85%
Req/Sec 0.00 0.00 0.00 100.00%
Latency Distribution
50% 4.24s
75% 4.66s
90% 4.96s
99% 9.57s
141 requests in 2.29m, 58.93KB read
Socket errors: connect 0, read 0, write 0, timeout 10
Requests/sec: 1.03
Transfer/sec: 439.19BFirst one is without process pool, then 2 processes and then 12 (auto-set for my laptop, core i7 etc..). There is a nice progression, and this seems to prove my point that revscoring-model-servers do a lot of cpu-bound/blocking code.
Change 856520 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] Refactor revscoring model servers
Change 859461 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: Update docker images to use single model server
In this task we worked on several things, that now have separate subtasks:
- Refactored all the revscoring model servers to reuse code as much as possible, adding also Multi Processing support for the cpu-bound revscoring bits. We are going to work on T323624 to deploy and test the new Docker images (and possibly improve them further if needed!).
- Found a weird Istio/Envoy corner case, namely sometimes Envoy seems to write to broken sockets causing immediate 503s. This is tracked in T322196
- Tested extensively Benthos! It works nicely, and it is a good choice for our Stream processor use cases. We'll decide with Data Engineering what's best to use when the time comes (most likely during the ORES deprecation).
- Tested new model servers (non-revscoring ones) and found corner cases, all having separate subtasks (credits to Aiko for the great work!).
It makes sense to keep working on separate tasks for clarity, closing this one!
Change 859461 merged by Elukey:
[operations/deployment-charts@master] ml-services: Update docker images to use single model server