User Details
- User Since
- Nov 1 2022, 12:34 PM (13 w, 5 d)
- Availability
- Available
- LDAP User
- Ilias Sarantopoulos
- MediaWiki User
- Isarantopoulos [ Global Accounts ]
Fri, Feb 3
@calbon I couldn't find a user that matches you (calbon or Chris Albon). @kevinbazira any luck?
Thu, Feb 2
All revscoring model servers have been successfully upgraded to Python 3.9.2 and Debian Bullseye. 🎉
As part of this ticket we also solved the revscoring 's package security vulnerabilities.
Tue, Jan 31
@elukey I closed this task since your change has already been merged and deployed.
After discussing during the review with @RLazarus we went with the second approach.
In the aforementioned patch the tests support a json_body field in which we pass a json serializable object and the request is altered.
Only one of the form_body or json_body fields can be specified, something which is validated upon parsing the test cases.
I'm trying to find whether kserve supports sharing GPU among model servers.
What seems promising on this topic is the Model Mesh architecture where multiple models share the same server. However it is still in alpha version so I wouldn't count on it for the time being.
A brief description on how to enable MP has been added on LiftWing's Wikitech page along with a link to this task https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe
Mon, Jan 30
As discussed within the team we want to proceed with httpbb which is a more standard tool for this purpose. The python script has been uploaded to inference services repo for reference and can be used for now until we make httpbb work.
In the patch above I convert the dictionary passed in form_body field to json if there is the header Content-Type: application/json exists in the request.
Fri, Jan 27
In the attached patch I adde a python script that hits all the deployed models in production and staging and verifies that a proper response is returned (200 status code and word probability in text).
If both of the revision ids fail to give a proper response we log an error with the appropriate info. The reason for testing 2 revision ids is that I got some errors in editquality damaging pl wiki when I used one rev id, so I thought this was a good "hack" to avoid false positives.
I also added two files used by the script:
Thu, Jan 26
Summary: a set of pre-commit hooks have been added to the inference-services repository. The same hooks are run in CI through Jenkins in all the test images.
We use the following hooks:
Wed, Jan 25
Tue, Jan 24
Try to use https://wikitech.wikimedia.org/wiki/Httpbb instead of python/bash scripts.
This task has an overlap with https://phabricator.wikimedia.org/T325528.
In order to solve the errors mentioned previously we need to upgrade numpy to 1.22 which in turn requires kserve to be upgraded to 0.10...
There is a breaking change in kserve 0.10. as the headers object is made available in functions like preprocess and predict
and we get the following error TypeError: preprocess() takes 2 positional arguments but 3 were given
Since we extend these classes through our custom models simply adding a headers argument in the functions seem to do the trick.
Tested with a couple of models (drattopic - en, ar, cs - the ones that we had issues) and it works.
Mon, Jan 23
The PR has been merged and yamlconf has been updated
Added these hooks to all the images hosted in the inference-services repo.
If one wants to install the pre-commit hooks in order to run these locally upon every commit run the following:
pre-commit install
Otherwise it can be ran on an ad-hoc basis by issuing the command:
pre-commit run --all-files
Sure! Just opened a PR https://github.com/halfak/yamlconf/pull/7
There is an issue/blocker on upgrading the python kserve package to 0.9.0 that has to do with its dependencies. Let me explain the chain of dependencies:
Fri, Jan 20
By the set of load tests we run with wrk and benthos there seem to be mixed results.
https://phabricator.wikimedia.org/T323624#8468248
The editquality-damaging and editquality-goodfaith models seem to be the only ones that benefit significantly
by employing multi-processing, while the rest of the models seem to perform worse when we use MP for inference. (drafttopic, draftquality, articletopic, articlequality)
In the aforementioned models when we enable MP only for preprocessing (not for inference) there is a slight improvement which I believe doesn't justify using more resources.
My overall recommendation would be to enable MP on some of the goodfaith and damaging models that have higher traffic (if any) and leave the rest of the models as is.
Thu, Jan 19
Tue, Jan 17
Figured out a way to make the failing models work by monkey patching the utils.py of the enchant library https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/13..14
Mon, Jan 16
I would recommend to create some simple pipelines (mlflow, airflow, argo) or just containerize the training procedure. It seems that we may have to deal with this again in the future.
with a set of scripts we could retrain all the models. However we need some effort in order compare model's performance to see if it is equivalent to the old ones.
Fri, Jan 13
There are some models which cannot be loaded and throw the above errors.
The reasons are the following:
- an older version has been used for the pyenchant library during training which have some extra classes in the utils.py package.UTF16EnchantStr . These have all been removed in an older PR on pyenchant after version 3. https://github.com/pyenchant/pyenchant/pull/160
- We could try to use version 2.0.0 which includes them but it has no wheels for python 3.9
Thu, Jan 12
I ran a final set of tests and I repasting here the original results for single process (SP) as it is difficult to navigate the results in this thread.
SP - Single process
bash isaranto@deploy1002:~/scripts$ wrk -c 1 -t 1 --timeout 2s -s inference-drafttopic.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --latency -d 60 Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 142.07ms 48.22ms 819.86ms 98.85% Req/Sec 8.11 2.46 10.00 62.65% Latency Distribution 50% 136.92ms 75% 139.84ms 90% 143.65ms 99% 262.40ms 431 requests in 1.00m, 1.58MB read Requests/sec: 7.18 Transfer/sec: 27.01KB
bash isaranto@deploy1002:~/scripts$ wrk -c 8 -t 4 --timeout 2s -s inference-drafttopic.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --latency -d 60 Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict 4 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 375.22ms 111.26ms 1.30s 71.25% Req/Sec 6.29 2.96 20.00 39.83% Latency Distribution 50% 360.21ms 75% 440.38ms 90% 525.37ms 99% 664.02ms 1277 requests in 1.00m, 4.69MB read Requests/sec: 21.26 Transfer/sec: 80.03KB
MP
Wed, Jan 11
I ran some tests for drafttopic with MP enabled only for inference. I didn't see improvement over SP.
Tue, Jan 10
Mon, Jan 9
Jan 5 2023
I have broken down the RevscoringModel in a parent class used for single processing and a child class for MP.
On deployment we need to input the following env vars: PREPROCESS_MP, INFERENCE_MP.
Dec 23 2022
As we discussed in our team meeting we will run some final tests with MP enabled for drafttopic only for inference (and not preprocessing step) to see how the results look like and then we can draw some conclusions
The github action that appears on revscoring repo now uses commands that appear in a Makefile in order for the repo. + its CI to be easily transferable elsewhere (e.g. Gitlab) and thus our efforts are not tied with the Github ecosystem.
make pip-install make setup-image make run-tests
The revscoring python package is now tested and built using python 3.9 and the bullseye image in this PR https://github.com/wikimedia/revscoring/pull/531
I closed the previous PR that was using the Ubuntu image (default in Github actions)
Dec 22 2022
Successfully built revscoring with debian bullseye and python 3.9.
The below two PR/patches need to be merged (first the revscoring one) and then the inference-services one that will use the new revscoring version
https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/
https://github.com/wikimedia/revscoring/pull/527
@elukey only the test container was built successfully but using the branch provided in here https://github.com/wikimedia/revscoring/pull/527
I changed the requirements to git+https://github.com/wikimedia/revscoring.git@feature-add-ci in order to test it. At the moment I am working on CI in revscoring repo to make it work with bullseye
The test image in the patch https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/ is working + all tests are successful.
However production image still fails because it can't find numpy (?).
Dec 21 2022
Same for other revscoring model repos
https://github.com/wikimedia/draftquality/pull/44
https://github.com/wikimedia/articlequality/pull/175
https://github.com/wikimedia/editquality/pull/238
same for drafttopic repo https://github.com/wikimedia/drafttopic/pull/67
Done it the first way publish a package whenever we merge a new version of about.py to master
https://github.com/wikimedia/revscoring/pull/528
Dec 20 2022
The above plots show that we can enable MP for editquality models if we see fit it makes them much more stable and keeps latency low even in the 99th percentile
This is an example action https://github.com/wikimedia/drafttopic/pull/67 that will push to PyPI
Dec 19 2022
Dec 16 2022
Dec 15 2022
Dec 14 2022
The plots below better explain the results of the tests. AS already mentioned they require further investigation but at the moment it seems that MP out of the box is suitable for editquality models.
I didn't see any events while describing the pod and the metrics also report lower memory usage than the limit https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=revscoring-drafttopic&var-pod=enwiki-drafttopic-predictor-default-r8njn-deployment-65c9749bn6&from=1670171657541&to=1671037878621
this seems to work!
builder: command: ["python3.7", "-m", "nltk.downloader", "omw", "sentiwordnet", "stopwords", "wordnet"]
since there is only one version of python3 installed we can use python3 instead of python 3.7
I built the revscoring image and tested it. the NLTK_DATA env var is reduntant since this it is set to /home/user/nltk_data as default.
Dec 12 2022
Results for MP for drafftopic with the increased resources (4GB memory instead of 2) - They don't seem to be any better
Dec 9 2022
My suggestion to proceed would be the following:
- introduce new image, deploy and test it wherever we want
- deprecate old files and pipelines.
At the moment I have created one image for all revscoring models and managed to run inference through that. We build an image of approx 1.5GB instead of 4 images which should potentially speed up and make our CI/CD process a bit easier.
As you understand the changes in this patch are too many so it requires extensive QA on our side.
Remaining things:
- merge the patch in the integration/config repo for the new deployment pipeline
- update deployment charts to use the same image for all revscoring models
Dec 6 2022
I didn't see any timeouts from benthos logs and I forgot to mention above that all these metrics are only for response code 200 as read from the kserve/pod logs. Is there someplace else I could figure this out from the logs?
There seems to be a memory usage around that time that reaches the pods limit (2GB) https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=revscoring-drafttopic&var-pod=enwiki-drafttopic-predictor-default-nknc9-deployment-785b6gg8fr&from=1670243139234&to=1670244344075
@elukey Thank you for the explanation. I haven't checked about ray workers but I think it is worth the effort as it seems the "standard" way to do parallel inference with kserve. I agree with your last point that we should use MP only where needed. Perhaps for now it would be sufficient to find what is the proper way to do MP/parallel inference so we can use it when needed.
Dec 5 2022
Dec 2 2022
Editquality with Benthos
Here the results are much better with Multi-processing
SP - Single process
Total duration:0 days 00:05:00, Total No of requests: 641 50.0% 368.48ms 75.0% 550.88ms 90.0% 1002.84ms 99.0% 2956.55ms 2022-12-02 09:56:58 Minute 1, No of requests: 127 50.0% 385.16ms 75.0% 592.51ms 90.0% 1317.39ms 99.0% 3208.23ms 2022-12-02 09:57:58 Minute 2, No of requests: 128 50.0% 337.26ms 75.0% 478.46ms 90.0% 711.67ms 99.0% 9667.91ms 2022-12-02 09:58:58 Minute 3, No of requests: 74 50.0% 343.19ms 75.0% 526.78ms 90.0% 1187.21ms 99.0% 7547.48ms 2022-12-02 09:59:58 Minute 4, No of requests: 143 50.0% 371.78ms 75.0% 537.08ms 90.0% 968.13ms 99.0% 2799.92ms 2022-12-02 10:00:58 Minute 5, No of requests: 160 50.0% 366.24ms 75.0% 543.2ms 90.0% 898.81ms 99.0% 1978.13ms
MP
Total duration:0 days 00:00:59, Total No of requests: 593 50.0% 325.58ms 75.0% 393.12ms 90.0% 456.58ms 99.0% 579.06ms 2022-12-02 16:10:02 Minute 1, No of requests: 593 50.0% 325.58ms 75.0% 393.12ms 90.0% 456.58ms 99.0% 579.06ms 2022-12-02 16:11:02 Minute 2, No of requests: 593 50.0% 325.58ms 75.0% 393.12ms 90.0% 456.58ms 99.0% 579.06ms 2022-12-02 16:12:02 Minute 3, No of requests: 593 50.0% 325.58ms 75.0% 393.12ms 90.0% 456.58ms 99.0% 579.06ms 2022-12-02 16:13:02 Minute 4, No of requests: 593 50.0% 325.58ms 75.0% 393.12ms 90.0% 456.58ms 99.0% 579.06ms 2022-12-02 16:14:02 Minute 5, No of requests: 593 50.0% 325.58ms 75.0% 393.12ms 90.0% 456.58ms 99.0% 579.06ms
- SP - Single process
with MP**
editquality-goodfaith
With MP
isaranto@deploy1002:~/scripts$ wrk -c 1 -t 1 --timeout 2s -s inference-goodfaith.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency -d 60 Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 312.50ms 49.80ms 531.37ms 82.81% Req/Sec 2.96 0.73 5.00 73.96% Latency Distribution 50% 292.40ms 75% 299.61ms 90% 404.56ms 99% 520.68ms 192 requests in 1.00m, 72.19KB read Requests/sec: 3.19 Transfer/sec: 1.20KB
isaranto@deploy1002:~/scripts$ wrk -c 4 -t 2 --timeout 2s -s inference-goodfaith.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency -d 60 Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict 2 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 405.38ms 76.24ms 708.64ms 72.33% Req/Sec 5.76 2.76 10.00 66.85% Latency Distribution 50% 377.24ms 75% 456.11ms 90% 509.77ms 99% 660.63ms 589 requests in 1.00m, 221.45KB read Requests/sec: 9.82 Transfer/sec: 3.69KB
Nov 29 2022
And various test with wrk
Re-run the test and edited the previous message. Much better results, and it seems that latency doesn't increase over time as it happens in the non MP version.
Here are the results for the full 20 minutes I ran it:
@elukey you are right. I put it as boolean, but true in yaml is translated to True in python and the comparison is actually comparing strings so True=="True" will always be false.
I see this in the logs:
[I 221128 16:44:42 model_server:125] Will fork 1 workers [I 221128 16:44:42 model_server:128] Setting max asyncio worker threads as 5
As I understand workers == threads in this case. Could you patch it once again with "True" so that we can check it out?
Regarding resources I did not see any spikes in CPU while the test was run. The difference in performance though between the two tests could be justified by the number of asyncio worker threads in the first case it is 5 and in the second 9.
Nov 28 2022
Enabled MP and ran on ml-staging with benthos for 5 minutes for revscoring-editquality-goodfaith:
for en wiki
Total duration:0 days 00:05:00, Total No of requests: 641 50.0% 368.48ms 75.0% 550.88ms 90.0% 1002.84ms 99.0% 2956.55ms Minute 1, No of requests: 127 50.0% 385.16ms 75.0% 592.51ms 90.0% 1317.39ms 99.0% 3208.23ms Minute 2, No of requests: 128 50.0% 337.26ms 75.0% 478.46ms 90.0% 711.67ms 99.0% 9667.91ms Minute 3, No of requests: 74 50.0% 343.19ms 75.0% 526.78ms 90.0% 1187.21ms 99.0% 7547.48ms Minute 4, No of requests: 143 50.0% 371.78ms 75.0% 537.08ms 90.0% 968.13ms 99.0% 2799.92ms Minute 5, No of requests: 160 50.0% 366.24ms 75.0% 543.2ms 90.0% 898.81ms 99.0% 1978.13ms
For zh wiki we didnt have the same increase per mintue:
Total 5 minute duration: 50.0% 289.17ms 75.0% 548.68ms 90.0% 1049.54ms 99.0% 8628.5ms Broken down by minute: Minute 1, No of requests: 9 50.0% 232.0ms 75.0% 702.35ms 90.0% 2802.58ms 99.0% 9238.11ms Minute 2, No of requests: 9 50.0% 397.93ms 75.0% 826.5ms 90.0% 2538.78ms 99.0% 6778.66ms Minute 3, No of requests: 14 50.0% 248.38ms 75.0% 522.77ms 90.0% 697.5ms 99.0% 1471.2ms Minute 4, No of requests: 5 50.0% 241.56ms 75.0% 479.28ms 90.0% 1080.67ms 99.0% 1441.51ms Minute 5, No of requests: 12 50.0% 288.67ms 75.0% 496.1ms 90.0% 548.94ms 99.0% 919.24ms
Nov 25 2022
Checked en-wiki-revscoring-editquality-goodfaith with benthos and wrk: