Page MenuHomePhabricator

isarantopoulos (Ilias Sarantopoulos)
User

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 1 2022, 12:34 PM (13 w, 5 d)
Availability
Available
LDAP User
Ilias Sarantopoulos
MediaWiki User
Isarantopoulos [ Global Accounts ]

Recent Activity

Fri, Feb 3

isarantopoulos added a comment to T328494: WikiGPT Experiment.

@calbon I couldn't find a user that matches you (calbon or Chris Albon). @kevinbazira any luck?

Fri, Feb 3, 3:17 PM · Machine-Learning-Team

Thu, Feb 2

isarantopoulos moved T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images from In Progress to Done on the Machine-Learning-Team board.
Thu, Feb 2, 4:09 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

All revscoring model servers have been successfully upgraded to Python 3.9.2 and Debian Bullseye. 🎉
As part of this ticket we also solved the revscoring 's package security vulnerabilities.

Thu, Feb 2, 4:09 PM · Patch-For-Review, Machine-Learning-Team

Tue, Jan 31

isarantopoulos added a comment to T328120: httpbb doesn't support integers in the POST's body.

@elukey I closed this task since your change has already been merged and deployed.

Tue, Jan 31, 3:32 PM · Machine-Learning-Team, SRE-tools, Infrastructure-Foundations
isarantopoulos moved T328280: httpbb with HTTP POSTs and json payload from Unsorted to Watching on the Machine-Learning-Team board.
Tue, Jan 31, 3:21 PM · SRE-tools, Infrastructure-Foundations, Machine-Learning-Team
isarantopoulos moved T328120: httpbb doesn't support integers in the POST's body from In Progress to Done on the Machine-Learning-Team board.
Tue, Jan 31, 3:08 PM · Machine-Learning-Team, SRE-tools, Infrastructure-Foundations
isarantopoulos reassigned T328120: httpbb doesn't support integers in the POST's body from isarantopoulos to elukey.
Tue, Jan 31, 3:08 PM · Machine-Learning-Team, SRE-tools, Infrastructure-Foundations
isarantopoulos added a comment to T328280: httpbb with HTTP POSTs and json payload.

After discussing during the review with @RLazarus we went with the second approach.
In the aforementioned patch the tests support a json_body field in which we pass a json serializable object and the request is altered.
Only one of the form_body or json_body fields can be specified, something which is validated upon parsing the test cases.

Tue, Jan 31, 2:03 PM · SRE-tools, Infrastructure-Foundations, Machine-Learning-Team
isarantopoulos claimed T328120: httpbb doesn't support integers in the POST's body.
Tue, Jan 31, 1:59 PM · Machine-Learning-Team, SRE-tools, Infrastructure-Foundations
isarantopoulos moved T328120: httpbb doesn't support integers in the POST's body from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, Jan 31, 1:58 PM · Machine-Learning-Team, SRE-tools, Infrastructure-Foundations
isarantopoulos claimed T328280: httpbb with HTTP POSTs and json payload.
Tue, Jan 31, 1:58 PM · SRE-tools, Infrastructure-Foundations, Machine-Learning-Team
isarantopoulos added a comment to T327923: Get a GPU on Lift Wing.

I'm trying to find whether kserve supports sharing GPU among model servers.
What seems promising on this topic is the Model Mesh architecture where multiple models share the same server. However it is still in alpha version so I wouldn't count on it for the time being.

Tue, Jan 31, 1:56 PM · Machine-Learning-Team
isarantopoulos moved T328438: [outlink] Upgrade python from 3.7 to 3.9 in docker images from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
Tue, Jan 31, 1:25 PM · Machine-Learning-Team
isarantopoulos moved T328439: [revertrisk] Upgrade python from 3.7 to 3.9 in docker images from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
Tue, Jan 31, 1:25 PM · Machine-Learning-Team
isarantopoulos added a project to T328439: [revertrisk] Upgrade python from 3.7 to 3.9 in docker images: Machine-Learning-Team.
Tue, Jan 31, 1:25 PM · Machine-Learning-Team
isarantopoulos created T328439: [revertrisk] Upgrade python from 3.7 to 3.9 in docker images.
Tue, Jan 31, 1:24 PM · Machine-Learning-Team
isarantopoulos updated the task description for T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .
Tue, Jan 31, 1:23 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos created T328438: [outlink] Upgrade python from 3.7 to 3.9 in docker images .
Tue, Jan 31, 1:23 PM · Machine-Learning-Team
isarantopoulos renamed T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images from Upgrade python from 3.7 to 3.9 in docker images to [revscoring] Upgrade python from 3.7 to 3.9 in docker images .
Tue, Jan 31, 1:20 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T323624: Test revscoring model servers on Lift Wing from In Progress to Done on the Machine-Learning-Team board.
Tue, Jan 31, 1:19 PM · Machine-Learning-Team
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

A brief description on how to enable MP has been added on LiftWing's Wikitech page along with a link to this task https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe

Tue, Jan 31, 1:18 PM · Machine-Learning-Team

Mon, Jan 30

isarantopoulos added a comment to T327787: [Liftwing testing] - Post deployment testing.

As discussed within the team we want to proceed with httpbb which is a more standard tool for this purpose. The python script has been uploaded to inference services repo for reference and can be used for now until we make httpbb work.

Mon, Jan 30, 4:52 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T328280: httpbb with HTTP POSTs and json payload.

In the patch above I convert the dictionary passed in form_body field to json if there is the header Content-Type: application/json exists in the request.

Mon, Jan 30, 4:04 PM · SRE-tools, Infrastructure-Foundations, Machine-Learning-Team

Fri, Jan 27

isarantopoulos added a comment to T327787: [Liftwing testing] - Post deployment testing.

In the attached patch I adde a python script that hits all the deployed models in production and staging and verifies that a proper response is returned (200 status code and word probability in text).
If both of the revision ids fail to give a proper response we log an error with the appropriate info. The reason for testing 2 revision ids is that I got some errors in editquality damaging pl wiki when I used one rev id, so I thought this was a good "hack" to avoid false positives.
I also added two files used by the script:

Fri, Jan 27, 4:38 PM · Patch-For-Review, Machine-Learning-Team

Thu, Jan 26

isarantopoulos moved T327787: [Liftwing testing] - Post deployment testing from Unsorted to In Progress on the Machine-Learning-Team board.
Thu, Jan 26, 5:37 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos committed rMLISb16fab5402e1: docs: add pre-commit info in README.md (authored by isarantopoulos).
docs: add pre-commit info in README.md
Thu, Jan 26, 3:14 PM
isarantopoulos moved T325198: Create a pre-commit hook for inference-services repo from In Progress to Done on the Machine-Learning-Team board.
Thu, Jan 26, 2:19 PM · Machine-Learning-Team
isarantopoulos added a comment to T325198: Create a pre-commit hook for inference-services repo.

Summary: a set of pre-commit hooks have been added to the inference-services repository. The same hooks are run in CI through Jenkins in all the test images.
We use the following hooks:

Thu, Jan 26, 10:50 AM · Machine-Learning-Team

Wed, Jan 25

isarantopoulos committed rMLISe00dd25bba7e: update nltk-dependency (authored by isarantopoulos).
update nltk-dependency
Wed, Jan 25, 12:20 PM

Tue, Jan 24

isarantopoulos added a comment to T327787: [Liftwing testing] - Post deployment testing.

Try to use https://wikitech.wikimedia.org/wiki/Httpbb instead of python/bash scripts.

Tue, Jan 24, 3:13 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos committed rMLIS7cb0e0a94d0a: ci: add pre-commit checks in all images (authored by isarantopoulos).
ci: add pre-commit checks in all images
Tue, Jan 24, 2:18 PM
isarantopoulos moved T325198: Create a pre-commit hook for inference-services repo from In Progress to Done on the Machine-Learning-Team board.
Tue, Jan 24, 2:14 PM · Machine-Learning-Team
isarantopoulos created T327787: [Liftwing testing] - Post deployment testing.
Tue, Jan 24, 1:58 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

This task has an overlap with https://phabricator.wikimedia.org/T325528.
In order to solve the errors mentioned previously we need to upgrade numpy to 1.22 which in turn requires kserve to be upgraded to 0.10...

Tue, Jan 24, 1:43 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos committed rMLIS0d5c8ca4e8c5: feat: revscoring kserve upgrade to 0.10 (authored by isarantopoulos).
feat: revscoring kserve upgrade to 0.10
Tue, Jan 24, 12:52 PM
isarantopoulos added a comment to T325528: Upgrade ml clusters to kserve 0.9.

There is a breaking change in kserve 0.10. as the headers object is made available in functions like preprocess and predict
and we get the following error TypeError: preprocess() takes 2 positional arguments but 3 were given
Since we extend these classes through our custom models simply adding a headers argument in the functions seem to do the trick.
Tested with a couple of models (drattopic - en, ar, cs - the ones that we had issues) and it works.

Tue, Jan 24, 9:42 AM · Patch-For-Review, Machine-Learning-Team

Mon, Jan 23

isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

The PR has been merged and yamlconf has been updated

Mon, Jan 23, 4:29 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T325198: Create a pre-commit hook for inference-services repo.

Added these hooks to all the images hosted in the inference-services repo.
If one wants to install the pre-commit hooks in order to run these locally upon every commit run the following:
pre-commit install
Otherwise it can be ran on an ad-hoc basis by issuing the command:
pre-commit run --all-files

Mon, Jan 23, 4:21 PM · Machine-Learning-Team
isarantopoulos added a comment to T325528: Upgrade ml clusters to kserve 0.9.

Sure! Just opened a PR https://github.com/halfak/yamlconf/pull/7

Mon, Jan 23, 2:29 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T325528: Upgrade ml clusters to kserve 0.9.

There is an issue/blocker on upgrading the python kserve package to 0.9.0 that has to do with its dependencies. Let me explain the chain of dependencies:

Mon, Jan 23, 2:18 PM · Patch-For-Review, Machine-Learning-Team

Fri, Jan 20

isarantopoulos moved T323624: Test revscoring model servers on Lift Wing from In Progress to Done on the Machine-Learning-Team board.
Fri, Jan 20, 4:43 PM · Machine-Learning-Team
isarantopoulos committed rMLIS10da4ffb5c77: Upgrade the revscoring model server to Python 3.9 (authored by elukey).
Upgrade the revscoring model server to Python 3.9
Fri, Jan 20, 3:04 PM
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

By the set of load tests we run with wrk and benthos there seem to be mixed results.
https://phabricator.wikimedia.org/T323624#8468248
The editquality-damaging and editquality-goodfaith models seem to be the only ones that benefit significantly
by employing multi-processing, while the rest of the models seem to perform worse when we use MP for inference. (drafttopic, draftquality, articletopic, articlequality)
In the aforementioned models when we enable MP only for preprocessing (not for inference) there is a slight improvement which I believe doesn't justify using more resources.
My overall recommendation would be to enable MP on some of the goodfaith and damaging models that have higher traffic (if any) and leave the rest of the models as is.

Fri, Jan 20, 10:56 AM · Machine-Learning-Team

Thu, Jan 19

isarantopoulos committed rMLIS15682889a220: pre-commit: make required changes (authored by isarantopoulos).
pre-commit: make required changes
Thu, Jan 19, 5:05 PM

Tue, Jan 17

isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

Figured out a way to make the failing models work by monkey patching the utils.py of the enchant library https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/13..14

Tue, Jan 17, 3:21 PM · Patch-For-Review, Machine-Learning-Team

Mon, Jan 16

isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

I would recommend to create some simple pipelines (mlflow, airflow, argo) or just containerize the training procedure. It seems that we may have to deal with this again in the future.
with a set of scripts we could retrain all the models. However we need some effort in order compare model's performance to see if it is equivalent to the old ones.

Mon, Jan 16, 4:21 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos committed rMLISe3c12de48942: Add pre-commit hooks (authored by isarantopoulos).
Add pre-commit hooks
Mon, Jan 16, 4:46 AM

Fri, Jan 13

isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

There are some models which cannot be loaded and throw the above errors.
The reasons are the following:

  • an older version has been used for the pyenchant library during training which have some extra classes in the utils.py package.UTF16EnchantStr . These have all been removed in an older PR on pyenchant after version 3. https://github.com/pyenchant/pyenchant/pull/160
  • We could try to use version 2.0.0 which includes them but it has no wheels for python 3.9
Fri, Jan 13, 5:05 PM · Patch-For-Review, Machine-Learning-Team

Thu, Jan 12

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

I ran a final set of tests and I repasting here the original results for single process (SP) as it is difficult to navigate the results in this thread.
SP - Single process

bash
isaranto@deploy1002:~/scripts$ wrk -c 1 -t 1 --timeout 2s -s inference-drafttopic.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --latency -d 60
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   142.07ms   48.22ms 819.86ms   98.85%
    Req/Sec     8.11      2.46    10.00     62.65%
  Latency Distribution
     50%  136.92ms
     75%  139.84ms
     90%  143.65ms
     99%  262.40ms
  431 requests in 1.00m, 1.58MB read
Requests/sec:      7.18
Transfer/sec:     27.01KB
bash
isaranto@deploy1002:~/scripts$ wrk -c 8 -t 4 --timeout 2s -s inference-drafttopic.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --latency -d 60
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict
  4 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   375.22ms  111.26ms   1.30s    71.25%
    Req/Sec     6.29      2.96    20.00     39.83%
  Latency Distribution
     50%  360.21ms
     75%  440.38ms
     90%  525.37ms
     99%  664.02ms
  1277 requests in 1.00m, 4.69MB read
Requests/sec:     21.26
Transfer/sec:     80.03KB

MP

Thu, Jan 12, 10:55 AM · Machine-Learning-Team

Wed, Jan 11

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

I ran some tests for drafttopic with MP enabled only for inference. I didn't see improvement over SP.

Wed, Jan 11, 4:19 PM · Machine-Learning-Team

Tue, Jan 10

isarantopoulos committed rMLISd31f2aaa21df: Create different objects for MP/SP (authored by isarantopoulos).
Create different objects for MP/SP
Tue, Jan 10, 12:40 PM

Mon, Jan 9

isarantopoulos claimed T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .
Mon, Jan 9, 1:15 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images from Unsorted to In Progress on the Machine-Learning-Team board.
Mon, Jan 9, 1:14 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images from In Progress to Unsorted on the Machine-Learning-Team board.
Mon, Jan 9, 10:30 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images from Unsorted to In Progress on the Machine-Learning-Team board.
Mon, Jan 9, 10:30 AM · Patch-For-Review, Machine-Learning-Team

Jan 5 2023

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

I have broken down the RevscoringModel in a parent class used for single processing and a child class for MP.
On deployment we need to input the following env vars: PREPROCESS_MP, INFERENCE_MP.

Jan 5 2023, 5:09 PM · Machine-Learning-Team

Dec 23 2022

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

As we discussed in our team meeting we will run some final tests with MP enabled for drafttopic only for inference (and not preprocessing step) to see how the results look like and then we can draw some conclusions

Dec 23 2022, 2:44 PM · Machine-Learning-Team
isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

The github action that appears on revscoring repo now uses commands that appear in a Makefile in order for the repo. + its CI to be easily transferable elsewhere (e.g. Gitlab) and thus our efforts are not tied with the Github ecosystem.

make pip-install
make setup-image
make run-tests

The revscoring python package is now tested and built using python 3.9 and the bullseye image in this PR https://github.com/wikimedia/revscoring/pull/531
I closed the previous PR that was using the Ubuntu image (default in Github actions)

Dec 23 2022, 2:40 PM · Patch-For-Review, Machine-Learning-Team

Dec 22 2022

isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

Successfully built revscoring with debian bullseye and python 3.9.
The below two PR/patches need to be merged (first the revscoring one) and then the inference-services one that will use the new revscoring version
https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/
https://github.com/wikimedia/revscoring/pull/527

Dec 22 2022, 3:53 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

@elukey only the test container was built successfully but using the branch provided in here https://github.com/wikimedia/revscoring/pull/527
I changed the requirements to git+https://github.com/wikimedia/revscoring.git@feature-add-ci in order to test it. At the moment I am working on CI in revscoring repo to make it work with bullseye

Dec 22 2022, 2:29 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .

The test image in the patch https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/ is working + all tests are successful.
However production image still fails because it can't find numpy (?).

Dec 22 2022, 11:37 AM · Patch-For-Review, Machine-Learning-Team

Dec 21 2022

isarantopoulos moved T325561: Automate publishing python packages to PyPI from In Progress to Done on the Machine-Learning-Team board.
Dec 21 2022, 3:01 PM · Machine-Learning-Team
isarantopoulos claimed T325561: Automate publishing python packages to PyPI.
Dec 21 2022, 3:01 PM · Machine-Learning-Team
isarantopoulos added a comment to T325561: Automate publishing python packages to PyPI.

Same for other revscoring model repos
https://github.com/wikimedia/draftquality/pull/44
https://github.com/wikimedia/articlequality/pull/175
https://github.com/wikimedia/editquality/pull/238

Dec 21 2022, 2:54 PM · Machine-Learning-Team
isarantopoulos committed rMLISfde2a4d8f8b6: revertrisk: add torch==1.13.1+cpu (authored by achou).
revertrisk: add torch==1.13.1+cpu
Dec 21 2022, 1:50 PM
isarantopoulos committed rDRAFTTOPIC2b5265d5f850: feat: add PYPI index GH action (authored by isarantopoulos).
feat: add PYPI index GH action
Dec 21 2022, 1:25 PM
isarantopoulos added a comment to T325561: Automate publishing python packages to PyPI.

same for drafttopic repo https://github.com/wikimedia/drafttopic/pull/67

Dec 21 2022, 1:24 PM · Machine-Learning-Team
isarantopoulos added a comment to T325561: Automate publishing python packages to PyPI.

Done it the first way publish a package whenever we merge a new version of about.py to master
https://github.com/wikimedia/revscoring/pull/528

Dec 21 2022, 11:05 AM · Machine-Learning-Team

Dec 20 2022

isarantopoulos created T325657: [revscoring] Upgrade python from 3.7 to 3.9 in docker images .
Dec 20 2022, 4:05 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

The above plots show that we can enable MP for editquality models if we see fit it makes them much more stable and keeps latency low even in the 99th percentile

Dec 20 2022, 2:25 PM · Machine-Learning-Team
isarantopoulos added a comment to T325561: Automate publishing python packages to PyPI.

This is an example action https://github.com/wikimedia/drafttopic/pull/67 that will push to PyPI

Dec 20 2022, 1:44 PM · Machine-Learning-Team
isarantopoulos committed rDRAFTTOPIC9b327cf5aab0: test commit (authored by isarantopoulos).
test commit
Dec 20 2022, 9:59 AM
isarantopoulos committed rDRAFTTOPIC9e5f997be9c3: feat: add publish to pypi index (authored by isarantopoulos).
feat: add publish to pypi index
Dec 20 2022, 9:54 AM
isarantopoulos committed rDRAFTTOPICfd1516997c07: feat: add publish to pypi index (authored by isarantopoulos).
feat: add publish to pypi index
Dec 20 2022, 9:52 AM

Dec 19 2022

isarantopoulos created T325561: Automate publishing python packages to PyPI.
Dec 19 2022, 4:18 PM · Machine-Learning-Team

Dec 16 2022

isarantopoulos moved T323586: Reduce number of published docker images for revscoring models from In Progress to Done on the Machine-Learning-Team board.
Dec 16 2022, 2:15 PM · Machine-Learning-Team
isarantopoulos committed rMLIS4ae9054e24bd: revscoring: delete individual revscoring images (authored by isarantopoulos).
revscoring: delete individual revscoring images
Dec 16 2022, 9:48 AM

Dec 15 2022

isarantopoulos updated the task description for T325295: Enrich revertrisk image tag with model's package version.
Dec 15 2022, 2:19 PM · Machine-Learning-Team
isarantopoulos created T325295: Enrich revertrisk image tag with model's package version.
Dec 15 2022, 2:17 PM · Machine-Learning-Team
isarantopoulos committed rMLIS3b953485cc99: blubber: create universal revscoring image (authored by isarantopoulos).
blubber: create universal revscoring image
Dec 15 2022, 1:22 PM

Dec 14 2022

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

The plots below better explain the results of the tests. AS already mentioned they require further investigation but at the moment it seems that MP out of the box is suitable for editquality models.

image.png (455×593 px, 29 KB)

image.png (455×602 px, 29 KB)

image.png (455×593 px, 34 KB)

image.png (455×602 px, 27 KB)

image.png (455×602 px, 33 KB)

image.png (455×593 px, 31 KB)

Dec 14 2022, 5:20 PM · Machine-Learning-Team
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

I didn't see any events while describing the pod and the metrics also report lower memory usage than the limit https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=revscoring-drafttopic&var-pod=enwiki-drafttopic-predictor-default-r8njn-deployment-65c9749bn6&from=1670171657541&to=1671037878621

Dec 14 2022, 5:14 PM · Machine-Learning-Team
isarantopoulos created T325198: Create a pre-commit hook for inference-services repo.
Dec 14 2022, 4:35 PM · Machine-Learning-Team
isarantopoulos added a comment to T324658: Remove hack from ML's blubber files.

this seems to work!

builder:
  command: ["python3.7", "-m", "nltk.downloader", "omw", "sentiwordnet", "stopwords", "wordnet"]

since there is only one version of python3 installed we can use python3 instead of python 3.7
I built the revscoring image and tested it. the NLTK_DATA env var is reduntant since this it is set to /home/user/nltk_data as default.

Dec 14 2022, 12:22 PM · Machine-Learning-Team

Dec 12 2022

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

Results for MP for drafftopic with the increased resources (4GB memory instead of 2) - They don't seem to be any better

Dec 12 2022, 8:50 AM · Machine-Learning-Team

Dec 9 2022

isarantopoulos added a comment to T323586: Reduce number of published docker images for revscoring models.

My suggestion to proceed would be the following:

  • introduce new image, deploy and test it wherever we want
  • deprecate old files and pipelines.
Dec 9 2022, 2:29 PM · Machine-Learning-Team
isarantopoulos added a comment to T323586: Reduce number of published docker images for revscoring models.

At the moment I have created one image for all revscoring models and managed to run inference through that. We build an image of approx 1.5GB instead of 4 images which should potentially speed up and make our CI/CD process a bit easier.
As you understand the changes in this patch are too many so it requires extensive QA on our side.
Remaining things:

  • merge the patch in the integration/config repo for the new deployment pipeline
  • update deployment charts to use the same image for all revscoring models
Dec 9 2022, 2:19 PM · Machine-Learning-Team

Dec 6 2022

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

I didn't see any timeouts from benthos logs and I forgot to mention above that all these metrics are only for response code 200 as read from the kserve/pod logs. Is there someplace else I could figure this out from the logs?
There seems to be a memory usage around that time that reaches the pods limit (2GB) https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=revscoring-drafttopic&var-pod=enwiki-drafttopic-predictor-default-nknc9-deployment-785b6gg8fr&from=1670243139234&to=1670244344075

Dec 6 2022, 2:44 PM · Machine-Learning-Team
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

@elukey Thank you for the explanation. I haven't checked about ray workers but I think it is worth the effort as it seems the "standard" way to do parallel inference with kserve. I agree with your last point that we should use MP only where needed. Perhaps for now it would be sufficient to find what is the proper way to do MP/parallel inference so we can use it when needed.

Dec 6 2022, 9:26 AM · Machine-Learning-Team

Dec 5 2022

isarantopoulos committed rMLIS7768c423964c: asyncio: cast asyncio_aux_workers env var to int on read (authored by isarantopoulos).
asyncio: cast asyncio_aux_workers env var to int on read
Dec 5 2022, 9:14 AM

Dec 2 2022

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

Editquality with Benthos

Here the results are much better with Multi-processing

SP - Single process
Total duration:0 days 00:05:00, Total No of requests: 641
	50.0% 368.48ms
	75.0% 550.88ms
	90.0% 1002.84ms
	99.0% 2956.55ms
2022-12-02 09:56:58
Minute 1, No of requests: 127
	50.0% 385.16ms
	75.0% 592.51ms
	90.0% 1317.39ms
	99.0% 3208.23ms
2022-12-02 09:57:58
Minute 2, No of requests: 128
	50.0% 337.26ms
	75.0% 478.46ms
	90.0% 711.67ms
	99.0% 9667.91ms
2022-12-02 09:58:58
Minute 3, No of requests: 74
	50.0% 343.19ms
	75.0% 526.78ms
	90.0% 1187.21ms
	99.0% 7547.48ms
2022-12-02 09:59:58
Minute 4, No of requests: 143
	50.0% 371.78ms
	75.0% 537.08ms
	90.0% 968.13ms
	99.0% 2799.92ms
2022-12-02 10:00:58
Minute 5, No of requests: 160
	50.0% 366.24ms
	75.0% 543.2ms
	90.0% 898.81ms
	99.0% 1978.13ms
MP
Total duration:0 days 00:00:59, Total No of requests: 593
	50.0% 325.58ms
	75.0% 393.12ms
	90.0% 456.58ms
	99.0% 579.06ms
2022-12-02 16:10:02
Minute 1, No of requests: 593
	50.0% 325.58ms
	75.0% 393.12ms
	90.0% 456.58ms
	99.0% 579.06ms
2022-12-02 16:11:02
Minute 2, No of requests: 593
	50.0% 325.58ms
	75.0% 393.12ms
	90.0% 456.58ms
	99.0% 579.06ms
2022-12-02 16:12:02
Minute 3, No of requests: 593
	50.0% 325.58ms
	75.0% 393.12ms
	90.0% 456.58ms
	99.0% 579.06ms
2022-12-02 16:13:02
Minute 4, No of requests: 593
	50.0% 325.58ms
	75.0% 393.12ms
	90.0% 456.58ms
	99.0% 579.06ms
2022-12-02 16:14:02
Minute 5, No of requests: 593
	50.0% 325.58ms
	75.0% 393.12ms
	90.0% 456.58ms
	99.0% 579.06ms
Dec 2 2022, 6:13 PM · Machine-Learning-Team
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.
  1. SP - Single process
Dec 2 2022, 5:04 PM · Machine-Learning-Team
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

with MP**

Dec 2 2022, 4:59 PM · Machine-Learning-Team
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

editquality-goodfaith
With MP

isaranto@deploy1002:~/scripts$ wrk -c 1 -t 1 --timeout 2s -s inference-goodfaith.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency -d 60
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   312.50ms   49.80ms 531.37ms   82.81%
    Req/Sec     2.96      0.73     5.00     73.96%
  Latency Distribution
     50%  292.40ms
     75%  299.61ms
     90%  404.56ms
     99%  520.68ms
  192 requests in 1.00m, 72.19KB read
Requests/sec:      3.19
Transfer/sec:      1.20KB
isaranto@deploy1002:~/scripts$ wrk -c 4 -t 2 --timeout 2s -s inference-goodfaith.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency -d 60
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   405.38ms   76.24ms 708.64ms   72.33%
    Req/Sec     5.76      2.76    10.00     66.85%
  Latency Distribution
     50%  377.24ms
     75%  456.11ms
     90%  509.77ms
     99%  660.63ms
  589 requests in 1.00m, 221.45KB read
Requests/sec:      9.82
Transfer/sec:      3.69KB
Dec 2 2022, 4:56 PM · Machine-Learning-Team

Nov 29 2022

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

And various test with wrk

Nov 29 2022, 2:36 PM · Machine-Learning-Team
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

Re-run the test and edited the previous message. Much better results, and it seems that latency doesn't increase over time as it happens in the non MP version.
Here are the results for the full 20 minutes I ran it:

Nov 29 2022, 11:41 AM · Machine-Learning-Team
isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

@elukey you are right. I put it as boolean, but true in yaml is translated to True in python and the comparison is actually comparing strings so True=="True" will always be false.
I see this in the logs:

[I 221128 16:44:42 model_server:125] Will fork 1 workers
[I 221128 16:44:42 model_server:128] Setting max asyncio worker threads as 5

As I understand workers == threads in this case. Could you patch it once again with "True" so that we can check it out?
Regarding resources I did not see any spikes in CPU while the test was run. The difference in performance though between the two tests could be justified by the number of asyncio worker threads in the first case it is 5 and in the second 9.

Nov 29 2022, 9:20 AM · Machine-Learning-Team

Nov 28 2022

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

Enabled MP and ran on ml-staging with benthos for 5 minutes for revscoring-editquality-goodfaith:
for en wiki

Total duration:0 days 00:05:00, Total No of requests: 641
	50.0% 368.48ms
	75.0% 550.88ms
	90.0% 1002.84ms
	99.0% 2956.55ms
Minute 1, No of requests: 127
	50.0% 385.16ms
	75.0% 592.51ms
	90.0% 1317.39ms
	99.0% 3208.23ms
Minute 2, No of requests: 128
	50.0% 337.26ms
	75.0% 478.46ms
	90.0% 711.67ms
	99.0% 9667.91ms
Minute 3, No of requests: 74
	50.0% 343.19ms
	75.0% 526.78ms
	90.0% 1187.21ms
	99.0% 7547.48ms
Minute 4, No of requests: 143
	50.0% 371.78ms
	75.0% 537.08ms
	90.0% 968.13ms
	99.0% 2799.92ms
Minute 5, No of requests: 160
	50.0% 366.24ms
	75.0% 543.2ms
	90.0% 898.81ms
	99.0% 1978.13ms

For zh wiki we didnt have the same increase per mintue:

Total 5 minute duration:
        50.0% 289.17ms
	75.0% 548.68ms
	90.0% 1049.54ms
	99.0% 8628.5ms
Broken down by minute:
Minute 1, No of requests: 9
	50.0% 232.0ms
	75.0% 702.35ms
	90.0% 2802.58ms
	99.0% 9238.11ms
Minute 2, No of requests: 9
	50.0% 397.93ms
	75.0% 826.5ms
	90.0% 2538.78ms
	99.0% 6778.66ms
Minute 3, No of requests: 14
	50.0% 248.38ms
	75.0% 522.77ms
	90.0% 697.5ms
	99.0% 1471.2ms
Minute 4, No of requests: 5
	50.0% 241.56ms
	75.0% 479.28ms
	90.0% 1080.67ms
	99.0% 1441.51ms
Minute 5, No of requests: 12
	50.0% 288.67ms
	75.0% 496.1ms
	90.0% 548.94ms
	99.0% 919.24ms
Nov 28 2022, 3:52 PM · Machine-Learning-Team

Nov 25 2022

isarantopoulos added a comment to T323624: Test revscoring model servers on Lift Wing.

Checked en-wiki-revscoring-editquality-goodfaith with benthos and wrk:

Nov 25 2022, 4:15 PM · Machine-Learning-Team

Nov 24 2022

isarantopoulos moved T322006: Add new syntax directive to blubber.yaml files to enable users to directly use docker build with blubber.yaml. from In Progress to Done on the Machine-Learning-Team board.
Nov 24 2022, 4:03 PM · Machine-Learning-Team