Page MenuHomePhabricator

isarantopoulos (Ilias Sarantopoulos)
Machine Learning/MLOps Engineer

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Nov 1 2022, 12:34 PM (75 w, 3 d)
Availability
Available
LDAP User
Ilias Sarantopoulos
MediaWiki User
ISarantopoulos-WMF [ Global Accounts ]

Recent Activity

Yesterday

isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

I managed to deploy Mistral-7B-Instructv0.2 on ml-staging using the GPU and 35GB of memory.
At the moment I'm testing the modifications we need to make to the huggingfaceserver so that:

Fri, Apr 12, 7:08 AM · Patch-For-Review, Machine-Learning-Team

Tue, Apr 9

isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

Deployed bert-base-uncased model on ml-staging and it works!

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/bert:predict" -X POST -d '{"instances": ["The capital of france is [MASK]."] }' -H  "Host: bert.experimental.wikimedia.org" -H "Content-Type: application/json"
Tue, Apr 9, 3:14 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Tue, Apr 9, 2:57 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T361238: Update and fix locust load testing for revscoring models from Unsorted to Ready To Go on the Machine-Learning-Team board.
Tue, Apr 9, 2:51 PM · Machine-Learning-Team
isarantopoulos set the point value for T361238: Update and fix locust load testing for revscoring models to 1.
Tue, Apr 9, 2:51 PM · Machine-Learning-Team
isarantopoulos moved T361370: Determine a structure for the python package repository from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, Apr 9, 2:49 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos set the point value for T361370: Determine a structure for the python package repository to 3.
Tue, Apr 9, 2:49 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos moved T361803: Create logo-detection model-server to be hosted on LiftWing from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, Apr 9, 2:45 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos set the point value for T361803: Create logo-detection model-server to be hosted on LiftWing to 4.
Tue, Apr 9, 2:45 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T361881: Investigate the inconsistent load test results (locust) for revertrisk from Unsorted to Ready To Go on the Machine-Learning-Team board.
Tue, Apr 9, 2:44 PM · Machine-Learning-Team
isarantopoulos set the point value for T361881: Investigate the inconsistent load test results (locust) for revertrisk to 2.
Tue, Apr 9, 2:44 PM · Machine-Learning-Team
isarantopoulos moved T361483: Selectively disable changeprop functionality that is no longer used from Unsorted to Watching on the Machine-Learning-Team board.
Tue, Apr 9, 2:41 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTbase Deprecation Roadmap)

Mon, Apr 8

isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

Proceeding with google-bert-uncased as an example model (the one we used during debugging) and `Mistral-7B-Instruct-v0.2 as a bigger model

Mon, Apr 8, 3:37 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

I tried to deploy falcon-7b-instruct using the hf image but got the following error in the kserve-container:

kubectl logs falcon-7b-instruct-gpu-predictor-00001-deployment-6776fccd6llf7 kserve-container
INFO:root:Copying contents of /mnt/models to local
The repository for /mnt/models contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//mnt/models.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
Mon, Apr 8, 3:13 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos created P59848 (An Untitled Masterwork).
Mon, Apr 8, 2:09 PM
isarantopoulos committed rMLISa0549764cc7d: huggingface: add huggingface image.
huggingface: add huggingface image
Mon, Apr 8, 10:43 AM

Thu, Apr 4

isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

After the released new pytorch image I have reveried the docker image size (11.4GB as described above), and the layers being the following (same as above):

docker history --format "Layer Size: {{.Size}} |  {{.CreatedBy}}" hf:kserve
Layer Size: 0B |  LABEL blubber.variant=production blubber.ver…
Layer Size: 0B |  ENTRYPOINT ["./entrypoint.sh"]
Layer Size: 321B |  COPY huggingface_modelserver/entrypoint.sh .…
Layer Size: 428kB |  COPY /srv/app/kserve_repo/python/huggingface…
Layer Size: 1.17GB |  COPY /opt/lib/python/site-packages /opt/lib/…
Layer Size: 0B |  ENV PATH=/opt/lib/python/site-packages/bin:/…
Layer Size: 0B |  ENV MODEL_DIR=/mnt/models PYTHONPATH=/srv/app
Layer Size: 0B |  WORKDIR /srv/app
Layer Size: 0B |  ENV HOME=/home/somebody
Layer Size: 0B |  USER 65533
Layer Size: 9.07kB |  RUN |6 LIVES_AS=somebody LIVES_UID=65533 LIV…
Layer Size: 0B |  ARG RUNS_GID=900
Layer Size: 0B |  ARG RUNS_UID=900
Layer Size: 0B |  ARG RUNS_AS=runuser
Layer Size: 0B |  RUN |3 LIVES_AS=somebody LIVES_UID=65533 LIV…
Layer Size: 0B |  ARG LIVES_GID=65533
Layer Size: 0B |  ARG LIVES_UID=65533
Layer Size: 0B |  ARG LIVES_AS=somebody
Layer Size: 0B |  ENV HOME=/root
Layer Size: 0B |  USER 0
Layer Size: 10GB |  |0 /bin/sh -c /usr/bin/pip3 install --target…
Layer Size: 0B |  /bin/sh -c #(nop)  USER 65533
Layer Size: 1.54MB |  |0 /bin/sh -c echo 'Acquire::http::Proxy "ht…
Layer Size: 69.8MB |  |0 /bin/sh -c echo 'Acquire::http::Proxy "ht…
Layer Size: 0B |  /bin/sh -c #(nop)  CMD ["/bin/bash"]
Layer Size: 0B |  /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8
Layer Size: 122MB |  /bin/sh -c #(nop) ADD file:4d8f8923252d099a4…

There is a 10GB layer but it is the one coming from the base pytorch image.

Thu, Apr 4, 4:24 PM · Patch-For-Review, Machine-Learning-Team

Wed, Apr 3

isarantopoulos added a comment to T358676: Host a logo detection model for Commons images.

Hi @mfossati ! Thanks a lot for all this great work!
I was wondering if you had tried to train the same model using pytorch as a keras backend instead of tensorflow. The reason I'm asking is totally unrelated to the model itself but has to do with technical challenges of maintaining multiple images and backends. There is ongoing work on our side to provide better support for pytorch (related task).
This is more of a question so we can provide better support and not a request from our side as we'd be supporting keras/tensorflow models as well.

Wed, Apr 3, 10:37 AM · Structured-Data-Backlog (Current Work), Machine-Learning-Team

Tue, Apr 2

isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

I have built the "final" image that also includes vllm==0.2.7 which will be used for GPU inference optimization. The final size is 11.4GB and contains the following layers (larger one is 1.17GB)

IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
b992697a5b07   11 minutes ago   LABEL blubber.variant=production blubber.ver…   0B        buildkit.dockerfile.v0
<missing>      11 minutes ago   ENTRYPOINT ["./entrypoint.sh"]                  0B        buildkit.dockerfile.v0
<missing>      11 minutes ago   COPY huggingface_modelserver/entrypoint.sh .…   321B      buildkit.dockerfile.v0
<missing>      11 minutes ago   COPY /srv/app/kserve_repo/python/huggingface…   428kB     buildkit.dockerfile.v0
<missing>      11 minutes ago   COPY /opt/lib/python/site-packages /opt/lib/…   1.17GB    buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV PATH=/opt/lib/python/site-packages/bin:/…   0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV MODEL_DIR=/mnt/models PYTHONPATH=/srv/app   0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   WORKDIR /srv/app                                0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV HOME=/home/somebody                         0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   USER 65533                                      0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   RUN |6 LIVES_AS=somebody LIVES_UID=65533 LIV…   9.07kB    buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG RUNS_GID=900                                0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG RUNS_UID=900                                0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG RUNS_AS=runuser                             0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   RUN |3 LIVES_AS=somebody LIVES_UID=65533 LIV…   0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG LIVES_GID=65533                             0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG LIVES_UID=65533                             0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG LIVES_AS=somebody                           0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   RUN /bin/sh -c apt-get update && apt-get ins…   0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV DEBIAN_FRONTEND=noninteractive              0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV HOME=/root                                  0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   USER 0                                          0B        buildkit.dockerfile.v0
Tue, Apr 2, 6:04 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

I believe we would have the same issue even with poetry in the following scenario:

Tue, Apr 2, 4:01 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

I've built an image by explicitly defining all the requirements (instead of pip resolving the dependencies by its own). This resulted in a reduced image size of 10.9GB compared to 13.9GB and a really small docker layer size
In the output of docker history we see that the biggest layer size is 622MB:

IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
f357017f3949   42 minutes ago   LABEL blubber.variant=production blubber.ver…   0B        buildkit.dockerfile.v0
<missing>      42 minutes ago   ENTRYPOINT ["./entrypoint.sh"]                  0B        buildkit.dockerfile.v0
<missing>      42 minutes ago   COPY huggingface_modelserver/entrypoint.sh .…   321B      buildkit.dockerfile.v0
<missing>      42 minutes ago   COPY /srv/app/kserve_repo/python/huggingface…   418kB     buildkit.dockerfile.v0
<missing>      42 minutes ago   COPY /opt/lib/python/site-packages /opt/lib/…   622MB     buildkit.dockerfile.v0
<missing>      4 hours ago      ENV PATH=/opt/lib/python/site-packages/bin:/…   0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ENV MODEL_DIR=/mnt/models PYTHONPATH=/srv/app   0B        buildkit.dockerfile.v0
<missing>      4 hours ago      WORKDIR /srv/app                                0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ENV HOME=/home/somebody                         0B        buildkit.dockerfile.v0
<missing>      4 hours ago      USER 65533                                      0B        buildkit.dockerfile.v0
<missing>      4 hours ago      RUN |6 LIVES_AS=somebody LIVES_UID=65533 LIV…   9.07kB    buildkit.dockerfile.v0
<missing>      4 hours ago      ARG RUNS_GID=900                                0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG RUNS_UID=900                                0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG RUNS_AS=runuser                             0B        buildkit.dockerfile.v0
<missing>      4 hours ago      RUN |3 LIVES_AS=somebody LIVES_UID=65533 LIV…   0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG LIVES_GID=65533                             0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG LIVES_UID=65533                             0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG LIVES_AS=somebody                           0B        buildkit.dockerfile.v0
<missing>      4 hours ago      RUN /bin/sh -c apt-get update && apt-get ins…   1.53MB    buildkit.dockerfile.v0
<missing>      4 hours ago      ENV DEBIAN_FRONTEND=noninteractive              0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ENV HOME=/root                                  0B        buildkit.dockerfile.v0
<missing>      4 hours ago      USER 0                                          0B        buildkit.dockerfile.v0
<missing>      5 hours ago      /bin/sh -c /usr/bin/pip3 install --target /o…   10GB
<missing>      5 hours ago      /bin/sh -c #(nop)  USER 65533                   0B
<missing>      5 hours ago      /bin/sh -c (groupadd -o -g 65533 -r "somebod…   8.88kB
<missing>      2 days ago       |0 /bin/sh -c echo 'Acquire::http::Proxy "ht…   69.8MB
<missing>      2 days ago       /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B
<missing>      2 days ago       /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8           0B
<missing>      2 days ago       /bin/sh -c #(nop) ADD file:4d8f8923252d099a4…   122MB
Tue, Apr 2, 11:24 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T360638: Create a Pytorch base image.

The above "issue" with numpy seems that it is not an issue after all. Numpy was removed as a requirement after torch 1.9 but they do maintain an aggressive warning as I read in an issue.

Tue, Apr 2, 9:30 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos closed T358195: Investigate increased preprocessing latencies on LW of article-descriptions model, a subtask of T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing., as Resolved.
Tue, Apr 2, 6:49 AM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Machine-Learning-Team
isarantopoulos closed T358195: Investigate increased preprocessing latencies on LW of article-descriptions model as Resolved.
Tue, Apr 2, 6:49 AM · Wikipedia-Android-App-Backlog, Machine-Learning-Team
isarantopoulos moved T358195: Investigate increased preprocessing latencies on LW of article-descriptions model from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Tue, Apr 2, 6:49 AM · Wikipedia-Android-App-Backlog, Machine-Learning-Team
isarantopoulos closed T360212: Add pyopencl requirements to images that use resource_utils as Resolved.
Tue, Apr 2, 6:48 AM · Machine-Learning-Team
isarantopoulos moved T360212: Add pyopencl requirements to images that use resource_utils from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Tue, Apr 2, 6:48 AM · Machine-Learning-Team

Mon, Apr 1

isarantopoulos added a comment to T361370: Determine a structure for the python package repository.

For this project we'll be using a pyproject.toml file as described in python documentation page "Writing your pyproject.toml". Pyproject.tml is the standard way of storing metadata for packaging related tools as described in PEP 621.

Mon, Apr 1, 1:54 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

HF image now works! The above issue seems to have been fixed in one of the latest commits. However to avoid any future issues (breaking changes etc) we are still using a wikimedia fork of the kserve repo for huggingfaceserver package (and kserve).
Using a fork makes updating a bit mundane as a task but will allow us to have deterministic builds until things are more stable (in a new kserve release).
For now we can:

Mon, Apr 1, 1:14 PM · Patch-For-Review, Machine-Learning-Team

Fri, Mar 29

isarantopoulos created P59011 (An Untitled Masterwork).
Fri, Mar 29, 3:08 PM
isarantopoulos added a comment to T361370: Determine a structure for the python package repository.

@Mercelisvaughan Apart from the official documentation other resources may be also be useful, so feel free to add them here if you find something useful.
Another idea would be to use a cookiecutter template (more info about cookiecutter templates here) but I think that it may be too much to start with. Nevertheless if you want you can also explore this option as as it is a nice idea and useful to know of.

Fri, Mar 29, 2:17 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos created T361370: Determine a structure for the python package repository.
Fri, Mar 29, 2:17 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos added a comment to T360638: Create a Pytorch base image.

I noticed something odd in the base image.
When I import torch inside the image I get a warning about numpy missing:

/usr/local/lib/python3.11/dist-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

I verified that numpy doesn't exist in the image which is odd because it is a dependency for torch as defined in the pyproject.toml. I saw that the same happens for requests and pyyaml (and may be the case for the other dependencies as well).
The only way this will happen is if you install a package without its dependencies(--no-dependencies) which is not what we are doing.

Fri, Mar 29, 1:23 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

It seems that the most prominent solution at the moment that would not require so many hacks and forks would be to use a pytorch 2.1.2 and rocm5.5 base image.
I was able to build the image which ended up with the following dependencies

aiohttp==3.9.3
aiohttp-cors==0.7.0
aiosignal==1.3.1
annotated-types==0.6.0
anyio==4.3.0
async-timeout==4.0.3
attrs==23.2.0
azure-core==1.30.1
azure-identity==1.15.0
azure-storage-blob==12.19.1
azure-storage-file-share==12.15.0
boto3==1.34.73
botocore==1.34.73
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloudevents==1.10.1
cmake==3.29.0.1
colorful==0.5.6
cryptography==42.0.5
deprecation==2.1.0
distlib==0.3.8
fastapi==0.108.0
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.3.1
google-api-core==2.18.0
google-auth==2.29.0
google-cloud-core==2.4.1
google-cloud-storage==2.16.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.63.0
grpcio==1.62.1
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.26.0
huggingface-hub==0.22.1
# Editable install with no version control (huggingfaceserver==0.12.0)
-e /srv/app/kserve_repo/python/huggingfaceserver
idna==3.6
isodate==0.6.1
Jinja2==3.1.3
jmespath==1.0.1
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kserve @ file:///srv/app/kserve_repo/python/kserve
kubernetes==29.0.0
lit==18.1.2
MarkupSafe==2.1.5
mpmath==1.3.0
msal==1.28.0
msal-extensions==1.1.0
msgpack==1.0.8
multidict==6.0.5
networkx==3.2.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
opencensus==0.11.4
opencensus-context==0.1.3
orjson==3.10.0
packaging==24.0
pandas==2.2.1
platformdirs==4.2.0
portalocker==2.8.2
prometheus-client==0.13.1
proto-plus==1.23.0
protobuf==3.20.3
psutil==5.9.8
py-spy==0.3.14
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.21
pydantic==2.6.4
pydantic_core==2.16.3
PyJWT==2.8.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytorch-triton-rocm==2.0.1
pytz==2024.1
PyYAML==6.0.1
ray==2.10.0
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
requests-oauthlib==2.0.0
rpds-py==0.18.0
rsa==4.9
s3transfer==0.10.1
safetensors==0.4.2
six==1.16.0
smart-open==7.0.4
sniffio==1.3.1
starlette==0.32.0.post1
sympy==1.12
tabulate==0.9.0
timing-asgi==0.3.1
tokenizers==0.15.2
torch==2.1.2
tqdm==4.66.2
transformers==4.37.2
triton==2.1.0
typing_extensions==4.10.0
tzdata==2024.1
urllib3==1.26.18
uvicorn==0.21.1
uvloop==0.19.0
virtualenv==20.25.1
watchfiles==0.21.0
websocket-client==1.7.0
websockets==12.0
wrapt==1.16.0
yarl==1.9.4

At the moment I am getting the following error with ray serve

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "/srv/app/huggingfaceserver/__init__.py", line 15, in <module>
    from .model import HuggingfaceModel  # noqa # pylint: disable=unused-import
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/app/huggingfaceserver/model.py", line 20, in <module>
    from kserve.model import PredictorConfig
  File "/opt/lib/python/site-packages/kserve/__init__.py", line 18, in <module>
    from .model_server import ModelServer
  File "/opt/lib/python/site-packages/kserve/model_server.py", line 27, in <module>
    from ray.serve.handle import RayServeHandle
ImportError: cannot import name 'RayServeHandle' from 'ray.serve.handle' (/opt/lib/python/site-packages/ray/serve/handle.py)

There is an open issue already for this on kserve GH. I am trying to see if I can find a workaround when using the wikimedia kserve fork.
I created the fork for 2 reasons:

Fri, Mar 29, 10:55 AM · Patch-For-Review, Machine-Learning-Team

Thu, Mar 28

isarantopoulos added a comment to T361234: Fix locust load testing for Revert Risk models.

The locust.conf was the one that is committed in the repo. I can't recall if anything was different at the moment. However since the host header wasn't there it makes me wonder if the test was ran using the API GW but again that doesn't justify the increased request count.

Thu, Mar 28, 5:23 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

There is an open Pull Request in vllm repo to upgrade pytorch support to 2.2.1, just leaving this here as a reference.

Thu, Mar 28, 4:37 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos created P58979 docker-pkg error.
Thu, Mar 28, 1:20 PM
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

I'm facing issues trying to update huggingfaceserver dependencies to use torch 2.2.1. I've reached a point where I'm blocked because vllm requires torch version 2.1.2 so my suggestion is to go and create another base image with torch-rocm 2.1.2 and rocm 5.5 which is available for that version.

Thu, Mar 28, 11:40 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos closed T361117: Remove redundant deployments from ml-staging as Resolved.
Thu, Mar 28, 6:38 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
isarantopoulos moved T361117: Remove redundant deployments from ml-staging from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Thu, Mar 28, 6:38 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
isarantopoulos added a comment to T361117: Remove redundant deployments from ml-staging.

The following deployments have been removed from ml-staging:

Thu, Mar 28, 6:37 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
isarantopoulos set the point value for T361117: Remove redundant deployments from ml-staging to 1.
Thu, Mar 28, 6:36 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team

Wed, Mar 27

isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

After thinking about this and trying various things out (copying code or using a specific commit) I found the following 2 issues we need to resolve:

Wed, Mar 27, 5:48 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos created T361117: Remove redundant deployments from ml-staging.
Wed, Mar 27, 3:24 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

We'll be using the pytorch rocm image based on debian bookworm for this image (see #T360638)
Also we need to either copy the code from the directory https://github.com/kserve/kserve/tree/master/python/huggingfaceserver/huggingfaceserver or pin to a specific commit on the master branch as new changes are being introduced all the time and if we use the release version we don't get the fixes and if we just use the master branch we may end up in a failed deployment (which actually just happened due to some commit from last week on kserve).
I'm exploring which of the two options would be the best for now (copy code or pin to specific commit).

Wed, Mar 27, 11:06 AM · Patch-For-Review, Machine-Learning-Team

Thu, Mar 21

isarantopoulos added a comment to T360593: Create an examples directory in the repository and add a basic README.md.

@Mercelisvaughan As a follow up to the first Pull request you can create another one where you'll add data validation using pydantic. Please go through the basic documentation and then you can update the existing function.

Thu, Mar 21, 2:35 PM · Machine-Learning-Team
isarantopoulos committed rMLISad95e4bd49d6: fix: install pyopencl in llm and article-desc.
fix: install pyopencl in llm and article-desc
Thu, Mar 21, 7:19 AM
isarantopoulos created T360593: Create an examples directory in the repository and add a basic README.md.
Thu, Mar 21, 5:46 AM · Machine-Learning-Team
isarantopoulos added a comment to T359140: Q4: Lift Wing Python Package.

This is the repository where the project will be hosted: https://github.com/wikimedia/liftwing-python

Thu, Mar 21, 5:37 AM · Goal, Machine-Learning-Team

Wed, Mar 20

isarantopoulos renamed T360212: Add pyopencl requirements to images that use resource_utils from Add GPU check in all images to Add pyopencl requirements to images that use resource_utils.
Wed, Mar 20, 12:07 PM · Machine-Learning-Team
isarantopoulos moved T360212: Add pyopencl requirements to images that use resource_utils from Ready To Go to In Progress on the Machine-Learning-Team board.
Wed, Mar 20, 12:04 PM · Machine-Learning-Team

Tue, Mar 19

isarantopoulos moved T360212: Add pyopencl requirements to images that use resource_utils from Unsorted to Ready To Go on the Machine-Learning-Team board.
Tue, Mar 19, 2:49 PM · Machine-Learning-Team
isarantopoulos created P58817 (An Untitled Masterwork).
Tue, Mar 19, 1:24 PM
isarantopoulos committed rMLIS338748cb5cc4: revertrisk: remove obsolete step from README.
revertrisk: remove obsolete step from README
Tue, Mar 19, 12:00 PM
isarantopoulos moved T360352: change to WatchedItemQueryServiceExtension signature type hint causes phan error for ORES from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Tue, Mar 19, 7:35 AM · MW-1.42-notes (1.42.0-wmf.23; 2024-03-19), Machine-Learning-Team, MediaWiki-extensions-ORES

Fri, Mar 15

isarantopoulos created T360212: Add pyopencl requirements to images that use resource_utils.
Fri, Mar 15, 4:32 PM · Machine-Learning-Team

Thu, Mar 14

isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

I've managed to make it work with a model available on disk (which means no connection to HF repo). The issues I faced were specific to the example as it was also trying to load the coreml directory as a model so it was failing. (See model directory structure here). However I successfully loaded bloom-560m and nllb model.
I see there are some open issues on GH so I understand this is going to be more stable as we move forward.
For now we will have to test each model we want to deploy/test without taking for granted that all models will work. Also when using the GPU with vllm it is important to remember that not all models are supported (list of supported models by vllm).

Thu, Mar 14, 3:51 PM · Patch-For-Review, Machine-Learning-Team

Mar 13 2024

isarantopoulos created P58780 (An Untitled Masterwork).
Mar 13 2024, 5:05 PM
isarantopoulos closed T358953: Inconsistent data type for articlequality score predictions on ptwiki as Resolved.
Mar 13 2024, 4:40 PM · Machine-Learning-Team, ORES
isarantopoulos moved T358953: Inconsistent data type for articlequality score predictions on ptwiki from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Mar 13 2024, 4:40 PM · Machine-Learning-Team, ORES
isarantopoulos closed T359871: Add httpbb tests for ores-legacy as Resolved.
Mar 13 2024, 4:39 PM · ORES, Machine-Learning-Team
isarantopoulos moved T359871: Add httpbb tests for ores-legacy from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Mar 13 2024, 4:39 PM · ORES, Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

I managed to build the huggingface image with blubber and downloading the specified model from HF (example with bert-base-uncased) I'm currently looking at what would be the best strategy for creating the image.
At the moment I'm cloning the kserve repo and using the huggingfaceserver module directly but I'm exploring if committing these files in the repository would work better in the long run.
Next steps:

  • Create a readme and open up the attached patch for reviews
  • Trying to run it with a model that exists locally since this will be the standard way that we'll use it. Since our pods won't have access to the HF repository we want to load the model from disk (the same way that we do with the models in the LLM image). At the moment I'm running into some errors so I'm looking if it is cause of me or the kserve code.(example of the errors, not really improtant at the moment I dont' know if I'm using the arguments properly)
Mar 13 2024, 4:38 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T358953: Inconsistent data type for articlequality score predictions on ptwiki.

@He7d3r I have deployed the fix in production and it is working as expected.

Mar 13 2024, 9:26 AM · Machine-Learning-Team, ORES

Mar 12 2024

isarantopoulos moved T358344: Enable Language-agnostic revert risk model in ORES for Indonesian Wikipedia from Blocked to Watching on the Machine-Learning-Team board.
Mar 12 2024, 2:50 PM · MediaWiki-extensions-ORES, Moderator-Tools-Team, Machine-Learning-Team, Automoderator
isarantopoulos moved T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server from Blocked to In Progress on the Machine-Learning-Team board.
Mar 12 2024, 2:49 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T359416: Add Dragonfly to the ML k8s clusters from Unsorted to In Progress on the Machine-Learning-Team board.
Mar 12 2024, 2:43 PM · Machine-Learning-Team
isarantopoulos set the point value for T359416: Add Dragonfly to the ML k8s clusters to 5.
Mar 12 2024, 2:43 PM · Machine-Learning-Team
isarantopoulos moved T359569: Investigate if it is possible to reduce torch's package size from Unsorted to In Progress on the Machine-Learning-Team board.
Mar 12 2024, 2:41 PM · Machine-Learning-Team
isarantopoulos set the point value for T359569: Investigate if it is possible to reduce torch's package size to 10.
Mar 12 2024, 2:39 PM · Machine-Learning-Team
isarantopoulos moved T357654: Replace usage of wfGetDB() in ORES before the 1.42 cut so it can be hard-deprecated from Blocked to Watching on the Machine-Learning-Team board.
Mar 12 2024, 2:36 PM · ORES, Technical-Debt, Machine-Learning-Team
isarantopoulos moved T359793: Add a util function in python to detect GPU from Unsorted to In Progress on the Machine-Learning-Team board.
Mar 12 2024, 2:20 PM · Machine-Learning-Team
isarantopoulos claimed T359871: Add httpbb tests for ores-legacy.
Mar 12 2024, 2:10 PM · ORES, Machine-Learning-Team
isarantopoulos moved T359871: Add httpbb tests for ores-legacy from Unsorted to In Progress on the Machine-Learning-Team board.
Mar 12 2024, 2:10 PM · ORES, Machine-Learning-Team
isarantopoulos set the point value for T359871: Add httpbb tests for ores-legacy to 2.
Mar 12 2024, 2:09 PM · ORES, Machine-Learning-Team
isarantopoulos added a comment to T359871: Add httpbb tests for ores-legacy.

In the attached patch I brought some old ores httpbb tests back to life.
httpbb doesn't seem to support having a boolean in the body's response e.g. true. I'm looking if it is easy to implement.

Mar 12 2024, 7:46 AM · ORES, Machine-Learning-Team
isarantopoulos committed rMLISef713c8c5be5: redability: trigger new build.
redability: trigger new build
Mar 12 2024, 7:45 AM

Mar 11 2024

isarantopoulos created T359871: Add httpbb tests for ores-legacy.
Mar 11 2024, 5:15 PM · ORES, Machine-Learning-Team
isarantopoulos updated subscribers of T359793: Add a util function in python to detect GPU.

@achou suggested to use pyopencl (GitHub, PyPI) which seems well supported and promising.
An alternative would be to use a specific function depending on the framework used. For example catboost has its own function get_gpu_device_count.
However if a generic solution can be achieved and the installed package is small it seems much better.

Mar 11 2024, 9:22 AM · Machine-Learning-Team
isarantopoulos created T359793: Add a util function in python to detect GPU.
Mar 11 2024, 9:17 AM · Machine-Learning-Team

Mar 8 2024

isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

Some of the links from the pytorch repositories seem to be wrong today (at least this is when I noticed it). The links under https://download.pytorch.org/whl/rocm5.5 should resolve to https://download.pytorch.org/whl/rocm5.5/{package_name} e.g. if you click on torch you should be redirected to https://download.pytorch.org/whl/rocm5.5/torch.
This results in a flaky build of the image as if the link behaves like the above we end up with the default pytorch version with no rocm support and all the nvidia stuff in it.

Mar 8 2024, 5:11 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos created P58691 (An Untitled Masterwork).
Mar 8 2024, 3:11 PM
isarantopoulos committed rMLISb4ca64f97ace: ores-legacy: fix mixed boolean and string field.
ores-legacy: fix mixed boolean and string field
Mar 8 2024, 9:39 AM

Mar 7 2024

isarantopoulos added a comment to T358953: Inconsistent data type for articlequality score predictions on ptwiki.

The attached patch solves the issue. I will deploy it to staging and add some httpbb tests that capture this behavior before I deploy to production.

Mar 7 2024, 5:10 PM · Machine-Learning-Team, ORES
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

I'm proceeding of creating a blubber image instead of going to production-images and handling dependencies in there. After running what upstream provides locally, I saw that we don't really need anything special.

Mar 7 2024, 2:58 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos updated Other Assignee for T358953: Inconsistent data type for articlequality score predictions on ptwiki, removed: isarantopoulos.
Mar 7 2024, 2:42 PM · Machine-Learning-Team, ORES
isarantopoulos committed rMLISe5f33d0ee8c4: revertrisk-multilingual: add extra index for torch rocm.
revertrisk-multilingual: add extra index for torch rocm
Mar 7 2024, 2:26 PM
isarantopoulos added a comment to T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server.

This happens because the kserve repository is not a python package and as the error message tells us there is no setup.py file. The python package can be found in the subdirectory python/kserve.
In order to install it you can add this line in the requirements.txt file:
replace this:

kserve==0.11.2

with this

-e "git+https://github.com/kserve/kserve.git@426fe21da0612ea6ef4a116b5114270313e02bbb#egg=kserve&subdirectory=python/kserve"

Note: the quotes are required otherwise the subdirectory won't be used. For more info you can check the pip VCS support documentation(I got the example from there).

Mar 7 2024, 12:17 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos claimed T358953: Inconsistent data type for articlequality score predictions on ptwiki.
Mar 7 2024, 10:12 AM · Machine-Learning-Team, ORES

Mar 6 2024

isarantopoulos added a comment to T356045: Test revertrisk-multilingual with GPU.

As I see torch is being downloaded from pypi. Although I don't know exactly why this happens but it seems that the extra index (source in terms of pyproject.toml file) isn't respected so pip just sees the dependency and fetches it from PyPI.
To overcome this we can do the following: add an extra-index and torch in the requirements file before knowledge integrity is installed. That way it will already exist and downloaded correctly.
example:

--extra-index-url https://download.pytorch.org/whl/rocm5.4.2
torch==2.0.1
Mar 6 2024, 5:35 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T358953: Inconsistent data type for articlequality score predictions on ptwiki.

I found that this is caused because of the mixed schema of the responses returned by ORES. The prediction field is either a boolean, a string or a list of strings and we have the following in our schema

class Score(BaseModel):
    prediction: Union[bool, str, List[str]]
    probability: Dict[str, float]

The prediction field in the above pydantic model also declares a priority. This means that first it will try to evaluate a boolean and this is what happens as "1" is evaluated as true.
I'm working to provide a universal solution for this to cater for both options properly (booleans and strings).

Mar 6 2024, 1:04 PM · Machine-Learning-Team, ORES
isarantopoulos moved T358953: Inconsistent data type for articlequality score predictions on ptwiki from Ready To Go to In Progress on the Machine-Learning-Team board.
Mar 6 2024, 11:24 AM · Machine-Learning-Team, ORES
isarantopoulos removed a project from T357986: Use Huggingface model server image for HF LLMs: Goal.
Mar 6 2024, 11:08 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos committed rMLIS9a06b3974232: readability: bump catboost to 1.2.3.
readability: bump catboost to 1.2.3
Mar 6 2024, 9:35 AM

Mar 5 2024

isarantopoulos edited P58525 (An Untitled Masterwork).
Mar 5 2024, 5:36 PM
isarantopoulos edited P58525 (An Untitled Masterwork).
Mar 5 2024, 5:35 PM
isarantopoulos created P58525 (An Untitled Masterwork).
Mar 5 2024, 5:35 PM
isarantopoulos updated the task description for T359140: Q4: Lift Wing Python Package.
Mar 5 2024, 7:57 AM · Goal, Machine-Learning-Team
isarantopoulos created T359140: Q4: Lift Wing Python Package.
Mar 5 2024, 7:56 AM · Goal, Machine-Learning-Team

Mar 4 2024

isarantopoulos added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

Can we investigation reducing the computational need to just the language requested?

Mar 4 2024, 9:10 AM · Wikipedia-Android-App-Backlog, Machine-Learning-Team