Feed Advanced Search

Advanced Search
Use Results
Edit Query
Hide Query

	Include stories about projects I am a member of.

May 21 2024

isarantopoulos added a comment to T364089: Have problem with migrating to LiftWing from ores.

@AgnesAbah have you managed to resolve the issue?
As Kosta mentioned there isn't anything there related to Lift Wing but with the MediaWiki Action API.

May 21 2024, 4:19 PM · Machine-Learning-Team

isarantopoulos added a comment to T365166: Update Pytorch base image to 2.3.0.

As it turns out the above approach won't cut it. Even without the dependencies the compressed image with pytorch 2.3.0 and rocm 6.0 is 4.36GB.
This is the list of packages under /opt/lib/site-packages

functorch  
torch  
torch-2.3.0+rocm6.0.dist-info  
torchgen

Also seems that torch-ROCm by itself is ~12GB, so it is indeed getting bigger and bigger:

May 21 2024, 1:53 PM · Machine-Learning-Team

isarantopoulos added a comment to T365166: Update Pytorch base image to 2.3.0.

Images seem to become more bloated so I am exploring the option to install pytorch-rocm with --no-dependencies option and handle dependencies manually either at the production images repo or on the inference services side. It is a long shot but I think it is worth to try from our side at least to cross it out if it can't be done.
Whether this approach is feasible or not will depend on:

May 21 2024, 12:29 PM · Machine-Learning-Team

isarantopoulos added a comment to T365439: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL.

The internal urls also behave properly so it seems that the issue is not on the Lift Wing side but has to do with how the API Gateway translates/encodes the URL.

curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions%3Apredict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json"

May 21 2024, 8:47 AM · Machine-Learning-Team

May 20 2024

isarantopoulos added a comment to T365166: Update Pytorch base image to 2.3.0.

Unfortunately pytorch package seems to get bigger and bigger after each release. Same for ROCm.

May 20 2024, 3:50 PM · Machine-Learning-Team

May 17 2024

isarantopoulos claimed T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0).

May 17 2024, 1:07 PM · Machine-Learning-Team

isarantopoulos created T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0).

May 17 2024, 1:07 PM · Machine-Learning-Team

isarantopoulos created P62580 (An Untitled Masterwork).

May 17 2024, 12:53 PM

May 16 2024

isarantopoulos updated the task description for T365166: Update Pytorch base image to 2.3.0.

May 16 2024, 2:45 PM · Machine-Learning-Team

isarantopoulos created T365166: Update Pytorch base image to 2.3.0.

May 16 2024, 2:45 PM · Machine-Learning-Team

May 15 2024

isarantopoulos moved T363449: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons from Unsorted to Blocked on the Machine-Learning-Team board.

May 15 2024, 2:21 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos moved T362749: Deploy logo-detection model-server to LiftWing staging from Blocked to In Progress on the Machine-Learning-Team board.

May 15 2024, 2:21 PM · Machine-Learning-Team

isarantopoulos moved T362503: ORES doesn't work (at least for ru- and ukwiki) from Unsorted to Blocked on the Machine-Learning-Team board.

May 15 2024, 2:20 PM · Patch-For-Review, Machine-Learning-Team, ORES

isarantopoulos moved T362663: Add slow-logs for ML isvcs from Unsorted to Ready To Go on the Machine-Learning-Team board.

May 15 2024, 2:17 PM · Machine-Learning-Team, ORES

isarantopoulos set the point value for T362663: Add slow-logs for ML isvcs to 2.

May 15 2024, 2:17 PM · Machine-Learning-Team, ORES

isarantopoulos moved T362749: Deploy logo-detection model-server to LiftWing staging from Unsorted to Blocked on the Machine-Learning-Team board.

May 15 2024, 2:16 PM · Machine-Learning-Team

isarantopoulos moved T362984: GPU errors in hf image in ml-staging from Unsorted to In Progress on the Machine-Learning-Team board.

May 15 2024, 2:10 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos moved T363191: Test if we can avoid ROCm debian packages on k8s nodes from Unsorted to Ready To Go on the Machine-Learning-Team board.

May 15 2024, 2:09 PM · Machine-Learning-Team

isarantopoulos set the point value for T363191: Test if we can avoid ROCm debian packages on k8s nodes to 3.

May 15 2024, 2:09 PM · Machine-Learning-Team

isarantopoulos moved T363334: [httpbb] fix failing httpbb test in production enwiki-articletopic from Unsorted to Ready To Go on the Machine-Learning-Team board.

May 15 2024, 2:08 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos set the point value for T363334: [httpbb] fix failing httpbb test in production enwiki-articletopic to 1.

May 15 2024, 2:07 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos moved T363725: Patch Location headers of HTTP redirects coming from the MW API in Lift Wing services from Unsorted to Ready To Go on the Machine-Learning-Team board.

May 15 2024, 2:05 PM · Machine-Learning-Team

isarantopoulos committed rMLISa3596256e290: revertrisk: update locust results.

revertrisk: update locust results

May 15 2024, 10:09 AM

isarantopoulos closed T361881: Investigate the inconsistent load test results (locust) for revertrisk as Resolved.

May 15 2024, 8:24 AM · Machine-Learning-Team

isarantopoulos moved T361881: Investigate the inconsistent load test results (locust) for revertrisk from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.

May 15 2024, 8:23 AM · Machine-Learning-Team

May 14 2024

isarantopoulos added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

As part of the task T362984: GPU errors in hf image in ml-staging we have also experimented with different versions of pytorch (2.2.1, 2.3.0) and rocm (5.6, 5.7, 6.0) and we are still hitting the same issue.
To clarify the GPU works properly with pytorch 2.0.1 and rocm 5.4.2 but these versions are too old to be used with the huggingfaceserver.

May 14 2024, 4:11 PM · Goal, Machine-Learning-Team

isarantopoulos added a comment to T363725: Patch Location headers of HTTP redirects coming from the MW API in Lift Wing services.

As discussed in the team meeting this task will be restricted to providing a solution for the revertrisk-language-agnostic that currently handles these redirects by rewriting the hosts from the values available in a configuration file. We want to remove this "hack".
Also, a potential idea would be to allow only a specific number of redirects (e.g. up to 3).

May 14 2024, 4:08 PM · Machine-Learning-Team

isarantopoulos moved T360455: Add Article Quality Model to LiftWing from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.

May 14 2024, 2:48 PM · Content-Transform-Team, Research, Machine-Learning-Team

isarantopoulos set the point value for T363725: Patch Location headers of HTTP redirects coming from the MW API in Lift Wing services to 3.

May 14 2024, 2:44 PM · Machine-Learning-Team

isarantopoulos assigned T363506: Pass image objects to the logo detection service to kevinbazira.

May 14 2024, 2:41 PM · Machine-Learning-Team, Structured-Data-Backlog

isarantopoulos moved T363506: Pass image objects to the logo detection service from Unsorted to In Progress on the Machine-Learning-Team board.

May 14 2024, 2:40 PM · Machine-Learning-Team, Structured-Data-Backlog

isarantopoulos set the point value for T363506: Pass image objects to the logo detection service to 5.

May 14 2024, 2:32 PM · Machine-Learning-Team, Structured-Data-Backlog

isarantopoulos moved T364551: [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard from Unsorted to Watching on the Machine-Learning-Team board.

May 14 2024, 2:22 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Machine-Learning-Team

isarantopoulos added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.

We have a first MVP for the package.

May 14 2024, 2:18 PM · Goal, Machine-Learning-Team

isarantopoulos added a comment to T363506: Pass image objects to the logo detection service.

We concluded that we will figure out the format after the team figures out the spike (accessing the image and sending a thumbnail to Lift Wing).
I'd suggest we proceed with a base64 encoded image for now. Something like this would work:

May 14 2024, 12:21 PM · Machine-Learning-Team, Structured-Data-Backlog

isarantopoulos added a comment to T362984: GPU errors in hf image in ml-staging.

Tested llm image for nllb-200 with pytorch 2.3.0 and rocm 6.0 and got the same errors:

amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)

I manually changed the image in the experimental namespace in ml-staging and tested this. After the test I reverted it back to the previous one, so the deployment now still has pytorch 2.2.1 and rocm 5.7.

May 14 2024, 11:01 AM · Lift-Wing, Machine-Learning-Team

isarantopoulos committed rMLIS9a36af97f386: llm: bump torch and rocm 6.0 versions (2.3.0-rocm6.0).

llm: bump torch and rocm 6.0 versions (2.3.0-rocm6.0)

May 14 2024, 9:03 AM

May 13 2024

isarantopoulos committed rMLIS2d45e70e4e0a: utils: slow function execution wrapper.

utils: slow function execution wrapper

May 13 2024, 4:23 PM

isarantopoulos committed rMLISae29412fdc30: llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7).

llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7)

May 13 2024, 2:39 PM

May 10 2024

isarantopoulos updated subscribers of T362749: Deploy logo-detection model-server to LiftWing staging.

In T362749#9786161, @Ladsgroup wrote:

Yes, Upload stash shouldn't be accessed directly or indirectly. It is internal to mediawiki and private. You can do it post-upload and add a comment or something (similar to what Automoderator is planning to do with edits or what ores did)

May 10 2024, 3:36 PM · Machine-Learning-Team

isarantopoulos added a comment to T362984: GPU errors in hf image in ml-staging.

I got an error when trying llm image locally with bullseye-torch2.3.0-rocm5.7 (related patch):

Traceback (most recent call last):
  File "/srv/app/llm/model.py", line 9, in <module>
    import torch
  File "/opt/lib/python/site-packages/torch/__init__.py", line 237, in <module>
    from torch._C import *  # noqa: F403
ImportError: libamdhip64.so: cannot enable executable stack as shared object requires: Invalid argument

Haven't found any related open issue so I'm currently testing different pytorch versions to see if the issue still exists

May 10 2024, 1:15 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos added a comment to T362984: GPU errors in hf image in ml-staging.

Regarding the nllb-gpu deployment: we have successfully tested it when we first obtained the MI100. The deployment was just removed at some point as it wasn't used and we wan't to start working on the GPU with huggingface image. The same stands for article-descriptions model server.
However both have only been tested only with torch 2.0.1-rocm5.4.2.
So let's try a different combination. KServe upstream updated the huggingfaceserver recently to support torch 2.3.0 which will also be available in kserve 0.13. This means that we can test rocm 5.7 with it.
I am working towards updating this.

May 10 2024, 10:47 AM · Lift-Wing, Machine-Learning-Team

isarantopoulos added a comment to T362749: Deploy logo-detection model-server to LiftWing staging.

@mfossati is there any other way to access the images in the upload stash other than using a cookie. Using a user cookie to access an API doesn't seem like the right way for a production application both from a design as well as a security point of view. An API key/token would seem more appropriate (if there is such an option available).
Another if there is the possibility to allow Lift Wing IPs direct access to the stash in some way, but then again I'm unaware if that is an option.

May 10 2024, 8:44 AM · Machine-Learning-Team

May 9 2024

isarantopoulos closed T363203: Unsupported lang error for some wiki for revertrisk-language-agnostic calls as Resolved.

May 9 2024, 1:57 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos moved T355656: Investigate how to implement batch inference for revertrisk-multilingual from Backlog/Lift Wing to 2023-2024 Q4 Done on the Machine-Learning-Team board.

May 9 2024, 1:55 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos moved T363203: Unsupported lang error for some wiki for revertrisk-language-agnostic calls from Backlog/Lift Wing to 2023-2024 Q4 Done on the Machine-Learning-Team board.

May 9 2024, 1:55 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos added a comment to T363203: Unsupported lang error for some wiki for revertrisk-language-agnostic calls.

This has been deployed to production and can be used via the api gateway.

May 9 2024, 1:54 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos closed T363130: Update revertrisk wikidata to kserve 0.12.1 as Resolved.

May 9 2024, 1:52 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos closed T363127: Update revertrisk to kserve 0.12.1 as Resolved.

May 9 2024, 1:50 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos added a comment to T363130: Update revertrisk wikidata to kserve 0.12.1.

I double checked and revertrisk-wikidata has already been deployed to staging/prod so this task is done
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023460

May 9 2024, 1:50 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos moved T363130: Update revertrisk wikidata to kserve 0.12.1 from Unsorted to 2023-2024 Q4 Done on the Machine-Learning-Team board.

May 9 2024, 1:49 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos moved T363127: Update revertrisk to kserve 0.12.1 from Unsorted to 2023-2024 Q4 Done on the Machine-Learning-Team board.

May 9 2024, 1:49 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos moved T363129: Update revertrisk multilingual to kserve 0.12.1 from Unsorted to 2023-2024 Q4 Done on the Machine-Learning-Team board.

May 9 2024, 1:49 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos closed T363129: Update revertrisk multilingual to kserve 0.12.1 as Resolved.

May 9 2024, 1:49 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos added a comment to T363127: Update revertrisk to kserve 0.12.1.

I double checked and revertrisk has already been deployed to staging/prod so this task is done
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023460

May 9 2024, 1:48 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos added a comment to T363129: Update revertrisk multilingual to kserve 0.12.1.

I double checked and revertrisk-multilingual has already been deployed to staging/prod so this task is done
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023460

May 9 2024, 1:47 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos added a comment to T363506: Pass image objects to the logo detection service.

In T363506#9782659, @mfossati wrote:

In T363506#9757394, @isarantopoulos wrote:

We would need the upload wizard to send a resized image (224x224) instead of the whole file.

I can imagine we can tackle that from within the Upload Wizard with some JavaScript library. I can create a ticket to look into that if you think this would be the best solution.

May 9 2024, 12:45 PM · Machine-Learning-Team, Structured-Data-Backlog

isarantopoulos moved T364218: UnexpectedValueException: Wikimedia\Rdbms\InsertQueryBuilder::execute can't have empty $rows value (via ORES SqlScoreStorage) from Unsorted to Watching on the Machine-Learning-Team board.

May 9 2024, 7:56 AM · MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), User-brennen, MediaWiki-extensions-ORES, Machine-Learning-Team, Wikimedia-production-error

isarantopoulos placed T361881: Investigate the inconsistent load test results (locust) for revertrisk up for grabs.

May 9 2024, 7:52 AM · Machine-Learning-Team

isarantopoulos claimed T361881: Investigate the inconsistent load test results (locust) for revertrisk.

May 9 2024, 7:52 AM · Machine-Learning-Team

isarantopoulos moved T361881: Investigate the inconsistent load test results (locust) for revertrisk from Ready To Go to In Progress on the Machine-Learning-Team board.

May 9 2024, 7:52 AM · Machine-Learning-Team

May 8 2024

isarantopoulos added a comment to T363506: Pass image objects to the logo detection service.

@mfossati We noticed that the user can define the width in the url like in this example http://commons.wikimedia.org/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224. If we can use this then it would be sufficient and we can stick with using urls in the request.
In this case we can change the request to just include the image name and we can construct the remaining url. Do you know if the name is the unique identifier for the image?
An request then would look like this

{
  "instances": [
    {
      "filename": "Cambia_logo.png",
      "target": "logo"
    }
  ]
}

May 8 2024, 4:12 PM · Machine-Learning-Team, Structured-Data-Backlog

isarantopoulos updated subscribers of T363506: Pass image objects to the logo detection service.

We haven't thought of this yet, mainly because pre-processing logic on the model side already handles resizing. That said, I agree it'd be better to directly send the 224x224 image object.

May 8 2024, 7:01 AM · Machine-Learning-Team, Structured-Data-Backlog

Apr 30 2024

isarantopoulos added a comment to T363506: Pass image objects to the logo detection service.

@mfossati I am in favor of passing the image object in some serialized form.
We would need the upload wizard to send a resized image (224x224) instead of the whole file. Is that something you are already considering or think it would be easy to try?

Apr 30 2024, 12:03 PM · Machine-Learning-Team, Structured-Data-Backlog

isarantopoulos moved T363505: Pass the maximum number of uploads to the logo detection service from Ready To Go to Unsorted on the Machine-Learning-Team board.

Apr 30 2024, 11:56 AM · Machine-Learning-Team, Structured-Data-Backlog

isarantopoulos moved T363506: Pass image objects to the logo detection service from Ready To Go to Unsorted on the Machine-Learning-Team board.

Apr 30 2024, 11:56 AM · Machine-Learning-Team, Structured-Data-Backlog

Apr 26 2024

isarantopoulos updated subscribers of T363203: Unsupported lang error for some wiki for revertrisk-language-agnostic calls.

The wiki/language restrictions have been lifted. The new changes have have been deployed to ml-staging for the moment and we plan to deploy to production early next week.

curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-language-agnostic:predict" -X POST -d '{"lang": "dga", "rev_id": 22041}' -H  "Host: revertrisk-language-agnostic.revertrisk.wikimedia.org"

Apr 26 2024, 5:24 PM · Patch-For-Review, Machine-Learning-Team

Apr 25 2024

isarantopoulos added a comment to P61209 (An Untitled Masterwork).

Leaving the full set of instructions in case someone wants to try/replicate

Apr 25 2024, 12:47 PM

isarantopoulos added a comment to P61209 (An Untitled Masterwork).

Things run fine for me outside of a container. I have successfully ran the server in the past using tensorflow instead of tensorflow-cpu.

Apr 25 2024, 12:37 PM

isarantopoulos created P61209 (An Untitled Masterwork).

Apr 25 2024, 10:45 AM

Apr 24 2024

isarantopoulos added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Update: We have Mistral-7b-instruct hosted on ml-staging that uses a CPU and is using the pytorch base image that we have created. A simple request takes approx 30s (haven't run extensive tests yet).
We are facing some issues using the GPU with this docker image at the moment as documented in T362984: GPU errors in hf image in ml-staging.

Apr 24 2024, 3:57 PM · Goal, Machine-Learning-Team

isarantopoulos added a parent task for T362984: GPU errors in hf image in ml-staging: T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Apr 24 2024, 3:53 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos added a subtask for T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU: T362984: GPU errors in hf image in ml-staging.

Apr 24 2024, 3:53 PM · Goal, Machine-Learning-Team

isarantopoulos added a subtask for T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU: T363191: Test if we can avoid ROCm debian packages on k8s nodes.

Apr 24 2024, 3:52 PM · Goal, Machine-Learning-Team

isarantopoulos added a parent task for T363191: Test if we can avoid ROCm debian packages on k8s nodes: T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Apr 24 2024, 3:52 PM · Machine-Learning-Team

isarantopoulos created T363334: [httpbb] fix failing httpbb test in production enwiki-articletopic.

Apr 24 2024, 1:07 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos added a parent task for T357986: Use Huggingface model server image for HF LLMs: T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Apr 24 2024, 12:17 PM · Patch-For-Review, Machine-Learning-Team

isarantopoulos added a subtask for T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU: T357986: Use Huggingface model server image for HF LLMs.

Apr 24 2024, 12:17 PM · Goal, Machine-Learning-Team

Apr 23 2024

isarantopoulos added a comment to T362984: GPU errors in hf image in ml-staging.

At the moment we have tried/used things in the following matrix. Success/fail refers to whether the GPU has been successfully been used with pytorch.

Apr 23 2024, 4:08 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos committed rMLIS5eb0c382bb5b: revertrisk-wikidata: upgrade to 0.12.1.

revertrisk-wikidata: upgrade to 0.12.1

Apr 23 2024, 3:58 PM

isarantopoulos committed rMLIS1278c28d16b0: revertrisk-multilingual: upgrade to 0.12.1.

revertrisk-multilingual: upgrade to 0.12.1

Apr 23 2024, 12:54 PM

isarantopoulos added a comment to T362984: GPU errors in hf image in ml-staging.

Debian bookworm has a different version of the libdrm-amdgpu1 package as we can see in the repository . We could try to use bullseye to see if this is solved. The problem is that even if it is we will still need to solve this problem in the future when we upgrade debian version.

Apr 23 2024, 12:02 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos committed rMLIS1472e5354050: revertrisk: upgrade to 0.12.1.

revertrisk: upgrade to 0.12.1

Apr 23 2024, 9:53 AM

isarantopoulos created T363130: Update revertrisk wikidata to kserve 0.12.1.

Apr 23 2024, 6:55 AM · Patch-For-Review, Machine-Learning-Team

isarantopoulos created T363129: Update revertrisk multilingual to kserve 0.12.1.

Apr 23 2024, 6:55 AM · Patch-For-Review, Machine-Learning-Team

isarantopoulos created T363127: Update revertrisk to kserve 0.12.1.

Apr 23 2024, 6:51 AM · Patch-For-Review, Machine-Learning-Team

Apr 19 2024

isarantopoulos renamed T362984: GPU errors in hf image in ml-staging from GPU errors in ml-staging to GPU errors in hf image in ml-staging.

Apr 19 2024, 2:16 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos created T362984: GPU errors in hf image in ml-staging.

Apr 19 2024, 2:13 PM · Lift-Wing, Machine-Learning-Team

isarantopoulos added a comment to T362749: Deploy logo-detection model-server to LiftWing staging.

Upgrading to keras==3.2.1 resolved the above issue. Nice catch @klausman!
Now in order for requests to work we'd need to give the model server connectivity to the commons upload stash so that the model server can download the images.

Apr 19 2024, 10:57 AM · Machine-Learning-Team

isarantopoulos committed rMLIS6834a49bfbb3: logo-detection: bump keras to 3.2.1.

logo-detection: bump keras to 3.2.1

Apr 19 2024, 10:34 AM

isarantopoulos added a comment to T362749: Deploy logo-detection model-server to LiftWing staging.

Filed a patch to test latest keras version (3.2.1) which opens the file in "rb" mode
https://github.com/keras-team/keras/blob/master/keras/src/saving/saving_lib.py#L151
keras 3.2.1

with open(filepath, "rb") as f:
            return _load_model_from_fileobj(
                f, custom_objects, compile, safe_mode
            )

vs keras 3.0.4 (current version) https://github.com/keras-team/keras/blob/v3.0.4/keras/saving/saving_lib.py#L139

with file_utils.File(filepath, mode="r+b") as gfile_handle, zipfile.ZipFile(
        gfile_handle, "r"
    ) as zf:
        with zf.open(_CONFIG_FILENAME, "r") as f:
            config_json = f.read()

Apr 19 2024, 10:15 AM · Machine-Learning-Team

isarantopoulos added a comment to T362749: Deploy logo-detection model-server to LiftWing staging.

The directory is being created by the storage-initializer container when the pod initializes.
I attached a shell to the running pod and checked the permissions on the file and they seem ok

-rw-r--r-- 1 nobody daemon 71333909 Apr 19 09:10 logo_max_all.keras

Apr 19 2024, 9:19 AM · Machine-Learning-Team

isarantopoulos edited P61000 (An Untitled Masterwork).

Apr 19 2024, 7:31 AM

isarantopoulos created P61000 (An Untitled Masterwork).

Apr 19 2024, 7:31 AM

Apr 18 2024

isarantopoulos closed T362853: Fix revscoring model servers as Resolved.

Apr 18 2024, 12:49 PM · Machine-Learning-Team

isarantopoulos moved T362853: Fix revscoring model servers from Blocked to 2023-2024 Q4 Done on the Machine-Learning-Team board.

Apr 18 2024, 12:47 PM · Machine-Learning-Team

isarantopoulos moved T362853: Fix revscoring model servers from Unsorted to Blocked on the Machine-Learning-Team board.

Apr 18 2024, 12:47 PM · Machine-Learning-Team

isarantopoulos committed rMLIS3d2acc469265: revscoring: fix circular import.

revscoring: fix circular import

Apr 18 2024, 10:05 AM

isarantopoulos added a comment to T362853: Fix revscoring model servers.

This was caused by a change in the order of the imports. RevscoringModelMP depends on RevscoringModel and RevscoringModelType.
There are 2 solutions:

Apr 18 2024, 9:23 AM · Machine-Learning-Team

isarantopoulos created T362853: Fix revscoring model servers.

Apr 18 2024, 8:42 AM · Machine-Learning-Team

Advanced SearchUse ResultsEdit QueryHide Query

May 21 2024

May 20 2024

May 17 2024

May 16 2024

May 15 2024

May 14 2024

May 13 2024

May 10 2024

May 9 2024

May 8 2024

Apr 30 2024

Apr 26 2024

Apr 25 2024

Apr 24 2024

Apr 23 2024

Apr 19 2024

Apr 18 2024

Advanced Search
Use Results
Edit Query
Hide Query