@AgnesAbah have you managed to resolve the issue?
As Kosta mentioned there isn't anything there related to Lift Wing but with the MediaWiki Action API.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
May 21 2024
As it turns out the above approach won't cut it. Even without the dependencies the compressed image with pytorch 2.3.0 and rocm 6.0 is 4.36GB.
This is the list of packages under /opt/lib/site-packages
functorch torch torch-2.3.0+rocm6.0.dist-info torchgen
Also seems that torch-ROCm by itself is ~12GB, so it is indeed getting bigger and bigger:
Images seem to become more bloated so I am exploring the option to install pytorch-rocm with --no-dependencies option and handle dependencies manually either at the production images repo or on the inference services side. It is a long shot but I think it is worth to try from our side at least to cross it out if it can't be done.
Whether this approach is feasible or not will depend on:
The internal urls also behave properly so it seems that the issue is not on the Lift Wing side but has to do with how the API Gateway translates/encodes the URL.
curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions%3Apredict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json"
May 20 2024
Unfortunately pytorch package seems to get bigger and bigger after each release. Same for ROCm.
May 17 2024
May 16 2024
May 15 2024
May 14 2024
As part of the task T362984: GPU errors in hf image in ml-staging we have also experimented with different versions of pytorch (2.2.1, 2.3.0) and rocm (5.6, 5.7, 6.0) and we are still hitting the same issue.
To clarify the GPU works properly with pytorch 2.0.1 and rocm 5.4.2 but these versions are too old to be used with the huggingfaceserver.
As discussed in the team meeting this task will be restricted to providing a solution for the revertrisk-language-agnostic that currently handles these redirects by rewriting the hosts from the values available in a configuration file. We want to remove this "hack".
Also, a potential idea would be to allow only a specific number of redirects (e.g. up to 3).
We have a first MVP for the package.
We concluded that we will figure out the format after the team figures out the spike (accessing the image and sending a thumbnail to Lift Wing).
I'd suggest we proceed with a base64 encoded image for now. Something like this would work:
Tested llm image for nllb-200 with pytorch 2.3.0 and rocm 6.0 and got the same errors:
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
I manually changed the image in the experimental namespace in ml-staging and tested this. After the test I reverted it back to the previous one, so the deployment now still has pytorch 2.2.1 and rocm 5.7.
May 13 2024
May 10 2024
In T362749#9786161, @Ladsgroup wrote:Yes, Upload stash shouldn't be accessed directly or indirectly. It is internal to mediawiki and private. You can do it post-upload and add a comment or something (similar to what Automoderator is planning to do with edits or what ores did)
I got an error when trying llm image locally with bullseye-torch2.3.0-rocm5.7 (related patch):
Traceback (most recent call last): File "/srv/app/llm/model.py", line 9, in <module> import torch File "/opt/lib/python/site-packages/torch/__init__.py", line 237, in <module> from torch._C import * # noqa: F403 ImportError: libamdhip64.so: cannot enable executable stack as shared object requires: Invalid argument
Haven't found any related open issue so I'm currently testing different pytorch versions to see if the issue still exists
Regarding the nllb-gpu deployment: we have successfully tested it when we first obtained the MI100. The deployment was just removed at some point as it wasn't used and we wan't to start working on the GPU with huggingface image. The same stands for article-descriptions model server.
However both have only been tested only with torch 2.0.1-rocm5.4.2.
So let's try a different combination. KServe upstream updated the huggingfaceserver recently to support torch 2.3.0 which will also be available in kserve 0.13. This means that we can test rocm 5.7 with it.
I am working towards updating this.
@mfossati is there any other way to access the images in the upload stash other than using a cookie. Using a user cookie to access an API doesn't seem like the right way for a production application both from a design as well as a security point of view. An API key/token would seem more appropriate (if there is such an option available).
Another if there is the possibility to allow Lift Wing IPs direct access to the stash in some way, but then again I'm unaware if that is an option.
May 9 2024
This has been deployed to production and can be used via the api gateway.
I double checked and revertrisk-wikidata has already been deployed to staging/prod so this task is done
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023460
I double checked and revertrisk has already been deployed to staging/prod so this task is done
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023460
I double checked and revertrisk-multilingual has already been deployed to staging/prod so this task is done
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023460
In T363506#9782659, @mfossati wrote:In T363506#9757394, @isarantopoulos wrote:We would need the upload wizard to send a resized image (224x224) instead of the whole file.
I can imagine we can tackle that from within the Upload Wizard with some JavaScript library. I can create a ticket to look into that if you think this would be the best solution.
May 8 2024
@mfossati We noticed that the user can define the width in the url like in this example http://commons.wikimedia.org/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224. If we can use this then it would be sufficient and we can stick with using urls in the request.
In this case we can change the request to just include the image name and we can construct the remaining url. Do you know if the name is the unique identifier for the image?
An request then would look like this
{ "instances": [ { "filename": "Cambia_logo.png", "target": "logo" } ] }
We haven't thought of this yet, mainly because pre-processing logic on the model side already handles resizing. That said, I agree it'd be better to directly send the 224x224 image object.
Apr 30 2024
@mfossati I am in favor of passing the image object in some serialized form.
We would need the upload wizard to send a resized image (224x224) instead of the whole file. Is that something you are already considering or think it would be easy to try?
Apr 26 2024
The wiki/language restrictions have been lifted. The new changes have have been deployed to ml-staging for the moment and we plan to deploy to production early next week.
curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-language-agnostic:predict" -X POST -d '{"lang": "dga", "rev_id": 22041}' -H "Host: revertrisk-language-agnostic.revertrisk.wikimedia.org"
Apr 25 2024
Leaving the full set of instructions in case someone wants to try/replicate
Things run fine for me outside of a container. I have successfully ran the server in the past using tensorflow instead of tensorflow-cpu.
Apr 24 2024
Update: We have Mistral-7b-instruct hosted on ml-staging that uses a CPU and is using the pytorch base image that we have created. A simple request takes approx 30s (haven't run extensive tests yet).
We are facing some issues using the GPU with this docker image at the moment as documented in T362984: GPU errors in hf image in ml-staging.
Apr 23 2024
At the moment we have tried/used things in the following matrix. Success/fail refers to whether the GPU has been successfully been used with pytorch.
Debian bookworm has a different version of the libdrm-amdgpu1 package as we can see in the repository . We could try to use bullseye to see if this is solved. The problem is that even if it is we will still need to solve this problem in the future when we upgrade debian version.
Apr 19 2024
Upgrading to keras==3.2.1 resolved the above issue. Nice catch @klausman!
Now in order for requests to work we'd need to give the model server connectivity to the commons upload stash so that the model server can download the images.
Filed a patch to test latest keras version (3.2.1) which opens the file in "rb" mode
https://github.com/keras-team/keras/blob/master/keras/src/saving/saving_lib.py#L151
keras 3.2.1
with open(filepath, "rb") as f: return _load_model_from_fileobj( f, custom_objects, compile, safe_mode )
vs keras 3.0.4 (current version) https://github.com/keras-team/keras/blob/v3.0.4/keras/saving/saving_lib.py#L139
with file_utils.File(filepath, mode="r+b") as gfile_handle, zipfile.ZipFile( gfile_handle, "r" ) as zf: with zf.open(_CONFIG_FILENAME, "r") as f: config_json = f.read()
The directory is being created by the storage-initializer container when the pod initializes.
I attached a shell to the running pod and checked the permissions on the file and they seem ok
-rw-r--r-- 1 nobody daemon 71333909 Apr 19 09:10 logo_max_all.keras
Apr 18 2024
This was caused by a change in the order of the imports. RevscoringModelMP depends on RevscoringModel and RevscoringModelType.
There are 2 solutions: