Page MenuHomePhabricator

Reimplement the model-upload script to take into consideration new use cases
Closed, ResolvedPublic

Description

In T391958 it was clear that the model-upload script needs some improvements, since it assumes that we upload a single model binary file at any given time. There are new use cases, like uploading an entire file path (with subdirs etc..) that should probably be encoded in Python rather than bash, for ease of debuggability/testability/readability.

The script is currently living in the puppet repository (./modules/profile/templates/statistics/explorer/ml/model_upload.sh.erb) and it gets deployed to all the statxxxx nodes (with the related s3 credentials etc..).

Event Timeline

I've started work on this ticket and I've reimplemented the bash script in Python, where I take advantage of boto3 to handle connection to Swift/s3.
However, I'm facing the following questions at the moment:

  1. How can I easily test the Python script?

I'm wondering if I could test by uploading dummy files, but I'm not sure if we have enough permissions to delete them later?
Another option would be just re-uploading one of the currently used models.

  1. Where should the new script live? Do other teams use model-upload script?

Currently, the model-upload bash script is part of puppet repository, which makes it available in each stat machine. In theory, we could do a similar thing with Python script, however the Python script also requires some dependencies (e.g. boto3) to work properly. Possibly we could install the required dependencies into our global Python on stat machines to make it work easily(?).

Another option would be putting it inside inference-services repository in a similar way we do load tests with locust, which means providing a Makefile, which would initialise the required environment and run the script. However, this approach makes it a little harder to use the script, because one needs to copy the repository first.

Thnx for working on this initiative.
I will share my thoughts which maybe answer some of your questions.

  1. You can always test your python script on statmachines: e.g. ssh stat1008.eqiad.wmnet.
    • You can test it first with some dummy files indeed (that was the way I tested the s3cmd when I was initially wanted to upload edit-check)
    • We have permissions to delete files using: s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg del s3://wmf-ml-models/.../.../...
    • You can also then test a real model, you can dowload it using transformers and then try to upload it using your script.:
model_name = "markussagen/xlm-roberta-longformer-base-4096"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

tokenizer.save_pretrained(save_dir)
model.save_pretrained(save_dir)
  1. I am not sure about this answer but I think that the script should live somewhere in eqiad thus it can be run from many machines which are connected to eqiad (e.g. statboxes, etc). But I think an SRE can tell you more about it for sure.

Hi @BWojtowicz-WMF! Thanks for working on this :)

My 2c: the script should be available only for ml-admins (namely, all members of your team) because you should be in control of what it is pushed to Swift/S3 (so we can implement the protocol/workflow in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing in a safe way). About the script: if it is self-contained and reasonably short, it is fine to have it as single file in the puppet repo and install dependencies like `python3-boto (caveat - make sure that the deps that you need are available as Debian packages). If the script is more complex, multiple modules, etc.. then it is better to create a separate repo and possibly debian package for it.

Thank you both for the answers, this helps a lot! @elukey @gkyziridis

I'll try to make the script as self-contained as possible, ideally the only external dependency would be python3-boto, which would make it possible to put in in the puppet repository. I'll drop any additional dependencies and test the script. Once done, we can start putting it inside puppet :)

I've made the Python script for model-upload work with just urllib3 and boto3 as external dependencies, both of which are available as debian packages python3-boto3 and python3-urllib3. I also successfully tested the use-cases we want to support.

@elukey What would be the next steps to put it inside puppet repository? Should I create a patch to put it next to the current model-upload bash script?
I'm also wondering about:

  • Where can we define the additional debian packages we need to install: python3-boto3 and python3-urllib3?
  • How can we add the script to the PATH to make the usage easier?

I have two additional sidenotes:

  • During the development, I've discovered a small problem with Swift storage, which makes it quite incompatible with new versions of boto3 (> 1.35). Namely, boto3 started aggressively enforcing chunked uploads (aka multi-part uploads) with the new versions, which is very hard to circumvent. This is problematic, because Swift storage does not support the chunked uploads: https://github.com/openstack/swift/blob/5d1dbbccbee04bf7c703538dbab7c903ea146778/swift/common/middleware/s3api/s3request.py#L920-L921. The current python3-boto3 debian package is installing old boto3==1.26.27, which does not enforce chunked uploads, which makes it work in our case. However, this might become a problem in the future.
  • The re-written script sits at ~300 lines of Python code, which is much more than the previous 73 lines of bash script. The new code is mostly related to: error handling, logging, argument parsing and boto3 configuration. I'm wondering how big of improvement in debuggability/testability/readability it really is over the bash script we're currently using. I'm happy to discuss this on the patch.

FYI @isarantopoulos

Hello!

@elukey What would be the next steps to put it inside puppet repository? Should I create a patch to put it next to the current model-upload bash script?

Yep I think it is a good start, namely profile::statistics::explorer::ml. You can also delete the config for the old script (and the script itself) since we are not going to use it anymore.

I'm also wondering about:

  • Where can we define the additional debian packages we need to install: python3-boto3 and python3-urllib3?

In the profile::statistics::explorer::ml class there is an ensure_packages list, you can use that one in my opinion.

  • How can we add the script to the PATH to make the usage easier?

It should be sufficient to follow what we did to the model_upload script, namely putting it under /usr/local/bin (it is already in the default's PATH).


I have two additional sidenotes:

  • During the development, I've discovered a small problem with Swift storage, which makes it quite incompatible with new versions of boto3 (> 1.35). Namely, boto3 started aggressively enforcing chunked uploads (aka multi-part uploads) with the new versions, which is very hard to circumvent. This is problematic, because Swift storage does not support the chunked uploads: https://github.com/openstack/swift/blob/5d1dbbccbee04bf7c703538dbab7c903ea146778/swift/common/middleware/s3api/s3request.py#L920-L921. The current python3-boto3 debian package is installing old boto3==1.26.27, which does not enforce chunked uploads, which makes it work in our case. However, this might become a problem in the future.

Great finding, I wasn't aware of it. In theory we are good until the stat10xx hosts don't migrate to Debian 13/Trixie (https://packages.debian.org/trixie/python3-boto3), they are currently on 11/Bullseye (and they are likely transition to 12/Bookworm first). At that point I'd hope that we'll have migrated away from Thanos Swift, in favor of apus/ceph, that should support the chunked uploads, so I'd say that it is fine to proceed for now.

  • The re-written script sits at ~300 lines of Python code, which is much more than the previous 73 lines of bash script. The new code is mostly related to: error handling, logging, argument parsing and boto3 configuration. I'm wondering how big of improvement in debuggability/testability/readability it really is over the bash script we're currently using. I'm happy to discuss this on the patch.

It is not a little script but also not a huge one, I think we can probably allow it for puppet, let's send the patch and we'll discuss in there :)

Change #1166345 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/puppet@production] statistics: Add Python script for model uploading to statistics machines.

https://gerrit.wikimedia.org/r/1166345

Change #1166345 merged by Elukey:

[operations/puppet@production] statistics: Add Python script for model uploading to statistics machines.

https://gerrit.wikimedia.org/r/1166345

As of this time, I've reimplemented the model-upload script in Python, tested its functionality and merged it into puppet repository. However, it's not currently functional as I made one major mistake - I've implemented and tested it with boto3==1.26.27 version, which is distributed via .deb package on Debian Bookworm, whereas our stat machines ar based on Debian Bullseye, which distributes older boto3==1.13.14 version.

I've tried updating the script to work with the boto3==1.13.14 version, but I believe I've came across compatibility issues with our Swift storage. They may be workarounds to make it compatible (by e.g. constructing all request headers manually), but it would make the code much more complex.

I believe there are now two main paths forward:

  1. Update stat machines to Debian Bookworm. According to our upgrade policy, Debian Bullseye is currently in the deprecation period and we should push to upgrade it by September. However, the new model-upload script would not work until we upgrade.
  2. Refactor the created model_upload.py script to work on Bullseye. This would mean we either try to find workarounds to make old boto3 work with our Swift storage, or use s3cmd with subprocess calls to circumvent boto3 completely.

I think we should assess how viable is option 1 - ideally, we could wait for stat machines to be upgraded to new Debian, which would also make the currently developed script work.


One more thing I'd love to hear about is the plans and timeline to change our storage layer from Swift to Ceph. This could also impact how we interact with storage via boto3, which could also require modifications from the model-upload script and modifications in how we'll interact with storage in our Airflow pipelines.

We've discussed the points above in our ML Team Meeting, which resulted in a following plan:

Since the current model_upload.py script can work, but with version of boto3 that is not available as apt package, we will provide a Makefile, which:

  1. Sets up the Python environment with required boto3 version.
  2. Runs the model_upload.py script.

This will make to script usable already and once we migrate to Bookworm, we'll be able to drop the Makefile and use the script directly.

Setting up a virtual environment for each stat10xx may be a little pain, we don't have a good way to do that with our deployment tools (or better, we have but it adds a lot of overhead). For this use case we could simply modify the script to warn the user about the need of a venv (maybe with ad-hoc quick snippet/command to create one and pip install the right package) if boto is not importable. This hack would be removed when the stat boxes will be upgraded to Bookworm. Thoughts?

@elukey I think the idea was that we'd provide a Makefile such that everybody could run just make model-upload ..., which would include installing the venv and running the script. However, this makefile would also need to be in PATH so that everyone can call this from anywhere.

I am hesitant about the idea of adding logs/snippets with instruction on how to create the venv and run the script, because it'd feel like big usability downgrade from our current bash script approach.

Do you think the idea of providing a Makefile, which would automate this makes sense here?

@BWojtowicz-WMF you can definitely add another script under say /usr/local/bin/ via puppet, that is called create-model-upload-venv or similar, and add in there the commands that you'd add to a makefile (so you don't need any Makefile around, nor the make command etc..). A compromise at this point could be to try to import boto3 in model-upload.py, and if not present just suggest to run the create-model-upload-venv. What do you think?

@elukey I think it sounds like a good compromise. I'll go this way, thank you!

Change #1180823 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/puppet@production] statistics: Update model upload script to check for correct boto3 version.

https://gerrit.wikimedia.org/r/1180823

Change #1180823 merged by Elukey:

[operations/puppet@production] statistics: Update model upload script to check for correct boto3 version.

https://gerrit.wikimedia.org/r/1180823

We should update the documentation in Wikitech on how to use the new script before we resolve this task. https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_upload_a_model_to_Swift
@Batorsz shall we also remove the old one?

Yes, I would keep this task open until the documentation has been updated.

shall we also remove the old one?

I'd suggest we also remove the old script once we update the documentation, it would make sure we use the new script :D

(wrong Bartosz :) )

right, sorry for the wrong ping!

Change #1190577 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/puppet@production] statistics: Update docstrings and help messages for model_upload script.

https://gerrit.wikimedia.org/r/1190577

Change #1190577 merged by Elukey:

[operations/puppet@production] statistics: Delete old model-upload script.

https://gerrit.wikimedia.org/r/1190577