Page MenuHomePhabricator

Implement model storage for enwiki-goodfaith inference service
Closed, ResolvedPublic

Description

Let's try loading the model binary from Thanos Swift and injecting it into the revscoring container. A Swift account was created in T280773.
The KFServing storage initializer uses boto3 so we should be able to set an endpoint url as seen here:
https://github.com/kubeflow/kfserving/blob/master/docs/samples/storage/s3/s3_secret.yaml

Currently the model binary file is packaged inside the container. Let's do the following:

  1. Upload the model onto Thanos Swift. Let's try using the timestamp naming conventions mentioned in T280467
  2. Add storage_uri to container service and remove the model binary file.

The storage_uri should look like this: "s3://ml-models/goodfaith/enwiki/202104181735"

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@elukey or @klausman: whenever you have time, can you please upload the enwiki-goodfaith model to Thanos Swift?
I think you two are the only ones with access to the credentials.

The model file is here: https://github.com/wikimedia/editquality/blob/master/models/enwiki.goodfaith.gradient_boosting.model

I wrote a small bash script to upload the models if you want to use it: P15960
This follows the naming conventions discussed in T280467
i.e. s3://wmf-ml-models/goodfaith/enwiki/202105132212/model.bin

@ACraze I added the following to all our home directories on ml-serve1001:

elukey@ml-serve1001:~$ ls -l /home/accraze/.s3cfg 
-rw------- 1 accraze root 191 May 14 06:12 /home/accraze/.s3cfg

This should allow us to manage models on our swift bucket :)

@ACraze

$ ./model_upload.sh 
CHECKING FOR MODEL_BUCKET
Bucket 's3://wmf-ml-models/' created
UPLOADING enwiki.goodfaith.gradient_boosting.model to s3://wmf-ml-models/goodfaith/enwiki/202105140814
upload: 'enwiki.goodfaith.gradient_boosting.model' -> 's3://wmf-ml-models/goodfaith/enwiki/202105140814/enwiki.goodfaith.gradient_boosting.model'  [1 of 1]
 110612 of 110612   100% in    0s   339.32 KB/s  done
$

Awesome, thank you both! Next I'm going to wire up the storage_uri , but I'll need to setup a custom s3 endpoint. Do either of you know what the Thanos Swift url is? edit: nm I found it https://thanos-swift.discovery.wmnet

Change 693217 had a related patch set uploaded (by Accraze; author: Accraze):

[machinelearning/liftwing/inference-services@main] [WIP] create secret for thanos swift

https://gerrit.wikimedia.org/r/693217

Did some more digging into storage today. It seems that the V1alpha2CustomSpec does not support the storageUri field, like the other framework specs do
(i.e. V1alpha2SKLearnSpec)

This leaves us with a couple of options:

  1. If we stay on v1alpha2 api, then we can continue packaging the model binary inside the image. The downside is that we will need to push a new version of the image to our registry everytime we re-train a model.
  2. We can specify a STORAGE_URI environment variable on our container as seen here. We may have to write some custom code in our service to wire everything up correctly.
  3. Another option would be to look into a new feature called Multi-Model-Serving (MMS). This addresses some scalability issues when there are a large number of models deployed on a cluster. I imagine we will eventually need to use this due to having 100+ ores models to deploy.

I will continue exploring these options next week.

Found a really helpful github issue today related to using STORAGE_URI with custom inference services:
https://github.com/kubeflow/kfserving/issues/1232

It seems that you can indeed use STORAGE_URI as an environment variable on a custom image using the v1alpha2 api, as long as container.name is set to 'kfserving-container'. This should do the same thing that happens with other framework images, and the models will be available at /mnt/models.

For some reason this has not yet been documented anywhere, let's test this out and if it works, let's send an upstream PR with some documentation about this feature.

To keep archives happy: https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/kfserving#Deploy_a_custom_InferenceService

We succeeded in configuring the init container storage-initializer, that takes care of pulling the models from our S3 internal endpoint (with authentication etc..) and storing it under /mnt/models.

@ACraze let's review the scope of the task, and decide what actions are left (if any).

@elukey: I reviewed the docs and everything looks good re: rbac role/secret/serviceAccount. I'm going to abandon the WIP CR and will mark this task as RESOLVED.

Change 693217 abandoned by Accraze:

[machinelearning/liftwing/inference-services@main] [WIP] create secret for thanos swift

Reason:

Abandoned due to different approach for rbac/secret/serviceAccount

https://gerrit.wikimedia.org/r/693217