Implement model storage for enwiki-goodfaith inference service
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ACraze
	May 13 2021, 5:37 PM

Description

Let's try loading the model binary from Thanos Swift and injecting it into the revscoring container. A Swift account was created in T280773.
The KFServing storage initializer uses boto3 so we should be able to set an endpoint url as seen here:
https://github.com/kubeflow/kfserving/blob/master/docs/samples/storage/s3/s3_secret.yaml

Currently the model binary file is packaged inside the container. Let's do the following:

Upload the model onto Thanos Swift. Let's try using the timestamp naming conventions mentioned in T280467
Add storage_uri to container service and remove the model binary file.

The storage_uri should look like this: "s3://ml-models/goodfaith/enwiki/202104181735"

Details

	Subject	Repo	Branch	Lines +/-
	[WIP] create secret for thanos swift	machinelearning/liftwing/inference-services	main	+20 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T272917 Lift Wing proof of concept
Resolved	calbon	T280025 Find a way to store models for Kubeflow
Resolved	• ACraze	T282802 Implement model storage for enwiki-goodfaith inference service

Event Timeline

• ACraze created this task.May 13 2021, 5:37 PM

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptMay 13 2021, 5:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ACraze added a parent task: T280025: Find a way to store models for Kubeflow.May 13 2021, 5:38 PM

Maintenance_bot added a project: artificial-intelligence.May 13 2021, 5:45 PM

• ACraze updated the task description. (Show Details)May 13 2021, 5:50 PM

• ACraze updated the task description. (Show Details)May 13 2021, 9:05 PM

@elukey or @klausman: whenever you have time, can you please upload the enwiki-goodfaith model to Thanos Swift?
I think you two are the only ones with access to the credentials.

The model file is here: https://github.com/wikimedia/editquality/blob/master/models/enwiki.goodfaith.gradient_boosting.model

I wrote a small bash script to upload the models if you want to use it: P15960
This follows the naming conventions discussed in T280467
i.e. s3://wmf-ml-models/goodfaith/enwiki/202105132212/model.bin

@ACraze I added the following to all our home directories on ml-serve1001:

elukey@ml-serve1001:~$ ls -l /home/accraze/.s3cfg 
-rw------- 1 accraze root 191 May 14 06:12 /home/accraze/.s3cfg

This should allow us to manage models on our swift bucket :)

@ACraze

$ ./model_upload.sh 
CHECKING FOR MODEL_BUCKET
Bucket 's3://wmf-ml-models/' created
UPLOADING enwiki.goodfaith.gradient_boosting.model to s3://wmf-ml-models/goodfaith/enwiki/202105140814
upload: 'enwiki.goodfaith.gradient_boosting.model' -> 's3://wmf-ml-models/goodfaith/enwiki/202105140814/enwiki.goodfaith.gradient_boosting.model'  [1 of 1]
 110612 of 110612   100% in    0s   339.32 KB/s  done
$

Awesome, thank you both! Next I'm going to wire up the storage_uri , but I'll need to setup a custom s3 endpoint. Do either of you know what the Thanos Swift url is? edit: nm I found it https://thanos-swift.discovery.wmnet

• ACraze mentioned this in T279000: Load a revscoring model into KFServing.May 20 2021, 6:26 PM

Change 693217 had a related patch set uploaded (by Accraze; author: Accraze):

[machinelearning/liftwing/inference-services@main] [WIP] create secret for thanos swift

https://gerrit.wikimedia.org/r/693217

gerritbot added a project: Patch-For-Review.May 20 2021, 9:10 PM

• ACraze moved this task from Unsorted to Active Tasks on the Machine-Learning-Team board.May 20 2021, 9:14 PM

• ACraze edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

Did some more digging into storage today. It seems that the V1alpha2CustomSpec does not support the storageUri field, like the other framework specs do
(i.e. V1alpha2SKLearnSpec)

This leaves us with a couple of options:

If we stay on v1alpha2 api, then we can continue packaging the model binary inside the image. The downside is that we will need to push a new version of the image to our registry everytime we re-train a model.
We can specify a STORAGE_URI environment variable on our container as seen here. We may have to write some custom code in our service to wire everything up correctly.
Another option would be to look into a new feature called Multi-Model-Serving (MMS). This addresses some scalability issues when there are a large number of models deployed on a cluster. I imagine we will eventually need to use this due to having 100+ ores models to deploy.

I will continue exploring these options next week.

kevinbazira subscribed.May 24 2021, 7:12 AM

• ACraze mentioned this in T283526: Create generic revscoring inference service.May 24 2021, 6:24 PM

• ACraze mentioned this in T272919: Install KFServing standalone.May 24 2021, 9:58 PM

Found a really helpful github issue today related to using STORAGE_URI with custom inference services:
https://github.com/kubeflow/kfserving/issues/1232

It seems that you can indeed use STORAGE_URI as an environment variable on a custom image using the v1alpha2 api, as long as container.name is set to 'kfserving-container'. This should do the same thing that happens with other framework images, and the models will be available at /mnt/models.

For some reason this has not yet been documented anywhere, let's test this out and if it works, let's send an upstream PR with some documentation about this feature.

calbon moved this task from Parked to Project: Lift Wing on the Machine-Learning-Team (Active Tasks) board.Jun 3 2021, 5:25 PM

To keep archives happy: https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/kfserving#Deploy_a_custom_InferenceService

We succeeded in configuring the init container storage-initializer, that takes care of pulling the models from our S3 internal endpoint (with authentication etc..) and storing it under /mnt/models.

@ACraze let's review the scope of the task, and decide what actions are left (if any).

@elukey: I reviewed the docs and everything looks good re: rbac role/secret/serviceAccount. I'm going to abandon the WIP CR and will mark this task as RESOLVED.

Change 693217 abandoned by Accraze:

[machinelearning/liftwing/inference-services@main] [WIP] create secret for thanos swift

Reason:

Abandoned due to different approach for rbac/secret/serviceAccount

https://gerrit.wikimedia.org/r/693217

• ACraze closed this task as Resolved.Aug 27 2021, 10:46 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 27 2021, 11:10 PM

Implement model storage for enwiki-goodfaith inference serviceClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement model storage for enwiki-goodfaith inference service
Closed, ResolvedPublic
Actions

Related Objects
Search...