Page MenuHomePhabricator

Find a way to store models for Kubeflow
Closed, ResolvedPublic

Description

As far as I can see in kfserving/storage.py, it seems that the best way for us to make models available to Kfserving/Kubeflow is via Swift.

Some general questions to answer:

  • Does kfserving support Swift? On paper yes since Swift supports the S3 API, but we should verify.
  • Is SRE onboard with us storing models on Swift? (from the capacity perspective, bandwidth, etc..). If so, we should reach out to SRE to ask for more info :)
  • Storing private models on Swift is surely good, but how should we do for public models? Should we store them in Swift, and somehow make them available via Commons or similar?
  • How do we deploy models? Should we allow people to just push them to Swift?

Event Timeline

@Theofpa - do you know if KFServing storage would support an object store that follows the S3 API? Or does it only specifically support S3?

We already use Openstack Swift for object storage at WMF and are wondering how well it would integrate with KFServing:
https://wikitech.wikimedia.org/wiki/Swift

At a quick glance, it looks like we should be able to set the boto client's endpoint_url via an env variable and point it to our object store, but we have not tried this yet.

It should work but I've never tested it. Here is an example of how to configure the endpoint_url across all inferenceservices.

By the way, the storage initialiser has recently switched from minio to boto3, although it shouldn't make any difference.

I can check if there is a simple way to run swift on my local cluster to validate it.

I had a chat with Filippo today (the SRE that has been following Swift closely over the past years) and I have some very interesting news:

  • There are multiple Swift clusters in Wikimedia, the most famous one is the backend for Commons, but unfortunately it has no S3-api gateway enabled (since it wasn't needed). Very soon the SRE team will create the Misc Object Storage Service, that should be what we'll need long term (T279621).
  • As interim solution for the MVP we could use the Thanos Swift cluster. It is present in eqiad and codfw, and the S3 api gateway is enabled. It offers a cross dc replication that it is currently not encrypted, so we should be careful in using it (basically for PII models etc.. it is not great).
  • The SRE team can create accounts for any given Swift cluster, and every account can create containers, that in turn can contain objects (no nesting of containers into other ones). Every container has one or more policies associated (like restricted to certain clients, replicated X times, etc..).
  • In the case of Swift "Commons", Varnish is instructed to fetch directly from Swift when a user in Commons wants to get an image/video/etc... In our case, we could build a little UI managed by us that fetches data with some swift credentials (read only) and expose objects to the public clients (maybe we could think about "public" containers and private ones).

If we like the idea, it is sufficient to open a task to get an account for Swift Thanos, to support the MVP use case. Long term we'd probably want two accounts in Misc Object Storage Service, one able to upload models and one able only to read them (for example, the KfServing cluster shouldn't be able to push models to Swift in my opinion).

@ACraze (and all others reading!) lemme know your thoughts!

@Theofpa thanks as always for the great support!

@elukey - this is very insightful, thanks for finding more info re: Swift, I think the M.O.S.S. cluster is most likely a good long term approach.

As interim solution for the MVP we could use the Thanos Swift cluster. It is present in eqiad and codfw, and the S3 api gateway is enabled. It offers a cross dc replication that it is currently not encrypted, so we should be careful in using it (basically for PII models etc.. it is not great).

This sounds interesting, I don't think any of our initial models will have PII so that may not be a huge concern for the MVP. I do wonder what the migration process from Thanos to M.O.S.S. would be like though.

Another alternative I thought of for the MVP would be to use pvc storage, although we would outgrow that solution once we get Train Wing up since we would not be able to move models across DCs.

In our case, we could build a little UI managed by us that fetches data with some swift credentials (read only) and expose objects to the public clients (maybe we could think about "public" containers and private ones).

I could see this being a 'model registry' that the community could explore & retrieve models from. Also I know that @kevinbazira is highly skilled in UI development so this might be an easy win for us.

Also thanks @Theofpa for the the endpoint_url example!

@elukey - this is very insightful, thanks for finding more info re: Swift, I think the M.O.S.S. cluster is most likely a good long term approach.

As interim solution for the MVP we could use the Thanos Swift cluster. It is present in eqiad and codfw, and the S3 api gateway is enabled. It offers a cross dc replication that it is currently not encrypted, so we should be careful in using it (basically for PII models etc.. it is not great).

This sounds interesting, I don't think any of our initial models will have PII so that may not be a huge concern for the MVP. I do wonder what the migration process from Thanos to M.O.S.S. would be like though.

In theory it should be something like:

  • create all the models in the MOSS containers
  • change the docker images that we use to point to them (changing the S3 endpoint)
  • helm deploy to recycle them

This is probably super optimistic, more issues might arise, but I don't expect a ton of problems on this front (last famous words).

Another alternative I thought of for the MVP would be to use pvc storage, although we would outgrow that solution once we get Train Wing up since we would not be able to move models across DCs.

The persistent volumes in kubernetes scares me a little, but I am super ignorant about them. I suspect that it might boil down to have some storage space to share via virtual volumes, but the Swift approach feels a little cleaner. We can experiment with both and decide!

In our case, we could build a little UI managed by us that fetches data with some swift credentials (read only) and expose objects to the public clients (maybe we could think about "public" containers and private ones).

I could see this being a 'model registry' that the community could explore & retrieve models from. Also I know that @kevinbazira is highly skilled in UI development so this might be an easy win for us.

+1 Kevin let us know your thoughts!

@ACraze ok if I ask SRE to create an ML account on Thanos Swift then? If so I'll create a subtask and work on it next week (it shouldn't take a lot to get one).

ok if I ask SRE to create an ML account on Thanos Swift then?

@elukey - yeah I think we should give this a try and see how it goes.

The persistent volumes in kubernetes scares me a little, but I am super ignorant about them. I suspect that it might boil down to have some storage space to share via virtual volumes, but the Swift approach feels a little cleaner.

The more I think about the pvc approach the less appealing it becomes compared to Swift. I think we should just keep it in mind as an alternative in case Swift doesnt work for the MVP for some reason.

Swift account created in T280773 on Thanos Swift (also tested it with s3cmd). The account name is mlserve:prod and the password is stored in puppet private for the moment (so me and Tobias can retrieve it). This is a read/write account, that should be ok as starter :)

calbon claimed this task.
calbon moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.