Page MenuHomePhabricator

Sunset MiniKF sandboxes
Closed, ResolvedPublic

Description

The time has come to migrate off of the MiniKF sandboxes! With the work done in T293331: Migrate from Kfserving to Kserve, we are now running kserve 0.7 on the ml-serve stack which uses our own images stored in the WMF Docker Registry.

We should no longer develop on clusters that use different distributions of istio/knative/kfserving/kserve/k8s etc.
For now we can start to work on a test-namespace on ml-serve and eventually move to a staging cluster.

Todos:

  • Shutdown MiniKF 1.1 cluster
  • Shutdown MiniKF 1.3 cluster
  • Clear out wmf-ml-models and wmfmltest s3 buckets

Event Timeline

@elukey - let us know once you have the deployment guide ready!

We are drafting a guide in https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy, and Kevin was able to deploy enwiki-damaging to Lift Wing successfully. This is a good sign to start the deprecation, but we can also wait for more people to complete a deployment :)

We should be ready to proceed with this, if the team agrees let's do it asap :)

Don't wait, lets close down the miniKF sandboxes and start dogfooding.

Echoing thoughts from the ML team meeting today. I'd like to deprecate the sandbox clusters this week, but I think both Kevin and myself have similar questions:

  1. Where can we build our dev images to test new ideas?
  2. Can we run our dev images on ml-serve doing something like kubectl apply -f .... ?
  3. If we should not use kubectl on ml-serve, are non-SREs able to get +2 on deployment-charts repo?
  4. What should our medium-to-long-term plans for dev look like? A general ML box? Just use stat boxes? Better laptops?

Echoing thoughts from the ML team meeting today. I'd like to deprecate the sandbox clusters this week, but I think both Kevin and myself have similar questions:

  1. Where can we build our dev images to test new ideas?

See below :)

  1. Can we run our dev images on ml-serve doing something like kubectl apply -f .... ?

Better to avoid any manual kubectl command, we can think about having a testing namespace or something similar (deployed via helm). We'll also have a two nodes cluster for staging, maybe that one could become also a playground for some experiments (before production).

  1. If we should not use kubectl on ml-serve, are non-SREs able to get +2 on deployment-charts repo?

Need to double check, but I suspect that an SRE will be needed to review configs etc..
Maybe we could relax our rules and admit the use of kubectl on staging (with some guidelines, like using specific test namespaces etc..)

  1. What should our medium-to-long-term plans for dev look like? A general ML box? Just use stat boxes? Better laptops?

The staging cluster could be an idea, especially if we have to test something private. Laptops with minikube could be optimal, but I guess that the ones that we have are not enough right?

As discussed in the ML Team Meeting today, I have terminated the MiniKF 1.3 sandbox as it has diverged too much from our production stack to be useful anymore (also I broke a bunch of things while trying to upgrade to kserve). Once we have our new horizon vm, we can continue shutting down the other sandbox and clearing out our storage buckets.

The MiniKF 1.1 cluster has now been terminated. The last step is to figure out how to handle storing the model binaries for the new ml-sandbox and delete the older s3 buckets.

I'm investigating a couple of storage options:

  • pvc storage on the cluster
  • use minio on the new ml-cluster (might be overkill)
  • use a new bucket on wmf swift just for dev and us the uploader script on the stat-boxes?

Not sure what the best option is long-term, but happy to discuss more in the next ML technical meeting

I've been reading the KServe docs and found an example of using minio for storage in a local cluster:
https://github.com/kserve/website/blob/main/docs/modelserving/kafka/kafka.md

I think we can do something similar for storage on the new cloudvps ml-sandbox.
After that is complete, we can officially close out this task!

I have installed a minio test instance on ml-sandbox and am able to use it for model storage. I have also configured s3cmd to use minio and can use our model_upload script.

root@ml-sandbox:/srv/home/accraze/ml-sandbox-cfg# ./model_upload.sh model.bin articlequality enwiki wmf-ml-models ~/.s3cfg
s3cmd is /usr/bin/s3cmd
CHECKING FOR MODEL_BUCKET: wmf-ml-models
ERROR: S3 error: 409 (BucketAlreadyOwnedByYou): Your previous request to create the named bucket succeeded and you already own it.
BUCKET ALREADY EXISTS, SKIPPING CREATION...
UPLOADING model.bin to s3://wmf-ml-models/articlequality/enwiki/20220203162607
upload: 'model.bin' -> 's3://wmf-ml-models/articlequality/enwiki/20220203162607/model.bin'  [part 1 of 9, 5MB] [1 of 1]
 5242880 of 5242880   100% in    0s    49.58 MB/s  done
upload: 'model.bin' -> 's3://wmf-ml-models/articlequality/enwiki/20220203162607/model.bin'  [part 2 of 9, 5MB] [1 of 1]
 5242880 of 5242880   100% in    0s    61.00 MB/s  done
upload: 'model.bin' -> 's3://wmf-ml-models/articlequality/enwiki/20220203162607/model.bin'  [part 3 of 9, 5MB] [1 of 1]
 5242880 of 5242880   100% in    0s    52.22 MB/s  done
upload: 'model.bin' -> 's3://wmf-ml-models/articlequality/enwiki/20220203162607/model.bin'  [part 4 of 9, 5MB] [1 of 1]
 5242880 of 5242880   100% in    0s    43.65 MB/s  done
upload: 'model.bin' -> 's3://wmf-ml-models/articlequality/enwiki/20220203162607/model.bin'  [part 5 of 9, 5MB] [1 of 1]
 5242880 of 5242880   100% in    0s    59.89 MB/s  done
upload: 'model.bin' -> 's3://wmf-ml-models/articlequality/enwiki/20220203162607/model.bin'  [part 6 of 9, 5MB] [1 of 1]
 5242880 of 5242880   100% in    0s    69.83 MB/s  done
upload: 'model.bin' -> 's3://wmf-ml-models/articlequality/enwiki/20220203162607/model.bin'  [part 7 of 9, 5MB] [1 of 1]
 5242880 of 5242880   100% in    0s    46.45 MB/s  done
upload: 'model.bin' -> 's3://wmf-ml-models/articlequality/enwiki/20220203162607/model.bin'  [part 8 of 9, 5MB] [1 of 1]
 5242880 of 5242880   100% in    0s    46.81 MB/s  done
upload: 'model.bin' -> 's3://wmf-ml-models/articlequality/enwiki/20220203162607/model.bin'  [part 9 of 9, 2MB] [1 of 1]
 2750759 of 2750759   100% in    0s    31.55 MB/s  done

I can verify that the model binary is available in the wmf-ml-models bucket on minio:

root@ml-sandbox:/srv/home/accraze/ml-sandbox-cfg# ./mc ls myminio/wmf-ml-models/articlequality/enwiki/20220203162607
[2022-02-03 16:26:08 UTC]  43MiB STANDARD model.bin

In the Inference Service config, I can point the STORAGE_URI to this object like so:

- name: STORAGE_URI
   value: "s3://wmf-ml-models/articlequality/enwiki/20220203162607/"
ACraze changed the task status from Open to In Progress.Feb 3 2022, 7:58 PM

@kevinbazira - I believe model storage is now ready on ml-sandbox. Can you try these steps to see if you can upload a model binary to our minio object store?

  1. In separate terminal, ssh to ml-sandbox and do:
kubectl port-forward $(kubectl get pod -n kserve-test --selector="app=minio" --output jsonpath='{.items[0].metadata.name}') 9000:9000 -n kserve-test

This will expose minio outside of minikube so we can use the model_upload script and/or minio client to store model files.

  1. In another terminal, ssh to ml-sandbox and try uploading a model using the model_upload script:
model_upload model.bin articlequality enwiki wmf-ml-models ~/.s3cfg

i have a model.bin for articlequality that you can use (/srv/home/accraze/ml-sandbox-cfg), or feel free uploading any other model. Also the last param (.s3cfg) file is in the root home dir.

  1. Confirm that the object is available in minio using minio-client (mc):
root@ml-sandbox:/srv/home/accraze# mc ls myminio -r
[2022-02-03 19:47:56 UTC]  43MiB STANDARD wmf-ml-models/articlequality/enwiki/20220203194754/model.bin

I have configured kserve storage-initializer to pull from our minio test instance when loading a model for an Inference Service, so you should be able to reference the STORAGE_URI as usual:

- name: STORAGE_URI
   value: "s3://wmf-ml-models/articlequality/enwiki/20220203194754/"

Let me know how this goes, if you feel confident with this approach, then we can clear out our old s3 buckets and close this task.

@ACraze, thank you for working on the model storage and sharing this documentation.

The model upload worked successfully:

root@ml-sandbox:/srv/home/kevinbazira# mc ls myminio -r
[2022-02-04 05:53:43 UTC]  43MiB STANDARD wmf-ml-models/articlequality/enwiki-test-upload-by-kevin/20220204055342/model.bin

Then I went ahead to test it in the isvc.

  1. I created enwiki-articlequality-test-by-kevin isvc that used the new model STORAGE_URI:
- name: STORAGE_URI
  value: "s3://wmf-ml-models/articlequality/enwiki-test-upload-by-kevin/20220204055342/"
  1. Checked the new isvc but it didn't seem to be running:
root@ml-sandbox:/srv/home/kevinbazira/isvcs/articlequality-test-by-kevin# kubectl get inferenceservice -n kserve-test
NAME                                  URL                                                      READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                             AGE
enwiki-articlequality                 http://enwiki-articlequality.kserve-test.wikimedia.org   True           100                              enwiki-articlequality-predictor-default-dqk9g   10h
enwiki-articlequality-test-by-kevin                                                            False
  1. Deleted the isvc and recreated it pointing to the old model STORAGE_URI:
- name: STORAGE_URI
  value: "s3://wmf-ml-models/articlequality/enwiki/wp10/202105271538/"
  1. Checked the new isvc and it also was not running.

I am not sure whether the isvcs were failing because of the model storage or it is another issue on the ml-sandbox. Will continue digging to see what the cause might be.

@ACraze, were you able to create an isvc pointing to the new model STORAGE_URI? Did the isvc run successfully on the ml-sandbox?

@kevinbazira - I took a look at your isvc spec, tried to deploy it and noticed that the Knative Revisions were failing.

kubectl describe isvc enwiki-articlequality-test-by-kevin -n kserve-test
...
...
...
    Last Transition Time:  2022-02-04T18:55:54Z
    Message:               Revision "enwiki-articlequality-test-by-kevin-predictor-default-v8bkd" failed with message: 0/1 nodes are available: 1 Insufficient cpu..
    Reason:                RevisionFailed
    Severity:              Info
    Status:                False

It turns out we are running up to the CPU limit....I only provisioned 4 cpu for minikube: https://gitlab.wikimedia.org/accraze/ml-sandbox-cfg/-/blob/main/README.org#L33

I just removed my enwiki-articlequality isvc and deployed yours, now things look good:

root@ml-sandbox:/srv/home/kevinbazira/isvcs/articlequality-test-by-kevin# ./test-aq.sh
enwiki-articlequality-test-by-kevin.kserve-test.wikimedia.org
* Expire in 0 ms for 6 (transfer 0x56049f504fb0)
*   Trying 192.168.49.2...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x56049f504fb0)
* Connected to 192.168.49.2 (192.168.49.2) port 30702 (#0)
> POST /v1/models/enwiki-articlequality-test-by-kevin:predict HTTP/1.1
> Host: enwiki-articlequality-test-by-kevin.kserve-test.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> Content-Length: 18
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 18 out of 18 bytes
< HTTP/1.1 200 OK
< content-length: 225
< content-type: application/json; charset=UTF-8
< date: Fri, 04 Feb 2022 19:05:24 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 201
<
* Connection #0 to host 192.168.49.2 left intact
{"predictions": {"prediction": "Stub", "probability": {"B": 0.017382693143129683, "C": 0.011305576384229396, "FA": 0.002078191955918339, "GA": 0.0029161293780774434, "Start": 0.05709479871741571, "Stub": 0.9092226104212294}}}

I will provision more cpu next week (memory looks ok when i do kubectl describe nodes but will provision more to be safe.) Other than that, things look good! Feel free to experiment with deploying other models. We can probably only have one isvc deployed at a time till fixed, but it seems everything else works :)

Thank you for clarifying on the CPU limit, @ACraze.

I removed the isvc that was running and created a new one:

root@ml-sandbox:/srv/home/kevinbazira# kubectl get inferenceservice -n kserve-test
NAME                                  URL                                                                    READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                                           AGE
enwiki-articlequality-test-by-kevin   http://enwiki-articlequality-test-by-kevin.kserve-test.wikimedia.org   True           100                              enwiki-articlequality-test-by-kevin-predictor-default-5j8kg   4m27s

It returned a prediction successfully:

root@ml-sandbox:/srv/home/kevinbazira/isvcs/articlequality-test-by-kevin# ./test-aq.sh 
enwiki-articlequality-test-by-kevin.kserve-test.wikimedia.org
* Expire in 0 ms for 6 (transfer 0x560087523fb0)
*   Trying 192.168.49.2...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x560087523fb0)
* Connected to 192.168.49.2 (192.168.49.2) port 30702 (#0)
> POST /v1/models/enwiki-articlequality-test-by-kevin:predict HTTP/1.1
> Host: enwiki-articlequality-test-by-kevin.kserve-test.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> Content-Length: 18
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 18 out of 18 bytes
< HTTP/1.1 200 OK
< content-length: 225
< content-type: application/json; charset=UTF-8
< date: Mon, 07 Feb 2022 16:27:50 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 216
< 
* Connection #0 to host 192.168.49.2 left intact
{"predictions": {"prediction": "Stub", "probability": {"B": 0.017382693143129683, "C": 0.011305576384229396, "FA": 0.002078191955918339, "GA": 0.0029161293780774434, "Start": 0.05709479871741571, "Stub": 0.9092226104212294}}}

I confirm that the minio model storage works. Thank you for setting it up @ACraze.

I have cleared out & deleted the old s3 buckets. I have also added documentation for our dev model storage: https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLearning/ML-Sandbox#Model_Storage

I think we can finally close this task out :)

calbon claimed this task.