Page MenuHomePhabricator

Move Revert-risk multilingual model from staging to production
Closed, ResolvedPublic

Description

Background

The ML team has decided to limit the experimental namespace to ml-staging to prevent non-production-ready model servers from being deployed to Lift Wing (production). To deploy to the production/API gateway, all requirements in T332711 must be met.

Note that the multilingual model needs more resources:

resources:
  limits:
    cpu: "4"
    memory: 6Gi
  requests:
    cpu: "4"
    memory: 6Gi

Event Timeline

I think that we could create a new revertrisk generic kubernetes namespace and deploy Revert risk model servers / isvcs to it, what do you think?

@klausman do you have time to work with Aiko to push this to production during then next days?

@klausman do you have time to work with Aiko to push this to production during then next days?

Yep, can do! I agree that a generic namespace as you mentioned would be fine.

Yess, let's create a new revertrisk generic Kubernetes namespace and deploy the model to it! @klausman, please let me know when you finish adding the new helmfile config (basic usernames, namespace and helmfile private settings, etc) for the new namespace, so that I can add models to it. :) Also, since the multilingual revertrisk model requires more resources, we'll need to increase the limits for CPU and memory, as specified in the description.

The model has been moved to a new bucket:

aikochou@stat1005:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/revertrisk/multilingual/20230320192952/
2023-05-12 16:51   2649711589  s3://wmf-ml-models/revertrisk/multilingual/20230320192952/model.pkl

Change 919842 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera/k8s/ml: add namespace permissions for revertrisk

https://gerrit.wikimedia.org/r/919842

Change 919843 had a related patch set uploaded (by Klausman; author: Tobias Klausmann):

[operations/deployment-charts@master] k8s/ml/prod: Add revertrisk namespace and permissions, plus TLS config

https://gerrit.wikimedia.org/r/919843

Change 919842 merged by Klausman:

[operations/puppet@production] admin_ng/ml-serve: add namespace permissions for revertrisk

https://gerrit.wikimedia.org/r/919842

Change 919843 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: add revertrisk namespace and config to ml-serve clusters

https://gerrit.wikimedia.org/r/919843

Namespaces are live in both eqiad and codfw:

# kube_env admin ml-serve-eqiad 
# kubectl get namespaces revertrisk
NAME         STATUS   AGE
revertrisk   Active   2m9s
# kube_env admin ml-serve-codfw
# kubectl get namespaces revertrisk
NAME         STATUS   AGE
revertrisk   Active   6m
#

Change 920208 had a related patch set uploaded (by Klausman; author: Tobias Klausmann):

[operations/deployment-charts@master] admin_ng: add revertrisk model config to ml-serve clusters

https://gerrit.wikimedia.org/r/920208

Change 920208 merged by Klausman:

[operations/deployment-charts@master] helmfile.d: add revertrisk model config to ml-serve clusters

https://gerrit.wikimedia.org/r/920208

The changes from 920208 have been deployed.

Deployed the fix for the revert risk (language agnostic) isvc name change, and I have also deployed the isvcs to staging (the idea is that we'll be able to test them in there if needed in the future).

Next steps:

  • Test the endpoints and see if everything works correctly
  • Expose the model servers via API-Gateway
  • Update the Lift Wing docs if needed :)

Test the internal endpoint and it works correctly:

aikochou@deploy1002:~$ time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-multilingual:predict" -d @input.json -H "Host: revertrisk-multilingual.revertrisk.wikimedia.org" --http1.1
{"lang":"en","rev_id":1096086751,"score":{"prediction":false,"probability":{"true":0.3807116729020513,"false":0.6192883270979487}}}
real	0m5.907s
user	0m0.014s
sys	0m0.000s

Next step is to config API gateway to enable the public endpoint.

Change 922073 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: add Lift Wing's revert risk model server to api-gateway

https://gerrit.wikimedia.org/r/922073

Next steps:

  • Wait for https://gerrit.wikimedia.org/r/922073 to be reviewed, merged and deployed to the api gateway.
  • Test the new external endpoint.
  • Add documentation to api.wikimedia.org (with examples etc..)

Change 922073 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: add Lift Wing's revert risk model server to api-gateway

https://gerrit.wikimedia.org/r/922073

elukey@stat1004:~$ time curl https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-multilingual:predict -X POST -d '{"rev_id": 123456, "lang": "en"}'
{"lang":"en","rev_id":123456,"score":{"prediction":false,"probability":{"true":0.3058536847305622,"false":0.6941463152694378}}}
real	0m0.471s
user	0m0.018s
sys	0m0.009s

Model exposed via API Gateway! Now we have to add docs to the api.wikimedia.org's Portal and we are done. @achou do you want to do it or do you prefer me to do it?

Change 922795 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::mirror: add support for PKI certificate

https://gerrit.wikimedia.org/r/922795