Page MenuHomePhabricator

Deploy revert-risk multilingual model to production
Closed, ResolvedPublic

Description

In T323613, we have tested the multilingual revert-risk-model service in ml-sandbox. The next step is to deploy the service to Lift Wing. This task serves to track the status of the production deployment.

Event Timeline

Change 861434 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk: upgrade to multilingual revertrisk model

https://gerrit.wikimedia.org/r/861434

The model has been uploaded to Thanos Swift:

aikochou@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/experimental/revertrisk/20221214175551/
2022-12-14 18:00   2647804395  s3://wmf-ml-models/experimental/revertrisk/20221214175551/model.pkl

The size is around 2.5G.

Change 861434 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk: upgrade to multilingual revertrisk model

https://gerrit.wikimedia.org/r/861434

Change 868442 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revertrisk docker images

https://gerrit.wikimedia.org/r/868442

Change 868442 merged by Elukey:

[operations/deployment-charts@master] ml-services: update revertrisk docker images

https://gerrit.wikimedia.org/r/868442

Current status:

Revertisk-multilingual model was successfully deployed to ml-staging yesterday!
Production image tag: 2022-12-22-150637-publish

For the moment, the prod image installed KI from https://gitlab.wikimedia.org/elukey/knowledge_integrity, which removed torch from the dependencies, and installed torch 1.13.1 CPU version in the requiremnts.txt to avoid nvidia/cuda related dependencies. (see T325349)

Next step:

@MunizaA is organizing dependency groups in the knowledge_integrity repository, so there will be a dependency group for lift wing. We'll rebuild images and update new models (work with transformers 4.25.1) when it's ready.

A new model that works with transformers 4.25.1 and torch 1.13.1 is uploaded:
(It is mainly because joblib serialisation specifics. It is needed to reload the model with a new transformers version and reserialize the model dump)

aikochou@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/experimental/revertrisk/20230201095010/
2023-02-01 09:54   2647806802  s3://wmf-ml-models/experimental/revertrisk/20230201095010/model.pkl
achou renamed this task from Deploy MultilingualRevertRiskModel to production to Deploy revert-risk multilingual model to production.Feb 20 2023, 8:33 AM
achou reopened this task as In Progress.

Current status:

  • the latest multilingual model was deployed in ml-staging-codfw
  • working on a separate blubberfile and pipeline for the model, so it no longer shares the pipeline with the revert-risk language-agnostic model. (see T329936)

Next steps:

  • deploy the latest multilingual model to production
    • need to adjust the memory limit range for ml-services, because this isvc needs at least 4 cpu & 6Gi memory
  • measure the latency

After the task is done, along with T321594 we have two revert-risk isvcs in production, one is the language-agnostic model, and the other is the multilingual model.

Change 891252 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw

https://gerrit.wikimedia.org/r/891252

Change 891252 merged by Elukey:

[operations/deployment-charts@master] ml-services: update revertrisk images and increase limitranges for ml-eqiad/codfw

https://gerrit.wikimedia.org/r/891252

CI pipeline for the revertrisk-multilingual has been added, the production images can be found in:
https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-revertrisk-multilingual/tags/

New images (upgrade to debian bullseye and python 3.9) are currently deployed only on ml-staging, in prod there is a complication with limits etc.. that will be solved when we upgrade to k8s 1.23!

Test the model after deployment:

aikochou@deploy1002:~/rrr$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-multilingual:predict" -d @input.json -H "Host: revertrisk-multilingual-predictor-default.experimental.wikimedia.org" --http1.1
{"lang": "en", "rev_id": 1096086751, "score": {"prediction": false, "probability": {"true": 0.3770119460413965, "false": 0.6229880539586035}}}
real	0m6.514s
user	0m0.010s
sys	0m0.004s

Models deployed to production as well, all good!