Deploy revert-risk-model to production
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	achou
	Oct 25 2022, 4:17 PM

Description

The Research team in collaboration with the ML team is working on a language agnostic model to predict reverts on Wikipedia. See T314385.

We'd like to deploy the early versions of the model to LiftWing's experimental namespace. This task serves to track the status of the production deployment.

Details

Subject	Repo	Branch	Lines +/-
revertrisk: update knowledge_integrity and set publish image tag	machinelearning/liftwing/inference-services	main	+3 -2
ml-services: update revert-risk's docker image	operations/deployment-charts	master	+6 -2
revertrisk: allow access to MediaWiki API from internal endpoint	machinelearning/liftwing/inference-services	main	+10 -2
ml-services: add revert-risk-model isvc	operations/deployment-charts	master	+14 -7
revertrisk: add revertrisk model server and pipeline	machinelearning/liftwing/inference-services	main	+308 -0
inference-services: add revertrisk pipelines	integration/config	master	+15 -0

Customize query in gerrit

Related Objects

Mentioned In: T332998: Move Revert-risk language agnostic model from staging to production
T325218: Deploy revert-risk multilingual model to production
T323023: Test batch prediction for revert-risk model
rMLISc1ab7f19b103: revertrisk: allow access to MediaWiki API from internal endpoint
rMLIS3b5d0ec3c226: revertrisk: add revertrisk model server and pipeline
Mentioned Here: T314385: Create a language agnostic model to predict reverts on Wikipedia

Event Timeline

achou created this task.Oct 25 2022, 4:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 25 2022, 4:17 PM

calbon moved this task from Active Tasks to Unsorted on the Machine-Learning-Team board.Oct 25 2022, 6:15 PM

calbon edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).

Change 849478 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk: add revertrisk model server and pipeline

https://gerrit.wikimedia.org/r/849478

gerritbot added a project: Patch-For-Review.Oct 26 2022, 8:41 AM

Change 849480 had a related patch set uploaded (by AikoChou; author: AikoChou):

[integration/config@master] inference-services: add revertrisk pipelines

https://gerrit.wikimedia.org/r/849480

Change 849480 merged by jenkins-bot:

[integration/config@master] inference-services: add revertrisk pipelines

https://gerrit.wikimedia.org/r/849480

Mentioned in SAL (#wikimedia-releng) [2022-10-26T08:58:31Z] <hashar> Reloaded Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/849480 # T321594

achou moved this task from Unsorted to In Progress on the Machine-Learning-Team board.Oct 26 2022, 10:19 AM

Change 849478 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk: add revertrisk model server and pipeline

https://gerrit.wikimedia.org/r/849478

Maintenance_bot removed a project: Patch-For-Review.Oct 26 2022, 2:29 PM

achou mentioned this in rMLIS3b5d0ec3c226: revertrisk: add revertrisk model server and pipeline.Oct 26 2022, 2:29 PM

The model has been uploaded to Thanos Swift:

aikochou@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/experimental/revertrisk/20221026144108/
2022-10-26 14:44       499465  s3://wmf-ml-models/experimental/revertrisk/20221026144108/model.pkl

It was downloaded from knowledge_integrity/pretrained_models

Change 849627 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: add revert-risk-model isvc

https://gerrit.wikimedia.org/r/849627

gerritbot added a project: Patch-For-Review.Oct 26 2022, 4:10 PM

Change 849627 merged by Elukey:

[operations/deployment-charts@master] ml-services: add revert-risk-model isvc

https://gerrit.wikimedia.org/r/849627

Maintenance_bot removed a project: Patch-For-Review.Oct 27 2022, 10:33 AM

Change 850408 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk: allow access to MediaWiki API from internal endpoint

https://gerrit.wikimedia.org/r/850408

gerritbot added a project: Patch-For-Review.Oct 28 2022, 7:46 AM

Change 850408 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk: allow access to MediaWiki API from internal endpoint

https://gerrit.wikimedia.org/r/850408

achou mentioned this in rMLISc1ab7f19b103: revertrisk: allow access to MediaWiki API from internal endpoint.Oct 28 2022, 9:09 AM

Change 850452 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revert-risk's docker image

https://gerrit.wikimedia.org/r/850452

Change 850452 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revert-risk's docker image

https://gerrit.wikimedia.org/r/850452

The revert-risk model has been deployed to production today. :)

Yeah! Thanks @achou ! Please, can you write here an example of how to hit the endpoint ?

@diego I added a section https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Usage in our documentation about how to access inference services internally.

You can change the code example in the doc with the following:

url: https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict
host: revert-risk-model.experimental.wikimedia.org
input data: {"lang": "en", "rev_id": 1083325118} (example)

and it should work. Let me know if there is any problem. :)

achou moved this task from In Progress to Backlog/Other on the Machine-Learning-Team board.Nov 1 2022, 2:29 PM

achou moved this task from Backlog/Other to In Progress on the Machine-Learning-Team board.

Some load test results:

1 connection

aikochou@deploy1002:~/rrr$ wrk -c 1 -t 1 --timeout 5s -s inference.lua https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict --latency
Running 10s test @ https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   123.76ms   29.09ms 339.32ms   93.90%
    Req/Sec     8.89      2.20    10.00     79.01%
  Latency Distribution
     50%  117.50ms
     75%  122.46ms
     90%  141.52ms
     99%  339.32ms
  81 requests in 10.01s, 21.75KB read
Requests/sec:      8.09
Transfer/sec:      2.17KB

3 connections

aikochou@deploy1002:~/rrr$ wrk -c 3 -t 3 --timeout 5s -s inference.lua https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict --latency
Running 10s test @ https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict
  3 threads and 3 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   129.22ms   18.57ms 316.34ms   85.34%
    Req/Sec     8.54      2.30    10.00     71.43%
  Latency Distribution
     50%  127.24ms
     75%  135.52ms
     90%  144.63ms
     99%  194.68ms
  231 requests in 10.02s, 62.04KB read
Requests/sec:     23.06
Transfer/sec:      6.19KB

5 connections

aikochou@deploy1002:~/rrr$ wrk -c 5 -t 5 --timeout 5s -s inference.lua https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict --latency
Running 10s test @ https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict
  5 threads and 5 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   139.48ms   24.52ms 410.64ms   87.53%
    Req/Sec     8.02      2.51    10.00     61.56%
  Latency Distribution
     50%  135.93ms
     75%  148.94ms
     90%  161.25ms
     99%  212.92ms
  359 requests in 10.02s, 96.41KB read
Requests/sec:     35.83
Transfer/sec:      9.62KB

10 connections

aikochou@deploy1002:~/rrr$ wrk -c 10 -t 10 --timeout 5s -s inference.lua https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict --latency
Running 10s test @ https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict
  10 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   173.77ms  156.91ms   1.38s    93.56%
    Req/Sec     8.02      2.67    10.00     63.12%
  Latency Distribution
     50%  134.02ms
     75%  150.65ms
     90%  190.85ms
     99%  993.63ms
  526 requests in 10.02s, 141.26KB read
  Socket errors: connect 2, read 0, write 0, timeout 0
Requests/sec:     52.50
Transfer/sec:     14.10KB

20 connections

aikochou@deploy1002:~/rrr$ wrk -c 20 -t 20 --timeout 5s -s inference.lua https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict --latency
Running 10s test @ https://inference.discovery.wmnet:30443/v1/models/revert-risk-model:predict
  20 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   157.03ms   25.39ms 330.88ms   82.81%
    Req/Sec     7.18      2.59    10.00     49.51%
  Latency Distribution
     50%  152.70ms
     75%  165.81ms
     90%  181.38ms
     99%  257.99ms
  1012 requests in 10.02s, 271.78KB read
  Socket errors: connect 4, read 0, write 0, timeout 0
Requests/sec:    101.01
Transfer/sec:     27.13KB

Overall looks very nice compared to revscoring models! The avg lantency remained in 1xx ms, and the RPS increased when we increased number of connections.

Grafana metrics:

Thanks a lot for sharing these results here, @achou! I do see that we're seeing more socket connect errors with increased connections. Is that something we should be concerned about? Wrk docs don't seem to say anything about these errors but some issues on the repo mention that connect errors in particular can also occur when wrk runs out of file descriptors but they also report opening hundreds of connections so not sure if that's the case here.

achou mentioned this in T323023: Test batch prediction for revert-risk model.Nov 14 2022, 12:59 PM

@MunizaA The reason of the socket connection error seems to be the limited file opening on Linux according to the article. It is not a problem from the model server, so I think we don't need to worry too much. Also note that deploy1002 is not only the deployment server for ML services, but also MediaWiki and all Wikimedia kubernetes services, so wrk may be easier to reach the limit.

The revertrisk model has been deployed to production. I'm going to mark this as RESOLVED. We'll open other tasks for new model deployment when needed.

Summarize the steps:

Add revertrisk model server and pipeline config to the inference-services repository
Add new pipeline to the integration/config repository
Upload model to Thanos Swift
Add revertrisk inference services to the deployment-charts and wait for ML SRE +2 and merge
Deploy to staging (ml-staging-codfw) and test the model
Deploy to production (ml-serve-eqiad &ml-serve-codfw) and test the model (simple curl and/or wrk load test)