Page MenuHomePhabricator

Deploy multilingual readability model to LiftWing
Closed, ResolvedPublic

Description

We developed a multilingual model for readability. This model generates a score for Wikipedia articles capturing (some aspect) of how easy it is to read. For more details see: https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research#An_improved_multilingual_model_for_readability

At the moment, the model lives on one of the stat-machines. The goal is to make the model's output available via Lifting.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

weekly update:

  • setting up documentation of the trained model.
    • we wrote up the summary of the results in a doc. we will be moving those to the project's meta-page.
    • we will be adding the code to the repo with an example notebook for predicting scores

weekly update:

  • started initial conversation with Diego (and Muniza and Aiko). since the model is relying on the same pipeline as the revert-risk model, the conclusion was that it should be possible to move it to LiftWing, in prinicple

weekly update

  • ongoing discussions between Mykola and Aiko/Muniza; getting feedback on repo
  • next step: drafting model card

weekly:

  • Mykola created first draft of the model card. will review and make suggestions/improvements if needed

Change 931987 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] readability: add readability model server

https://gerrit.wikimedia.org/r/931987

Change 931994 had a related patch set uploaded (by AikoChou; author: AikoChou):

[integration/config@master] inference-services: add readability pipelines

https://gerrit.wikimedia.org/r/931994

weekly update:

Change 931994 merged by jenkins-bot:

[integration/config@master] inference-services: add readability pipelines

https://gerrit.wikimedia.org/r/931994

Change 931987 merged by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] readability: add readability model server

https://gerrit.wikimedia.org/r/931987

Change 934562 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: add readability isvc to experimental ns

https://gerrit.wikimedia.org/r/934562

Change 934562 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add readability isvc to experimental ns

https://gerrit.wikimedia.org/r/934562

Change 934582 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: increase memory resources for readability isvc

https://gerrit.wikimedia.org/r/934582

Change 934582 abandoned by AikoChou:

[operations/deployment-charts@master] ml-services: increase memory resources for readability isvc

Reason:

no needed

https://gerrit.wikimedia.org/r/934582

Change 935068 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] readability: add nltk tokenizers download to blubber's builder

https://gerrit.wikimedia.org/r/935068

Change 935068 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] readability: add nltk tokenizers download to blubber's builder

https://gerrit.wikimedia.org/r/935068

Change 935676 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update readability docker image

https://gerrit.wikimedia.org/r/935676

Change 935676 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update readability docker image

https://gerrit.wikimedia.org/r/935676

The readability model has been deployed to LiftWing staging. It is available via an internal endpoint.

Test the model:

aikochou@deploy1002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/readability:predict" -X POST -d '{"lang": "en", "rev_id": 1161100049}' -H "Host: readability.experimental.wikimedia.org" --http1.1

{"model_name":"readability","model_version":"2","wiki_db":"enwiki","revision_id":1161100049,"output":{"prediction":true,"probabilities":{"true":0.8169194640857833,"false":0.1830805359142167},"fk_score":11.953445079550391}}
real	0m1.361s
user	0m0.014s
sys	0m0.001s

@achou this is great. I tried from the stat1008 and can confirm that this works.
Would it be possible to make it available publicly? I would like to access the endpoint from toolforge for a public API.
Thanks

@achou this is great. I tried from the stat1008 and can confirm that this works.
Would it be possible to make it available publicly? I would like to access the endpoint from toolforge for a public API.
Thanks

Hi @MGerlach! We follow https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing to graduate a model to production, so we can start working on it if Research has the bandwidth to meet all the criteria (especially the ownership etc..). Let us know :)

Hi @MGerlach! We follow https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing to graduate a model to production, so we can start working on it if Research has the bandwidth to meet all the criteria (especially the ownership etc..). Let us know :)

@elukey thanks for the pointer.
We created a model card for the model which describes in detail the evaluation and specifies a point of contact (me): https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Multilingual_readability_model_card
Do you need any additional information or commitments from our side? I am unsure about some of the other requirements specified in the docs such as the stability level, code quality, etc. Any guidance on how to make sure we can (help to) ensure that we meet those would be very helpful : )

Hi @MGerlach! We follow https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing to graduate a model to production, so we can start working on it if Research has the bandwidth to meet all the criteria (especially the ownership etc..). Let us know :)

@elukey thanks for the pointer.
We created a model card for the model which describes in detail the evaluation and specifies a point of contact (me): https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Multilingual_readability_model_card
Do you need any additional information or commitments from our side? I am unsure about some of the other requirements specified in the docs such as the stability level, code quality, etc. Any guidance on how to make sure we can (help to) ensure that we meet those would be very helpful : )

Sure! What we are looking for is an indication of the commitment of the team (requesting the model server) to support the model in the long term. For example, say that this model server leads to bugs, HTTP 500s, etc. and the issue is in the model itself. we (as ML) would need some support from the Research team (even if you are away etc..) to figure out what's wrong. I had a chat with @leila about this, and the idea was to have a limited list of models to publish in order to be able to better support them. From my point of view we are ready for it, but we'd need a sign off from your team first :)

MGerlach renamed this task from (stretch) Deploy multilingual readability model to LiftWing to Deploy multilingual readability model to LiftWing.Jul 14 2023, 8:57 AM
MGerlach moved this task from In Progress to FY2023-24-Research-July-September on the Research board.

@elukey thanks for the additional context.
there are ongoing discussions in the Research Team around the level of commitment we can provide and also sustain in the long-run. At the example of this specific task, we have been starting to think how to answer this question more generally for other potential models in the future as well. I would like to wait for these discussions to take place in the next week or so and then will get back here when I have a clearer picture.

@elukey Research accepts accountability for the readability model for a period of 12 months (We will revisit then if we want to continue being accountable. If yes, we renew. If no, we let you know and you can stop the model.). Accountability means that we will assure on our end there is always someone who can pick up the work related to updating the model as prioritized, and that we triage incoming tasks relevant to our team. If this works for you, we're good to go.

We will continue working on our end and with your team to further clarify accountability details. :)

weekly update:

  • coordinating next steps with folks from ML team. my understanding is that work will be picked up by them in the next week (thanks!)

Change 951460 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] profile::k8s::deployment_server: Add config for readability isvc

https://gerrit.wikimedia.org/r/951460

Change 951461 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] helmfile.d: Add config bits to move readability isvc to prod

https://gerrit.wikimedia.org/r/951461

weekly update:

@elukey Research accepts accountability for the readability model for a period of 12 months (We will revisit then if we want to continue being accountable. If yes, we renew. If no, we let you know and you can stop the model.). Accountability means that we will assure on our end there is always someone who can pick up the work related to updating the model as prioritized, and that we triage incoming tasks relevant to our team. If this works for you, we're good to go.

We will continue working on our end and with your team to further clarify accountability details. :)

Hi @leila! This is great, but it has a big downside - if we expose readability models to the outside community (via the API Gateway) we'll have clients that (rightfully) will base their jobs/bots/dashboards/etc.. on them, and the 12 months review time worries me a bit. We are doing a lot of work to migrate the community using the ORES API to Lift Wing, experiencing how difficult it is to ask a wide variety of projects to change/adapt their code. I am not saying that the 12 months timeline is not good, but we should also think about deprecation paths if we decide to remove a model from production after some time (so some extra work/help from Research may be needed to move the community to other models etc..). Lemme know your thoughts :)

@MGerlach I added a step in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing, namely:

A basic load test is performed to figure out (indicatively) how many rps the model server can sustain (in staging). The ML team and the model owner set a target SLO for the service.

The load test part is a simple test to figure out, varying the inputs, how the model server behaves in staging (namely, how many rps-es can sustain without slowing down etc..). We don't have any clear docs about it, but I'll try to create some to help out (and I'll publish results in here).

The SLO part is newer and more difficult, we can try to discuss it in a meeting if you prefer, but it is essentially the level of availability that we want to set for the model server. Maybe we can discuss these points during the next research/ml sync?

@MGerlach I added a step in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing, namely:

A basic load test is performed to figure out (indicatively) how many rps the model server can sustain (in staging). The ML team and the model owner set a target SLO for the service.

The load test part is a simple test to figure out, varying the inputs, how the model server behaves in staging (namely, how many rps-es can sustain without slowing down etc..). We don't have any clear docs about it, but I'll try to create some to help out (and I'll publish results in here).

The SLO part is newer and more difficult, we can try to discuss it in a meeting if you prefer, but it is essentially the level of availability that we want to set for the model server. Maybe we can discuss these points during the next research/ml sync?

@elukey thanks. let me know how I can help with the load test. I am happy to discuss the point about SLO and will try to attend the ML/Research meeting later today.

weekly update:

  • met with Luca and Aiko to discuss load testing and SLO T334182#9130664
  • we agreed that it makes sense to address these questions before deploying publicly
  • we will figure out details jointly along the way; ML Team will lead this investigation with input from Research

I conducted some load tests on the readability model in staging using the same input and script as we did for revert-risk (code), as they share the same input parameters. The results can be found here: P52406.

The model performs similarly to the revertrisk-multilingual model with high latency, as both use pre-trained mBERT. Therefore, we may need to set a low threshold for the autoscaling.knative.dev/target, e.g. 3 or 4.

Great results @achou!

@MGerlach before proceeding, do you have any plan for the model? I mean, are there any known consumers/clients that we'll use it or is it just a new endpoint to test? I am asking since the traffic handled per second seems moderate, so we'd need to refine/improve it a little if any client/consumer has higher performance demands. If not we can proceed and move the service to production, let us know!

@MGerlach before proceeding, do you have any plan for the model? I mean, are there any known consumers/clients that we'll use it or is it just a new endpoint to test? I am asking since the traffic handled per second seems moderate, so we'd need to refine/improve it a little if any client/consumer has higher performance demands. If not we can proceed and move the service to production, let us know!

There are no known external consumers at the moment since it is a new endpoint to test. Currently, the model output will be used to i) provide a metric for the knowledge gaps indexand ii) provide the scores for the readability-tool on toolforge. Especially the latter we are planning to advertise more once this is live so in the future the demand could rise (but I assume we could then adapt depending on what happens). Let me know if I should provide more details. Thanks.

Thanks for the info @MGerlach!

In my opinion we are ok to proceed. @klausman @achou (if you agree to proceed as well) - when you have time could you please coordinate and move readability to Prod?

Change 951461 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile.d: Add config bits to move readability isvc to prod

https://gerrit.wikimedia.org/r/951461

Change 951460 merged by Klausman:

[operations/puppet@production] profile::k8s::deployment_server: Add config for readability isvc

https://gerrit.wikimedia.org/r/951460

The service has been moved from the experimental namespace to readability in staging-codfw, and newly deployed to the same namespace to serve-codfw and -eqiad.

Queries work as expected:

$ curl -s "https://inference.svc.codfw.wmnet:30443/v1/models/readability:predict" -H Host:\ readability.readability.wikimedia.org -X POST -d @input-readability-1.json|jq .
{
  "model_name": "readability",
  "model_version": "2",
  "wiki_db": "enwiki",
  "revision_id": "123456",
  "output": {
    "prediction": false,
    "probabilities": {
      "true": 0.4793056845664978,
      "false": 0.5206943154335022
    },
    "fk_score": 8.277095534538086
  }
}

@elukey Research accepts accountability for the readability model for a period of 12 months (We will revisit then if we want to continue being accountable. If yes, we renew. If no, we let you know and you can stop the model.). Accountability means that we will assure on our end there is always someone who can pick up the work related to updating the model as prioritized, and that we triage incoming tasks relevant to our team. If this works for you, we're good to go.

We will continue working on our end and with your team to further clarify accountability details. :)

Hi @leila! This is great, but it has a big downside - if we expose readability models to the outside community (via the API Gateway) we'll have clients that (rightfully) will base their jobs/bots/dashboards/etc.. on them, and the 12 months review time worries me a bit. We are doing a lot of work to migrate the community using the ORES API to Lift Wing, experiencing how difficult it is to ask a wide variety of projects to change/adapt their code. I am not saying that the 12 months timeline is not good, but we should also think about deprecation paths if we decide to remove a model from production after some time (so some extra work/help from Research may be needed to move the community to other models etc..). Lemme know your thoughts :)

@elukey (to capture our sync conversation results w.r.t. your comment above):

  • We understand that you/we may not able to just take models out of Production as models will have users (internal/external). At this moment, if we are asking you to bring an existing model that Research has developed out of Production, I commit that our team works with you to make sure the transition is done gracefully.
  • We discussed the need for having a "research/experimental" environment for models that we don't want to take the time of your team to maintain in Production but we still want to expose to specifics user groups and in-line with your team's ask that we start working with your team early on when we start developing models, even when they are experimental. You had multiple good suggestions and clarifications on this front. I'll continue those conversations with you outside of this task.

Thanks to you and the team for your continued collaboration.

The service is currently deployed to production! It is only available for internal clients.

Next steps:

  • Publish the service via api.wikimedia.org (API Gateway).
  • Add basic documentation to the API Portal.
elukey reassigned this task from achou to klausman.

@klausman Assigned the task to you since there are a couple of steps that are more related to SRE (lemme know if you don't have time, I'll take care of it).

Change 959684 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] APIGW: add entry for multilingual readability LW isvc

https://gerrit.wikimedia.org/r/959684

Change 959684 merged by jenkins-bot:

[operations/deployment-charts@master] APIGW: add entry for multilingual readability LW isvc

https://gerrit.wikimedia.org/r/959684

Change 961701 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/grafana-grizzly@master] SLOs: Add SLO for Liftwing Readability isvc

https://gerrit.wikimedia.org/r/961701

Change 961701 merged by Klausman:

[operations/grafana-grizzly@master] SLOs: Add SLO for Liftwing Readability isvc

https://gerrit.wikimedia.org/r/961701

@MGerlach we are done! Let us know if we are good or if anything is missing :)

MGerlach closed this task as Resolved.EditedOct 26 2023, 5:53 PM
MGerlach added a subscriber: AikoChou.

@MGerlach we are done! Let us know if we are good or if anything is missing :)

this is great news. I had a look and all seems to be working as expected. thanks to everyone who contributed to make this happen (@elukey @achou @klausman )

from my side the task is resolved. any other improvements or changes will be captured in follow-up tasks.