Page MenuHomePhabricator

Request to update Readability model on Lift Wing
Closed, ResolvedPublic3 Estimated Story Points

Description

We have improved the readability model a lot (using a ranking instead of classification approach, see TRank vs LMC in the paper's results). Currently, the model on LiftWing still uses the older LMC (classification). If possible, we would like to replace it with the new TRank model.

See below for more details following the instructions for requesting an update.

Which model needs updating?

We would like to update the Readability model: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Readability_score_object and https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_readability_prediction

What changes have been made to the model? (e.g., updated training data, different approach, new features, etc.)

The model was trained using a different approach (ranking instead of classification). The FK-score approximation model was changed to a simple linear transformation instead of the nonlinear regression model.

Do any dependent repositories/packages need updates? (e.g., knowledge integrity, sklearn, pytorch, etc.) Please provide the MR/version for reference.

Package readability-liftwing should be updated according to the MR: https://gitlab.wikimedia.org/trokhymovych/readability-liftwing/-/merge_requests/6

Is there a new model binary? What is its version?

Yes. Its version is 4 (the same as mentioned in MR). The new model binary can be found here: https://drive.google.com/file/d/1wsmx5nw2_EtrRlA2RfDXBEiO-SivPYcU/view?usp=sharing .

Does the input/output schema need any changes?

The output schema needs to be changed. The new schema is represented as

@dataclass
class ReadabilityResult:
  score: float
  fk_score_proxy: float

where score stands for the readability score provided by the ranking model (can be used to compare articles between each other), and fk_score_proxy is a Flesch–Kincaid score approximation. The schema is different because the model is now formulated as a ranking task (only assigning a score) instead of a binary prediction task (True/False with probability score for each class).

Does the preprocessing stage require changes?
Does the prediction stage require changes?

The preprocessing and prediction stages have minor changes (due to model change), which are represented in the corresponding MR.

Checklist:

Additional note: The set of supported languages changed slightly because we use a different base model (see model card)

'af', 'sq', 'am', 'ar', 'hy', 'as', 'az', 'eu', 'be', 'bn', 'bs', 'br', 'bg', 'my', 'ca', 'zh-yue', 'zh', 'zh-classical', 'hr', 'cs', 'da', 'nl', 'en', 'eo', 'et', 'tl', 'fi', 'fr', 'gl', 'ka', 'de', 'el', 'gu', 'ha', 'he', 'hi', 'hu', 'is', 'id', 'ga', 'it', 'ja', 'jv', 'kn', 'kk', 'km', 'ko', 'ku', 'ky', 'lo', 'la', 'lv', 'lt', 'mk', 'mg', 'ms', 'ml', 'mr', 'mn', 'ne', 'no', 'or', 'om', 'ps', 'fa', 'pl', 'pt', 'pa', 'ro', 'ru', 'sa', 'gd', 'sr', 'sd', 'si', 'sk', 'sl', 'so', 'es', 'su', 'sw', 'sv', 'ta', 'te', 'th', 'tr', 'uk', 'ur', 'ug', 'uz', 'vi', 'cy', 'fy', 'xh', 'yi', 'simple'

Event Timeline

calbon set the point value for this task to 3.
calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

@Trokhymovych I'm starting to work on this. Is the prediction time similar to the previous model? Or it takes more/less time? Just wanted to get some numbers on how the model performs with expected inputs. Thanks!

Change #1059032 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] readability: updates according to the new TRank model

https://gerrit.wikimedia.org/r/1059032

Hi @achou! Thanks for working on this. Prediction time should be similar to the previous model. I have checked locally, and it is 2.5s per page on average (with 4s for 95 percentile). However, the model should require more RAM.

Also, I have already merged the corresponding MR.

Change #1059032 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] readability: updates according to the new TRank model

https://gerrit.wikimedia.org/r/1059032

Change #1060437 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update readability model

https://gerrit.wikimedia.org/r/1060437

Change #1060437 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update readability model

https://gerrit.wikimedia.org/r/1060437

Change #1061948 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] Makefile: update readability model path for local-run

https://gerrit.wikimedia.org/r/1061948

Change #1061948 merged by Kevin Bazira:

[machinelearning/liftwing/inference-services@main] Makefile: update readability model path for local-run

https://gerrit.wikimedia.org/r/1061948

Change #1062680 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] locust: entry for readability model

https://gerrit.wikimedia.org/r/1062680

Change #1062680 merged by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] locust: entry for readability model

https://gerrit.wikimedia.org/r/1062680

We've deployed the model to ml-staging. Initially, the service was crashlooping due to out of memory. The issue was resolved after increasing the memory to 4Gi (the patch). Mykola mentioned that the new model would require more RAM.

However, we observed higher latency compared to the old model during load tests. The old model is 4.5s on average and 0.27 req/s, while the new model is 8.7s on average and 0.17 req/s. Here is the load test results, performed on this input data.

From the dashboard for latency, the old model's numbers vary a lot between pages, whereas the new model takes a similar prediction time per page.

@Trokhymovych, could you run the same test you reported in T369712#10038210 with the previous model? Is the prediction time similar to the result of the new model?

Hi @achou, thanks so much for your work! I’ve run the tests and can confirm the scale of your observations. The old model averages 1.07s per item, while the new model averages 2.52s per item on the same data, meaning the new model is indeed about twice as slow. (Absolute numbers may vary depending on CPU and connection speed.) My initial assumption that their performance was "similar" was incorrect. I hope this information is helpful.

@Trokhymovych thanks for clarifying it. That's super helpful! :)

Change #1064391 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: bump memory for readability isvc in prod

https://gerrit.wikimedia.org/r/1064391

Change #1064391 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: bump memory for readability isvc in prod

https://gerrit.wikimedia.org/r/1064391

The new model has been deployed to production.

$ curl https://api.wikimedia.org/service/lw/inference/v1/models/readability:predict -X POST -d '{"rev_id": 123456, "lang": "en"}' -H "Content-type: application/json" | jq '.'
{
  "model_name": "readability",
  "model_version": "4",
  "wiki_db": "enwiki",
  "revision_id": 123456,
  "output": {
    "score": -0.29161882400512695,
    "fk_score_proxy": 8.63213539862886
  }
}

Also the API docs has been updated: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Readability_score_object and https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_readability_prediction

@Trokhymovych Let me know if you have any questions. :)