Page MenuHomePhabricator

Expand language support for Revert Risk Model
Closed, ResolvedPublic

Description

Technically this model is language agnostic, but it does require some statistical values for every wiki in order to calculate quality features:

  • avg article length
  • avg number of media
  • avg number of categories
  • avg number of headings
  • avg number of wikilinks
  • avg number of references

This task involves adding these values for new languages and updating the model binary to accurately reflect the total number of supported wikis.

  • Add quality feature values for 35 new languages to constants.py using this new file created by @diego (See the commit message for more details on how default values were generated for wikis)
  • Update the supported_wikis attribute on the serialized RevertRiskModel
  • Bump model version from 1.0 to 2.0
  • Test the new model binary
  • Pass it on to the ML team (sha512 checksum for the serialized model: P52800)

Event Timeline

MunizaA changed the task status from Open to In Progress.Sep 25 2023, 6:08 PM
MunizaA created this task.
MunizaA set Due Date to Sep 28 2023, 7:00 AM.Sep 25 2023, 6:12 PM
MunizaA moved this task from Backlog to In Progress on the Research board.

@diego, it's possible I'm missing something but while updating these constants I noticed that the values for be-x-old and be-tarask are different and according to the information here, the former redirects to the latter. be-tarask is one of the new wikis that we're adding these values for, so I wanted to check with you if this is okay. Thanks!

Let's the updated csv for now. Later Iets to coordinate with @fkaelin to periodically update these values, both RRLA and Article quality models.

Change 962049 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revertrisk-language-agnostic model binary

https://gerrit.wikimedia.org/r/962049

MunizaA added a subscriber: achou.

@achou thanks again for the review! I've released v0.4.0 for Knowledge Integrity which should help you pick up these new changes. Going to close this now but please feel free to reopen if something does not look right!

Change 962066 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revertrisk-la: bump knowledge_integrity version to v0.4.0

https://gerrit.wikimedia.org/r/962066

Change 962066 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-la: bump knowledge_integrity version to v0.4.0

https://gerrit.wikimedia.org/r/962066

Change 962049 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revertrisk-language-agnostic model binary

https://gerrit.wikimedia.org/r/962049

Thanks @MunizaA for adding the sha512 checksum for the new model binary in the task description. I have verified it and confirmed the integrity of the file that we uploaded to Swift. In the future, we will do this step before uploading to make sure the file wasn't tampered with or miscopied. :)

@achou @MunizaA thanks a lot! One nit - the paste outlined in the task's description is editable, so in theory anybody can tamper with it (everything is logged but it may be not straightforward to check for ML etc..). I would personally suggest to add the sha512 in a separate phab comment, that is not editable if not by the user (in theory).

Adding the sha512 here:

94ff70cbfac87565b5e04480acd7accd7d0c1f424ebfc2cb858338bc62c309b3745220223489498254010fb772266abd5498e2671eb1cddc138717e663bd3922 *revert_risk_language_agnostic_model_v2.pkl