Page MenuHomePhabricator

FY2024-25 Q4 Goal: Productionize tone check model
Open, In Progress, Needs TriagePublic

Description

As a machine learning engineer,
I want to deploy to production a model that detects peacock language by parsing the content of a paragraph and highlighting the problematic words.
I also want a reproducible way to generate and improve the model, making iteration easier and the process less error-prone by version-controlling the code and reviewing it through code reviews.

To achieve this, I’ll build an Airflow pipeline and deploy it on the ML airlfow instance. This pipeline will cover three basic steps:
• Dataset generation
• Model training
• Model evaluation

If not fully automated, the process of re-deploying a re-trained model from this pipeline should still be easy and well-documented.
Last but not least, an SLO dashboard using pyrra should be created for this service using the SLO template instructions.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
Openppelberg
OpenQuiddity
OpenMNeisler
In ProgressSucheta-Salgaonkar-WMF
Resolvedachou
Resolvedachou
ResolvedNone
Resolvedgkyziridis
Resolvedachou
ResolvedSucheta-Salgaonkar-WMF
Resolvedgkyziridis
Opengkyziridis
Resolvedbrouberol
Resolvedbrouberol
Resolvedgkyziridis
Resolvedkevinbazira
Resolvedkevinbazira
Resolvedkevinbazira
Resolvedkevinbazira
Resolvedbrouberol
Resolvedkevinbazira
Resolvedachou
Resolvedachou

Event Timeline

isarantopoulos changed the task status from In Progress to Open.Apr 15 2025, 2:53 PM
isarantopoulos changed the task status from Open to In Progress.

We are currently trying to build an airflow pipeline for the data generation step as well as evaluation data for the human evaluation data so we can decide on the languages that are going to be shipped first.

Update:

I managed to run a proof of concept DAG to collect peacock data in the development instance from a stat machine (stat1011.eqiad.wmnet).

airflow-dev.png (1×2 px, 329 KB)

Required code for running the DAG:

After setting up the both, I ran the following command to initialize an Airflow dev instance on the stat machine:

sudo -u analytics-privatedata ./run_dev_instance.sh  -m /tmp/aikochou_airf-low -p 8796 ml

*You may need to change the user home folder (-m) where airflow logs are saved and the port number (-p) if another user is already using it.

Issues I ran into and the fixes:

  1. The dev instance not initialized
    • To solve this, I cleaned up the staled airflow-webserver and gunicorn processes on the stat machine
  2. Error: artifacts not found in hdfs://cache..
    • This occurs when creating a DAG for the first time. To fix it, we need to manually change the conda_env variable in Admin/Variable to the url specified in our config/artifacts.yaml, as shown below:
      airflow-variable.png (1×2 px, 219 KB)
  3. Error: default_args is not dict type
    • I still don’t know why this error happens. A workaround is to cast default_args to dict in our DAG file (here).

Update:

  • Working on eval data collection and processing for en, ar, es, ja, pt, fr, id, pl, zh, cs, he, and tr wikis (languages prioritized by Editing Team in T388471)
    • Using only NPOV policy and jargon signals from edit summaries, as templates provided by the community in T389445 are mostly article-level templates
    • Cleaning and filtering data using metadata (e.g. revision text byte differences, number of consecutive reverted revisions)
    • Adding revert_timestamp column to enable filtering of recent data later
  • Next step: cleaning data based on content by joining with the content_diff dataset to remove revisions that only have template changes, reference additions, and other non-plaintext modifications

Update:

  • Working on negative samples collection for French, Spanish, Japanese, Portuguese, English for HIL evaluation.
    • Get examples of good edits on the same article as the bad edits - select experienced users' edits (eg. event_user_revision_count > N)
    • Control for size of edit (eg. only edits that are 1 paragraph) so that doesn’t bias the similarity measurement - calculate the number of hunks in the parent_revision_diff and select only diffs with a single hunk
Updates:
Introducing ToneCheck
  • Request schema changed from peacock to tone:
    • "instances": [{"lang": "en", "check_type": "tone", "original_text": "text", "modified_text": "text with modifications"}]
  • Model is deployed using correct drivers which leave us the opportunity to use GPUs if is needed in the future.
  • ToneCheck model with the latest updates (schema/drivers) is deployed on prod
  • ToneCheck model is deployed also to the API Gateway
  • SLO/ToneCheck page is under progress
  • SREs will configure and setup the pyrra-dashboard for ToneCheck.

Hitting ToneCheck model through the API GW example:

## Request
$ curl -i https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict \
-X POST \
-d '{ "instances": [{"lang": "en", "check_type": "tone", "original_text": "text", "modified_text": "this is a great example of work"}]}'

## Response
$ {"predictions":[{"status_code":200,"model_name":"edit-check","model_version":"v1","check_type":"tone","language":"en","prediction":true,"probability":0.882,"details":{}}]}%

The necessary changes in VisualEditor for the above model have been already addressed to the team and the patch is WIP.

Update:

  • HIL model evaluation for French, Spanish, Japanese, Portuguese, English is in progress
  • Restructure the data generation notebook to a clean version for reusability

Next step:

  • Working towards HIL for the remaining 7 languages, by:
    • Doing probability distribution mapping
    • Preparing for a manual review of TRUE, high-probability samples
isarantopoulos renamed this task from Q4 24-25 Goal: Productionize peacock detection model to Q4 24-25 Goal: Productionize tone check model .Jun 10 2025, 2:21 PM

Update:

  • Collected recent newcomer's data, ran the model, and plotted probability distributions for 13 languages (notebook)
  • Updated the poc tone_check DAG with updated data generation script. (I will write detailed steps in T396495)

Next:

  • After discussing with Diego today, we'll collect data on nlwiki and dewiki, so we can interpret the data better by comparing probability distributions with the previous model precision:

peacock_precision.png (489×500 px, 69 KB)

Update:

  • Working with Diego on methodology for analyzing peacock language detection (tone check) models in languages without enough evaluation data. The methodology is nearly complete. Only the notebook needs to be finalized.
  • Working on converting training notebooks on collecting & processing template-based data to Airflow jobs. Ongoing tasks include:
    • Migrating the source table from MediaWiki wikitext history to Mediawiki content history v1 (Iceberg) for better future support.
    • Rewriting the Spark SQL code in the notebooks to use PySpark functions, similar to my work with revert-based data. This makes the code more programmatic and composable (easier to parameterize, better for handling complex logic, and more maintainable).

Update:

  • A clean version of the notebook for collecting and processing template-based datasets. Changes compared to original notebooks include:
    • Using Mediawiki content history v1 as the source table instead of the deprecated MediaWiki wikitext history.
    • Rewriting the extraction of positive and negative template pairs to use PySpark functions.

Next:

  • Create an Airflow job for training data generation.
  • Expand the code to handle multiple wiki_dbs at once, as the original notebooks process wikis one by one separately.
Aklapper renamed this task from Q4 24-25 Goal: Productionize tone check model to FY2024-25 Q4 Goal: Productionize tone check model.Jul 1 2025, 1:48 PM

Spillovers:

    • Publishing the SLO
    • evaluate model performance with using page_title in the input
    • Continue the work on the airflow DAG
      • Create a notebook that handles data generation + training of the model to use it in the airflow dag
      • Figure out a way to pass the data from the data generation task to the model training task in the DAG. It will likely be an hdfs directory. We write to it during the ETL step and read from it on the training step.
      • combine all steps and create the final DAG.
  • Work on evaluating additional languages in order to identify languages that the model can be A/B tested safely.