Page MenuHomePhabricator

Add Article Quality Model to LiftWing
Open, Needs TriagePublic

Description

We would like to add the language-agnostic quality model to LiftWing.

What use case is the model going to support/resolve?
This is in relation to a few needs:

  • SWE Intern Program
  • Enterprise interest: T346089
  • Research and Decision Science uses the model for Knowledge Gap metrics (bulk assessment of content) but there is no way to validate this model to check predictions etc. for folks without access to the cluster
  • Currently the revscoring article quality model only supports 12 wikis and it's unlikely to expand coverage so this model offers a much more scalable approach to extending support to more wikis.

Do you have a model card? If you don't know what it is, please check https://meta.wikimedia.org/wiki/Machine_learning_models.
Yes: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_Wikipedia_article_quality

What team created/trained/etc.. the model? What tools and frameworks have you used?
The model card provides more details but this was created internally by myself (@Isaac) and is deliberately very simple -- a weighted average of a few features that are all calculated directly from an article's wikitext. The initial phase of this task will be to move this model to staging and then we will work on improvements to switch the model's features from wikitext-based to HTML-based but the core approach of the model will remain the same.

What kind of data was the model trained with, and what kind of data the model is going to need in production (for example, calls to internal/external services, special datasources for features, etc..) ?

If you have a minimal codebase that you used to run the first tests with the model, could you please share it?

State what team will own the model and please share some main point of contacts (see more info in Ownership of a model).
Research

What is the current latency and throughput of the model, if you have tested it? We don't need anything precise at this stage, just some ballparks numbers to figure out how the model performs with the expected inputs. For example, does the model take ms/seconds/etc.. to respond to queries? How does it react when 1/10/20/etc.. requests in parallel are made? If you don't have these numbers don't worry, open the task and we'll figure something out while we discuss about next steps!
Anecdotally, generally under 500ms with the wikitext. It's just a simple API call and a few regexes. Might bump up slightly with switch HTML -- both because the HTML takes a bit longer to process and also because increased latency in requesting the Parsoid HTML for older, likely-non-cached revisions -- but the processing itself is quite fast.

Is there an expected frequency in which the model will have to be retrained with new data? What are the resources required to train the model and what was the dataset size?
Re-training should be pretty infrequent with perhaps the normalization table being updated annually but no discussion has been had on this.

Have you checked if the output of your model is safe from a human rights point of view? Is there any risk of it being offensive for somebody? Even if you have any slight worry or corner case, please tell us!
Minimal risk -- assessment of article quality. Model might be incorrect at times but its simplicity allows for understanding why.

Anything thing else that is relevant in your opinion :)

  • Model is currently being improved but if we get the current wikitext-based approach onto staging, it should be trivial to make changes to switch to HTML-based approach when that is ready.

Event Timeline

Task created -- @isarantopoulos just let me know if any details are missing or anything I can do to help with next steps when you are ready!

We reviewed this task in the Research backlog refinement meeting today. @Miriam communicated that this is a task for the ML team. Moving the task to the Support Needed lane based on Miriam's assessment.

calbon removed kevinbazira as the assignee of this task.

Just adding another note of where these quality scores could be useful (filtering machine translation candidates): T293648#9816202

Adding a todo list of tasks:

  • Add a model server to inference-services - start with a dummy preprocess/predict function can be just identity functions
  • Create blubber image
  • Add CI pipelines
  • Create entry in deployment-charts and deploy in ml-staging experimental
  • Transfer preprocessing logic from notebook and fill in preprocess and predict functions

Change #1041690 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] articlequality: initial commit

https://gerrit.wikimedia.org/r/1041690

Change #1042154 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] ci: add blubber for articlequality

https://gerrit.wikimedia.org/r/1042154

Change #1041690 merged by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] articlequality: initial commit

https://gerrit.wikimedia.org/r/1041690

Change #1043644 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] articlequality: add CI pipelines

https://gerrit.wikimedia.org/r/1043644

Change #1043661 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[integration/config@master] inference-services: add CI jobs for articlequality model-server

https://gerrit.wikimedia.org/r/1043661

Change #1043661 merged by jenkins-bot:

[integration/config@master] inference-services: add CI jobs for articlequality model-server

https://gerrit.wikimedia.org/r/1043661

Change #1043644 merged by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] articlequality: add CI pipelines

https://gerrit.wikimedia.org/r/1043644

Change #1042154 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] ci: add blubber for articlequality

https://gerrit.wikimedia.org/r/1042154

Change #1046133 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: add dummy articlequality model

https://gerrit.wikimedia.org/r/1046133

Change #1046133 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add dummy articlequality model

https://gerrit.wikimedia.org/r/1046133

Dummy version has been deployed on ml-staging-codfw experimental. It is just a dummy service that returns the json input passed in the POST request.

curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345}' -H  "Host: articlequality.experimental.wikimedia.org"

Change #1046720 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] articlequality: add feature preprocess

https://gerrit.wikimedia.org/r/1046720

I'm trying to migrate the following example request to be used by the service:

response = request.get("https://en.wikipedia.org/w/rest.php/v1/revision/12345/html").text

The above returns the html text as this is used in the notebook and is later processed to create the features. This type of requests rely on RESTBase which is going to be deprecated.
In the RESTBase service migration page I found that there is a specific endpoint so that we can query per revision as we need and more specifically:

https://en.wikipedia.org/w/rest.php/v1/revision/12345/with_html

The above would return a json that contains an "html" entry which we could use. The issue is that it seems to not be supported by the REST Gateway (e.g. curl -v -H 'en.wikipedia.org' https://rest-gateway.discovery.wmnet:4113/en.wikipedia.org/v1/revision/12345/with_html doesn't work).

I also explored using mwapi with the MediaWiki Action API instead but haven't foudn a way to extract the html of a page this way (either using query or parse actions)

The above would return a json that contains an "html" entry which we could use. The issue is that it seems to not be supported by the REST Gateway (e.g. curl -v -H 'en.wikipedia.org' https://rest-gateway.discovery.wmnet:4113/en.wikipedia.org/v1/revision/12345/with_html doesn't work).

@MSantos any insights on the revision-oriented endpoints for Parsoid HTML via Rest Gateway? Small amount of context on top of what Ilias mentioned above: the quality model that we're working on extracts various features about an article from its HTML as input into a ML model -- e.g., how many references there are, whether it has an infobox or not, etc. It's important that ML models on LiftWing can accept arbitrary revision IDs instead of just the current version of the page. This allows us to do things like check how the quality has changed between multiple revisions of an article for evaluating the impact of edit campaigns or to evaluate the accuracy of the model (our groundtruth quality data is specific to revisions that have been evaluated by editors).

Change #1046720 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add feature preprocess

https://gerrit.wikimedia.org/r/1046720

Change #1048028 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] articlequality: add host header to request

https://gerrit.wikimedia.org/r/1048028

Currently (for example for PCS mobile-html output) we use the service mesh to connect directly to MW listener, so if that's an option for you you can reach the REST API that way.
Moving forward when we completely remove parsoid on RESTBase I assume that those endpoints are going to be served via rest-gateway but I would defer to @daniel for that.

Here is the related ticket:
https://phabricator.wikimedia.org/T367416

Change #1048028 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add force_http option

https://gerrit.wikimedia.org/r/1048028

Change #1048401 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] articlequality: add FORCE_HTTP env var

https://gerrit.wikimedia.org/r/1048401

Change #1048401 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add FORCE_HTTP env var

https://gerrit.wikimedia.org/r/1048401

Change #1048455 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: use force_http in articlequality

https://gerrit.wikimedia.org/r/1048455

Preprocessing now works. For this POC we used the following endpoint https://en.wikipedia.org/w/rest.php/v1/revision/12345/html:

curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345, "lang": "en"}' -H  "Host: articlequality.experimental.wikimedia.org"
{"rev_id":12345,"lang":"en","normalized_features":[0.25854384507523304,0.0,0.27142566329646467,0.0,0.0,0.0,0.0,false,false]}

Before we move this to production we should also figure out the way to use it with the Rest Gateway.
Now the only thing that is left is to load the model and run the above features through the predict function.

Change #1048487 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] articlequality: add predict

https://gerrit.wikimedia.org/r/1048487

Change #1048455 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: use force_http in articlequality

https://gerrit.wikimedia.org/r/1048455

Change #1048487 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add predict

https://gerrit.wikimedia.org/r/1048487

Change #1048559 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update articlequality image and storage URI

https://gerrit.wikimedia.org/r/1048559