Page MenuHomePhabricator

Add Article Quality Model to LiftWing
Closed, ResolvedPublic

Description

We would like to add the language-agnostic quality model to LiftWing.

What use case is the model going to support/resolve?
This is in relation to a few needs:

  • SWE Intern Program
  • Enterprise interest: T346089
  • Research and Decision Science uses the model for Knowledge Gap metrics (bulk assessment of content) but there is no way to validate this model to check predictions etc. for folks without access to the cluster
  • Currently the revscoring article quality model only supports 12 wikis and it's unlikely to expand coverage so this model offers a much more scalable approach to extending support to more wikis.

Do you have a model card? If you don't know what it is, please check https://meta.wikimedia.org/wiki/Machine_learning_models.
Yes: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_Wikipedia_article_quality

What team created/trained/etc.. the model? What tools and frameworks have you used?
The model card provides more details but this was created internally by myself (@Isaac) and is deliberately very simple -- a weighted average of a few features that are all calculated directly from an article's wikitext. The initial phase of this task will be to move this model to staging and then we will work on improvements to switch the model's features from wikitext-based to HTML-based but the core approach of the model will remain the same.

What kind of data was the model trained with, and what kind of data the model is going to need in production (for example, calls to internal/external services, special datasources for features, etc..) ?

If you have a minimal codebase that you used to run the first tests with the model, could you please share it?

State what team will own the model and please share some main point of contacts (see more info in Ownership of a model).
Research

What is the current latency and throughput of the model, if you have tested it? We don't need anything precise at this stage, just some ballparks numbers to figure out how the model performs with the expected inputs. For example, does the model take ms/seconds/etc.. to respond to queries? How does it react when 1/10/20/etc.. requests in parallel are made? If you don't have these numbers don't worry, open the task and we'll figure something out while we discuss about next steps!
Anecdotally, generally under 500ms with the wikitext. It's just a simple API call and a few regexes. Might bump up slightly with switch HTML -- both because the HTML takes a bit longer to process and also because increased latency in requesting the Parsoid HTML for older, likely-non-cached revisions -- but the processing itself is quite fast.

Is there an expected frequency in which the model will have to be retrained with new data? What are the resources required to train the model and what was the dataset size?
Re-training should be pretty infrequent with perhaps the normalization table being updated annually but no discussion has been had on this.

Have you checked if the output of your model is safe from a human rights point of view? Is there any risk of it being offensive for somebody? Even if you have any slight worry or corner case, please tell us!
Minimal risk -- assessment of article quality. Model might be incorrect at times but its simplicity allows for understanding why.

Anything thing else that is relevant in your opinion :)

  • Model is currently being improved but if we get the current wikitext-based approach onto staging, it should be trivial to make changes to switch to HTML-based approach when that is ready.

Details

Other Assignee
achou
SubjectRepoBranchLines +/-
operations/puppetproduction+18 -0
operations/deployment-chartsmaster+22 -1
machinelearning/liftwing/inference-servicesmain+13 -3
operations/deployment-chartsmaster+18 -22
operations/deployment-chartsmaster+14 -0
operations/deployment-chartsmaster+141 -0
machinelearning/liftwing/inference-servicesmain+32 -0
operations/puppetproduction+8 -0
operations/deployment-chartsmaster+6 -0
operations/deployment-chartsmaster+2 -4
machinelearning/liftwing/inference-servicesmain+418 -32
operations/deployment-chartsmaster+3 -1
machinelearning/liftwing/inference-servicesmain+12 -7
operations/deployment-chartsmaster+3 -1
machinelearning/liftwing/inference-servicesmain+3 -1
machinelearning/liftwing/inference-servicesmain+6 -4
machinelearning/liftwing/inference-servicesmain+240 -1
operations/deployment-chartsmaster+15 -0
machinelearning/liftwing/inference-servicesmain+91 -1
machinelearning/liftwing/inference-servicesmain+19 -0
integration/configmaster+15 -0
machinelearning/liftwing/inference-servicesmain+42 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1046720 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add feature preprocess

https://gerrit.wikimedia.org/r/1046720

Change #1048028 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] articlequality: add host header to request

https://gerrit.wikimedia.org/r/1048028

Currently (for example for PCS mobile-html output) we use the service mesh to connect directly to MW listener, so if that's an option for you you can reach the REST API that way.
Moving forward when we completely remove parsoid on RESTBase I assume that those endpoints are going to be served via rest-gateway but I would defer to @daniel for that.

Here is the related ticket:
https://phabricator.wikimedia.org/T367416

Change #1048028 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add force_http option

https://gerrit.wikimedia.org/r/1048028

Change #1048401 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] articlequality: add FORCE_HTTP env var

https://gerrit.wikimedia.org/r/1048401

Change #1048401 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add FORCE_HTTP env var

https://gerrit.wikimedia.org/r/1048401

Change #1048455 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: use force_http in articlequality

https://gerrit.wikimedia.org/r/1048455

Preprocessing now works. For this POC we used the following endpoint https://en.wikipedia.org/w/rest.php/v1/revision/12345/html:

curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345, "lang": "en"}' -H  "Host: articlequality.experimental.wikimedia.org"
{"rev_id":12345,"lang":"en","normalized_features":[0.25854384507523304,0.0,0.27142566329646467,0.0,0.0,0.0,0.0,false,false]}

Before we move this to production we should also figure out the way to use it with the Rest Gateway.
Now the only thing that is left is to load the model and run the above features through the predict function.

Change #1048487 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] articlequality: add predict

https://gerrit.wikimedia.org/r/1048487

Change #1048455 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: use force_http in articlequality

https://gerrit.wikimedia.org/r/1048455

Change #1048487 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add predict

https://gerrit.wikimedia.org/r/1048487

Change #1048559 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update articlequality image and storage URI

https://gerrit.wikimedia.org/r/1048559

Change #1048559 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update articlequality image and storage URI

https://gerrit.wikimedia.org/r/1048559

After Aiko uploaded the model we can now use the model server which is deployed in the experimental namespace in ml-staging.

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345, "lang": "en"}' -H  "Host: articlequality.experimental.wikimedia.org"
0.14454229522974138
real	0m0.244s
user	0m0.030s
sys	0m0.004s

There is no response schema at the moment and the service just returns a float with the quality score.

Just commenting that this is very exciting! I'll be working on getting you all the final model etc. in the next week or two.

Great @Isaac !
I was wondering if you'd be open to use a gradient boosting regressor model (xgboost, catboost, lightgbm) so that we don't have to do much feature preprocessing (normalization). In this case we wouldn't need to maintain the feature values (the one we have now in the csv) and model maintenance/updates would be easier. wdyt?

isarantopoulos moved this task from Ready To Go to Blocked on the Machine-Learning-Team board.

I was wondering if you'd be open to use a gradient boosting regressor model (xgboost, catboost, lightgbm) so that we don't have to do much feature preprocessing (normalization). In this case we wouldn't need to maintain the feature values (the one we have now in the csv) and model maintenance/updates would be easier. wdyt?

@isarantopoulos good question! The feature pre-processing is non-standard, that's why I've been self-implementing as opposed to using a standard scaler or something like that. But happy to brainstorm. Here's the situation:

  • We only have training data for ~6 wikis but want the model to apply to all 300+ Wikipedia language editions.
  • I'm trying to reduce the impact of some extreme outliers -- e.g., super long list articles -- on the features.
  • I wanted the features to be maximally interpretable so that they could also be used as recommendations -- i.e. which features is an article lagging on and would potentially be of highest priority to work on.

To achieve that, I took our current approach:

  • While we train on the 6 wikis, we run a job that computes feature distributions for all 300+ wikis (I'm actually working on that for the HTML -- for the wikitext it was trivial because that was already on the cluster but HTML is not yet)
  • We take 99th-percentiles (so remove the top 1% of articles for a given feature) as our "expectation" for a given feature + wiki
  • Using that "expectation", we scale each feature so it's a number between 0 and 1 with 0 = no content and 1 = reached or exceeded the expectation for that wiki -- e.g., if expectation is 3 images for Arabic Wikipedia, then an article with n images is scaled so its image feature is min(n / 3, 1.0):
    • 0 -> 0 / 3 -> 0.00
    • 1 -> 1 / 3 -> 0.33
    • 2 -> 2 / 3 -> 0.67
    • 3+ -> 3 / 3 -> 1.00
  • I think this is closest to MaxAbsScaler in sklearn (after removing top-1% of values). In theory we could switch to something like this but I think there's a challenge that the scaling is language-specific but the prediction is language-agnostic so we'd either need to maintain separate scalers for each language edition or do some hack where we use a single scaler with language-specific coefficients but then I think our model goes from a dense language-agnostic 9 coefficients to a sparse 9 * 300+ languages and we force each language-specific coefficients to be the same. Any ideas?
  • There's also a caveat that I think we're going to switch to an ordinal logistic regression model which unfortunately sklearn doesn't support (but we could still use their preprocessing I think so we can ignore this aspect).

I see you've done a lot of great work on feature engineering and preprocessing so I don't mean to interfere with your work! My suggestion is a bit short sighted as I was looking at it from the perspective of deploying and updating a model. I was hoping to use a gradient boosting model and don't do any normalization (we'd still have to take care of extreme outliers). This way we wouldn't have to maintain a separate csv with the values used in preprocessing an we could still have interpretable features using feature importance attribute of these models.

But even if we did this an improvement of the model would most certainly have a more sophisticated preprocessing so we would be in the same situation.
After you have trained the model we can discuss about adding the min/max values as a model attribute instead of a csv as this is something that depends/changes on model training.

I think this is closest to MaxAbsScaler in sklearn (after removing top-1% of values). In theory we could switch to something like this but I think there's a challenge that the scaling is language-specific but the prediction is language-agnostic so we'd either need to maintain separate scalers for each language edition or do some hack where we use a single scaler with language-specific coefficients but then I think our model goes from a dense language-agnostic 9 coefficients to a sparse 9 * 300+ languages and we force each language-specific coefficients to be the same. Any ideas?

I def wouldn't want to keep 9*300 coefficients! One idea could be to have some default values for these "expectations" for the unseen wikis (the ones not present in the training data) which could be either a mean/median from the data that we have from the 6 wikis, but then again this really depends on the quality of the predictions that we'd get from this.

Ok, for the V1 of the model, I have everything ready to go! Specifically:

At some point, maybe worthwhile to set up a meeting with you, me and @fkaelin who I've been chatting about this model with because he has worked on similar versions for the content gap metrics? That way we can figure out what a more sustainable approach might be here because I think your feedback is quite valid, just have to find a way to balance all the needs.

Thanks for the update Isaac!
By looking at the above code + model iiuc the following changes need to be introduced in Lift Wing:

  • switch from sklearn to of statsmodels ordinal regression
  • change output schema to match the one on the model card

Feature preprocessing remains the same and we are using the en labels for the classes for all languages.
I'm currently working on updating the service so let me know if any of the above is incorrect or there is any other piece I'm missing.

Another thing is that in order to get both the labels and the raw score we need to run model.predict twice as there is no way to do that natively in one go according to the docs to the docs. It isn't much of an issue as predictions are really fast, but I'm mentioning it as it would be something we could make optional (returning of the raw values) if we decide we need a faster service.

Change #1055177 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] (WIP) articlequality: update to ordinal regression from statsmodels

https://gerrit.wikimedia.org/r/1055177

@isarantopoulos your summary is accurate - thanks! We can adjust model card schema if it's non-standard and you want to adjust the keys / structure. In particular, as I think about it more based on your point around latency and double-running the model, I'd be open to a very simple default configuration where we just return the quality score (number between 0 and 1) as that's the "official" output of the model with the optional mode you mention that also returns the feature values and class label.

I'll have an update on this next week, since this week the team is doing a focus week on LLM work. I've already done some work in the patch seen above.

isarantopoulos updated Other Assignee, added: achou.
isarantopoulos moved this task from Blocked to In Progress on the Machine-Learning-Team board.

Update: I'm having some issues while building the Lift Wing service which is cause by dependencies.
I'm getting this issue on model load caused by numpy. The issue is that kserve demands numpy <2.0.0 which ends up installing numpy==1.26.4. Locally I've had no issue running things in a notebook but with numpy 2.0.0. After checking Isaac's notebook I found that the model has been trained using numpy 2.0.0, so ideally this would be the numpy version we would want to use while unpickling.

getting this error Traceback (most recent call last):
  File "/srv/articlequality/model_server/model.py", line 110, in <module>
    model = ArticleQualityModel(
            ^^^^^^^^^^^^^^^^^^^^
  File "/srv/articlequality/model_server/model.py", line 50, in __init__
    self.load()
  File "/srv/articlequality/model_server/model.py", line 53, in load
    self.model = load_pickle(self.model_path)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/somebody/.local/lib/python3.11/site-packages/statsmodels/iolib/smpickle.py", line 42, in load_pickle
    return pickle.load(fin)
           ^^^^^^^^^^^^^^^^
  File "/home/somebody/.local/lib/python3.11/site-packages/numpy/random/_pickle.py", line 34, in __bit_generator_ctor
    raise ValueError(str(bit_generator_name) + ' is not a known '
ValueError: <class 'numpy.random._mt19937.MT19937'> is not a known BitGenerator module.

I'll work on this and provide an update.

I'd be open to a very simple default configuration where we just return the quality score (number between 0 and 1) as that's the "official" output of the model with the optional mode you mention that also returns the feature values and class label.

I'm working on a solution to only return score and features and return the label only if the user provides a boolean in the payload get_label e.g. {"lang": "en", "rev_id": 12345, "return_label": "true"}. Let me know if this would work. We should provide as a default the one that is going to be used more often by latency sensitive/demanding services.

After checking Isaac's notebook I found that the model has been trained using numpy 2.0.0, so ideally this would be the numpy version we would want to use while unpickling.

@isarantopoulos if this continues to be an issue, let me know and I can see about re-training/pickling with an earlier version of numpy. In theory it should be pretty easy to force an older version of statsmodels that depends on an earlier version of numpy (aka shouldn't throw any errors or require much code changes) and I don't think it should affect the model parameters in any serious way as the core logic should all be the same.

I'm working on a solution to only return score and features and return the label only if the user provides a boolean in the payload get_label e.g. {"lang": "en", "rev_id": 12345, "return_label": "true"}. Let me know if this would work. We should provide as a default the one that is going to be used more often by latency sensitive/demanding services.

Yeah, my default is just whatever basic page metadata is standard on LiftWing and the numeric score field. @FNavas-foundation -- any thoughts on this? A "complete" prediction might look something like:

{'score': 0.5917956435209251,
 'label': 'C',
 'features': {'raw': {'characters': 6580,
   'refs': 56,
   'wikilinks': 79,
   'categories': 8,
   'media': 3,
   'headings': 10,
   'sources': 43,
   'infobox': True,
   'messagebox': False},
  'normalized': {'characters': 0.8401079697495082,
   'refs': 1,
   'wikilinks': 0.8248730890470838,
   'categories': 0.5333333333333333,
   'media': 0.6,
   'headings': 0.7935487945263084,
   'sources': 1,
   'infobox': True,
   'messagebox': False}}}

Where there are a few different things that you may or may not want to use:

  • score: 0-1 quality score for the article. This I think is the core output of the model and lets you track fine-grained changes in an article's quality.
  • label: our best guess of how the model prediction maps to the English Wikipedia quality classes (Stub, Start, B, C, GA, FA). Though because the model is wiki-specific, a C-class article in English Wikipedia will look very different from a C-class article in Cebuano Wikipedia (this is true of score too -- e.g., a 0.6 article in English Wikipedia will look very different from a 0.6 article in Cebuano Wikipedia). I think this label is most useful for analyses (grouping articles in quality classes to look at trends) and for editors who might want to use the outputs but I assume that most of Enterprise's customers would prefer the raw score?
  • raw features are the plain counts of different things in the article
  • normalized features are those raw counts normalized to 0-1 scores (0 = none; 1 = very high quality) based on our "expectations" for that wiki

In theory we could include all of these in every response but getting the score and label does require running the model twice and we assume most folks don't care about all of these other features, so were thinking to narrow it down dramatically for the default (and let folks request all the other details if they want).

@Isaac We're going to solve the numpy issue by relaxing the kserve restriction by using our wmf kserve fork. At some point in the near future it is going to be supported anyway so we will switch to the official release then. It wouldn't make much sense to build things with an older version just to make things work. Thanks for offering to help!

@Isaac

In theory we could include all of these in every response but getting the score and label does require running the model twice and we assume most folks don't care about all of these other features, so were thinking to narrow it down dramatically for the default (and let folks request all the other details if they want).

That makes perfect sense to me. I have the same sense that most enterprise reusers will just go for one the "one number" and not bother with the rest unless they really do care about it.

That said, I will push any enterprise reusers to at least learn/understand and at most to use the more granular data (at least the normalized ones), because of "data is not neutral". I want to push our reusers to understand and implement our model of verifiability.

We're going to solve the numpy issue by relaxing the kserve restriction by using our wmf kserve fork.

@isarantopoulos thanks!

That makes perfect sense to me. I have the same sense that most enterprise reusers will just go for one the "one number" and not bother with the rest unless they really do care about it.
That said, I will push any enterprise reusers to at least learn/understand and at most to use the more granular data (at least the normalized ones), because of "data is not neutral". I want to push our reusers to understand and implement our model of verifiability.

@FNavas-foundation okay, then it sounds like for this start small approach it's okay to default to the response just including that 0-1 score and if external re-users ever decide that they want more features, that functionality will still be available. thanks!

Change #1059043 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update lang agnostic articlequality model

https://gerrit.wikimedia.org/r/1059043

I've uploaded the model on swift and in the public analytics space

For testing purposes, this API should be hosting the same model so should match LiftWing outputs: https://misalignment.wmcloud.org/api/v1/quality-revid-html?lang=en&revid=1228403723. It's coded slightly differently and there might be tiny rounding errors but in that sense it's a nice independent verification :)

Change #1055177 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: update to ordinal regression from statsmodels

https://gerrit.wikimedia.org/r/1055177

I've deployed the new model in the experimental namespace in ml-staging so it is now available for further testing.

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345, "lang": "tr", "get_label": "True"}' -H  "Host: articlequality.experimental.wikimedia.org"
{"label":"Stub","score":0.18358280255137194,"features":{"raw":{"characters":1232,"refs":0,"wikilinks":2,"categories":0,"media":0,"headings":0,"sources":0,"infobox":false,"messagebox":false},"normalized":{"characters":0.5280114655163068,"refs":0.0,"wikilinks":0.07891242783132303,"categories":0.0,"media":0.0,"headings":0.0,"sources":0.0,"infobox":false,"messagebox":false}}}
real	0m0.210s
user	0m0.048s
sys	0m0.005s

Just a reminder that the get_label POST parameter is optional if you want to get the label as well in the response. I plan to add some schema validation in the future so that if anyone requests get_labels or anything similar they would get an error instead of a silent failure.

@isarantopoulos excited to see this up! It helped me notice a small bug in my code that's now fixed and so my experimental API endpoint is matching the outputs from staging. Latency sounds good though I didn't do any formal testing of it. I'd say we're good to move to the next step and begin to coordinate with Enterprise about traffic to the endpoint!

Change #1059043 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update lang agnostic articlequality model

https://gerrit.wikimedia.org/r/1059043

Change #1062049 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] (WIP) locust: add articlequality model

https://gerrit.wikimedia.org/r/1062049

Change #1062709 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] admin_ng/LiftWing: add article-models namespace

https://gerrit.wikimedia.org/r/1062709

Change #1062709 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng/LiftWing: add article-models namespace

https://gerrit.wikimedia.org/r/1062709

Change #1063182 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy articlequality to prod in new ns

https://gerrit.wikimedia.org/r/1063182

Change #1063183 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/puppet@production] hiera/deployment-server: create article-models config/roles

https://gerrit.wikimedia.org/r/1063183

Change #1063183 merged by Klausman:

[operations/puppet@production] hiera/deployment-server: create article-models config/roles

https://gerrit.wikimedia.org/r/1063183

Change #1062049 merged by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] locust: add articlequality model

https://gerrit.wikimedia.org/r/1062049

Change #1063182 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy articlequality to prod in new ns

https://gerrit.wikimedia.org/r/1063182

Change #1063213 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/puppet@production] httpbb: add article-models namespace tests for articlequality

https://gerrit.wikimedia.org/r/1063213

Change #1063225 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] APIGW: Add configuration to expose LW isvc articlequality

https://gerrit.wikimedia.org/r/1063225

The model has been deployed to production!
At the moment only the internal endpoint is available but we're also working to expose it through the API Gateway.

time curl "https://inference.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 21774755, "lang": "en"}' -H  "Host: articlequality.article-models.wikimedia.org"
{"score":0.07192803595025732,"features":{"raw":{"characters":166,"refs":0,"wikilinks":3,"categories":1,"media":0,"headings":0,"sources":0,"infobox":false,"messagebox":false},"normalized":{"characters":0.13343697927306403,"refs":0.0,"wikilinks":0.19721511695668603,"categories":0.06666666666666667,"media":0.0,"headings":0.0,"sources":0.0,"infobox":false,"messagebox":false}}}
real	0m0.314s
user	0m0.043s
sys	0m0.009s

Change #1063225 merged by jenkins-bot:

[operations/deployment-charts@master] APIGW: Add configuration to expose LW isvc articlequality

https://gerrit.wikimedia.org/r/1063225

Hi! The model has been made available through the API Gateway along with the related API docs.
I am working on adding the first four lines in the response schema to match other models on Lift Wing

"model_name": "articlequality", 
"model_version": "1",
"wiki_db": "enwiki",
"revision_id": 123456,

Sample request:

curl https://api.wikimedia.org/service/lw/inference/v1/models/articlequality:predict -X POST -d '{"rev_id": 21774755, "lang": "en"}' -H "Content-type: application/json"

{"score":0.07192803595025732,"features":{"raw":{"characters":166,"refs":0,"wikilinks":3,"categories":1,"media":0,"headings":0,"sources":0,"infobox":false,"messagebox":false},"normalized":{"characters":0.13343697927306403,"refs":0.0,"wikilinks":0.19721511695668603,"categories":0.06666666666666667,"media":0.0,"headings":0.0,"sources":0.0,"infobox":false,"messagebox":false}}}

I'll redeploy latest changes and repost tomorrow.

Change #1070228 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] articlequality: update output schema

https://gerrit.wikimedia.org/r/1070228

@isarantopoulos very exciting thank you!

@FNavas-foundation ^^ for testing and to see what features are available.

Change #1070535 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: temp undeploy articlequality

https://gerrit.wikimedia.org/r/1070535

Change #1070535 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: temp undeploy articlequality

https://gerrit.wikimedia.org/r/1070535

I have temporarily disabled the production deployments (available through the API Gateway) until we finalize the schema changes. I did that just to be sure nobody starts using it and then we introduce breaking changes.

Regarding the schema changes:

    • version: it is now hardcoded as version 1 as there isn't any other place to extract it. In the future we can add the version as an attribute to the model.
  • label, features: since most users are going to be interested in just the score we can add that as default and have the return the rest of the output on demand. Instead of having 2 flags as return_features and return_label we can combine these and add one flag named extended_output and return both if requested. That would be preferred in order not to add complexity with the downside of having to run inference twice (to get the label) if someone just wants the features.

Change #1070228 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: update output schema

https://gerrit.wikimedia.org/r/1070228

Change #1071232 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: re-deploy prod articlequality and update staging

https://gerrit.wikimedia.org/r/1071232

Change #1071232 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: re-deploy prod articlequality and update staging

https://gerrit.wikimedia.org/r/1071232

All changes are deployed and the model is ready for use:

Simple request:

% curl https://api.wikimedia.org/service/lw/inference/v1/models/articlequality:predict -X POST -d '{"rev_id": 12345, "lang": "en"}'

{
  "score": 0.10748992767177634,
  "model_name": "articlequality",
  "model_version": "1",
  "wiki_db": "enwiki",
  "revision_id": 12345
}

Request with "extended_output" : "True" to get label and feature values.

% curl https://api.wikimedia.org/service/lw/inference/v1/models/articlequality:predict -X POST -d '{"rev_id": 12345, "lang": "en", "extended_output" : "True"}'

{
  "label": "Stub",
  "features": {
    "raw": {
      "characters": 625,
      "refs": 0,
      "wikilinks": 8,
      "categories": 0,
      "media": 0,
      "headings": 0,
      "sources": 0,
      "infobox": false,
      "messagebox": false
    },
    "normalized": {
      "characters": 0.2589179540286342,
      "refs": 0,
      "wikilinks": 0.2710334973090758,
      "categories": 0,
      "media": 0,
      "headings": 0,
      "sources": 0,
      "infobox": false,
      "messagebox": false
    }
  },
  "score": 0.10748992767177634,
  "model_name": "articlequality",
  "model_version": "1",
  "wiki_db": "enwiki",
  "revision_id": 12345
}

many thanks @isarantopoulos! I think from my end, we can mark this as resolved. If anything comes up when Enterprise starts working with the model, we can always open a new task to address.

just to note that Enterprise is planning to integrate by November

Change #1063213 merged by Klausman:

[operations/puppet@production] httpbb: add article-models namespace tests for articlequality

https://gerrit.wikimedia.org/r/1063213