We would like to add the language-agnostic quality model to LiftWing.
What use case is the model going to support/resolve?
This is in relation to a few needs:
- SWE Intern Program
- Enterprise interest: T346089
- Research and Decision Science uses the model for Knowledge Gap metrics (bulk assessment of content) but there is no way to validate this model to check predictions etc. for folks without access to the cluster
- Currently the revscoring article quality model only supports 12 wikis and it's unlikely to expand coverage so this model offers a much more scalable approach to extending support to more wikis.
Do you have a model card? If you don't know what it is, please check https://meta.wikimedia.org/wiki/Machine_learning_models.
Yes: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_Wikipedia_article_quality
What team created/trained/etc.. the model? What tools and frameworks have you used?
The model card provides more details but this was created internally by myself (@Isaac) and is deliberately very simple -- a weighted average of a few features that are all calculated directly from an article's wikitext. The initial phase of this task will be to move this model to staging and then we will work on improvements to switch the model's features from wikitext-based to HTML-based but the core approach of the model will remain the same.
What kind of data was the model trained with, and what kind of data the model is going to need in production (for example, calls to internal/external services, special datasources for features, etc..) ?
- Input: Wikipedia article -- initially the wikitext but then the HTML
- Features: counts extracted from the page source -- e.g., # of references, page length, # of images, # of categories, # of sections, # of wikilinks. These are all normalized based on a simple static table of max feature values per wiki -- i.e. ~300 rows (1 per wiki) with one column for each feature. Example: https://analytics.wikimedia.org/published/datasets/one-off/isaacj/misalignment/quality-max-featurevalues-by-wiki.tsv.gz
- Output: score between [0-1] that reflects the relative quality of the article (0 = nothing; 1 = top quality).
If you have a minimal codebase that you used to run the first tests with the model, could you please share it?
- Training: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Quality/Quality_Model_Training.ipynb
- Evaluation: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Quality/Quality_Model_Evaluation.ipynb
- API: https://github.com/wikimedia/research-api-endpoint-template/blob/quality-article/model/wsgi.py
- UI to test it out: https://wiki-topic.toolforge.org/quality
State what team will own the model and please share some main point of contacts (see more info in Ownership of a model).
Research
What is the current latency and throughput of the model, if you have tested it? We don't need anything precise at this stage, just some ballparks numbers to figure out how the model performs with the expected inputs. For example, does the model take ms/seconds/etc.. to respond to queries? How does it react when 1/10/20/etc.. requests in parallel are made? If you don't have these numbers don't worry, open the task and we'll figure something out while we discuss about next steps!
Anecdotally, generally under 500ms with the wikitext. It's just a simple API call and a few regexes. Might bump up slightly with switch HTML -- both because the HTML takes a bit longer to process and also because increased latency in requesting the Parsoid HTML for older, likely-non-cached revisions -- but the processing itself is quite fast.
Is there an expected frequency in which the model will have to be retrained with new data? What are the resources required to train the model and what was the dataset size?
Re-training should be pretty infrequent with perhaps the normalization table being updated annually but no discussion has been had on this.
Have you checked if the output of your model is safe from a human rights point of view? Is there any risk of it being offensive for somebody? Even if you have any slight worry or corner case, please tell us!
Minimal risk -- assessment of article quality. Model might be incorrect at times but its simplicity allows for understanding why.
Anything thing else that is relevant in your opinion :)
- Model is currently being improved but if we get the current wikitext-based approach onto staging, it should be trivial to make changes to switch to HTML-based approach when that is ready.