As a first step, lets evaluate how what would be required to migrate the modeling part of the functionality to Lift Wing. This doesn't mean the full functionality of the entire recommendation application (i.e. Flask etc.) but rather specifically the modeling part (i.e. scikit-learn).
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T296994 Observations from research study for Section Translation on Thai Wikipedia | |||
Open | None | T293648 Content Translation Recommendations API | |||
Resolved | kevinbazira | T308164 Migrate Content Translation Recommendation API to Lift Wing | |||
Declined | calbon | T308165 Explore what would be required to migrate the content translation recommendation model to Lift Wing |
Event Timeline
In order to deploy the content translation recommendation model to Lift Wing, we need to upload the model files to storage, so our Inference Services can download the binaries and mount them into the pod.
In this translation recommendation documentation, I see that spark jobs are used for model training but it doesn't mention where the model is stored.
I am working on finding out where the file for the model is located.
@kevinbazira chiming in here to try to help sort out the different services. tagging @leila and @bmansurov too who hopefully can correct / verify what I know:
If I go to the Content Translation tool, this is what I think happens:
- A request for recommendations is made to a research instance hosted on Cloud VPS. Here's an example for translation recommendations in Spanish from English (based on network tab on my browser): https://recommend.wmflabs.org/types/translation/v1/articles?source=en&target=es&seed=&search=morelike&application=CX
- Sometimes the search parameter is related_articles -- Content Translation seems to randomly switch between the two but the related_articles endpoint is not up as far as I know so functionally they do the same thing because the fallback is to morelike (code showing this and I can see that log line show up occasionally in the uwsgi logs on the Cloud VPS instance).
- The seed parameter might also be set based on the editor's past translations (code).
- So that recommend.wmflabs API request triggers this code which does one of two things depending on whether the seed paramter is set:
- If it is not set (no previous translations from the user): the API gets the most popular articles from the source language as the recommendations.
- If it is set (results will be personalized to be similar to those previous translations): the API uses the morelike Search API to get related articles to the seeds.
- In both cases, the API then filters the candidates down, mainly by removing articles that already exists in the target language.
So nothing fancy algorithmically there, just a layer between ContentTranslation and other APIs. This is also described in this documentation on Meta. And this is all happening in Python on a Cloud VPS instance called tool.recommendation-api. There's also a Mediawiki service that seems to do largely the same thing but for nodejs not python:
- For example, similar service with endpoints described/testable here. @bmansurov would know more about the history of that service and why contenttranslation is still using the python instance on cloud vps instead of the nodejs version on mediawiki.
Getting to your original question about the model/spark jobs.
- That code seems to be for predicting how many pageviews an article might get if it was translated into a given language. This would improve the recommendation ranking so that instead of prioritizing content popular in the source language, the API prioritizes content that is likely to be popular in the target language. The vision is described in the paper here but the code I linked to suggests that the features were scaled back to just using sitelink/pageview data. I don't know if this was ever put into use outside of the experiments discussed in that paper.
- There is another model-like component that was under development and that's the related articles endpoint that I mentioned earlier. This was an approach that built embeddings of Wikidata items based on reader behavior and used them for finding similar articles to those in the seed parameter (instead of the morelike API). This is the code for the related_articles endpoint but the endpoint here no longer exists and none of the embeddings etc. are on the recommend.wmflabs.org instance. I think initially the embeddings being used were these though but I know @bmansurov at one point a few years ago was working on the jobs to produce a monthly update to them.
A miscellaneous point about picking through the documentation -- my general understanding is that anything under the schana repos can be ignored. @bmansurov has worked on it most recently and I believe migrated the relevant code to the wikimedia organization repos. For instance, you'll see references to a missing sections endpoint and I don't believe that exists anymore (it's been superseded by another project led by Diego in collaboration with the Language team).
If I'm correct about this, replicating the existing service used by ContentTranslation might not require any model training and there may already be a production-level endpoint. However, @leila and @bmansurov should chime in about whether there are any data/model pipelines that would need porting to LiftWing and if so, where that code/data/models live.