Page MenuHomePhabricator

Explore what would be required to migrate the content translation recommendation model to Lift Wing
Closed, DeclinedPublic

Description

As a first step, lets evaluate how what would be required to migrate the modeling part of the functionality to Lift Wing. This doesn't mean the full functionality of the entire recommendation application (i.e. Flask etc.) but rather specifically the modeling part (i.e. scikit-learn).

Code for the current content translation recommendation API

Event Timeline

This comment was removed by calbon.

In order to deploy the content translation recommendation model to Lift Wing, we need to upload the model files to storage, so our Inference Services can download the binaries and mount them into the pod.

In this translation recommendation documentation, I see that spark jobs are used for model training but it doesn't mention where the model is stored.

I am working on finding out where the file for the model is located.

@kevinbazira chiming in here to try to help sort out the different services. tagging @leila and @bmansurov too who hopefully can correct / verify what I know:

If I go to the Content Translation tool, this is what I think happens:

So nothing fancy algorithmically there, just a layer between ContentTranslation and other APIs. This is also described in this documentation on Meta. And this is all happening in Python on a Cloud VPS instance called tool.recommendation-api. There's also a Mediawiki service that seems to do largely the same thing but for nodejs not python:

  • For example, similar service with endpoints described/testable here. @bmansurov would know more about the history of that service and why contenttranslation is still using the python instance on cloud vps instead of the nodejs version on mediawiki.

Getting to your original question about the model/spark jobs.

  • That code seems to be for predicting how many pageviews an article might get if it was translated into a given language. This would improve the recommendation ranking so that instead of prioritizing content popular in the source language, the API prioritizes content that is likely to be popular in the target language. The vision is described in the paper here but the code I linked to suggests that the features were scaled back to just using sitelink/pageview data. I don't know if this was ever put into use outside of the experiments discussed in that paper.
  • There is another model-like component that was under development and that's the related articles endpoint that I mentioned earlier. This was an approach that built embeddings of Wikidata items based on reader behavior and used them for finding similar articles to those in the seed parameter (instead of the morelike API). This is the code for the related_articles endpoint but the endpoint here no longer exists and none of the embeddings etc. are on the recommend.wmflabs.org instance. I think initially the embeddings being used were these though but I know @bmansurov at one point a few years ago was working on the jobs to produce a monthly update to them.

A miscellaneous point about picking through the documentation -- my general understanding is that anything under the schana repos can be ignored. @bmansurov has worked on it most recently and I believe migrated the relevant code to the wikimedia organization repos. For instance, you'll see references to a missing sections endpoint and I don't believe that exists anymore (it's been superseded by another project led by Diego in collaboration with the Language team).

If I'm correct about this, replicating the existing service used by ContentTranslation might not require any model training and there may already be a production-level endpoint. However, @leila and @bmansurov should chime in about whether there are any data/model pipelines that would need porting to LiftWing and if so, where that code/data/models live.

@Isaac thank you so much for the detailed break down.