Page MenuHomePhabricator

Improve the algorithm for translation recommendations
Closed, ResolvedPublic

Description

Currently, the translation recommendations in GapFinder are offered based on the top pageviews in the source language. In section 3.2 of https://arxiv.org/abs/1604.03235 we discuss a more advanced algorithm for ranking missing articles in the destination language. We should aim for implementing that algorithm to improve recommendation quality and relevance to the destination language. Note that T158889 was designed as a first step towards this goal.

Event Timeline

leila created this task.Apr 13 2017, 4:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 13 2017, 4:18 PM
leila assigned this task to schana.Apr 13 2017, 4:19 PM
leila triaged this task as High priority.

It appears the model building code from the original research was removed from the recommendation-api repo. I have restored it in a new repo here: https://github.com/schana/model-building

After speaking with @JAllemandou, it looks like the majority of the features used in the paper are available in Hadoop. We could use Spark to gather the features and score the recommendations, and then potentially load that data for querying.

Notes:

  • Edit info is currently refreshed on a monthly basis
  • Some features are not currently available:
    • Links
    • Topics

Questions:

  • How computationally expensive is scoring with the model?
  • Elastic Search or Cassandra or ? as a best place to retrieve the recommendations from
  • How will we query this data?
    • Rate
    • Response structure
    • Size of corpus

@leila Can you answer the question about the model and if the edit data being refreshed on a monthly basis will present a problem? Also, it seems like the majority of features don't rely on the target language. It may simplify the scores if they can be computed based only on a source language, independent of (source, target) pairs. Is this feasible or do we need to accommodate all the permutations?

leila added a comment.May 1 2017, 1:12 PM

@schana:

  • Can you specify what you mean by "edit data"?
  • Re (source, target): it's important to compute the scores based on the (source, target) pair and not only the (source) as relying on source-only can impose what is important in the source language to the target language without taking into account the demand/needs of the target language.

@leila:

Follow up question:

  • Instead of having the scores computed for a (source, target) pair, could it instead be only based on target? It seems like source is relevant because it's the language the editor will translate from, but not necessarily the best indicator for whether it will be read in target.
    • Ex: A is popular in languages X, Y, and Z, but only exists as a stub in S. It could be recommended for any of the popular languages, but not S. For an editor looking to translate from S to T, would it be better to translate a stub that would likely be important for T, or to perform no action?

@JAllemandou would what we discussed (the Spark job computing the scores) be feasible to do for all language pairs? Or for all languages?

leila added a comment.May 2 2017, 4:28 PM

Got it. If we want to keep the features used in the ranking consistent, then doing an update on Edit data features once a month means updating the ranking scores once a month. Correct?
Just for my understanding: what should happen if we have to this job daily or weekly?

Follow up question:

  • Instead of having the scores computed for a (source, target) pair, could it instead be only based on target? It seems like source is relevant because it's the language the editor will translate from, but not necessarily the best indicator for whether it will be read in target.
    • Ex: A is popular in languages X, Y, and Z, but only exists as a stub in S. It could be recommended for any of the popular languages, but not S. For an editor looking to translate from S to T, would it be better to translate a stub that would likely be important for T, or to perform no action?

You and I talked about this briefly when we met just now. To capture our conversation:

  • The challenge we had was how to build a training set of articles that do exist in target language t but also those that don't exist. In the case of Wikipedia, if you look at t alone, you can't tell if a missing article will need to stay missing or will need to be created. Considering a source language s was a trick to come over this problem. (It also enabled us to include features in the source that can predict articles in the target which is another piece needed.)

In our meeting, you brought up another idea: how about instead of focusing on s, we focus on all other languages that include article A and get the features from all of them. From scalability point of view, this approach can help us focus on 300 models, as opposed to 30K+. Let me think about this some more and get back to you.

@JAllemandou would what we discussed (the Spark job computing the scores) be feasible to do for all language pairs? Or for all languages?

It possibly would, it really depends on how much it takes to do a single pair.

leila added a comment.May 8 2017, 4:33 AM

In our meeting, you brought up another idea: how about instead of focusing on s, we focus on all other languages that include article A and get the features from all of them. From scalability point of view, this approach can help us focus on 300 models, as opposed to 30K+. Let me think about this some more and get back to you.

@schana My intuition tells me that your approach should work, however we should compare the results of the two approaches for at least an example before making that call.

Can you focus on the (en,fr) example and generate a ranked list of articles using the original algorithm and then produce a separate ranked list using the expanded model (with the additional features from all languages)? We can then compare to see if there are major drawbacks in using the expanded model.

Note: some conversations have been happening elsewhere. The tl;dr is that I don't currently have the requisite ML knowledge to work with the codebase used in the research paper.

I was able to meet up with @EBernhardson at the hackathon and he helped to explain some of Ellery's code. I've started building out some code to learn Spark/Scala/Machine Learning and have made some progress here: https://github.com/schana/recommendation-translation

I've added subtasks that correspond to milestones that have been accomplished so far for added visibility

As a summary up to this point, the decision was made to pursue building models for all target languages based on a subset of features used in the research paper. The model used is Spark MLlib's implementation of RandomForestRegressionModel with default parameters. Feature vectors are currently <pageviews, normalized rank, exists> * sites for each site (which represents a target language).

  • pageviews is being read from the ez dump of the last month's views
  • normalized rank is the normalized rank by pageviews within a given site
  • exists is whether the item exists in a given site based on a wikidatawiki mysql query

To build a model for a target site, the feature vectors are first filtered to only include values for exist for that target site. Then the site's <pageviews, normalized rank, exists> are removed. The rest are fed into the regressor as features with the target site's normalized rank as the label.

I've connected the plumbing and there's now a visual interface here: http://recommend-alpha.wmflabs.org/translation_test

I've loaded the computed predictions from hadoop for en and de. Feel free to look at them using this query for de to en, this query for en to de, or through the test website.

schana added a comment.EditedJun 30 2017, 4:00 PM

First completed run on hadoop stats (from https://yarn.wikimedia.org/cluster/app/application_1498042433999_26206):
Started: Wed Jun 28 18:27:55 +0000 2017
Elapsed: 35hrs, 43mins, 23sec
Aggregate Resource Allocation: 5925723236 MB-seconds, 2186133 vcore-seconds
Resulting file sizes:

nschaaf@stat1002:~$ hadoop fs -du -s -h /user/nschaaf/trex/output/2017-06-28-182809_models
42.6 M  127.7 M  /user/nschaaf/trex/output/2017-06-28-182809_models
nschaaf@stat1002:~$ hadoop fs -du -s -h /user/nschaaf/trex/output/2017-06-28-182809_predictions
34.8 G  104.3 G  /user/nschaaf/trex/output/2017-06-28-182809_predictions

Invocation:

feature=/user/nschaaf/trex/output/2017-06-20-205914_featureData                                                        
parsed=/user/nschaaf/trex/output/2017-06-20-205914_parsedData
./spark-2.1.1-bin-hadoop2.6/bin/spark-submit --class org.wikimedia.research.recommendation.job.translation.JobRunner --driver-memory 8g --master yarn --deploy-mode client --num-executors 4 --executor-memory 10g --executor-cores 4 ./ResearchRecommendation-assembly-1.0.jar -o /user/nschaaf/trex/output -bs -p $parsed -f $feature

I've loaded some more computed predctions: enwiki, hzwiki, muswiki, krwiki, iiwiki, dewiki, frwiki, ruwiki, itwiki, eswiki, jawiki, plwiki, zhwiki, ptwiki, vecwiki, nlwiki, sahwiki, vlswiki, kvwiki, wuuwiki

All the scores are available on hadoop at /user/nschaaf/trex/output/2017-06-28-182809_predictions or stat1002 at /home/nschaaf/trex/scores

I'm currently loading all the scores - the service will be unavailable until this is complete.

schana added a comment.EditedJul 3 2017, 5:33 PM

The scores are loaded, but without indices built queries don't happen in a reasonable time. I'm going to start building the indices for the table

I've built the indices alphabetically through idwiki, but am running out of disk space on the labs instance. So far the indices alone are using approx 55GB. I've stopped building them in the meantime.

This means that queries with the target specified as <=idwiki will be fast, but others will time-out.

Indices updated through scnwiki after moving missing_sections database elsewhere. Database + indices are currently using approx 126GB.

I've added the ability to specify a seed for querying the dataset.

leila added a comment.Jul 13 2017, 4:42 PM

@schana following up on the discussion we had now: please provide couple of test examples where you have tested the changes and based on those you assess that the new algorithm does better than the old one. I will dig into it deeper afterwards, I need to make sure the improvements are actual improvements from your point of view before spending more time on it. Thanks.

@leila, here are some examples to compare:

@schana if I only specify the seed in the alpha version of the app, error messages are displayed as raw JSON, e.g.

{"errors":{"source":"Missing required parameter in the JSON body or the post body or the query string"},"message":"Input payload validation failed"}

is this intentional for testing purposes? cc @leila

@DarTar that is currently the behavior of the alpha version.

leila added a comment.Jul 25 2017, 9:39 PM

@schana a few questions as I'm testing the improved API:

As a summary up to this point, the decision was made to pursue building models for all target languages based on a subset of features used in the research paper. The model used is Spark MLlib's implementation of RandomForestRegressionModel with default parameters. Feature vectors are currently <pageviews, normalized rank, exists> * sites for each site (which represents a target language).

  • pageviews is being read from the ez dump of the last month's views

What is the logic to move away from the 6-month used in the original experiment to 1-month? Part of the reason for including more months was to be able to capture some notions of stability in views (versus articles that can receive high traffic only in a given month).

  • normalized rank is the normalized rank by pageviews within a given site
  • exists is whether the item exists in a given site based on a wikidatawiki mysql query

I'm curious: now that you have 'exists', why didn't you include Wikidata_count? That was one of the strong predictors in the models built for the experiment.

One question about the outcome: what is the outcome that your model is predicting?

@DarTar that is currently the behavior of the alpha version.

ok, that's cool.

schana added a comment.EditedJul 25 2017, 9:55 PM

@schana a few questions as I'm testing the improved API:
What is the logic to move away from the 6-month used in the original experiment to 1-month? Part of the reason for including more months was to be able to capture some notions of stability in views (versus articles that can receive high traffic only in a given month).

The main reason was to limit the computational requirements. This is by no means set in stone, but Joseph wanted to be careful about running the job over 6 months of data, so we'll need to coordinate with Analytics before running that job.

I'm curious: now that you have 'exists', why didn't you include Wikidata_count? That was one of the strong predictors in the models built for the experiment.

Does it make a difference in having that as a distinct feature instead of letting it be figured out through the exists features?

One question about the outcome: what is the outcome that your model is predicting?

It's predicting the normalized rank.

bmansurov closed this task as Resolved.Apr 25 2019, 5:16 PM
bmansurov moved this task from For Review to Done on the Recommendation-API board.
bmansurov added a subscriber: bmansurov.