Page MenuHomePhabricator

Improve the prioritization algorithm used in recommendation API
Closed, ResolvedPublic


The paper "Growing Wikipedia Across Languages via Recommendation" (pdf) has a section (section 2.2) about ranking missing articles. See T190774#4189253 for details.


  • Figure out which of the features in Section 2.2. we can compute at scale and on a regular basis.

See T190774#4192169.

  • Take the features from the previous step and use the model in Section 2.2. to build a prediction algorithm for every language.

See T190774#4319244

  • Discuss how to surface recommendations.

See T190774#4319700

Previous work


Event Timeline

leila triaged this task as High priority.
leila subscribed.

@leila, can we add a checklist of items here? I can add them if you can point me to the right direction. Thanks.

@bmansurov sorry for the long delay.

Let's focus on the translation task for the moment to stay as close as possible to the set up in . In this case and at the moment, for a fixed source and destination language pair and a seed article, the

The basic model for the working of GapFinder at the moment is that the user fixes source_language, destination_language, and seed_article and through more_like in ElasticSearch, the API shows a subset of the articles that are available in source_language, missing in destination _language and are similar to seed_article. (Please correct me if I'm wrong.)

We want to move away from this model as it does no real prioritization of the articles based on user interests (the latter provided by seed_article). Basically, we want to have an algorithm which is as close to the algorithm in Section 2.2 of but we don't want to worry about Section 2.3 for now (for this latter section, we can continue to rely on the user to provide a seed_article and we can continue using more_like to find what is of most interest to the user, but in a ranked list of missing articles as opposed to in the space of all missing articles.

If the above makes sense to you, what we basically need to do is to figure out which of the features listed in "Features" in Section 2.2 can be computed at scale (across all languages and assuming that we need to refresh the values every n weeks). Once we know which features we can include, we can then build a prediction model using those features that will predict the number of pageviews the article will receive if it were to be created in destination_language. We can use the scores from this model to rank missing articles in a given language. (I see a link to Spark job feature extraction at which may be the place where we started to move in this direction before you join the team.)

To summarize:

  1. Figure out which of the features in Section 2.2. we can compute at scale and on a regular basis.
  2. Take the features from the previous step and use the model in Section 2.2. to build a prediction algorithm for every language.

Once you have the two items above, we should discuss how we want to surface these new recommendations to the user. One approach is to use the seed_article and more_like to provide a ranked list of articles that are similar to seed_article and the ranking is done by the algorithm. (Instead of the default ranking model in more_like.)

Let me know if something is unclear.


Here is the list of features mentioned in the paper.

Wikidata count

Below (try it) is a SPAQL query that we can use to compute the number of Wikipedia articles that link to a specific Wikidata item.

SELECT ?item (COUNT(?sitelink) as ?count) WHERE {
  VALUES ?item { wd:Q1 wd:Q2 wd:Q3 }
  ?sitelink schema:about ?item
  FILTER REGEX(STR(?sitelink), "")
} GROUP BY ?item

Alternatively, we can consider all sitelinks (as opposed to Wikipedia sitelinks). The advantage would be that those sitelinks are pre-computed and we don't have to count them. Here's (try it) a sample query:

SELECT ?item ?linkcount WHERE {
  VALUES ?item { wd:Q1 wd:Q2 wd:Q3 }
  ?item wikibase:sitelinks ?linkcount

Running time of Wikidata queries are limited to 60 seconds. In order to run longer running queires, we'll have to talk to Discovery-ARCHIVED about our use case.

An alternative to querying WDQS is parsing Wikidata dumps with Wikidata toolkit. In order to speed up the process, we'll have to copy the dumps to the Analytics cluster and run a spark job to extract the information we need. Here's previous work done by Ellery on this. Joseph has also done some work on this. Wikidata dums as of January 2018 are available in JSON and Parquet formats at hdfs://user/joal/wikidata.

Page views

The wmf.pageview_hourly table (available on Hive) can be used to get page views in each of the top 50 wikipedias. We'll have to agree on what top in this context means. Here is a sample query. More info.

Geo page views

The wmf.pageview_hourly table (see above) can be used to query pages by country too.

Source-article length

The page table contains the page_len property.

Quality and importance classes

In order to use WikiProjects, each wiki has to be looked at separately. For example, template names are localized, so we have to identify templates that indicate article's importance first.

Edit Activity

We can use the revision table to query the article creation and last edit dates, along with the number of editors of the article.


The pagelinks table can be used to query inlinks and outlinks.


As suggested by the paper, using gensim is an option. It has two kinds of implementations of LDA: ldamodel and ldamulticore. Judging by the performance numbers given on the above links, the analytics cluster should be able to handle this workload easily.


Every feature, except for "Quality and importance classes", can be used without too much human intervention. Those features are: "page views", "geo page views", "source article length", "edit activity", "links", and "topics". The "Quality and importance classes" requires an initial human intervention of identifying wiki/language specific templates and indicators, after which it can be automated too.

I've created a script that ranks articles based on Wikidata sitelinks:

The next step is to integrate pageviews an article in the top 50 Wikipedias received over the last 6 months.

Model and Data

Analyzing results

Using we get article title, wikidata ID, rank, and pageviews. We compare the rank to the normalized rank predicted by the models. The higher the rank the more pageviews the article is expected to receive if created in the target wiki.

Test 1

Source: Russian
Target: Ukrainian
Seed article: СтранаСтрана

  • Rank provided by the current algorithm indicates that these suggestions are closely related to each other. Normalized ranks also indicate the same thing.
  • Related articles don't make much sense. We need to work on improving this, probably, by using the algorithm from the paper (section 2.1).
  • Some articles don't have Wikidata IDs in the current Wikidata dumps. We need to update Wikidata dumps before training more models and making predictions.

Test 2

Source: English
Target: Russian
Seed article: Seoul


Here also we can draw the same conclusions as above.

Test 3 (Manual test)

Source: English
Target: Russian

We also sort predictions by normalized rank and look their sitelinks count to see if normalized ranks make sense. The current iteration of the models takes sitelinks, pageviews, normalized pageviews, and log pageviews into account. So this comparison is an approximate comparison as we don't expect the sitelinks count to map to normalized ranks perfectly.

Bettina PousttchiQ10138620.000135394133573
Piz DoraQ171341240.00133051918878

Surfacing recommendations

English (source) - Spanish (target) has 4,712,255 predictions and weighs 123 MB. The other languages have fewer predictions. In order to efficiently surface these recommendations we depend on T193746. This assumes that we'll be developing the Node.js version of the algorithm as opposed to the Python version.

DarTar edited projects, added Research-Archive; removed Research.