Page MenuHomePhabricator

Improve the prioritization algorithm used in recommendation API
Closed, ResolvedPublic

Description

The paper "Growing Wikipedia Across Languages via Recommendation" (pdf) has a section (section 2.2) about ranking missing articles. See T190774#4189253 for details.

A/C

  • Figure out which of the features in Section 2.2. we can compute at scale and on a regular basis.

See T190774#4192169.

  • Take the features from the previous step and use the model in Section 2.2. to build a prediction algorithm for every language.

See T190774#4319244

  • Discuss how to surface recommendations.

See T190774#4319700

Previous work

  1. https://github.com/schana/recommendation-translation

Event Timeline

leila triaged this task as High priority.
leila subscribed.

@leila, can we add a checklist of items here? I can add them if you can point me to the right direction. Thanks.

@bmansurov sorry for the long delay.

Let's focus on the translation task for the moment to stay as close as possible to the set up in https://arxiv.org/pdf/1604.03235.pdf . In this case and at the moment, for a fixed source and destination language pair and a seed article, the

The basic model for the working of GapFinder at the moment is that the user fixes source_language, destination_language, and seed_article and through more_like in ElasticSearch, the API shows a subset of the articles that are available in source_language, missing in destination _language and are similar to seed_article. (Please correct me if I'm wrong.)

We want to move away from this model as it does no real prioritization of the articles based on user interests (the latter provided by seed_article). Basically, we want to have an algorithm which is as close to the algorithm in Section 2.2 of https://arxiv.org/pdf/1604.03235.pdf but we don't want to worry about Section 2.3 for now (for this latter section, we can continue to rely on the user to provide a seed_article and we can continue using more_like to find what is of most interest to the user, but in a ranked list of missing articles as opposed to in the space of all missing articles.

If the above makes sense to you, what we basically need to do is to figure out which of the features listed in "Features" in Section 2.2 can be computed at scale (across all languages and assuming that we need to refresh the values every n weeks). Once we know which features we can include, we can then build a prediction model using those features that will predict the number of pageviews the article will receive if it were to be created in destination_language. We can use the scores from this model to rank missing articles in a given language. (I see a link to Spark job feature extraction at https://www.mediawiki.org/wiki/GapFinder/Developers which may be the place where we started to move in this direction before you join the team.)

To summarize:

  1. Figure out which of the features in Section 2.2. we can compute at scale and on a regular basis.
  2. Take the features from the previous step and use the model in Section 2.2. to build a prediction algorithm for every language.

Once you have the two items above, we should discuss how we want to surface these new recommendations to the user. One approach is to use the seed_article and more_like to provide a ranked list of articles that are similar to seed_article and the ranking is done by the algorithm. (Instead of the default ranking model in more_like.)

Let me know if something is unclear.

Features

Here is the list of features mentioned in the paper.

Wikidata count

Below (try it) is a SPAQL query that we can use to compute the number of Wikipedia articles that link to a specific Wikidata item.

SELECT ?item (COUNT(?sitelink) as ?count) WHERE {
  VALUES ?item { wd:Q1 wd:Q2 wd:Q3 }
  ?sitelink schema:about ?item
  FILTER REGEX(STR(?sitelink), ".wikipedia.org/wiki/")
} GROUP BY ?item

Alternatively, we can consider all sitelinks (as opposed to Wikipedia sitelinks). The advantage would be that those sitelinks are pre-computed and we don't have to count them. Here's (try it) a sample query:

SELECT ?item ?linkcount WHERE {
  VALUES ?item { wd:Q1 wd:Q2 wd:Q3 }
  ?item wikibase:sitelinks ?linkcount
}

Running time of Wikidata queries are limited to 60 seconds. In order to run longer running queires, we'll have to talk to Discovery-ARCHIVED about our use case.

An alternative to querying WDQS is parsing Wikidata dumps with Wikidata toolkit. In order to speed up the process, we'll have to copy the dumps to the Analytics cluster and run a spark job to extract the information we need. Here's previous work done by Ellery on this. Joseph has also done some work on this. Wikidata dums as of January 2018 are available in JSON and Parquet formats at hdfs://user/joal/wikidata.

Page views

The wmf.pageview_hourly table (available on Hive) can be used to get page views in each of the top 50 wikipedias. We'll have to agree on what top in this context means. Here is a sample query. More info.

Geo page views

The wmf.pageview_hourly table (see above) can be used to query pages by country too.

Source-article length

The page table contains the page_len property.

Quality and importance classes

In order to use WikiProjects, each wiki has to be looked at separately. For example, template names are localized, so we have to identify templates that indicate article's importance first.

Edit Activity

We can use the revision table to query the article creation and last edit dates, along with the number of editors of the article.

Links

The pagelinks table can be used to query inlinks and outlinks.

Topics

As suggested by the paper, using gensim is an option. It has two kinds of implementations of LDA: ldamodel and ldamulticore. Judging by the performance numbers given on the above links, the analytics cluster should be able to handle this workload easily.

Conclusion

Every feature, except for "Quality and importance classes", can be used without too much human intervention. Those features are: "page views", "geo page views", "source article length", "edit activity", "links", and "topics". The "Quality and importance classes" requires an initial human intervention of identifying wiki/language specific templates and indicators, after which it can be automated too.

I've created a script that ranks articles based on Wikidata sitelinks: https://github.com/wikimedia/research-translation-recommendation-models/

The next step is to integrate pageviews an article in the top 50 Wikipedias received over the last 6 months.

Model and Data

Analyzing results

Using https://recommend.wmflabs.org/ we get article title, wikidata ID, rank, and pageviews. We compare the rank to the normalized rank predicted by the models. The higher the rank the more pageviews the article is expected to receive if created in the target wiki.

Test 1

Source: Russian
Target: Ukrainian
Seed article: Страна https://ru.wikipedia.org/wiki/Страна

titlewikidata_idrankpageviewsnormalized_rank
Политика_памятиQ14229074471911.57540699415e-05
Гань_ИнQ315678448491.57540699415e-05
Языки_НидерландовQ252575245216151.57540699415e-05
Диалекты_немецкого_языкаQ23065524615301.57540699415e-05
Словарь_географических_названий_зарубежных_странQ4423790472216n/a
Внешняя_политика_ТурцииQ373506475278n/a
Романовская_ИмперияQ7382038477726n/a
Женевская_инициативаQ86970947872n/a
Базис_(политология)Q16535129480111.57540699415e-05
РоссиянеQ4924684811182n/a
Большая_ЛангобардияQ2713566484881.57540699415e-05
Лондонское_соглашение_(1906)Q53844437494184n/a
  • Rank provided by the current algorithm indicates that these suggestions are closely related to each other. Normalized ranks also indicate the same thing.
  • Related articles don't make much sense. We need to work on improving this, probably, by using the algorithm from the paper (section 2.1).
  • Some articles don't have Wikidata IDs in the current Wikidata dumps. We need to update Wikidata dumps before training more models and making predictions.

Test 2

Source: English
Target: Russian
Seed article: Seoul https://en.wikipedia.org/wiki/Seoul

titlewikidata_idrankpageviewsnormalized_rank
Yongin_Daejanggeum_ParkQ156222534811550.000135394133573
Tomb_of_Princess_JeongseonQ4874083848314n/a
Sinansan_LineQ484137484870.000135394133573
Irwon_stationQ100870485120.000135394133573
Cheongjin-dongQ509164248840.000135394133573
Alone_in_LoveQ62344648913150.000135394133573
Beer_in_South_KoreaQ488000649125360.000135394133573
Yeongcheon-dongQ268150649370.000135394133573
TaepyeongnoQ704512494390.000135394133573
Yoo_(Korean_surname)Q6997424956850.000135394133573
Tears_of_the_Dragon_(TV_series)Q624867497271n/a
Namdaemun_MarketQ49468749810730.000135394133573

Here also we can draw the same conclusions as above.

Test 3 (Manual test)

Source: English
Target: Russian

We also sort predictions by normalized rank and look their sitelinks count to see if normalized ranks make sense. The current iteration of the models takes sitelinks, pageviews, normalized pageviews, and log pageviews into account. So this comparison is an approximate comparison as we don't expect the sitelinks count to map to normalized ranks perfectly.

titlewikidata_idsitelinksnormalized_rank
Bettina PousttchiQ10138620.000135394133573
Piz DoraQ171341240.00133051918878
NeuvireuilQ1000008320.000135394133573

Surfacing recommendations

English (source) - Spanish (target) has 4,712,255 predictions and weighs 123 MB. The other languages have fewer predictions. In order to efficiently surface these recommendations we depend on T193746. This assumes that we'll be developing the Node.js version of the algorithm as opposed to the Python version.

DarTar edited projects, added Research-Archive; removed Research.