Maniphest T190774

Improve the prioritization algorithm used in recommendation API
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• bmansurov
	Mar 27 2018, 2:20 AM

Description

The paper "Growing Wikipedia Across Languages via Recommendation" (pdf) has a section (section 2.2) about ranking missing articles. See T190774#4189253 for details.

A/C

Figure out which of the features in Section 2.2. we can compute at scale and on a regular basis.

See T190774#4192169.

Take the features from the previous step and use the model in Section 2.2. to build a prediction algorithm for every language.

See T190774#4319244

Discuss how to surface recommendations.

See T190774#4319700

Previous work

https://github.com/schana/recommendation-translation

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• bmansurov	T190774 Improve the prioritization algorithm used in recommendation API
		Invalid		None	T196592 PySpark 2 cannot find numpy

Event Timeline

• bmansurov created this task.Mar 27 2018, 2:20 AM

• bmansurov mentioned this in T189607: Plan for Q4.Mar 27 2018, 2:24 AM

leila assigned this task to • bmansurov.Mar 29 2018, 9:12 PM

leila triaged this task as High priority.

leila subscribed.

@leila, can we add a checklist of items here? I can add them if you can point me to the right direction. Thanks.

@bmansurov sorry for the long delay.

Let's focus on the translation task for the moment to stay as close as possible to the set up in https://arxiv.org/pdf/1604.03235.pdf . In this case and at the moment, for a fixed source and destination language pair and a seed article, the

The basic model for the working of GapFinder at the moment is that the user fixes source_language, destination_language, and seed_article and through more_like in ElasticSearch, the API shows a subset of the articles that are available in source_language, missing in destination _language and are similar to seed_article. (Please correct me if I'm wrong.)

We want to move away from this model as it does no real prioritization of the articles based on user interests (the latter provided by seed_article). Basically, we want to have an algorithm which is as close to the algorithm in Section 2.2 of https://arxiv.org/pdf/1604.03235.pdf but we don't want to worry about Section 2.3 for now (for this latter section, we can continue to rely on the user to provide a seed_article and we can continue using more_like to find what is of most interest to the user, but in a ranked list of missing articles as opposed to in the space of all missing articles.

If the above makes sense to you, what we basically need to do is to figure out which of the features listed in "Features" in Section 2.2 can be computed at scale (across all languages and assuming that we need to refresh the values every n weeks). Once we know which features we can include, we can then build a prediction model using those features that will predict the number of pageviews the article will receive if it were to be created in destination_language. We can use the scores from this model to rank missing articles in a given language. (I see a link to Spark job feature extraction at https://www.mediawiki.org/wiki/GapFinder/Developers which may be the place where we started to move in this direction before you join the team.)

To summarize:

Figure out which of the features in Section 2.2. we can compute at scale and on a regular basis.
Take the features from the previous step and use the model in Section 2.2. to build a prediction algorithm for every language.

Once you have the two items above, we should discuss how we want to surface these new recommendations to the user. One approach is to use the seed_article and more_like to provide a ranked list of articles that are similar to seed_article and the ranking is done by the algorithm. (Instead of the default ranking model in more_like.)

Let me know if something is unclear.

• bmansurov moved this task from Backlog to In Progress on the Research board.May 8 2018, 12:51 PM

• bmansurov updated the task description. (Show Details)May 8 2018, 1:41 PM

Features

Here is the list of features mentioned in the paper.

Wikidata count

Below (try it) is a SPAQL query that we can use to compute the number of Wikipedia articles that link to a specific Wikidata item.

SELECT ?item (COUNT(?sitelink) as ?count) WHERE {
  VALUES ?item { wd:Q1 wd:Q2 wd:Q3 }
  ?sitelink schema:about ?item
  FILTER REGEX(STR(?sitelink), ".wikipedia.org/wiki/")
} GROUP BY ?item

Alternatively, we can consider all sitelinks (as opposed to Wikipedia sitelinks). The advantage would be that those sitelinks are pre-computed and we don't have to count them. Here's (try it) a sample query:

SELECT ?item ?linkcount WHERE {
  VALUES ?item { wd:Q1 wd:Q2 wd:Q3 }
  ?item wikibase:sitelinks ?linkcount
}

Running time of Wikidata queries are limited to 60 seconds. In order to run longer running queires, we'll have to talk to Discovery-ARCHIVED about our use case.

An alternative to querying WDQS is parsing Wikidata dumps with Wikidata toolkit. In order to speed up the process, we'll have to copy the dumps to the Analytics cluster and run a spark job to extract the information we need. Here's previous work done by Ellery on this. Joseph has also done some work on this. Wikidata dums as of January 2018 are available in JSON and Parquet formats at hdfs://user/joal/wikidata.

Page views

The wmf.pageview_hourly table (available on Hive) can be used to get page views in each of the top 50 wikipedias. We'll have to agree on what top in this context means. Here is a sample query. More info.

Geo page views

The wmf.pageview_hourly table (see above) can be used to query pages by country too.

Source-article length

The page table contains the page_len property.

Quality and importance classes

In order to use WikiProjects, each wiki has to be looked at separately. For example, template names are localized, so we have to identify templates that indicate article's importance first.

Edit Activity

We can use the revision table to query the article creation and last edit dates, along with the number of editors of the article.

Links

The pagelinks table can be used to query inlinks and outlinks.

Topics

As suggested by the paper, using gensim is an option. It has two kinds of implementations of LDA: ldamodel and ldamulticore. Judging by the performance numbers given on the above links, the analytics cluster should be able to handle this workload easily.

Conclusion

Every feature, except for "Quality and importance classes", can be used without too much human intervention. Those features are: "page views", "geo page views", "source article length", "edit activity", "links", and "topics". The "Quality and importance classes" requires an initial human intervention of identifying wiki/language specific templates and indicators, after which it can be automated too.

• bmansurov updated the task description. (Show Details)May 15 2018, 6:09 PM

• bmansurov added a subtask: T196592: PySpark 2 cannot find numpy.Jun 6 2018, 10:13 PM

• bmansurov closed subtask T196592: PySpark 2 cannot find numpy as Invalid.Jun 7 2018, 6:44 PM

I've created a script that ranks articles based on Wikidata sitelinks: https://github.com/wikimedia/research-translation-recommendation-models/

The next step is to integrate pageviews an article in the top 50 Wikipedias received over the last 6 months.

• bmansurov updated the task description. (Show Details)Jun 26 2018, 10:42 PM

Model and Data

Repository for training models and making predictions: https://github.com/wikimedia/research-translation-recommendation-models
Predictions for 05/31/2018 for the top 10 language pairs according to Conten Translation: https://github.com/wikimedia/research-translation-recommendation-predictions

Analyzing results

Using https://recommend.wmflabs.org/ we get article title, wikidata ID, rank, and pageviews. We compare the rank to the normalized rank predicted by the models. The higher the rank the more pageviews the article is expected to receive if created in the target wiki.

Test 1

Source: Russian
Target: Ukrainian
Seed article: Страна https://ru.wikipedia.org/wiki/Страна

title	wikidata_id	rank	pageviews	normalized_rank
Политика_памяти	Q1422907	447	191	1.57540699415e-05
Гань_Ин	Q315678	448	49	1.57540699415e-05
Языки_Нидерландов	Q2525752	452	1615	1.57540699415e-05
Диалекты_немецкого_языка	Q2306552	461	530	1.57540699415e-05
Словарь_географических_названий_зарубежных_стран	Q4423790	472	216	n/a
Внешняя_политика_Турции	Q373506	475	278	n/a
Романовская_Империя	Q7382038	477	726	n/a
Женевская_инициатива	Q869709	478	72	n/a
Базис_(политология)	Q16535129	480	11	1.57540699415e-05
Россияне	Q492468	481	1182	n/a
Большая_Лангобардия	Q2713566	484	88	1.57540699415e-05
Лондонское_соглашение_(1906)	Q53844437	494	184	n/a

Rank provided by the current algorithm indicates that these suggestions are closely related to each other. Normalized ranks also indicate the same thing.
Related articles don't make much sense. We need to work on improving this, probably, by using the algorithm from the paper (section 2.1).
Some articles don't have Wikidata IDs in the current Wikidata dumps. We need to update Wikidata dumps before training more models and making predictions.

Test 2

Source: English
Target: Russian
Seed article: Seoul https://en.wikipedia.org/wiki/Seoul

title	wikidata_id	rank	pageviews	normalized_rank
Yongin_Daejanggeum_Park	Q15622253	481	155	0.000135394133573
Tomb_of_Princess_Jeongseon	Q48740838	483	14	n/a
Sinansan_Line	Q484137	484	87	0.000135394133573
Irwon_station	Q100870	485	12	0.000135394133573
Cheongjin-dong	Q5091642	488	4	0.000135394133573
Alone_in_Love	Q623446	489	1315	0.000135394133573
Beer_in_South_Korea	Q4880006	491	2536	0.000135394133573
Yeongcheon-dong	Q2681506	493	7	0.000135394133573
Taepyeongno	Q704512	494	39	0.000135394133573
Yoo_(Korean_surname)	Q699742	495	685	0.000135394133573
Tears_of_the_Dragon_(TV_series)	Q624867	497	271	n/a
Namdaemun_Market	Q494687	498	1073	0.000135394133573

Here also we can draw the same conclusions as above.

Test 3 (Manual test)

Source: English
Target: Russian

We also sort predictions by normalized rank and look their sitelinks count to see if normalized ranks make sense. The current iteration of the models takes sitelinks, pageviews, normalized pageviews, and log pageviews into account. So this comparison is an approximate comparison as we don't expect the sitelinks count to map to normalized ranks perfectly.

title	wikidata_id	sitelinks	normalized_rank
Bettina Pousttchi	Q101386	2	0.000135394133573
Piz Dora	Q1713412	4	0.00133051918878
Neuvireuil	Q1000008	32	0.000135394133573

• bmansurov updated the task description. (Show Details)Jun 27 2018, 1:09 PM

• bmansurov updated the task description. (Show Details)

Surfacing recommendations

English (source) - Spanish (target) has 4,712,255 predictions and weighs 123 MB. The other languages have fewer predictions. In order to efficiently surface these recommendations we depend on T193746. This assumes that we'll be developing the Node.js version of the algorithm as opposed to the Python version.

• bmansurov updated the task description. (Show Details)Jun 27 2018, 4:02 PM

• bmansurov moved this task from In Progress to Done (current quarter) on the Research board.

• Vvjjkkii reopened subtask T196592: PySpark 2 cannot find numpy as Open.Jul 1 2018, 1:05 AM

elukey closed subtask T196592: PySpark 2 cannot find numpy as Resolved.Jul 2 2018, 6:18 AM

elukey changed the status of subtask T196592: PySpark 2 cannot find numpy from Resolved to Invalid.

• DarTar closed this task as Resolved.Jul 28 2018, 2:12 AM

• DarTar edited projects, added Research-Archive; removed Research.

• DarTar moved this task from Default to Q4-FY18 on the Research-Archive board.Jul 28 2018, 2:14 AM