Page MenuHomePhabricator

Technical exploration to support topic-based suggestions with the current Recommendation API
Open, MediumPublic

Description

In order to support topic-based suggestions for translation (T113257) a better recommendation API is needed (T293648). Until the new API is available, we want to make progress by learning how useful users may perceive the opportunity to customize the suggestions based on topic areas.

This ticket proposes to do a technical exploration to find ways in which the current Recommendation API can be used to approximate the intended results. For example, the current API provides a "seed article" option that can be used to approximate the generation of topic-based suggestions. That is, we can provide the option for users to select topics such as "Architecture" and the system can use some articles in the "Architecture" topic area to be used as article seeds. The results may not be ideal since the level of indirection or the use of reduced samples of article may not produce as high quality results as a dedicated service, but they may still be useful.

This exploration will consider approaches to, given a language pair, find articles to (a) create and (b) expand with a new section in the following cases:

  • Find articles related to a given topic areas from "articletopic:" search.
  • Find articles in the intersection of two or more topic areas
  • Find articles related to a given item (Wikipedia article or Wikidata topic).
  • Find articles in the intersection of several items (Wikipedia article or Wikidata topic).
  • Find articles related to the current user location (nearby).
  • Find articles related to a given country.
  • Find articles that are part of active campaigns.
  • Find articles that are part of a specific page collection.
  • Find articles in the intersection of multiple of the above criteria.

Related Objects

Event Timeline

Pginer-WMF created this task.

My preference is to enhance the "new" recommendation API at https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_content_translation_recommendation so that it can accept a topic(example: Chemisty, History, Africa, Music etc) and give recommendations. It should accept more than one topic. We can also see an intersection of topic and article in later stage.

The code behind this recommendation system is at https://github.com/wikimedia/research-recommendation-api and we can see how to enhance it. Using topic classification used in the wikipedia search api(example: https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=articletopic:sports&format=json) and passing it through the existing filter for finding missing in target language we should be able to get it working.

Thanks for your input, @santhosh. Some comments below.

My preference is to enhance the "new" recommendation API at https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_content_translation_recommendation so that it can accept a topic(example: Chemisty, History, Africa, Music etc) and give recommendations. It should accept more than one topic.

This sounds good to me. I was assuming the recommendation API internals to be too complex to touch them without further support, but if it is viable to improve it, this seems a better solution than just wrapping the existing API with hacks.

We can also see an intersection of topic and article in later stage.

The purpose of the ticket is to check the viability for the different scenarios. So even if a cleaner solution is not implemented, I think it would be useful to have some sort of validation to check that there is an approach that could be viable. This could be a manual example demonstrating the approach proposed. The technical exploration for the different cases will help to surface earlier any adjustment that we may need to do to the product designs.

The source code at https://github.com/wikimedia/research-recommendation-api has lot of legacy code, broken or unmaintained dependencies. The web frontend is with bower, jquery and such very old tooling. Recent updates by machine learning team got it somewhat functional to the extend it is integrated to liftwing. But adding new features require more fixups to get a smooth local development experience. We can ignore the web frontend part (AKA - gapfinder) for now as we are interested only in the API.

As primary stakeholders of this service, I would recommend to partially own up the service along with machine learning team.

Thanks Santhosh and Pau for kicking this off! Commenting here what I put in Slack for ease of access / transparency and I added a bit more detail. The specific updates that I'd love to see made as part of this clean-up (beyond removing unused code and general modernization/standardization):

  • Flip ranking order (it's currently "backwards") -- see: T293648#9284550
  • No longer gather each candidate article's set of claims from Wikidata as they're not being used -- see: T347475#9226750 and T347475#9239002
  • Consider reducing down the number of candidates that are checked on Wikidata for inclusion from 500 to something more dynamic that cuts off the process when enough candidates have been found -- also see: T347475#9226750.
    • The current process makes a search query (params + code) to gather 500 candidate articles for translation. And then for each of these candidates, it applies a few filters to find which already exist in the target language and remove disambiguation/list articles (though the List filter is pretty basic). When the API was being ported to LiftWing, we experimented with reducing the number of candidate articles to check from 500 down to e.g., 250 but didn't see much change in latency because the API calls for each chunk of 50 candidates is done in parallel.
    • What I'd suggest considering instead is moving the "does this article exist in the target language" filter (which is the main one for removing candidates) to the original Search API call instead of relying on Wikibase API. So instead of just requesting articles that are morelike the seed and then making additional API calls to filter them, you could make the morelike API call a generator and use the langlinks API to filter out articles that already exist in the target language. In parallel, you could also filter out disambiguation pages with a similar API call. For example, if you were looking for articles like "Banana" to translate from English -> German, you could do something like:
    • Then you probably don't need any calls to the Wikibase API unless you want to know how many sitelinks an article already has, but at least at that point it's only for the final result set (which in practice is much smaller). You also don't necessarily need to request all 500 candidates at once if you don't want to and could easily use the Search continuation parameters to e.g., do chunks of 100 and only get more as needed.

Change #1052950 had a related patch set uploaded (by Santhosh; author: Santhosh):

[research/recommendation-api@master] Recommend articles to translate based on topic

https://gerrit.wikimedia.org/r/1052950

Commenting here rather than on the above patch because it's a high-level question that I thought Pau might have thoughts about too.

Right now the implementation is a topic search that is separate from the morelike search. So you can either find translation candidates that are similar to some example article or you can find translation candidates that fit into some set of high-level topics but you can't combine the two.

I was actually envisioning that the two were (optionally) combined -- i.e. you could optionally provide an example article as well as optionally provide a set of topics (and if you don't provide any, it just reverts like usual to providing a set of high-pageview articles as candidates). Code-wise I think this would be pretty simple (the tricky thing is using morelikethis instead of morelike when combining the two filters (documentation). And while most people might still only ever provide an example article or provide topics (I guess that depends on what you enable in the UI), it allows for interesting use-cases like Climate-change-related articles that are also films (srsearch=morelikethis:Climate_change%20articletopic:films) which non-Content-Translation users of the API might want too.

Commenting here rather than on the above patch because it's a high-level question that I thought Pau might have thoughts about too.

Right now the implementation is a topic search that is separate from the morelike search. So you can either find translation candidates that are similar to some example article or you can find translation candidates that fit into some set of high-level topics but you can't combine the two.

I was actually envisioning that the two were (optionally) combined -- i.e. you could optionally provide an example article as well as optionally provide a set of topics (and if you don't provide any, it just reverts like usual to providing a set of high-pageview articles as candidates). Code-wise I think this would be pretty simple (the tricky thing is using morelikethis instead of morelike when combining the two filters (documentation). And while most people might still only ever provide an example article or provide topics (I guess that depends on what you enable in the UI), it allows for interesting use-cases like Climate-change-related articles that are also films (srsearch=morelikethis:Climate_change%20articletopic:films) which non-Content-Translation users of the API might want too.

I cannot talk much about the technical aspects, but the usecase that you described (e.g., intersection between similar articles to a given one, and a set of topics) seems quite relevant, and something we were planning to support in the UI. The idea is to let users select multiple topic areas form a menu (T369268) and also search (T369595) for articles to define a more narrow knowledge gap they are interested in. So it would be great to consider how this could be supported as part of the next steps of the technical exploration.

I had thought about this topic+article mixing, and I have an idea on its implementation, but just deferred it for another patch once these are merged and tested.

the usecase that you described (e.g., intersection between similar articles to a given one, and a set of topics) seems quite relevant, and something we were planning to support in the UI.

Great to hear!

I had thought about this topic+article mixing, and I have an idea on its implementation, but just deferred it for another patch once these are merged and tested.

Ahh fair -- I'll leave it up to you then on how to proceed. I left a comment but current code seems to be working for me.

Change #1052950 merged by jenkins-bot:

[research/recommendation-api@master] Recommend articles to translate based on topic

https://gerrit.wikimedia.org/r/1052950

eamedina updated the task description. (Show Details)

In preparation for T369268: Custom translation suggestions: Multiple selection, we want to figure out if any combination of filters config is possible or if some are fundamentally incompatible so we can adapt the UX accordingly.

UNION of topics is already possible and in used today since in several instances, we combine many backend topics into a single frontend topic. For instance, when the topic tv-and-film is selected, we send articletopic:films|television to the search API and that is interpreted as "films" OR "television".

INTERSECTION of topics is also possible by specifying multiple articletopic: commands in the search query. For instance, articletopic:africa articletopic:women articletopic:books will find articles about books by African women. (results on enwiki)

"For you", or seed-based recommendations, can be combined with any other search queries to find articles similar to the provided seed while matching the other commands if possible. For instance, articletopic:south-america morelikethis:Lake would find articles about South American bodies of water. (results on enwiki)

"Popular" is a good default filter for when the user has no edits and hasn't selected any topics or collections. However, I'm not sure what it should do when combined with other filters. Maybe it just affects ranking of the results by showing the popular articles (by pagesviews) first? I believe PVs are already used in the default ranking but that might change based on what we find in T377124: Explore search parameters to increase diversity in topic-based suggestions

Multiple selection of collections should be straightforward since we have all the collections and their article titles available in the recommendation service cache. However, if we are looking at the INTERSECTION, it might be an empty set in most cases, unless a specific collection is combined with a very generic list like the Vital Articles.

Combination or collection(s) and topic(s): for this to work we may interpret it as "from the articles in this/these collections, find those that match the selected topics". This would need an API to get the articletopic for articles and build a mini index of { topic: articles } in the recommendation service.

Here's a table of all the combinations and how they may be interpreted

For youPopularTopicsCollectionsNotes
xBased on seed
xBased on mostpopular search keyword
xxBased on seed, sorted by pageviews
xIntersection of selected topics
xxIntersection of selected topics and seed
xxxIntersection of selected topics and seed, sorted by pageviews
xIntersection of selected collections
xxIntersection of selected collections (collections win, seed is ignored) //?
xxIntersection of selected collections, sorted by pageviews
xxxIntersection of selected collections, sorted by pageviews, seed is ignored
xxArticles at the intersection of selected collections that matches all the selected topics
xxxSame as above, seed is ignored
xxxxSame as above but sorted by pageviews