Technical exploration to support topic-based suggestions with the current Recommendation API
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Pginer-WMF
	Jun 18 2024, 12:43 PM

Description

In order to support topic-based suggestions for translation (T113257) a better recommendation API is needed (T293648). Until the new API is available, we want to make progress by learning how useful users may perceive the opportunity to customize the suggestions based on topic areas.

This ticket proposes to do a technical exploration to find ways in which the current Recommendation API can be used to approximate the intended results. For example, the current API provides a "seed article" option that can be used to approximate the generation of topic-based suggestions. That is, we can provide the option for users to select topics such as "Architecture" and the system can use some articles in the "Architecture" topic area to be used as article seeds. The results may not be ideal since the level of indirection or the use of reduced samples of article may not produce as high quality results as a dedicated service, but they may still be useful.

This exploration will consider approaches to, given a language pair, find articles to (a) create and (b) expand with a new section in the following cases:

Find articles related to a given topic areas from "articletopic:" search.
Find articles in the intersection of two or more topic areas
Find articles related to a given item (Wikipedia article or Wikidata topic).
Find articles in the intersection of several items (Wikipedia article or Wikidata topic).
Find articles related to the current user location (nearby).
Find articles related to a given country.
Find articles that are part of active campaigns.
Find articles that are part of a specific page collection.
Find articles in the intersection of multiple of the above criteria.

Details

	Subject	Repo	Branch	Lines +/-
	Recommend articles to translate based on topic	research/recommendation-api	master	+115 -10

Customize query in gerrit

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Open	None	T113257 Custom translation suggestions: Find opportunities to translate in topic areas selected by the user
Open	None	T367873 Technical exploration to support topic-based suggestions with the current Recommendation API
Resolved	santhosh	T373961 Support basic topic & community-defined translation collections with the current Recommendation API
Resolved	SBisson	T377124 Explore search parameters to increase diversity in topic-based suggestions
		· · ·

Event Timeline

Pginer-WMF triaged this task as Medium priority.Jun 18 2024, 12:43 PM

Pginer-WMF created this task.

Pginer-WMF mentioned this in T113257: Custom translation suggestions: Find opportunities to translate in topic areas selected by the user.Jun 19 2024, 9:54 AM

Pginer-WMF moved this task from Needs Triage to Enhancements on the ContentTranslation board.Jun 20 2024, 8:34 AM

FRomeo_WMF added a subscriber: Rmaung.Jun 25 2024, 2:32 PM

GFontenelle_WMF subscribed.Jun 25 2024, 9:22 PM

My preference is to enhance the "new" recommendation API at https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_content_translation_recommendation so that it can accept a topic(example: Chemisty, History, Africa, Music etc) and give recommendations. It should accept more than one topic. We can also see an intersection of topic and article in later stage.

The code behind this recommendation system is at https://github.com/wikimedia/research-recommendation-api and we can see how to enhance it. Using topic classification used in the wikipedia search api(example: https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=articletopic:sports&format=json) and passing it through the existing filter for finding missing in target language we should be able to get it working.

Thanks for your input, @santhosh. Some comments below.

In T367873#9943197, @santhosh wrote:

My preference is to enhance the "new" recommendation API at https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_content_translation_recommendation so that it can accept a topic(example: Chemisty, History, Africa, Music etc) and give recommendations. It should accept more than one topic.

This sounds good to me. I was assuming the recommendation API internals to be too complex to touch them without further support, but if it is viable to improve it, this seems a better solution than just wrapping the existing API with hacks.

We can also see an intersection of topic and article in later stage.

The purpose of the ticket is to check the viability for the different scenarios. So even if a cleaner solution is not implemented, I think it would be useful to have some sort of validation to check that there is an approach that could be viable. This could be a manual example demonstrating the approach proposed. The technical exploration for the different cases will help to surface earlier any adjustment that we may need to do to the product designs.

Pginer-WMF edited projects, added LPL Hypothesis; removed Language-Team (Language-2024-April-June).Jul 2 2024, 9:46 AM

The source code at https://github.com/wikimedia/research-recommendation-api has lot of legacy code, broken or unmaintained dependencies. The web frontend is with bower, jquery and such very old tooling. Recent updates by machine learning team got it somewhat functional to the extend it is integrated to liftwing. But adding new features require more fixups to get a smooth local development experience. We can ignore the web frontend part (AKA - gapfinder) for now as we are interested only in the API.

As primary stakeholders of this service, I would recommend to partially own up the service along with machine learning team.

Isaac mentioned this in T361637: Support for topic infrastructure work.Jul 2 2024, 6:53 PM

Thanks Santhosh and Pau for kicking this off! Commenting here what I put in Slack for ease of access / transparency and I added a bit more detail. The specific updates that I'd love to see made as part of this clean-up (beyond removing unused code and general modernization/standardization):

Flip ranking order (it's currently "backwards") -- see: T293648#9284550
No longer gather each candidate article's set of claims from Wikidata as they're not being used -- see: T347475#9226750 and T347475#9239002
Consider reducing down the number of candidates that are checked on Wikidata for inclusion from 500 to something more dynamic that cuts off the process when enough candidates have been found -- also see: T347475#9226750.
- The current process makes a search query (params + code) to gather 500 candidate articles for translation. And then for each of these candidates, it applies a few filters to find which already exist in the target language and remove disambiguation/list articles (though the List filter is pretty basic). When the API was being ported to LiftWing, we experimented with reducing the number of candidate articles to check from 500 down to e.g., 250 but didn't see much change in latency because the API calls for each chunk of 50 candidates is done in parallel.
- What I'd suggest considering instead is moving the "does this article exist in the target language" filter (which is the main one for removing candidates) to the original Search API call instead of relying on Wikibase API. So instead of just requesting articles that are morelike the seed and then making additional API calls to filter them, you could make the morelike API call a generator and use the langlinks API to filter out articles that already exist in the target language. In parallel, you could also filter out disambiguation pages with a similar API call. For example, if you were looking for articles like "Banana" to translate from English -> German, you could do something like:
  - Similar articles (and whether they already exist in German): https://en.wikipedia.org/w/api.php?action=query&generator=search&format=json&gsrnamespace=0&gsrwhat=text&gsrsearch=morelike:Banana&gsrlimit=max&prop=langlinks&lllang=de&lllimit=max
  - Same set of similar articles but check for disambiguation pages and their Wikidata ID: https://en.wikipedia.org/w/api.php?action=query&generator=search&format=json&gsrnamespace=0&gsrwhat=text&gsrsearch=morelike:Banana&gsrlimit=max&prop=pageprops&ppprop=wikibase_item|disambiguation
- Then you probably don't need any calls to the Wikibase API unless you want to know how many sitelinks an article already has, but at least at that point it's only for the final result set (which in practice is much smaller). You also don't necessarily need to request all 500 candidates at once if you don't want to and could easily use the Search continuation parameters to e.g., do chunks of 100 and only get more as needed.

santhosh mentioned this in T369484: Modernize recommendation API.Jul 8 2024, 8:19 AM

Change #1052950 had a related patch set uploaded (by Santhosh; author: Santhosh):

[research/recommendation-api@master] Recommend articles to translate based on topic

https://gerrit.wikimedia.org/r/1052950

gerritbot added a project: Patch-For-Review.Jul 9 2024, 11:00 AM

Commenting here rather than on the above patch because it's a high-level question that I thought Pau might have thoughts about too.

Right now the implementation is a topic search that is separate from the morelike search. So you can either find translation candidates that are similar to some example article or you can find translation candidates that fit into some set of high-level topics but you can't combine the two.

I was actually envisioning that the two were (optionally) combined -- i.e. you could optionally provide an example article as well as optionally provide a set of topics (and if you don't provide any, it just reverts like usual to providing a set of high-pageview articles as candidates). Code-wise I think this would be pretty simple (the tricky thing is using morelikethis instead of morelike when combining the two filters (documentation). And while most people might still only ever provide an example article or provide topics (I guess that depends on what you enable in the UI), it allows for interesting use-cases like Climate-change-related articles that are also films (srsearch=morelikethis:Climate_change%20articletopic:films) which non-Content-Translation users of the API might want too.

In T367873#9967061, @Isaac wrote:

Commenting here rather than on the above patch because it's a high-level question that I thought Pau might have thoughts about too.

Right now the implementation is a topic search that is separate from the morelike search. So you can either find translation candidates that are similar to some example article or you can find translation candidates that fit into some set of high-level topics but you can't combine the two.

I was actually envisioning that the two were (optionally) combined -- i.e. you could optionally provide an example article as well as optionally provide a set of topics (and if you don't provide any, it just reverts like usual to providing a set of high-pageview articles as candidates). Code-wise I think this would be pretty simple (the tricky thing is using morelikethis instead of morelike when combining the two filters (documentation). And while most people might still only ever provide an example article or provide topics (I guess that depends on what you enable in the UI), it allows for interesting use-cases like Climate-change-related articles that are also films (srsearch=morelikethis:Climate_change%20articletopic:films) which non-Content-Translation users of the API might want too.

I cannot talk much about the technical aspects, but the usecase that you described (e.g., intersection between similar articles to a given one, and a set of topics) seems quite relevant, and something we were planning to support in the UI. The idea is to let users select multiple topic areas form a menu (T369268) and also search (T369595) for articles to define a more narrow knowledge gap they are interested in. So it would be great to consider how this could be supported as part of the next steps of the technical exploration.

I had thought about this topic+article mixing, and I have an idea on its implementation, but just deferred it for another patch once these are merged and tested.

the usecase that you described (e.g., intersection between similar articles to a given one, and a set of topics) seems quite relevant, and something we were planning to support in the UI.

Great to hear!

I had thought about this topic+article mixing, and I have an idea on its implementation, but just deferred it for another patch once these are merged and tested.

Ahh fair -- I'll leave it up to you then on how to proceed. I left a comment but current code seems to be working for me.

Change #1052950 merged by jenkins-bot:

[research/recommendation-api@master] Recommend articles to translate based on topic

https://gerrit.wikimedia.org/r/1052950

Maintenance_bot removed a project: Patch-For-Review.Jul 24 2024, 5:30 PM

Pginer-WMF moved this task from Backlog to Prioritized on the LPL Hypothesis board.Aug 2 2024, 12:18 PM

eamedina updated the task description. (Show Details)Aug 6 2024, 3:29 PM

eamedina updated the task description. (Show Details)

eamedina subscribed.Aug 6 2024, 3:32 PM

PWaigi-WMF moved this task from Prioritized to In-progress on the LPL Hypothesis board.Aug 13 2024, 1:31 PM

ngkountas updated the task description. (Show Details)Aug 23 2024, 3:34 PM

PWaigi-WMF moved this task from In-progress to Backlog on the LPL Hypothesis board.Aug 28 2024, 8:20 AM

Pginer-WMF added a project: Epic.Sep 3 2024, 1:34 PM

PWaigi-WMF mentioned this in T373961: Support basic topic & community-defined translation collections with the current Recommendation API.Sep 4 2024, 5:40 AM

PWaigi-WMF changed the status of subtask T373961: Support basic topic & community-defined translation collections with the current Recommendation API from Open to In Progress.Sep 4 2024, 5:43 AM

In preparation for T369268: Custom translation suggestions: Multiple selection, we want to figure out if any combination of filters config is possible or if some are fundamentally incompatible so we can adapt the UX accordingly.

UNION of topics is already possible and in used today since in several instances, we combine many backend topics into a single frontend topic. For instance, when the topic tv-and-film is selected, we send articletopic:films|television to the search API and that is interpreted as "films" OR "television".

INTERSECTION of topics is also possible by specifying multiple articletopic: commands in the search query. For instance, articletopic:africa articletopic:women articletopic:books will find articles about books by African women. (results on enwiki)

"For you", or seed-based recommendations, can be combined with any other search queries to find articles similar to the provided seed while matching the other commands if possible. For instance, articletopic:south-america morelikethis:Lake would find articles about South American bodies of water. (results on enwiki)

"Popular" is a good default filter for when the user has no edits and hasn't selected any topics or collections. However, I'm not sure what it should do when combined with other filters. Maybe it just affects ranking of the results by showing the popular articles (by pagesviews) first? I believe PVs are already used in the default ranking but that might change based on what we find in T377124: Explore search parameters to increase diversity in topic-based suggestions

Multiple selection of collections should be straightforward since we have all the collections and their article titles available in the recommendation service cache. However, if we are looking at the INTERSECTION, it might be an empty set in most cases, unless a specific collection is combined with a very generic list like the Vital Articles.

Combination or collection(s) and topic(s): for this to work we may interpret it as "from the articles in this/these collections, find those that match the selected topics". This would need an API to get the articletopic for articles and build a mini index of { topic: articles } in the recommendation service.

Here's a table of all the combinations and how they may be interpreted

For you	Popular	Topics	Collections	Notes
x				Based on seed
	x			Based on `mostpopular` search keyword
x	x			Based on seed, sorted by pageviews
		x		Intersection of selected topics
x		x		Intersection of selected topics and seed
x	x	x		Intersection of selected topics and seed, sorted by pageviews
			x	Intersection of selected collections
x			x	Intersection of selected collections (collections win, seed is ignored) //?
	x		x	Intersection of selected collections, sorted by pageviews
x	x		x	Intersection of selected collections, sorted by pageviews, seed is ignored
		x	x	Articles at the intersection of selected collections that matches all the selected topics
x		x	x	Same as above, seed is ignored
x	x	x	x	Same as above but sorted by pageviews