Page MenuHomePhabricator

API to get article topics
Closed, ResolvedPublic

Description

Article topics are predicted after each page edits and stored in the search index to support the articletopic: search keyword. Currently using the revscoring model but it could eventually use the outlink model. The process to get the article topics into the search index is described here. However, there is currently no mechanism to retrieve which topics have been predicted for a given article.

In the context of content translation recommendations, and T369268: Custom translation suggestions: Multiple selection specifically, the Language and Product Localization team would like to be able to get the topics for a small number of articles that are part of community-defined page collections.

In the past, bringing ML prediction results, mostly ORES damaging and goodfaith models scores, inside MediaWiki was handled by the ORES MW extension for various purposes (RC, WL, Contribs, API, etc). Unlike the new article topics models, the damaging and goodfaith models required extensive language-training and as a result, the extension is only deployed to a handful of wikis.

So what are the options to query the topics for an article? Here are some options

  1. Get the topic from the doc in elastic and expose it in a new MW API or by extending an existing MW API like query/info or query/revisions
  1. Add caching to the LW API and query there directly
  1. tbd

Event Timeline

The ORES extension keeps a DB copy of all scores for revisions which are in the recentchanges table. Originally this wasn't so much for direct lookup (ORES did have a cache) as for filtering recent changes / watchlists via joins. I think it was something of a pain point in terms of storage space use.

Articletopics aren't really relevant for old revisions, so I imagine that would be less of an issue, even when taking it into account that a single articletopic store is way larger than a single edit quality score (since an article can have any number of topics). Not super sure about it though - no idea what's the typical ratio of articles to revisions in RC.

Also, maybe you could just add a memcache layer or a page property before the LW API, and hope things work out. As long as there are only a few articles that need scores, might be best not to overengineer it.

Also, maybe you could just add a memcache layer or a page property before the LW API, and hope things work out. As long as there are only a few articles that need scores, might be best not to overengineer it.

I'm not sure if you mean caching on the LW or MW side, but I was thinking adding cache in LW and querying it directly is by far the simplest solution. No need to involve MW in this case. The service that actually needs the topics is also hosted on LW.

@isarantopoulos would it be OK if we queried the articletopic model directly from the recommendation-api, also hosted on LW? We would maintain a title-to-topics index on our side to keep API call volume at a minimum.

@isarantopoulos would it be OK if we queried the articletopic model directly from the recommendation-api, also hosted on LW? We would maintain a title-to-topics index on our side to keep API call volume at a minimum.

@SBisson That sounds good. Do you have any rough estimates on the expected load?
In this case requests should be made using the internal endpoint to avoid rerouting traffic through the API Gateway.

! In T377891#10277814, @isarantopoulos wrote:
[...] Do you have any rough estimates on the expected load?

It will come in bursts. Initially, it will need to get topic info for the enwiki Vital articles levels 1, 2, 3. That's about 1110 articles. After that, when a community flags a page collection for translation, the service will have to get topics info for the articles in that collection too, minus those already covered by previous collections. The next one coming may be Wikiproject_Women's_Health/Vital_articles, which is about 401 articles.

It's important to note that we store those articles as their wikidata id so we wouldn't rescore different language versions of the same article. We'd probably request the topics using the English version of the article 99% of the times.

While we don't expect new revisions of articles to change their topics, it has yet to be discussed whether we would try to rescore new rev ids.

SBisson claimed this task.

No need for a new API at this point. We'll query the API in LW directly.