Page MenuHomePhabricator

Article topic model produces unexpected results for general topic areas such as art
Open, Needs TriagePublic

Description

The articletopic: keyword on search is used for querying articles on a specific topic area, and the underlying model are used for recommendations in tools such as Content Translation (T113257).

When searching for art-related topics on English Wikipedia the top results include articles that does not seem the most representative/relevant articles related to art among the more than six million articles on English Wikipedia. Top results include "List of the Transformers characters", "Cosmetics", or "Brick" which seem surprising for top art-related articles. Below you can find a table with examples of results from some topic areas with screenshots and notes about the unexpected elements:

visual-artsbiographyengineeringentertainment
en.wikipedia.org_w_index.php_search=articletopic%3Avisual-arts&title=Special%3ASearch&ns0=1(Wiki Tablet).png (1×768 px, 237 KB)
en.wikipedia.org_w_index.php_search=articletopic%3Abiography&title=Special_Search&profile=advanced&fulltext=1&ns0=1(Wiki Tablet).png (1×768 px, 223 KB)
en.wikipedia.org_w_index.php_search=articletopic%3Aengineering&title=Special_Search&profile=advanced&fulltext=1&ns0=1(Wiki Tablet).png (1×768 px, 270 KB)
en.wikipedia.org_w_index.php_search=articletopic%3A+entertainment&title=Special_Search&profile=advanced&fulltext=1&ns0=1(Wiki Tablet).png (1×768 px, 267 KB)
Frequent unrelated topicsNon-biographies frequently includedResults dominated by planes (part of the topic, but lack of diversity in results)Results dominated by wrestlers (part of the topic, but lack of diversity in the results)

This is reflected on other tools such as Content Translation when users select the "Art" filter in the suggestions:

bn.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&active-list=suggestions&from=en&to=bn(Wiki Tablet) (2).png (1×768 px, 208 KB)

My suspicion is that the general article popularity takes precedence over topic alignment frequently. But this is just based on intuition, and it is not clear to me if there may be any parameters in the system that could help getting a better balance.

Event Timeline

This is excellent -- thank you @Pginer-WMF for documenting! Detailed thoughts below about each example you brought up and what we might do about it. Keeping sending these sorts of examples along if you notice other patterns!

I'll address the biographies one first because that's the easiest. The model will always have errors but with something as explicit as biographies, those errors are more surprising (in a bad way). To address this, the next iteration of the model that we're working on will actually just rely directly on Wikidata for the biography determination and that should address it (same will be true for gender we expect). To a lesser degree, this is also the approach I took with countries and we're also looking into for animals/plants/fungi and some occupations. You can explore this UI that shows the current prototype (feedback welcome): https://wiki-topic.toolforge.org/topic-prototype. There's no set timeline for rolling this out but we're working on getting feedback from a few community stakeholders in the next few months and then implementing hopefully in Q3/Q4 if all goes smoothly.

I'll address the lack of diversity (engineering+entertainment) next. The articletopic keyword in Search has a score with it for any given article that reflects how confident the model is that the topic applies (docs). What's going on with engineering and entertainment is that the model is quite confident about these particular subtopics -- i.e. plane and wrestling articles are really easy to identify and so it assigns them high scores. If you're curious about individual articles, you can see those scores as percentages from 0-100% directly from the model (ex. of WWE) where only scores above 50% are passed on to the Search Index or you can see directly what score (now mapped to 0-1000) is being stored by the Search index (same ex of WWE). When Content Translation queries the Search index without any other filters that might affect the ranking, the sorting is a mixture of this confidence score and some amount of weighting generically "more-relevant" content based on incoming links and pageviews to the articles. So, for certain topics that cover a range of sub-topics but of varying confidence, we end up with these homogenous results of the most confident sub-topic. Options to address:

  • You could consider changing the "generically more-relevant" part of the sorting to something that exerts a more randomizing effect on the results topic-wise. So here's the docs for the Search API and we could either play with srsort or srqiprofile. The default I think is this srqiprofile=wsum_inclinks_pv&srsort=relevance. I looked and I think Growth is using srqiprofile=classic_noboostlinks (code), which does mix things up. There are options you can play with -- here are the docs for srsort and srqiprofile.
  • You could add additional ranking adjustments -- e.g., prefer-recent though in my exploration, that didn't have a major effect on the topic distribution because many wrestling articles had been recently edited and honestly I assume you might actually want stable as opposed to recently-edited articles to translate.
  • Technically we could try to adjust the incoming scores to the Search index -- e.g., capping at 90% so you have a lower likelihood of this sort of thing happening. But that would be an even less precise art and would take a lot more time to implement and propagate.

Finally there's the case of visual-arts with unrelated topics. Like I said, the model isn't perfect so sometimes you just will see unrelated results. I do pride myself in the current model though and think it's actually quite good even as we work on improvements. In this case, the unrelated predictions in visual-arts actually are not from the current model. Background:

  • What you see in the Search results in almost all cases is the output from the language-agnostic articletopic outlink model. We made this model the default on 10 July 2023 (T328276#9002375). Before that, the search index had been filled with predictions from the earlier generation of ORES models (this model). We didn't flush the index though at that time because it would have been expensive so we've been allowing the outlink model to slowly overwrite the old scores. This happens when an article is edited and the outlink model makes a prediction, which means the vast majority of articles in the search index now have the new predictions. But for articles that either a) have not been edited since July 2023, or, b) have been edited but don't have confident predictions from the new outlink model, the scores in the Search index are still from the old ORES model. Situation "b" happens to be the case for "List of the Transformers characters", "Cosmetics", and "Brick". The first because it's a list article, which is a confusing distribution of links for the outlink model so it doesn't make any confident predictions, and the latter because they're pretty general articles that touch on a number of topics and so the model ends up having a number of lower-confidence predictions. A few options:
  • I've been working on this a bit as well in the V2 for the articletopic model. We've been trying to tighten up the topics a little bit so that the model is better at identifying them (and they're clearer to end-users). This is no guarantee that it'll fix the problem though -- some general-purpose articles like "Brick" that touch on many topics are just always going to be a little difficult for the model.
  • I actually had been leaning to excluding List/Disambiguation articles completely from the predictions (based on Wikidata instance-of properties) because they're a different type of article that model isn't really trained for. If you think that would be an issue (i.e. you do expect List of... articles to show up in the results, just not in this case), let me know and I'm happy to rethink.
  • If this poses a more urgent issue, we can also talk with the Search team about actually flushing the Search index so at least the wrong predictions will only be from the new model. The outlinks model is better than the ORES model in general but in particular I do think the outlink model is less likely to make these really confident but wrong predictions (which as I related above, high-confidence scores show up higher in the results too so you're more likely to run into them). Related, we could look into making sure that empty predictions also overwrite existing predictions which should be easier and deal with this particular situation but Search would have to tell us how much work that is.

Thanks, @Isaac, for your detailed and useful response.

I want to emphasize that the topic-based solutions look quite useful overall. I just wanted to surface some topic areas where I was finding more unexpected results in order to check whether there were some opportunities for improvement (or confirming we were just hitting harder limitations of the models). This is not an immediate blocker for our team, since our work in this front can continue and benefit from improvements in the suggestions as those happen. In any case, it is great to hear there are opportunities for improving the results.

Regarding biographies, I'm glad to hear that there are plans to improve the models. The use of Wikidata seems quite promising.

Regarding the lack of diversity, the adjustments on the search parameters seems promising too. I created a separate task (T377124) for the LPL team to explore some of the options that may result in better suggestions in Content Translation.

Regarding the case of visual-arts with unrelated topics, as I mentioned, it is good to know that more articles will get the new scores over time. I cannot provide much guidance on what could be the best approach to avoid articles on general topics to get stuck with the old scores, but making empty predictions from the new models to also overwrite existing predictions makes sense. In any case, I won't consider this to be an urgent issue, but something for the Search team to consider if the issue persists in the long term, and they have the capacity.

Excluding lists is not expected to be problematic in our context. These do not seem the most valuable type of content to emphasize for translation (especially since the articles being linked may or may not exist in the target wiki).