Page MenuHomePhabricator

Build a prototype API for recommending links to orphan articles
Closed, ResolvedPublic

Description

We completed a detailed analysis of orphan articles where we showed that:

  • orphan articles are very common in Wikipedia (15% of all 60M articles)
  • adding new incoming links from other articles leads to a statistically significant increase in the internally-referred pageviews to those articles
  • the rate of de-orphanization is low suggesting a need for the development of automatic tools to support editors in the task of de-orphanization
  • translating existing links from other language versions of Wikipedia provide high-quality recommendations and constitute a scalable approach to suggest new links for orphan articles.

In this task, we want to implement the link translation approach to build a publicly accessible prototype API that surfaces recommendations for new incoming links for orphan articles.

Event Timeline

weekly update:

  • I finished a first version of the tool to suggest new incoming links for orphan articles based on already existing links in other language versions https://linkrec.toolforge.org/
  • this now works with articles in all languages and I also slightly improved the description of the tool
  • Next step: getting feedback from potential users (e.g. Web Team)

weekly update:

  • discussed with Olga from Web around how the tool could be used towards the goal of increasing internally referred pageviews. our analysis showed that editors adding new incoming links to orphan articles leads to a statistically significant increase in internally referred pageviews to these articles due to a spillover effect. we were thinking whether the tool could be adapted for readers to generate recommendations of related articles (similar to the RelatedArticles extension used in mobile devices) that could surface relevant articles with low visibility using the same underlying model based on link translation.
  • I will try to explore this idea to adapt the current model

weekly update:

  • Implemented a few usability improvements of the tool
  • Started some exploratory analysis to adapt the framework to generate reading recommendations taking into account relevance (does the link exist in other languages) and visibility (the target article is an orphan and thus de-facto invisible)

weekly update:

  • Exploratory analysis to generate reading recommendations using the link translation approach taking into account i) Relevance: there is a link to the recommended article in many other language versions of Wikipedia; ii) (In)visibility: the recommended article has no (or few) incoming links and is thus lacking visibility for readers navigating Wikipedia.
  • This approach works well and gives at least 3 recommendations for around 40M out of 60M articles across all languages. One question is how to rank the suggestions to find a good trade-off between relevance and invisibility.
  • The open question is how to evaluate the quality of these recommendations. I started to compare with recommendations for RelatedArticles (visible at the bottom of the mobile version of each article). Looking at some qualitative examples (e.g. en.m:Hypatia) it seems that the latter recommends articles that are already existing as links in the article; so it does not necessarily surface any additional information that has not been already added by editors.

weekly update:

weekly update:

  • Set up an endpoint to generate reading recommendations for articles (this corresponds to generating new outgoing links instead of incoming links)
    • Example: https://linkrec.toolforge.org/api/v1/out?lang=en&title=Tiwanaku&ltrans=ca|es|fr
    • The idea is to generate reading recommendations using the idea of link translation, i.e. it collects all existing outgoing links of the same article in other languages that do not yet exist as a blue link in the current language. This simple approach can generate recommendations for 44M articles and, on average, it yields 60 recommendations.
    • The idea is to prioritize recommendations that are, both, relevant and lack visibility: i) relevance: the number of language versions in which the link already exists; ii) lack of visibility: the number of incoming links to the recommended article (i.e. the fewer inlinks, the less visible the article is)
    • This pipeline is not very efficient and might take some time. Needs some additional filtering optimization.

weekly update:

While there is still some work that could be done to optimize the performance of the API calls, I consider the work done as it provides an experimental setup that illustrates the main idea. Further improvements will be done in other tasks after receiving specific feedback from other tams.