Page MenuHomePhabricator

Specify new task for Linking articles as a structured tasks (Q2)
Closed, ResolvedPublic

Description

We have identified a simple approach to generating link-recommendation to increase the visibility of orphan-articles T288241. see https://meta.wikimedia.org/wiki/Research:Recommending_links_to_increase_visibility_of_articles#Approach_1:_Link-translation_for_orphan_articles

In this task the goal is to:

  • Generate a dataset containing the links between articles in all Wikipedia languages matched with the Wikidata-item
  • Exploratory analysis to understand the extent of need to increase visibility of orphan articles: How many orphan articles exist in each language? How many link recommendations could we generate in prinicple?
  • Implement an algorithm to i) identify orphan articles and ii) recommend and prioritize incoming link recommendations
  • (stretch) Evaluate recommendations

Event Timeline

Update week 2021-10-11:

I did the first round of exploratory analysis around recommending incoming links for orphan articles based on "link translation". The analysis is contained in the notebook.

The main results are very encouraging:

  • there is a substantial number of articles in Wikipedia without any incoming links: from 57M articles in all Wikipedias, there are roughly 8.4M orphan articles without incoming links (14.7%). this varies across wikis (see the detailed table): enwiki=5% (300k articles), arwiki=20% (18k articles), viwiki=50% (122k articles)
  • the problem of orphan articles (no incoming) links seems to be much more severe than no outgoing links (see the detailed table). for example: enwiki=0.01% (605 articles), arwiki=0.01% (137 articles), viwiki=0.1% (1149 articles)
  • many of the orphan articles in a wiki are not orphan articles in another language. for example: in enwiki for 27% of the orphan articles, the article exists in at least on other language where it is not an orphan article (arwiki=92%, viwiki=81%). this means that for a sizable number of orphan articles we can potentially find link-candidates by looking at the other language versions
  • the caveat is that the source article in the other language might not exist in the language of the orphan article; however, out of 8.4M orphan articles across all wikis, for 4.9M (58%) we can identify at least one link-candidate for an incoming link from the other language versions of the same article.
  • an additional advantage is that translating an existing link will give us some information of where to add the link in absence of a clear anchor-string (e.g. which section the link appears)

Example 1: dewiki: (Q940138, Q1791563)

Example 2: enwiki: (Q1001975, Q1107122)

Update week 2021-10-18:

  • I gathered some feedback around the analysis from last week in ways to extend the analysis, which I will approach in the following weeks:
    • what is the fraction of list- or disambiguation pages in orphan articles? (and among those we can de-orphanize)
    • what is the fraction of bot-created articles among orphans? this might be not super straightforward to calculate, and has no impact about the usefulness of the recommendations to de-orphanize. However, it might be important for how to frame the task. The large fraction of orphans in viwiki (a language with a large number of bot-created articles) could suggest that bots generate lots of orphans when creating new articles automatically
    • are there some topics or categories (e.g. gender) that are more or less affected by orphan-articles?
  • More generally, it will be necessary to perform an additional evaluation of the recommendations generated via link-translation:
    • there is already an implicit signal contained in the translated links -- the fact that they already exist in other languages shows that they have been vetted by one (or more) language communities. thus, it might make sense to prioritize links based on the number of language editions they already exist in
    • single snapshot-evaluation: use incoming links from non-orphan articles as ground truth
    • two snapshot-evaluation: comparing two consecutive snapshots we can identify articles that were de-orphanized and use the added links as ground truth to evaluate the link translation.

Update week 2021-10-25:

  • added @Aroraakhil as a formal collaborator to the project
  • started to sketch more detailed analysis-plan due to added capacity. with initial exploratory results and akhil's expertise we can go much deeper in the following aspects
    • highlight the scope of the problem of orphan (or underlinked in terms of incoming links) articles; not only in terms of the number of articles but also in terms of the effect on readership (e.g. pageviews)
    • improving the model to generate recommendations and their evaluation
    • identifying text-regions where to insert the link in the absence of obvious anchors

Update week 2021-11-04:

  • prepared a short presentation describing the project slide-deck (motivation, initial findings, planned research) to discuss with Bob and Akhil the details of the next steps

Update week 2021-11-22:

  • started in-depth analysis of which type of articles are orphan articles
  • for this I collected all auxiliary tables with properties of: topic, quality, wikidata-statements related to properties P21/P31, bot-created article, age of the article
  • using wikidata-statements I investigated relation to known biases: obtained gender of all articles via P21-property in Wikidata (that are biographies P31=Q5). in almost every wiki, women are much more overrepresented in the orphan articles than one would expect from the ratio of articles on men/women. For example, one extreme case is cawiki (catalan) in row 21: only 19% of biographies are on women; though when looking at biographies that are orphans, women make up almost 43% of those. Similar patterns are observed for the other wikis (though typically not that extreme).
  • next: relation to bot-created articles. some of the previous analysis suggests that wikis typically associated with many bot-created articles have large number of orphan articles (cebwiki, viwiki). I thus started to build a dataset to automatically identify all articles that were created by bots (ongoing).

Update week 2021-12-13:

  • built a dataset to capture all possible links (ingoing and outgoing) that could be added between articles in every wiki based on translating the link from another wiki where it already exists
  • from this I could build a simple model to recommend new incoming links for an article in a given wiki -- the recommendations are prioritized based on the number of language editions in which they already exist (this is often the case for tens of wikis)
  • systematic analysis of the extent of orphan articles with respect to: disambiguation page, bot-created articles, the gender in biography-articles, the topic of the article, the quality of the article, and the age of the article
    • I calculated the respective tables for each of there properties and we see there is considerable variation across the different language versions. they can be interpreted in the following way. As an example, lets consider the gender gap in enwiki (all other properties can be understood in the same way). we only conider biography-articles and class x=all articles on women.
      • N=1.8M, this is the number of biography articles
      • Po = 0.018, this is the fraction of orphan articles among all biography articles
      • Px = 0.19, this is the known fraction of articles on women among all biography articles
      • Px|o = 0.29, this is the fraction of articles on women if we only consider the orphan articles
      • Exo > 1 (and logExo>0) which simply means that Px|o > Px, i.e. among orphans we find relatively more articles on women than among all biographies (they are enriched)
    • code: https://github.com/martingerlach/link-recommendation-visibility/blob/main/exploratory-analysis_orphans-properties.ipynb

Wrote up a summary of the results on meta answering containing three parts:

  • how many orphan articles are there? (the extent of orphans)
  • what types of articles are orphans? (characterize orphans)
  • how to generate recommendations for new incoming links to orphan articles based on link translation? (de-orphanize)

https://meta.wikimedia.org/wiki/Research:Recommending_links_to_increase_visibility_of_articles/Link-translation#De-orphanizing_via_link_translation