Page MenuHomePhabricator

Add a link: algorithm improvements: linking to TV shows and movies
Closed, DeclinedPublic

Description

We have heard multiple users report that too many links are being made to movies, songs, TV shows and other kinds of media. It seems that many mundane phrases are also the names of specific works of media, and that links to those are offered often. These are some examples (not real examples from the algorithm, but examples of the sort of thing that users are seeing):

  • "The offspring of the octopus tend to..." might link to The Offspring.
  • "Most species of the lizard can be found in the jungle south of..." might link to The Jungle.
  • "On that holiday, sixteen candles are traditionally lit..." might link to Sixteen Candles.

While it makes sense that this would happen sometimes, users are reporting that it happens with disproportionate frequency.

Event Timeline

kostajh added a subscriber: MGerlach.

@MMiller_WMF I'm moving this to "Triaged" for us as it's something Research / @MGerlach would need to work on, but if you want it moved to another column please do so.

Some ideas how to mitigate this problem:

  • Increase the threshold for the link-to-text ratio of anchors (here). When generating the anchor dictionary, for each anchor found in the existing links, we calculate the "link-ratio" which is the number of times the string of the anchor occurs as a link compared with the number of times it occurs in total (as a link or as plain text) across all articles. The idea is to avoid anchors which would link to very common words. Currently, this ratio is set to 6.5% based on the recommendation from previous work "Learning to Link with Wikipedia" (pdf). Perhaps increasing this threshold to, e.g. 10% gets rid of some of the links to "mundane phrases". This change would be easy to perform. However, we should probably do some analysis around how many and which anchors and links would be removed by such a change.
  • Introduce an additional filter to avoid links to articles that already have many incoming links. For example, the article The_Offspring already has more than 1,000 incoming links in enwiki (see, e.g., here). The number of incoming links is relatively easy to calculate via the pagelinks-table (see also the "what links here" page). Once we have this number for all articles, we would add another step to the filtering of the anchor dictionary (code). I dont have a good intuition for what would be a good choice of this threshold.
  • Remove links to articles with certain properties (such as TV shows). Following the approach in T287034 (avoid links to given names), we could extend the filter of the anchor dictionary to also remove links to article for which the wikidata item indicates they are an instance_of a TV show. For example, The_Offspring (Q157041) is instance_of musical_group (Q215380)

I would recommend to try the first approach as this seems to capture the problem most directly (links to mundane phrases).

Adding some more context around the first approach on the link-to-text ratio (linking probability)

  • Increase the threshold for the link-to-text ratio of anchors (here). When generating the anchor dictionary, for each anchor found in the existing links, we calculate the "link-ratio" which is the number of times the string of the anchor occurs as a link compared with the number of times it occurs in total (as a link or as plain text) across all articles. The idea is to avoid anchors which would link to very common words. Currently, this ratio is set to 6.5% based on the recommendation from previous work "Learning to Link with Wikipedia" (pdf). Perhaps increasing this threshold to, e.g. 10% gets rid of some of the links to "mundane phrases". This change would be easy to perform. However, we should probably do some analysis around how many and which anchors and links would be removed by such a change.

For the three examples mentioned at the top:

"The offspring of the octopus tend to..." might link to The Offspring.
"Most species of the lizard can be found in the jungle south of..." might link to The Jungle.
"On that holiday, sixteen candles are traditionally lit..." might link to Sixteen Candles.

I calculated the number of times the corresponding anchor occurs as link and as text and from this the link-prob (=aslink/(aslink+astext)) for enwiki

anchoraslinkastextlinkprob
the jungle514112480.043700
the offspring145538140.276143
sixteen candles239660.783607

This means that currently "the jungle" is removed as a possible anchor because it falls below the linking-probability-threshold of 0.065; and, as a result, will never be linked (it should in fact not be in the anchor-dictionary). The other two anchors occur more often as links and do not fall below the threshold and are thus kept in the anchor-dictionary.

One could remove those by increasing the threshold. This would decrease the overall number of allowed anchors and thus decrease the number of links we will be able to generate for each article. To get an idea, I calculated the number anchors that would still be available when changing the corresponding threshold:

Screenshot from 2022-02-25 15-15-26.png (315×436 px, 33 KB)

For the current threshold of 0.065 (red dotted line) we have around 7.5M anchors.
If we were to increase the threshold to, say, 0.3 such that "the offspring" would fall below the threshold, we would still have around 6.5M anchors (~-13%). The caveat is that we will have fewer recommendations overall -- while it is hard to predict how much the number of recommendations will decrease in practice, I would naively expect that it would roughly correspond to the decrease in the number of anchors.

Wouldn't the most straightforward thing be to just make the text matching case-sensitive (except for the initial character)? Granted not all languages have the concept of title case, but for the ones that do, that clearly differentiates the names of creative works from common words.

Thanks for the thoughts and analysis, @MGerlach and @Tgr. Hearing these alternatives, I think we should not work on this task now and I'm declining it. Deciding which links to make and not make is the reason there is a human in this loop, and avoiding the kinds of links described here is exactly the sort of thing human judgment is good at. I just wanted to make sure we didn't notice some obvious reason that this is happening a tremendous amount. I am comfortable leaving it alone.