Page MenuHomePhabricator

Add a link: algorithm improvements: Improve parsing of text for generating anchor-text candidates
Open, MediumPublic

Description

In the manual evaluation T278864#6974599 there were some comments that suggested the generated anchor-texts for the links are wrong.

The following cases were described:

  • Linking to just a portion of a larger phrase, in which the larger phrase would not be a link, such as:
    • Awards, e.g. "The Jane Smith Award for Excellence" might link just to "Jane Smith".
    • Song titles, e.g. "Un Beso Para Mi" might link just to "Un Beso".
    • Schools, e.g. "Rockville High School" might link just to "Rockville".
  • Possessive suffix are: for anchor text "Brazilian Navy's", the suggestion would be to link just the "Brazilian Navy" portion to the target, whereas we would want to include the "'s" in the link.

This will require some better parsing of the raw text to generate candidate anchors.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kostajh subscribed.

@MGerlach, I'm moving this to the post-release backlog in terms of what Growth team is working on, but please feel free to work on it in the interim.

kostajh triaged this task as Medium priority.Apr 26 2021, 7:55 PM

A similar issue to possessive suffixes happens with punctuation. E.g. Oklahoma! is linked as [[Oklahoma]]! when the correct link would be [[Oklahoma!]] (diff). Seems hard to generalize though.

To generalize yes, but the issues I mentioned in T299380 all have a giveaway: quotes. Maybe you can work with short (3-5 words) quotes as one entity?