Page MenuHomePhabricator

Hovercards fail to bold article title when it does not appear verbatim in the text extract
Closed, DuplicatePublic

Description

Story
As a reader, I want to see the most important part of a page preview in bold so that I know what the article is about

Description
When users highlight a hovercard, display the first instance of bolded text appearing within the article.

(Note: still some issues with this, but simplest solution atm)

Example:
In:


bold Phineas P. Gage as seen in

Background from original task description
The mechanism for generating the hovercard and displaying the article title in bold relies on the simplistic assumption that the article title is contained verbatim in the text extracted from the article intro. Consequently, it fails to highlight the article topic when this assumption is not true.
Some examples from dewiki (all picked from this page, where they can be tested by hovering over the corresponding link with hovercards enabled):

The Russian Wikipedia seems to have a naming convention where many or most biographical articles are named "[[surname, first name]]", but the first sentence contains the subject's name in the usual form "first name surname" (e.g. Obama, Barack). So this problem will be very frequent there (indeed it appears to occur in 7 out of the 9 ruwiki screenshots someone posted for different reasons at T115817#2244421).

A solution would be to instead preserve the existing bold formatting, where the article's editors should already have taken of all those special cases, language-specific issues, local naming conventions etc.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 29 2016, 7:46 PM
Tbayer updated the task description. (Show Details)
dr0ptp4kt triaged this task as Normal priority.Aug 1 2016, 5:09 PM
dr0ptp4kt moved this task from Incoming to Needs Prioritization on the Readers-Web-Backlog board.
dr0ptp4kt added a project: Design.

@Tbayer The team has requested you flesh out some Acceptance Criteria. Could you lay out specifically what needs to be solved?

@Nirzar

@MBinder_WMF I'll leave it to the product owner to make the ultimate call on how to navigate the tradeoffs. But here are some possible options for an acceptance criterion:

  1. Hovercard extracts contain meaningful bold text for all 8 pages that are named in the task description.
  2. Hovercard extracts exactly reproduce the original bold text from the article for all these 8 examples.
  3. Hovercard extracts exactly reproduce the original bold text from the article for all pages on all wikis.
  4. On every wiki where the feature is deployed in production, Hovercard extracts contain bold text for 99% of those pages that contain bold text in the excerpted part.
ovasileva updated the task description. (Show Details)Sep 13 2016, 1:24 PM
ovasileva added a subscriber: ovasileva.

adapted description to select first instance of bolded text from the article for all hovercards

ovasileva updated the task description. (Show Details)Sep 13 2016, 1:25 PM

Technically this is likely to be hard to achieve for all articles. We'll need to generate lots of test cases and we're likely to create lots of work in the long term supporting this feature. For this reason if this is a nice to have we may want to think otherwise. Screenscraping is hard.

Technically this is likely to be hard to achieve for all articles. We'll need to generate lots of test cases and we're likely to create lots of work in the long term supporting this feature. For this reason if this is a nice to have we may want to think otherwise. Screenscraping is hard.

Aren't we screenscraping anyway currently, just with the added step of converting HTML to text? How big would the effort be to have TextExtracts preserve the information whether a character was bolded?

Besides, it seems that this is not the only bug caused by throwing away all HTML formatting, see e.g. T141766 or T59850 or T112137.