Page MenuHomePhabricator

Scrub parentheses and dates from text extract
Closed, DuplicatePublic

Description

The Wikipedia Android app is developing a "Link preview" feature that shows a page image and text extract, just like Hovercards. As I understand it, the Java code removes stuff in parentheses and dates, and adapts to show either one sentence or more than one depending how long the result from textextrace API is.

It seems to me the Hovercards (Popups) extension would benefit from all these enhancements (similar bugs have been filed like T93160: Hovercards sometimes has contents in brackets (parentheses) appearing in excerpts, especially at ruwiki), so the textextracts API should implement getSentences() and removeParens() (if API clients specify a dotherightthing=1 parameter :) ).

Event Timeline

Spage raised the priority of this task from to Needs Triage.
Spage updated the task description. (Show Details)
Spage added subscribers: Spage, Dbrant, MaxSem and 2 others.

Note that we really want to retain some of the parenthetical data, in
Hovercards. See T91344: Review exclude all approach to parenthetical elements in summary endpoint for details. (There is some agreement that
IPA/pronunciation can excluded, and possibly etymology (which is much
rarer), but other details that appear in brackets in the first few
sentences can be important. Especially when bracket-addicts like me, write
articles... ;)

This is another big reason why we need to consolidate the logic for this sort of thing. It'd be a nightmare to port all of these exceptions & such across all the platforms. I mentioned this in passing in T99561, but maybe having these snippets be editable is something we should consider more seriously? For example, use a curated snippet, or prompt the user to create one (with the text-extracted first sentence as a starting point)?

maybe having these snippets be editable is something we should consider more seriously? For example, use a curated snippet, or prompt the user to create one (with the text-extracted first sentence as a starting point)?

This proposal has had several reincarnations over time, notably Concise Wikipedia.

Jdlrobson renamed this task from move text cleanup and 1-2 sentence logic from the "Link preview" mobile feature into TextExtracts to Scrub parentheses and dates from text extract.Sep 18 2015, 6:39 PM
Jdlrobson triaged this task as Low priority.
Jdlrobson set Security to None.
Jdlrobson subscribed.

To be clear what exactly are you guys scrubbing? Can someone summarise?

This bug needs a more clear description and title. What does scrubbing dates mean? Surely that means only scrubbing dates that are within parentheses, right? If so, it's redundant to mention dates as the title and description already mention scrubbing stuff in parentheses.