Page MenuHomePhabricator

Better sentence handling needed in Hovercards for multiple full-stops
Closed, ResolvedPublic

Details

Reference
bz57669

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:25 AM
bzimport added a project: TextExtracts.
bzimport set Reference to bz57669.
bzimport added a subscriber: Unknown Object (MLST).
MaxSem created this task.Nov 27 2013, 4:45 PM

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/mobile/cards/1460

Gg4u added a comment.Dec 1 2013, 9:48 PM

H. P. Lovecraft: Against the World, Against Life
http://en.wikipedia.org/w/api.php?format=jsonfm&action=query&pageids=17545993&prop=extracts&exsentences=2&exintro&explaintext

Here I obtained a truncated sentence, cause the dots in the name force the sentence to be truncated.

Gg4u added a comment.Dec 2 2013, 11:21 AM

I have more comments on this bug.
It is just a guess, not tested on my local machine yet - sorry, I am new in wikimedia-dev, and process for reproducing the bug is still not clear.

My guess is due because the API truncate the sentence roughly by counting the dots '.'

If so, a quick improvement may be check:
if the char before the dot is a capital letter, or a word formed by a capital letter > truncate at the next dot

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/mobile/cards/1478

  • Bug 67841 has been marked as a duplicate of this bug. ***

Per 67841, blanking out instances of the title before searching for a cutoff point would improve many of these cases.

(copy from the merged task:)

The hovercard for the German Wikipedia article of "D. J. Caruso" (https://de.wikipedia.org/wiki/D._J._Caruso) only says "D. J." E.g. the link from https://de.wikipedia.org/wiki/Caruso

A similar problem can be seen at the link to "J.P. Morgan & Co." in https://en.wikipedia.org/wiki/J._P._Morgan_%28disambiguation%29

Reported/discussed at https://www.mediawiki.org/wiki/Topic:S6hl4q8uvi4ux10n

Quiddity renamed this task from Better sentence handling needed to Better sentence handling needed in Hovercards for multiple full-stops.Apr 5 2015, 5:05 PM
Quiddity updated the task description. (Show Details)
Quiddity set Security to None.
Quiddity removed subscribers: Maryana, Unknown Object (MLST).
Tgr added a subscriber: Tgr.Jun 26 2015, 2:33 AM

With an NLP toolkit and three lines of python:

import nltk, json, requests
data = requests.get('https://en.wikipedia.org/w/api.php?format=json&action=query&pageids=17545993&prop=extracts&exchars=10000&exintro&explaintext').text
intro = json.loads(data)['query']['pages'].itervalues().next()['extract']
print nltk.sent_tokenize(intro)[0]

will give

H. P. Lovecraft: Against the World, Against Life (French: H. P. Lovecraft : Contre le monde, contre la vie) is a work of literary criticism by French author Michel Houellebecq regarding the works of H. P. Lovecraft.

We should find a place to store extracted summaries and set up one of the open-source toolkits to automatically process new revisions.

In T59669#1403469, @Tgr wrote:

We should find a place to store extracted summaries

That'd be https://www.mediawiki.org/wiki/Requests_for_comment/Text_extraction / T1319.

Tgr added a comment.Jun 26 2015, 7:37 PM

That stores the plaintext version of the article. I meant storing the definition (first X sentences) extracted from that by some text processing tool (which is probably not written in PHP and runs as an external service, since there aren't any serious NLP libraries in PHP). Seems similar to the use case of storing lead image focus points, probably the same solution could be used.

Neat. Do you know how well that works with languages other than English? I couldn't easily find out if NLTK supports anything else.

MaxSem added a comment.Jul 7 2015, 5:39 PM

Ideally, we could use NTLK or similar for splitting to sentences and then just store the text with sentence end markers.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2015, 10:35 PM
kaldari removed a subscriber: kaldari.Sep 9 2015, 9:29 PM
Jdlrobson triaged this task as Medium priority.Sep 18 2015, 8:24 PM
Jdlrobson added a subscriber: Jdlrobson.

@MaxSem: This seems to be fixed for me. I get the following extract ...
"H. P. Lovecraft: Against the World, Against Life (French: H. P. Lovecraft : Contre le monde, contre la vie) is a work of literary criticism by French author Michel Houellebecq regarding the works of H. P. Lovecraft. The English-language edition for the American and UK market was translated by Dorna Khazeni, and features an introduction by American novelist Stephen King."
... from https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&pageids=17545993&prop=extracts&exsentences=2&exintro&explaintext

The other examples seem to be fixed as well.

Jdlrobson closed this task as Resolved.Aug 11 2016, 8:54 PM
Jdlrobson claimed this task.
Jdlrobson added a subscriber: kaldari.

@kaldari reports this as fixed. Description is very vague so if this is still a problem please edit description with better replication steps upon reopen.