Page MenuHomePhabricator

Implement a reasonably elegant and non-labor-intensive means of describing/summarizing pages
Closed, DuplicatePublic

Description

It's handy to have a means of summarizing or describing page contents, so as to generate meta descriptions tags, blurbs for inclusion in feeds, etc. Several approaches have been tried:

  1. Grabbing the first x characters of an article without regard to where sentences cut off (e.g. Extension:Blurb or Extension:TextExtracts)
  2. Using a template, e.g. {{PageSummary|'''[[Humility]]''' is a psychological state, that is the opposite of [[dominance]]. |Humility allows one to see the intrinsic value of others (as opposed to only extrinsic value), and is therefore the largest factor of [[empathy]]. A person with humility therefore sees minors as having intrinsic value, as contrasted with being objects of domination, which they are mostly regarded as being by the laws and practices of the status quo. Like dominance, humility is an innate psychological trait.}} See docs at http://childwiki.net/wiki/Template:PageSummary . This is implemented by Extension:BedellPenDragon Notice that there are two parameters here, parameter #1 for the first sentence of the lead and parameter #2 for the remainder of the lead.
  3. Adding/modifying the description by means of a separate text box (Extension:Advanced_Meta) or separate page (Extension:ExplicitDescription) from the article text or Wikidata.

Ideally, we could implement a feature to automatically grab the first sentence of the lead; however, it's hard for software to detect the ends of sentences, since punctuation marks such as the period can appear in the middle of sentences ("Afterward, Mr. Brown went to the U.S. District Courthouse . . . and when he came back, everyone was gone.")

If you have any ideas on the best way to do this, feel free to post them. Thanks.

See also: T59669

Details

Reference
bz59641

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 2:17 AM
bzimport set Reference to bz59641.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 5335 has been marked as a duplicate of this bug. ***

(In reply to Nathan Larson from comment #0)

Ideally, we could implement a feature to automatically grab the first
sentence of the lead; however, it's hard for software to detect the ends of
sentences, since punctuation marks such as the period can appear in the
middle of sentences ("Afterward, Mr. Brown went to the U.S. District
Courthouse . . . and when he came back, everyone was gone.")

If you have any ideas on the best way to do this, feel free to post them.
Thanks.

For TextExtracts, that would be bug 57669. And yeah, any insights on better sentence handling would be highly appreciated:)

Tgr set Security to None.
Tgr updated the task description. (Show Details)

This is a common NLP problem called sentence segmentation. Instead of reinventing wheels, just grab some library like OpenNLP or NLTK and see how well it fares?

Aklapper closed this task as a duplicate of T127038: Automatic summarization of articles.
Aklapper subscribed.

Merged into T127038 to keep discussion centralized in one place.