Page MenuHomePhabricator

Add exparagraphs parameter to API
Closed, DeclinedPublic

Description

Currently it's possible to extract a given number of characters, sentences, and the whole intro of a wiki page. It would be natural as well as useful to also allow to extract a given number of paragraphs.

When querying Wikipedia, for example, it happened to me several times already that I want to extract the first paragraph because all I want is a short introduction to the topic. The full intro is often too long, a single sentence isn't always enough, and two sentences is sometimes too much, and other times not enough. Paragraphs are natural semantic units, so if I want a concise intro to a topic, usually what I want is the first paragraph.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2016, 6:05 PM
Sophivorus updated the task description. (Show Details)Sep 2 2016, 6:07 PM
Sophivorus updated the task description. (Show Details)
Jhernandez triaged this task as Low priority.Sep 7 2016, 3:33 PM
Jhernandez moved this task from Incoming to Triaged but Future on the Readers-Web-Backlog board.
Jhernandez added a subscriber: Jhernandez.

Thanks for the report. It seems like a good idea.

Jdlrobson added a subscriber: Jdlrobson.

The first paragraph is sometimes not what you expect. For example, in the article for Planet the first paragraph needs the list that follows it to make any sense.

<p>A <b>planet</b> is an <a href="/wiki/Astronomical_body" class="mw-redirect" title="Astronomical body">astronomical body</a> <a href="/wiki/Orbit" title="Orbit">orbiting</a> a <a href="/wiki/Star" title="Star">star</a> or <a href="/wiki/Stellar_evolution#Stellar_remnants" title="Stellar evolution">stellar remnant</a> that</p>
<ul>
<li>is massive enough to be <a href="/wiki/Hydrostatic_equilibrium" title="Hydrostatic equilibrium">rounded</a> by its own <a href="/wiki/Gravity" title="Gravity">gravity</a>,</li>
<li>is not massive enough to cause <a href="/wiki/Thermonuclear_fusion" title="Thermonuclear fusion">thermonuclear fusion</a>, and</li>
<li>has <a href="/wiki/Clearing_the_neighbourhood" title="Clearing the neighbourhood">cleared its neighbouring region</a> of <a href="/wiki/Planetesimal" title="Planetesimal">planetesimals</a>.<sup id="cite_ref-footnoteA_1-0" class="reference"><a href="#cite_note-footnoteA-1">[a]</a></sup><sup id="cite_ref-IAU_2-0" class="reference"><a href="#cite_note-IAU-2">[1]</a></sup><sup id="cite_ref-WSGESP_3-0" class="reference"><a href="#cite_note-WSGESP-3">[2]</a></sup></li>
</ul>

@phuedx might be useful to think about as we build out the new summary endpoint?

phuedx added a comment.EditedMay 11 2017, 12:37 PM

@Jdlrobson: 👍 So should we define the first paragraph as every element, E, that precedes the second <p> element. Should we also have a whitelist for E, e.g. [ 'p', 'ul', 'ol', 'dl' ]?

Jdlrobson added a comment.EditedJun 20 2017, 4:33 AM

Ping @pmiazga I think this is what you were talking about in the html rest base task?

Jdlrobson raised the priority of this task from Low to High.Jun 22 2017, 4:05 PM
Jdlrobson moved this task from Needs Prioritization to Upcoming on the Readers-Web-Backlog board.

T168625 is high.

@phuedx made a good point that paragraphs can be large. An exparagraphs option thus could actually lead too a large API response.

Also consider the page "https://en.wikipedia.org/wiki/Planet". To make sense an exparagraph option should also include certain non-paragraph elements such as lists (ul,ol, dl tags). In the case of a long list e.g. list pages this could have undesirable consequences.

Thinking about this some more, it might be better for consumers to request a large exsentences value or to request the lead paragraph (T168625) - in which case we may decline this task.
@Sophivorus as the original submitter of this task would being able to get only the first paragraph solve the problem you were facing?

Jdlrobson changed the task status from Open to Stalled.Jun 27 2017, 9:09 PM

@Jdlrobson Yes, being able to extract only the first paragraph would solve my use case.

Jhernandez closed this task as Declined.Jun 28 2017, 9:11 AM

I'm going to go ahead and decline this one given the proposed use case is being tackled in T168625: Make the Page Summary API return an "intro" for a page.

Also we have some ways of getting the intro content, a bit bigger than the 1st paragraph, but you can do:

And then parse extract with a dom parser and get the first P tag if you want.