Should we add sentence Boundary Detection to the Page Summary API?
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	phuedx
	Jul 21 2017, 5:14 PM

Description

NOTE: Pending discussion and resolution at https://www.mediawiki.org/wiki/Topic:Tuqlildvj3czzpyx.

Background

The apps request page summaries of 5 sentences. Page Previews requests an 525 character extract. For performance reasons, we want to minimize the length of the extract as much as possible while satisfying pre-existing use-cases.

AC

If the intro consists of one paragraph, then no more than N sentences of that paragraph are returned.
If the intro consists of a short paragraph and a list, then it's returned as is.

Notes

@bmansurov mentioned that CXServer's "segmentation" module can do SBD on HTML input for multiple languages. There could be an opportunity to work across teams to make an NPM library of it for consumption by CXServer and MCS.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T171331 Should we add sentence Boundary Detection to the Page Summary API?
Resolved	Jdlrobson	T168848 Bootstrap an initial version of the Page Summary API in MCS
Resolved	Jdlrobson	T172021 It shouldn't be possible for coordinates to be the lead paragraph
Resolved	Jdlrobson	T174698 Parenthetical stripping is too aggressive

Event Timeline

phuedx created this task.Jul 21 2017, 5:14 PM

Restricted Application added a project: Product-Infrastructure-Team-Backlog-Deprecated. · View Herald TranscriptJul 21 2017, 5:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

phuedx updated the task description. (Show Details)Jul 21 2017, 5:19 PM

ovasileva triaged this task as High priority.Jul 21 2017, 5:23 PM

ovasileva added a project: Web-Team-Backlog.

ovasileva moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.

@bmansurov: In your travels around NLP-land have you found other approaches for SBD on HTML markup?

phuedx updated the task description. (Show Details)Jul 21 2017, 5:38 PM

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.Jul 21 2017, 5:50 PM

I've moved this to tracking for the time being as I don't think it's required if we are using the lead paragraph of text summaries which is my current plan in the new endpoint.

Can I ask that we hold off creating tasks around the new HTML endpoint until a first version of the new endpoint is in place? I'm expecting the new endpoint not to inherit a lot of the issues with the existing one....

I'd rather create the new endpoint - get it enabled on some wikis for testing and then iterate off of that... The sentence detection approach is flawed. My feeling is it's much more sensible to use the <p> tags already in the HTML.

phuedx lowered the priority of this task from High to Medium.Jul 21 2017, 7:00 PM

phuedx added a subtask: T168848: Bootstrap an initial version of the Page Summary API in MCS.

Jdlrobson moved this task from Untriaged to Discuss further on the Web-Team-Backlog (Tracking) board.Jul 21 2017, 7:12 PM

In T171331#3461345, @phuedx wrote:

@bmansurov: In your travels around NLP-land have you found other approaches for SBD on HTML markup?

I have been mainly interested in language independent SBD on plain text. I've found that with enough training data the Punkt system yields good results. The nltk package makes it easy to use the Punkt algorithm. Here is my attempt at creating an unsupervised sentence boundary detection model for the Uzbek language.

CXServer implements something that seems to work reasonably well, but language specific fine tuning is needed. CXServer seems to use a SAX based approach. Here is the CXServer documentation on segmentation. (cc @KartikMistry, who may have more info to add.)

I think our use case is simpler than the use case for Content Translation, because we only have one language at a time to worry about. We don't have a problem where we need to detect a sentence (with HTML markup inside) and machine translate it while marking up the translated text to match the original markup. It looks like we need to be able to detect sentence boundaries and balance HTML elements while returning a number of sentences. This seems an easier problem once we agree on the ways to detect sentence boundaries on plain text.

In T171331#3461394, @Jdlrobson wrote:

I've moved this to tracking for the time being as I don't think it's required if we are using the lead paragraph of text summaries which is my current plan in the new endpoint.

Thanks!

Can I ask that we hold off creating tasks around the new HTML endpoint until a first version of the new endpoint is in place? I'm expecting the new endpoint not to inherit a lot of the issues with the existing one....

I'd rather create the new endpoint - get it enabled on some wikis for testing and then iterate off of that...

I hadn't yet marked this as blocked on the initial implementation of the API. I have now. There's also a large note at the top of the description saying that this work is pending input on the spec. Along with your changes, that should make the status of this task unambiguous.

For context, I created this task after a conversation with @bmansurov about his 10% time projects and research around NLP and SBD. After listening to him speak about his background work I remarked that it might be applicable in this project and would appreciate his input on how we might approach this problem. I felt a Phab task might be best to have a technical discussion as it tends to be where we (Reading Web) have them.

The sentence detection approach is flawed.

Could you expand on this? What about the approach is flawed?

I hadn't yet marked this as blocked on the initial implementation of the API. I have now. There's also a large note at the top of the description saying that this work is pending input on the spec. Along with your changes, that should make the status of this task unambiguous.

Thanks for clarity.

For context

This was the bit I was missing. This makes sense.

Could you expand on this? What about the approach is flawed?

I don't think a summary should be limited by sentences. A paragraph is defined as "a distinct section of a piece of writing, usually dealing with a single theme and indicated by a new line, indentation, or numbering."

Consider the following text:
"During a trial Sam Smith was accused of stealing one thousand rice krispie squares from a factory in North-East London. The trial lasted six days. A media circus followed. Ultimately, Sam was found not guilty and acquitted of all charges."

Creating an artificial boundary in terms of sentences feels flawed when a well written paragraph is already creating that boundary for you. Imagine enforcing 2-3 sentences in the above example... poor Sam!

On discussion

Jdlrobson closed subtask T168848: Bootstrap an initial version of the Page Summary API in MCS as Resolved.Sep 7 2017, 4:15 PM

• bearND moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Sep 13 2017, 11:29 AM

ovasileva moved this task from Backlog to incoming on the Page-Previews board.Dec 12 2017, 3:50 PM

• bearND mentioned this in T180838: Link preview does not show full text extract when lead sentence contains abbreviations (text extract cuts off at the first abbrevation).Dec 15 2017, 2:59 AM

• Fjalapeno moved this task from Tracking to Needs investigation on the Product-Infrastructure-Team-Backlog-Deprecated board.Mar 12 2018, 5:01 PM

Per the discussion on the ticket and at an RI meeting, we are going to decline this.

Currently there are no needs for sentence level control and preserving the introductory paragraph seems to be keeping with the author's intent.

Should we add sentence Boundary Detection to the Page Summary API?Closed, DeclinedPublicActions