Page MenuHomePhabricator

Should we add sentence Boundary Detection to the Page Summary API?
Closed, DeclinedPublic

Description

NOTE: Pending discussion and resolution at https://www.mediawiki.org/wiki/Topic:Tuqlildvj3czzpyx.

Background

The apps request page summaries of 5 sentences. Page Previews requests an 525 character extract. For performance reasons, we want to minimize the length of the extract as much as possible while satisfying pre-existing use-cases.

AC

  • If the intro consists of one paragraph, then no more than N sentences of that paragraph are returned.
  • If the intro consists of a short paragraph and a list, then it's returned as is.

Notes

  1. @bmansurov mentioned that CXServer's "segmentation" module can do SBD on HTML input for multiple languages. There could be an opportunity to work across teams to make an NPM library of it for consumption by CXServer and MCS.

Event Timeline

ovasileva added a project: Web-Team-Backlog.
ovasileva moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.

@bmansurov: In your travels around NLP-land have you found other approaches for SBD on HTML markup?

I've moved this to tracking for the time being as I don't think it's required if we are using the lead paragraph of text summaries which is my current plan in the new endpoint.

Can I ask that we hold off creating tasks around the new HTML endpoint until a first version of the new endpoint is in place? I'm expecting the new endpoint not to inherit a lot of the issues with the existing one....

I'd rather create the new endpoint - get it enabled on some wikis for testing and then iterate off of that... The sentence detection approach is flawed. My feeling is it's much more sensible to use the <p> tags already in the HTML.

@bmansurov: In your travels around NLP-land have you found other approaches for SBD on HTML markup?

I have been mainly interested in language independent SBD on plain text. I've found that with enough training data the Punkt system yields good results. The nltk package makes it easy to use the Punkt algorithm. Here is my attempt at creating an unsupervised sentence boundary detection model for the Uzbek language.

CXServer implements something that seems to work reasonably well, but language specific fine tuning is needed. CXServer seems to use a SAX based approach. Here is the CXServer documentation on segmentation. (cc @KartikMistry, who may have more info to add.)

I think our use case is simpler than the use case for Content Translation, because we only have one language at a time to worry about. We don't have a problem where we need to detect a sentence (with HTML markup inside) and machine translate it while marking up the translated text to match the original markup. It looks like we need to be able to detect sentence boundaries and balance HTML elements while returning a number of sentences. This seems an easier problem once we agree on the ways to detect sentence boundaries on plain text.

I've moved this to tracking for the time being as I don't think it's required if we are using the lead paragraph of text summaries which is my current plan in the new endpoint.

Thanks!

Can I ask that we hold off creating tasks around the new HTML endpoint until a first version of the new endpoint is in place? I'm expecting the new endpoint not to inherit a lot of the issues with the existing one....

I'd rather create the new endpoint - get it enabled on some wikis for testing and then iterate off of that...

I hadn't yet marked this as blocked on the initial implementation of the API. I have now. There's also a large note at the top of the description saying that this work is pending input on the spec. Along with your changes, that should make the status of this task unambiguous.

For context, I created this task after a conversation with @bmansurov about his 10% time projects and research around NLP and SBD. After listening to him speak about his background work I remarked that it might be applicable in this project and would appreciate his input on how we might approach this problem. I felt a Phab task might be best to have a technical discussion as it tends to be where we (Reading Web) have them.

The sentence detection approach is flawed.

Could you expand on this? What about the approach is flawed?

I hadn't yet marked this as blocked on the initial implementation of the API. I have now. There's also a large note at the top of the description saying that this work is pending input on the spec. Along with your changes, that should make the status of this task unambiguous.

Thanks for clarity.

For context

This was the bit I was missing. This makes sense.

Could you expand on this? What about the approach is flawed?

I don't think a summary should be limited by sentences. A paragraph is defined as "a distinct section of a piece of writing, usually dealing with a single theme and indicated by a new line, indentation, or numbering."

Consider the following text:
"During a trial Sam Smith was accused of stealing one thousand rice krispie squares from a factory in North-East London. The trial lasted six days. A media circus followed. Ultimately, Sam was found not guilty and acquitted of all charges."

Creating an artificial boundary in terms of sentences feels flawed when a well written paragraph is already creating that boundary for you. Imagine enforcing 2-3 sentences in the above example... poor Sam!

Jdlrobson renamed this task from Add Sentence Boundary Detection to the Page Summary API to Should we add sentence Boundary Detection to the Page Summary API?.Aug 8 2017, 7:40 PM
Jdlrobson changed the task status from Open to Stalled.

On discussion

Fjalapeno subscribed.

Per the discussion on the ticket and at an RI meeting, we are going to decline this.

Currently there are no needs for sentence level control and preserving the introductory paragraph seems to be keeping with the author's intent.