Description

Background

We noticed that the preview for the Egyptian weasel article was omitting the last sentence.

The following text is omitted:
"It is rated "Least Concern" by the IUCN Red List." is omitted from the text extract.

Problem

There is a bug in the TextExtracts API in that it doesn't always return the number of sentences required via the exsentences query parameter. Sometimes the output is empty.

Minimum Test Case

For the page https://en.wikipedia.beta.wmflabs.org/wiki/Egyptian_weasel_10 when 5 sentences are requested only 2 are given.

In both examples, the last line is mysteriously ignored.

Cause

The problem is in ApiQueryExtracts::getFirstSentences
A PHP unit test is provided: https://gerrit.wikimedia.org/r/360783

Considerations

Your fix will likely have to take into account maintaining valid markup while extracting sentences. While fixing this, consider the impact you might have on T166272: HTML version of text extracts is not balanced/well formed and naive.

Developer notes

Consider reimplementing ExtractFormatter::getFirstSentences using HtmlFormatter for more reliability.

Sign off steps

Go to: https://en.wikipedia.beta.wmflabs.org/wiki/Category:Articles_with_%27species%27_microformats
Hover over “egyptian weasel”

Observed: extract reads: The Egyptian weasel is a species of weasel that lives in northern Egypt.
Expected: entire first paragraph: The Egyptian weasel is a species of weasel that lives in northern Egypt. It is rated "Least Concern" by the IUCN Red List.

	Subject	Repo	Branch	Lines +/-
	Fix sentence counting handling when using HTML	mediawiki/extensions/TextExtracts	master	+197 -0

Status	Assigned	Task
Resolved	None	T169242 Develop Page Content Service for Reading Clients
Resolved	None	T177425 Develop General Layer of PCS
Resolved	• Jhernandez	T177426 Develop structured JSON APIs for general consumption
Resolved	• Mholloway	T177431 Develop a Summary JSON API
Resolved	Dereckson	T68374 Enable Hovercards on se.wikimedia.org (Swedish chapter wiki)
Resolved	Jdlrobson	T70860 [GOAL] Graduate Page Previews feature (Popups extension) out of Beta Feature
Resolved	ovasileva	T154635 [EPIC] Deploy page previews to English and German Wikipedia
Resolved	ovasileva	T192622 [EPIC] Page previews post-deploy cleanup
Resolved	Jdlrobson	T173952 Remove A/B testing instrumentation code
Duplicate	None	T167433 Switch all projects to the new (and yet to be built) summary-html endpoint for page previews
Duplicate	None	T167429 Make enwiki and dewiki fetch previews from the summary-html RESTBase endpoint
Resolved	ovasileva	T165018 Page previews can consume new summary-HTML endpoint
Declined	Jdlrobson	T111329 [GOAL] Page previews on mobileweb
Resolved	Jdlrobson	T164010 [EPIC] Strengthen the APIs we provide in reading web maintained extensions
Resolved	ovasileva	T113094 [EPIC] The Page Summary API needs to provide useful content for the majority of articles
Declined	Jdlrobson	T168329 exsentences does not work correctly when HTML output used

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 19 2017, 5:52 PM

ovasileva triaged this task as High priority.Jun 19 2017, 5:59 PM

ovasileva added a parent task: T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles.

ovasileva added a subscriber: pmiazga.Jun 19 2017, 6:06 PM

ovasileva added a subscriber: phuedx.

ovasileva moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.Jun 19 2017, 6:14 PM

The page summary endpoint on the Beta Cluster is only returning the first sentence in the extract for that page. However, it's returning two+ sentences for other pages. This leads me to believe that it's an issue with TextExtracts.

phuedx updated the task description. (Show Details)Jun 20 2017, 11:14 AM

ovasileva moved this task from Backlog to Next Up on the Page-Previews board.Jun 20 2017, 1:03 PM

phuedx updated the task description. (Show Details)Jun 20 2017, 4:34 PM

Jdlrobson updated the task description. (Show Details)Jun 20 2017, 4:36 PM

Jdlrobson updated the task description. (Show Details)

ovasileva assigned this task to Jdlrobson.Jun 20 2017, 4:39 PM

(I will investigate)

Jdlrobson updated the task description. (Show Details)Jun 21 2017, 9:39 PM

Jdlrobson renamed this task from Preview not displaying complete extract to Preview not displaying complete extract when Taxobox template used.Jun 21 2017, 9:44 PM

Jdlrobson updated the task description. (Show Details)

Jdlrobson updated the task description. (Show Details)Jun 21 2017, 9:49 PM

Jdlrobson updated the task description. (Show Details)Jun 21 2017, 10:02 PM

I'm struggling to replicate this locally.

Getting a bit stumped but basically TextExtracts on the beta cluster is ignoring the last line... Could this relate to $wgUseTidy ? I don't see any issues in TextExtracts itself and not sure how to debug this any more.

Jdlrobson renamed this task from Preview not displaying complete extract when Taxobox template used to Preview is stripping last sentence mysteriously.Jun 21 2017, 11:39 PM

I can replicate this \o/ wooo! Adding a test case...

Jdlrobson renamed this task from Preview is stripping last sentence mysteriously to exsentences does not work correctly when HTML output used.Jun 21 2017, 11:51 PM

Jdlrobson updated the task description. (Show Details)

Change 360783 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/TextExtracts@master] Test case for T168329

https://gerrit.wikimedia.org/r/360783

gerritbot added a project: Patch-For-Review.Jun 21 2017, 11:53 PM

Hopefully this is ready for estimation now. I've tried to explain the problem as thoroughly as possible.

Jdlrobson moved this task from Upcoming to Triaged but Future on the Web-Team-Backlog board.Jun 22 2017, 5:21 PM

Jdlrobson moved this task from Triaged but Future to Upcoming on the Web-Team-Backlog board.

Jdlrobson mentioned this in T166272: HTML version of text extracts is not balanced/well formed and naive.Jun 22 2017, 5:36 PM

Jdlrobson mentioned this in T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles.Jun 22 2017, 5:46 PM

Jdlrobson mentioned this in T168332: HTML previews' layout breaks text multi-line text truncation.Jun 22 2017, 6:16 PM

phuedx updated the task description. (Show Details)Jun 22 2017, 6:26 PM

Jdlrobson removed Jdlrobson as the assignee of this task.Jun 22 2017, 11:28 PM

Jdlrobson subscribed.

Jdlrobson updated the task description. (Show Details)Jun 27 2017, 4:27 PM

ovasileva set the point value for this task to 8.Jun 27 2017, 4:34 PM

ovasileva added a project: Readers-Web-Kanbanana-Board-Old.Jun 27 2017, 4:51 PM

Regarding T168329#3383165: the Reading Web team are aware that this might require a fundamental change to sentence processing. With that in mind we bumped the estimate from a 5 to an 8 to give ourselves time to investigate/document exactly what's going on here, plan, find edge cases, plan a little more, and then make a change.

Jdlrobson moved this task from Upcoming to 2016-17 Q4 on the Web-Team-Backlog board.Jun 27 2017, 5:24 PM

Jdlrobson moved this task from To Do to Needs Design Review on the Readers-Web-Kanbanana-Board-Old board.Jun 28 2017, 5:16 PM

I'm starting this by writing some test cases. Please feel free to contribute any more in follow ups - I'll fold them in as you do!

Also wowser. This one is going to be quite the challenge;)

Change 360783 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/TextExtracts@master] WIP: Fix sentence counting handling when using HTML

https://gerrit.wikimedia.org/r/360783

gerritbot added a project: Patch-For-Review.Jun 30 2017, 11:02 PM

Jdlrobson moved this task from Doing to Needs Code Review on the Readers-Web-Kanbanana-Board-Old board.Jul 3 2017, 8:22 PM

ovasileva moved this task from Next Up to In Development on the Page-Previews board.Jul 5 2017, 12:36 PM

We talked about this task and T168329 during prioritisation/standup/goals time and decided we were working on this a little prematurely and thus feeling pain. We plan to wait on decisions inside T113094 that will tell us how we continue maintaining TextExtracts/the new services endpoint and how we want to sustain this going forward.

Jdlrobson mentioned this in T159065: Fix formulas in HTML extracts.Jul 5 2017, 5:37 PM

Jdlrobson edited projects, added Web-Team-Backlog; removed Readers-Web-Kanbanana-Board-Old.Jul 5 2017, 11:12 PM

Jdlrobson merged a task: T73671: Sentence extraction needs to handle HTML better.Jul 5 2017, 11:37 PM

Jdlrobson added subscribers: MaxSem, Ricordisamoa, • wikibugs-l-list.

Jdlrobson merged a task: T167045: exsentences incorrectly returns empty text extract.Jul 6 2017, 7:00 PM

Jdlrobson updated the task description. (Show Details)

Jdlrobson added subscribers: Aschroet, Stashbot, • Pchelolo and 3 others.