Page MenuHomePhabricator

exsentences does not work correctly when HTML output used
Closed, DeclinedPublic8 Estimated Story Points

Description

Background

We noticed that the preview for the Egyptian weasel article was omitting the last sentence.

  1. https://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&format=json&prop=extracts&titles=Egyptian_weasel&exintro=1
  2. https://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&format=json&prop=extracts&titles=Egyptian_weasel&exsentences=5&exintro=1

The following text is omitted:
"It is rated "Least Concern" by the IUCN Red List." is omitted from the text extract.

Problem

There is a bug in the TextExtracts API in that it doesn't always return the number of sentences required via the exsentences query parameter. Sometimes the output is empty.

Minimum Test Case

For the page https://en.wikipedia.beta.wmflabs.org/wiki/Egyptian_weasel_10 when 5 sentences are requested only 2 are given.

In both examples, the last line is mysteriously ignored.

Cause

The problem is in ApiQueryExtracts::getFirstSentences
A PHP unit test is provided: https://gerrit.wikimedia.org/r/360783

Considerations

Your fix will likely have to take into account maintaining valid markup while extracting sentences. While fixing this, consider the impact you might have on T166272: HTML version of text extracts is not balanced/well formed and naive.


Developer notes

Consider reimplementing ExtractFormatter::getFirstSentences using HtmlFormatter for more reliability.

Sign off steps

  1. Go to: https://en.wikipedia.beta.wmflabs.org/wiki/Category:Articles_with_%27species%27_microformats
  2. Hover over “egyptian weasel”

Observed: extract reads: The Egyptian weasel is a species of weasel that lives in northern Egypt.
Expected: entire first paragraph: The Egyptian weasel is a species of weasel that lives in northern Egypt. It is rated "Least Concern" by the IUCN Red List.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The page summary endpoint on the Beta Cluster is only returning the first sentence in the extract for that page. However, it's returning two+ sentences for other pages. This leads me to believe that it's an issue with TextExtracts.

Jdlrobson updated the task description. (Show Details)
Jdlrobson renamed this task from Preview not displaying complete extract to Preview not displaying complete extract when Taxobox template used.Jun 21 2017, 9:44 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson updated the task description. (Show Details)

I'm struggling to replicate this locally.

Getting a bit stumped but basically TextExtracts on the beta cluster is ignoring the last line... Could this relate to $wgUseTidy ? I don't see any issues in TextExtracts itself and not sure how to debug this any more.

Jdlrobson renamed this task from Preview not displaying complete extract when Taxobox template used to Preview is stripping last sentence mysteriously.Jun 21 2017, 11:39 PM
Jdlrobson updated the task description. (Show Details)

I can replicate this \o/ wooo! Adding a test case...

Jdlrobson renamed this task from Preview is stripping last sentence mysteriously to exsentences does not work correctly when HTML output used.Jun 21 2017, 11:51 PM
Jdlrobson updated the task description. (Show Details)

Change 360783 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/TextExtracts@master] Test case for T168329

https://gerrit.wikimedia.org/r/360783

Jdlrobson updated the task description. (Show Details)
Jdlrobson moved this task from Needs Prioritization to Upcoming on the Web-Team-Backlog board.

Hopefully this is ready for estimation now. I've tried to explain the problem as thoroughly as possible.

Jdlrobson subscribed.
ovasileva set the point value for this task to 8.Jun 27 2017, 4:34 PM

Regarding T168329#3383165: the Reading Web team are aware that this might require a fundamental change to sentence processing. With that in mind we bumped the estimate from a 5 to an 8 to give ourselves time to investigate/document exactly what's going on here, plan, find edge cases, plan a little more, and then make a change.

I'm starting this by writing some test cases. Please feel free to contribute any more in follow ups - I'll fold them in as you do!

Also wowser. This one is going to be quite the challenge;)

Change 360783 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/TextExtracts@master] WIP: Fix sentence counting handling when using HTML

https://gerrit.wikimedia.org/r/360783

Jdlrobson changed the task status from Open to Stalled.Jul 5 2017, 5:37 PM
Jdlrobson removed a project: Web-Team-Backlog.
Jdlrobson moved this task from 2016-17 Q4 to Needs Prioritization on the Web-Team-Backlog board.

We talked about this task and T168329 during prioritisation/standup/goals time and decided we were working on this a little prematurely and thus feeling pain. We plan to wait on decisions inside T113094 that will tell us how we continue maintaining TextExtracts/the new services endpoint and how we want to sustain this going forward.

@phuedx and I have made the decision that this will not be fixed. T170617 will make sure that HTML behaviour has problems in these circumstances.

Change 360783 abandoned by Jdlrobson:
Fix sentence counting handling when using HTML

https://gerrit.wikimedia.org/r/360783