Background
We noticed that the preview for the Egyptian weasel article was omitting the last sentence.
- https://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&format=json&prop=extracts&titles=Egyptian_weasel&exintro=1
- https://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&format=json&prop=extracts&titles=Egyptian_weasel&exsentences=5&exintro=1
The following text is omitted:
"It is rated "Least Concern" by the IUCN Red List." is omitted from the text extract.
Problem
There is a bug in the TextExtracts API in that it doesn't always return the number of sentences required via the exsentences query parameter. Sometimes the output is empty.
Minimum Test Case
For the page https://en.wikipedia.beta.wmflabs.org/wiki/Egyptian_weasel_10 when 5 sentences are requested only 2 are given.
In both examples, the last line is mysteriously ignored.
Cause
The problem is in ApiQueryExtracts::getFirstSentences
A PHP unit test is provided: https://gerrit.wikimedia.org/r/360783
Considerations
Your fix will likely have to take into account maintaining valid markup while extracting sentences. While fixing this, consider the impact you might have on T166272: HTML version of text extracts is not balanced/well formed and naive.
Developer notes
Consider reimplementing ExtractFormatter::getFirstSentences using HtmlFormatter for more reliability.
Sign off steps
- Go to: https://en.wikipedia.beta.wmflabs.org/wiki/Category:Articles_with_%27species%27_microformats
- Hover over “egyptian weasel”
Observed: extract reads: The Egyptian weasel is a species of weasel that lives in northern Egypt.
Expected: entire first paragraph: The Egyptian weasel is a species of weasel that lives in northern Egypt. It is rated "Least Concern" by the IUCN Red List.