Page MenuHomePhabricator

exsentences incorrectly returns empty text extract
Closed, DuplicatePublic

Description

Example to reproduce:

When visiting https://de.wikipedia.org/wiki/Benutzer:CTHOE/Fototouren with the Android App and then clicking the first link Rothenbach in the first table the preview window opens saying: "An error occurred". However, when clicking the title of the preview window the article is opened correctly.

This holds for several but not all linked articles in this list. On the first glance i could not find out what the difference between these articles is.

Expected behavior: I would expect that if the article exists also the preview can be generated.

The problem

The following request appears to provide an empty text extract:
https://de.wikipedia.org/wiki/Spezial:ApiSandbox#action=query&format=json&prop=extracts&titles=Rothenbach_(Lindenkreuz)&exsentences=5&exintro=1

Event Timeline

Dbrant subscribed.

This seems to be because the summary endpoint is returning an empty string for these articles. For example,

https://de.wikipedia.org/api/rest_v1/page/summary/Rothenbach_(Lindenkreuz)

In the above link, the response contains an empty extract_html attribute, which is what the Link Preview dialog uses to populate its contents.
This looks like it's an issue in the Content Service, because the TextExtracts API call seems to work correctly:

https://de.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&titles=Rothenbach_(Lindenkreuz)&exchars=512&exlimit=1

bearND subscribed.

The summary endpoint is implemented in RESTBase directly. AFAIK, the app uses the extract and not the new extract_html property. Looks like when the extract_html value is empty then it doesn't create the extract property.

mobrovac subscribed.

We will look into it on the services side. The error in the Android app seems to come from the fact that the extract field is missing in the response. It would be good for the app to be able to handle missing fields in general, though, to increase its user friendliness. In this concrete case, whether the field is missing or is empty, the net result for the user should be the same (this is not to say that it is OK for the field to be empty/missing).

We will look into it on the services side. The error in the Android app seems to come from the fact that the extract field is missing in the response. It would be good for the app to be able to handle missing fields in general, though, to increase its user friendliness. In this concrete case, whether the field is missing or is empty, the net result for the user should be the same (this is not to say that it is OK for the field to be empty/missing).

+1 for the app handling missing properties, but in this case the fix should be done on the RESTBase side. In the response schema we state that the extract property is required, so in this case the response doesn't conform to the schema.

Mentioned in SAL (#wikimedia-operations) [2017-06-09T20:56:48Z] <mobrovac@tin> Started deploy [restbase/deploy@4e5cb35]: Ensure the extract field is always present in the summary response - T167045

Mentioned in SAL (#wikimedia-operations) [2017-06-09T21:01:44Z] <mobrovac@tin> Finished deploy [restbase/deploy@4e5cb35]: Ensure the extract field is always present in the summary response - T167045 (duration: 04m 57s)

Mentioned in SAL (#wikimedia-operations) [2017-06-09T21:02:18Z] <mobrovac@tin> Started deploy [restbase/deploy@4e5cb35]: Ensure the extract field is always present in the summary response - T167045 (take #2)

Mentioned in SAL (#wikimedia-operations) [2017-06-09T21:07:41Z] <mobrovac@tin> Finished deploy [restbase/deploy@4e5cb35]: Ensure the extract field is always present in the summary response - T167045 (take #2) (duration: 05m 23s)

We tried to deploy the fix for RESTBase, but ran into problems, so we will revisit this on Monday. Note that the PR above just deals with ensuring the field is there, it does not address the question as to why it is empty.

It would be helpful if you guys could compile a list of titles for which extract is missing so that we can purge those once the fix is deployed.

Why is the extract empty?

I think the culprit is somewhere in TextExtracts when exsentences is used. It works fine when exchars is used instead, which Android uses when requesting the extract directly from MW API.

https://de.wikipedia.org/wiki/Spezial:ApiSandbox#action=query&format=json&prop=extracts&titles=Rothenbach_(Lindenkreuz)&exsentences=5&exintro=1
"extract": ""

https://de.wikipedia.org/wiki/Spezial:ApiSandbox#action=query&format=json&prop=extracts&titles=Rothenbach_(Lindenkreuz)&exchars=512&exintro=1
"extract": "<p><b>Rothenbach</b> ist ein Weiler von Lindenkreuz im Landkreis Greiz in Th\u00fcringen.</p>\u2026"

I vaguely remember that the Android app team switched from exsentences to exchars since we encountered various issues with the former. It did come with some cost.
The Android app uses some Java code to turn exchars=512 into something like exsentences=2. It takes advantage of Java's BreakIterator.getSentenceInstance() method taking the Locale of the wiki project into consideration. The exchars parameter also seems to add an extra ellipsis character (\u2026) at the end sometimes.

Yup, @bearND, there is something funky going on in the TextExtracts extension which will need to be looked at more closely. Also, it seems as if explaintext contributes to that (which we have recently turned off in the RESTBase request to be able to get the HTML equivalent). Adding TextExtracts to that effect.

Jdlrobson renamed this task from Preview error for several articles to exsentences incorrectly returns empty text extract.Jun 12 2017, 3:44 PM
Jdlrobson lowered the priority of this task from High to Medium.
Jdlrobson updated the task description. (Show Details)
Jdlrobson subscribed.

Since this is only impacting one page (that we know of so far) high seems a little extreme.