Page MenuHomePhabricator

[Bug] Page summary for lead sections without any bold tags returns "standard" and no extract
Closed, InvalidPublic

Description

The summary endpoint produced an unknown page type, "standard", and no extract for today's enwiki featured article, "Vultee Vengeance in Australian service".

Replicate locally by accessing the specific revision: http://0.0.0.0:6927/en.wikipedia.org/v1/page/summary/Vultee_Vengeance_in_Australian_service/848656723

The response is as follows:

// https://en.wikipedia.org/api/rest_v1/page/summary/Vultee_Vengeance_in_Australian_service
{
  "type": "standard",
  "title": "Vultee Vengeance in Australian service",
  "displaytitle": "Vultee Vengeance in Australian service",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "titles": {
    "canonical": "Vultee_Vengeance_in_Australian_service",
    "normalized": "Vultee Vengeance in Australian service",
    "display": "Vultee Vengeance in Australian service"
  },
  "pageid": 53771831,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c5/No_12_Squadron_Vengeance_dive_bomber_in_flight_during_December_1943.JPG/320px-No_12_Squadron_Vengeance_dive_bomber_in_flight_during_December_1943.JPG",
    "width": 320,
    "height": 241
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/c/c5/No_12_Squadron_Vengeance_dive_bomber_in_flight_during_December_1943.JPG",
    "width": 460,
    "height": 347
  },
  "lang": "en",
  "dir": "ltr",
  "revision": "848656723",
  "tid": "df321e1c-7eb6-11e8-bf5b-2aeaebe33987",
  "timestamp": "2018-07-03T11:47:28Z",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Vultee_Vengeance_in_Australian_service",
      "revisions": "https://en.wikipedia.org/wiki/Vultee_Vengeance_in_Australian_service?action=history",
      "edit": "https://en.wikipedia.org/wiki/Vultee_Vengeance_in_Australian_service?action=edit",
      "talk": "https://en.wikipedia.org/wiki/Talk:Vultee_Vengeance_in_Australian_service"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.org/wiki/Vultee_Vengeance_in_Australian_service",
      "revisions": "https://en.m.wikipedia.org/wiki/Special:History/Vultee_Vengeance_in_Australian_service",
      "edit": "https://en.m.wikipedia.org/wiki/Vultee_Vengeance_in_Australian_service?action=edit",
      "talk": "https://en.m.wikipedia.org/wiki/Talk:Vultee_Vengeance_in_Australian_service"
    }
  },
  "api_urls": {
    "summary": "https://en.wikipedia.org/api/rest_v1/page/summary/Vultee_Vengeance_in_Australian_service",
    "metadata": "https://en.wikipedia.org/api/rest_v1/page/metadata/Vultee_Vengeance_in_Australian_service",
    "references": "https://en.wikipedia.org/api/rest_v1/page/references/Vultee_Vengeance_in_Australian_service",
    "media": "https://en.wikipedia.org/api/rest_v1/page/media/Vultee_Vengeance_in_Australian_service",
    "edit_html": "https://en.wikipedia.org/api/rest_v1/page/html/Vultee_Vengeance_in_Australian_service",
    "talk_page_html": "https://en.wikipedia.org/api/rest_v1/page/html/Talk:Vultee_Vengeance_in_Australian_service"
  },
  "extract": "",
  "extract_html": ""
}

Popups only understands:

const previewTypes = {
	/** empty preview */
	TYPE_GENERIC: 'generic',
	/** standard preview */
	TYPE_PAGE: 'page',
	/** disambiguation preview */
	TYPE_DISAMBIGUATION: 'disambiguation'
};

And only defaults to TYPE_PAGE when an extract exists.

Developer notes

extractLeadIntroduction is returning an empty string. Input looks like this:

<html><head></head><body><p><span>
</span>

</p>

<p about="#mwt4">|}
The <a href="./Royal_Australian_Air_Force" title="Royal Australian Air Force">Royal Australian Air Force</a> (RAAF) operated <a href="./Vultee_A-31_Vengeance" title="Vultee A-31 Vengeance">Vultee Vengeance</a> <a href="./Dive_bomber" title="Dive bomber">dive bombers</a> during <a href="./World_War_II" title="World War II">World War II</a>. The Australian Government ordered 297 of the type in late 1941 as part of efforts to expand the RAAF. This order was later increased to 400 aircraft. A few Vengeances arrived in Australia during 1942, and large-scale deliveries commenced in early 1943; further orders were cancelled in 1944 after 342 had been delivered.

The RAAF was slow to bring its Vengeances into service, their first combat missions being flown in June 1943. The main deployment of the type took place between mid-January and early March 1944, when <a href="./Squadron_(aviation)" title="Squadron (aviation)">squadrons</a> operated in support of Australian and United States Army forces in New Guinea. This force was withdrawn after only six weeks as the Vengeance was considered inferior to other aircraft available to the Allied air forces. All of the RAAF's five Vengeance-equipped squadrons were re-equipped with <a href="./Consolidated_B-24_Liberator" title="Consolidated B-24 Liberator">Consolidated B-24 Liberator</a> heavy bombers. Vengeances continued to be used in training and support roles with the RAAF until 1946, and some were transferred to the <a href="./Royal_Australian_Navy" title="Royal Australian Navy">Royal Australian Navy</a> between 1948 and 1950 for ground training.

Historians' assessments of the Vengeance's career in Australian service differ. While there is consensus that the type was obsolete, some argue that it nevertheless proved successful. Others, including the RAAF's Air Power Development Centre, have judged that the Vengeance's performance was mixed and the type was not suited to Australia's requirements.</p></body></html>

node.querySelector('b') returns undefined as there are no child nodes of the lead that are b tags.
This seems to be an intentional check added in T186847, which seems a little misguided and Wikimedia wiki specific as it means if a paragraph has no bold tag inside the lead, the lead and thus the summary will always be empty.

Event Timeline

Jdlrobson renamed this task from [Bug] Page summary for some pages has unknown type "standard" and no extract to [Bug] Page summary for lead sections without any bold tags returns "standard" and no extract.Jul 3 2018, 8:10 PM
Jdlrobson reopened this task as Open.
Jdlrobson updated the task description. (Show Details)
Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.

@bearND only just saw https://phabricator.wikimedia.org/T198730#4395150 and see you debugged it earlier. whoops.

This seems to be an intentional check added in T186847, which seems a little misguided and Wikimedia wiki specific as it means if a paragraph has no bold tag inside the lead, the lead and thus the summary will always be empty.

@bearND with respect to T186847, I would advise iterating through elements and explicitly removing any nodes with a style that matches display:none. It think this would be less error prone. Given one day we will also replace all "b" tags with "strong" tags this is also a little brittle. If that happened, all summaries would be empty

If we keep the current approach I personally don't think it's correct to return an empty string here, given a non-empty and in this situation, correct, summary can be generated.

This seems to be an intentional check added in T186847, which seems a little misguided and Wikimedia wiki specific as it means if a paragraph has no bold tag inside the lead, the lead and thus the summary will always be empty.

The check for a bold child element is not an absolute requirement. It only acts as a hint to prefer this paragraph if if includes one. After a workaround of a Parsoid bug was applied to the page mentioned in the task description it did select the correct paragraph. Note that there is still no bold tag in this paragraph.

If we keep the current approach I personally don't think it's correct to return an empty string here, given a non-empty and in this situation, correct, summary can be generated.

After a workaround of a Parsoid bug was applied to the page mentioned in the task description it did select the correct paragraph.

I guess my concern here is that this will happen again and time will be wasted debugging this again. Maybe instead we could flag this as a warning in the API? Given it seems to relate to unbalanced HTML I wonder if there is a way to flag that?

After a workaround of a Parsoid bug was applied to the page mentioned in the task description it did select the correct paragraph.

Are you saying there is an upstream Parsoid bug that if fixed will mean this will never happen again? If so, what is the number? Could we point to that in the code via a comment?

The upstream Parsoid bug I was referring to is T126834. See comment T198730#4395443.

I guess my concern here is that this will happen again and time will be wasted debugging this again. Maybe instead we could flag this as a warning in the API?

We would have to come up with a good way to detect this. I'm not seeing a good way right now.

Thanks! Sorry for missing that bug.

I guess my thinking is if no bold tag exists and no class but the lead paragraph is not empty it might be better to throw a 500 or 204 rather than resolve to an empty string and a 200 to make it clearer there is a problem on the server with the wikitext not the server itself.

Right now there is no way to distinguish this from articles which have no lead section and our error handling on the client could be better.

@Niedzielski The possible values for the type field are listed in https://www.mediawiki.org/wiki/Specs/Summary/1.3.0:
"standard", "disambiguation", "no-extract", "mainpage"
https://www.mediawiki.org/wiki/Page_Previews/API_Specification doesn't mention "generic" either. Not sure where that was coming from.

The main change for lead paragraph detection is I543c0bcee8aa5e24a4e94889d6bd0f582eb55373.
If you have ideas on how to improve the lead paragraph detection please let us know.

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptJan 16 2019, 6:25 PM

Not sure what column to put this one but it's not really actionable from our side at this time, so I'm going to put it in tracking.

No point in keeping this open if there is nothing actionable. I guess the issue here was an unclosed template tag. It would be great if Parsoid could operate in a script mode where any unbalanced tags cause the whole page to error.