Page MenuHomePhabricator

Wikisource Export: Remove Parsoid errors from ebooks
Open, MediumPublicBUG REPORT

Description

As a Wikisource user, I want the bug in which Parsoid errors are displayed in ebooks to be fixed, so that readers can have a smooth ebook experience and not feel discouraged or confused by unwanted text.

What is the problem?

Error messages in Parsoid's HTML currently get included in ebooks.

I don't think we can do anything about these errors, but could they not be removed when we are parsing Parsoid output?

Steps to reproduce problem
  1. https://wsexport.wmflabs.org/?lang=en&page=The_City_of_Dreadful_Night_and_other_poems&format=epub-3&fonts=
  2. Go to the last page of content (page 117 for me)

Observed behavior: You will see the error Cite error: <ref> tags exist for a group named "errata", but no corresponding <references group="errata"/> tag was found

Acceptance Criteria
  • Investigate bug in in which Parsoid errors display in ebooks and:
    • Try to determine the cause of the issue
    • Share findings in comment in this ticket
    • If possible, issue fix or suggest next steps based on findings
Screenshots (if applicable):

Error as it appears in Parsoid (https://en.wikisource.org/api/rest_v1/page/html/The_City_of_Dreadful_Night_and_other_poems%2FThe_City_of_Dreadful_Night):


Same error as it appears in the epub:

Event Timeline

ifried added a subscriber: ifried.

I have also been able to reproduce the issue (see screenshot example below). We can discuss if there is anything we can do about it at the next estimation.

I think the bigger question here is why this error is not appearing in the wiki page itself: https://en.wikisource.org/wiki/The_City_of_Dreadful_Night_and_other_poems/The_City_of_Dreadful_Night

That page has {{smallrefs|group="errata"}} which resolves to <references group="errata" />, so everything should be fine (and is fine, on wiki).

Could this be a Parsoid-Rendering bug?

The references list is being done via the {{smallrefs}} template, which calls {{#tag:references|{{{refx|}}}|group={{{group|}}}}}

Samwilson added a subscriber: ssastry.

@ssastry this looks like another Parsoid difference in rendering.

ifried updated the task description. (Show Details)

Ok, so, there are two distinct issues here:

  • Legitimate wikitext issues which Parsoid flags as errors. If WS-Export doesn't want to display them, you should strip them from the output. They all have a typeof="mw:Error" property. See https://www.mediawiki.org/wiki/Specs/HTML/2.2.0#Error_handling and https://www.mediawiki.org/wiki/Specs/HTML/2.2.0/Extensions/Cite#Error_representation for Cite-specific errors. But, if you simply suppress nodes with typeof="mw:Error" property during WS-Export, you are good to go. There is nothing to do on the Parsoid end here.
  • Parsoid bugs where some piece of wikitext is not being parsed identically to the core parser. We will take a look at errors like these.@Samwilson, do you know if the error reported here is seen on other pages as well? Or is this somehow specific to this one page?

Hope that helps.

The error is happening in the same form on https://en.wikisource.org/wiki/Paradise_Regained/Book_1

The error is that "no corresponding <references group="errata"/> tag was found", but there's a template at the end of that page, {{smallrefs|group="errata"}}, which resolves to {{#tag:references|{{{refx|}}}|group={{{group|}}}}}. This seems like it should work, and it is working correctly with the old parser.

I've tried to replicate this on Beta Wikisource. I haven't figured out how to get the error message, but have a look at this page and its Parsoid output. The last test case shows the references list failing to be displayed — I think this is the same error, but just without showing the error message.

It looks like the error has something to do with the fact that the <ref /> is transcluded by the ProofreadPage <pages /> tag. I think PRP is just doing basically '{{:' . $pageName . '}}' so I don't know why it'd be different from the 3rd test case above, but obviously something is changed.

Arlolra triaged this task as Medium priority.Thu, Mar 25, 7:56 PM
Arlolra moved this task from Needs Triage to Missing Functionality on the Parsoid board.
Arlolra added a subscriber: Arlolra.

It looks like the error has something to do with the fact that the <ref /> is transcluded by the ProofreadPage <pages /> tag. I think PRP is just doing basically '{{:' . $pageName . '}}' so I don't know why it'd be different from the 3rd test case above, but obviously something is changed.

There's no native implementation of the Proofread Page extension for Parsoid, so it calls out to the legacy parser to parse the content of the <page> tag. Since the "errata" references group is somewhere else on the page (only visible to Parsoid), the legacy parser is throwing those reference errors. Generally, those big red user visible errors are sign that content is being parsed by the legacy parser.

Also, since the legacy parser is the one trying to parse those references, Parsoid doesn't have visibility into that to display them in the group, which is why it's empty.

The solution is a native implementation of Proofread Page. I've filed T278481 for that. The more general ticket was T110909