Page MenuHomePhabricator

Wikisource Ebooks: Validation error: Duplicate ID '-' [medium]
Closed, ResolvedPublic5 Estimated Story Points

Description

As a Wikisource user, I want the Epubcheck error (as described below) fixed, so that the epub export process can be more reliable and efficient.

Background: Epubcheck is a tool to validate the conformance of EPUB publications against the EPUB specifications. Right now, Epubcheck is raising errors such as:

ERROR(RSC-005): Emma.epub/OPS/c18_Emma_Volume_1_Chapter_18.xhtml(55,122): Error while parsing file: Duplicate ID '-'

We can find similar errors from Nu Html Checker.

These errors are occurring because the page numbers defined in ProofreadPage pagelists are commonly being used as element IDs in MediaWiki:Proofreadpage pagenum template:

  • en: <span class="pagenum ws-pagenum" id="{{{num}}}" data-page-number="{{{num}}}" title="{{urlencode:{{{page}}}|WIKI}}">&#8203;</span>
  • fr: <span class="pagenum ws-pagenum" id="{{{num}}}" title="{{{page}}}"></span>
  • eu: <span class="pagenum" id="{{{num}}}" title="{{{page}}}"></span>
  • bn: <span class="pagenum ws-pagenum" id="{{{num}}}" data-page-number="{{{num}}}" title="{{urlencode:{{{page}}}|WIKI}}"></span>

This works really well for actual page numbers, because they can then be used to locate individual pages by URL fragment, e.g. page 3 in https://en.wikisource.org/wiki/Emma/Volume_1/Chapter_1#3 but when a page is given any repeating string (such as, commonly, a hyphen for pages that are outside of the pagination).

One fix might be to pass a new id parameter to this template, that's ensured to be unique (within separate <pages /> invocations only โ€” would that be sufficient?).

(The example error above should actually be avoided by not transcluding the hyphen pages, because they're blank anyway, but this issue is still valid I think in that we should not be producing duplicate IDs.)

Acceptance Criteria:

  • Fix the Epubcheck bug, as described above, so such errors no longer occur.
  • Look into whether the following idea fixes the issue: Pass a new id parameter to this template, that's ensured to be unique (within separate <pages /> invocations only).

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Aklapper. ยท View Herald Transcript
ifried renamed this task from Validation error: Duplicate ID '-' to Wikisource Ebooks: Validation error: Duplicate ID '-'.Jun 11 2020, 10:42 PM
ifried updated the task description. (Show Details)

No, but users might notice if they download an ebook and attempt to use page anchors to link to specific places in the text (they'll only be able to get to the first ocurence of any duplicated ID).

ifried renamed this task from Wikisource Ebooks: Validation error: Duplicate ID '-' to Wikisource Ebooks: Validation error: Duplicate ID '-' [medium].Jul 9 2020, 11:44 PM
ifried moved this task from Needs Discussion to Up Next on the Community-Tech board.

@Samwilson Pinging you so you remember to add in the technical idea we have for how to implement the fix. Thanks!

The idea is to add a new id parameter for [[mw:Proofreadpage pagenum template]], and if the a page label is duplicated we'll append an integer count to the second and subsequent IDs. The initial one needs to stay as the bare label, in order to not break existing fragment links.

After further discussion, it seems it's good to also modify duplicate IDs so that it doesn't matter if wikitext contains duplicates. Here's a patch to make them unique, and fix the leading underscore error at the same time: https://github.com/wsexport/tool/pull/288

Thank you, @Samwilson!

One question: Since this ticket was estimated as medium (many moons ago!) and we have now switched to fibonacci numbers, how big would you say this ticket is? 3? 5? Thanks!

Hmm, I'd say 3, but I've been underestimating things lately so should call it a 5. :) There's a bit of "making sure we don't break existing links" to do.

Great, thanks, @Samwilson! I'll mark it as 5 :)

ifried set the point value for this task to 5.Dec 14 2020, 11:31 PM
dom_walden subscribed.

I have not been able to reproduce this issue.

For example, https://wsexport-test.wmflabs.org/?lang=en&page=The_Drums_of_Jeopardy%2FChapter_21&format=epub-3&fonts= does not have duplicate IDs in the epub's HTML.

As Sam says, it was not really a user-facing issue.

As noted by Dom above, the issue is no longer reproducible by our tests & this wasn't really user-facing. I'm marking this as Done.