Page MenuHomePhabricator

[BUG] wsexport cannot generate valid epub with svg images
Closed, ResolvedPublic2 Estimated Story PointsBUG REPORT

Description

What is the problem?

I cannot open epub files of books which contain SVG images.

The XML in the content.opf of the generated epub is not valid, e.g.:

<item id="c217_c79dd6f560650b235f4970c96c113b7232e92bc9.svg" href="images/c217_c79dd6f560650b235f4970c96c113b7232e92bc9.svg" media-type="image/svg+xml; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"" />

(The quoting is incorrect)

I suspect this is a bug in the way the strings are concatenated in Epub3Generator.php line ~95.

Oddly, I can generate valid PDFs of the same books.

Steps to reproduce problem
  1. Go to https://tools.wmflabs.org/wsexport/tool/book.php?lang=fr&format=epub&page=Le_Vingti%C3%A8me_Si%C3%A8cle
  2. Download the epub and try to view it

Event Timeline

Niharika triaged this task as Medium priority.Jul 9 2019, 11:36 PM
Niharika moved this task from Needs Discussion to Up Next on the Community-Tech board.
Niharika set the point value for this task to 2.

You were quite right Dom, that was the issue. But even escaping the quotes didn't make epubcheck happy:

Non-standard image resource of type image/svg+xml; charset=utf-8; profile='https://www.mediawiki.org/wiki/Specs/SVG/1.0.0' found.

So I've made a PR for this https://github.com/wsexport/tool/pull/192 that throws away all the content-type string after the first semicolon.

Side-note: we shouldn't be string-building XML. :) I might made another PR for that...

The above patch is merged and the staging site updated. Ready for QA.

On the test environment, I can generate and view the epub in the reproduction steps. The problematic SVG displays.

It is my understanding that the tool generates the epub first and then converts this into other formats. In case our changing the epub output has any affect when converting to other formats, I also generated the same ebook in all the different formats we support.

As this change affects how we interpret image media type, I found few ebooks on en.wikisource which have different types of image (gif, tiff, djvu). I was also able to generate and view them.

In case of regression, I generated epubs of a large number (200+) of random ebooks from en.wikisource on both the test and production environment. I unzipped and used diff to compare the test and prod versions. The only differences I saw were where test is no longer generating invalid XML (as in the Description).

I also ran an epub validator (epubcheck) over all the epubs I generated. No important problems found.

@Tpt - Can you update the production code to incorporate the fix? (or add Sam as a maintainer so he can update it.)

@kaldari Done. The Community Tech team is already a maintainer of the tool.