Page MenuHomePhabricator

PDF is missing text at at end of page and beginning of next page
Closed, ResolvedPublic

Description

Reported at https://fr.wikisource.org/wiki/Utilisateur:Viticulum/Anomalies_epub/english

Book: L'Île Ste. Hélène. Passé, présent et avenir/Présent

Bug 1: Missing text at at end of page and beginning of next page

  • Wikisource : [[Page:Achintre, Crevier - L'Île Ste. Hélène. Passé, présent et avenir, 1876.djvu/67|page 58 in Wikisource]]
  • PDF : page 49-50 : missing text in table : "Ronces noires.", "Catherinettes,", "Low blackberry. Rubus oanadensis."
  • Kobo : Ok (begin-end page at different places)
  • Kindle : Ok (begin-end page at different places)

Another example :

  • Wikisource : [[Page:Achintre, Crevier - L'Île Ste. Hélène. Passé, présent et avenir, 1876.djvu/69|page 60 in Wikisource]]
  • PDF : page 51-52 : missing text in table : "Bouillon blanc. Bonhomme, Moléne. Common Mullein Verbascum alatum"
  • Kobo : Ok (begin-end page at different places)
  • Kindle : Ok (begin-end page at different places)

There are more example in the following pages.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think this bug is due to Calibre's handling of tables in PDF. For example, the image below shows Calibre's rendering (left) vs Pandoc's rendering (which first turns the epub into LaTeX).

comp.png (587×1 px, 112 KB)

The table is actually cropped mid-line in the Calibre example, and I think that's what happening with the missing text mentioned above.

If we uncover what we think might be bugs or inconsistencies with Calibre's renderings, it would be cool to create issues or bug reports for them. We probably can't fix it but we can help with a bit of research.

I've been trying to tweak the bottom margin (with Calibre's margin-bottom and pdf-page-margin-bottom parameters) but all that does is increase the space below the page number.

Likewise, adding a top margin to the footer area doesn't help with the chopped-off text.

Setting break-inside: avoid-page or page-break-inside: avoid on the captions, tables, cells etc. doesn't do anything either.

If we uncover what we think might be bugs or inconsistencies with Calibre's renderings, it would be cool to create issues or bug reports for them. We probably can't fix it but we can help with a bit of research.

Yes, great idea. I'll do so if I possibly can.

I've reported this upstream at https://bugs.launchpad.net/calibre/+bug/1862732 with a demo epub and PDF attached. It doesn't look like it's anything to do with tables, just the fact that whatever the PDF generation system is permits lines to be broken over page breaks.

Their response: "Are you using an up to date version of calibre?" Eh....

I had a concern this would be the case. I thought we created a task for upgrading calibre in production?

The new version of Calibre does indeed work! So this is a simple fix: T244837: Upgrade Calibre on wsexport VPSs

Calibre has been upgraded, and in my testing it looks like this issue has been solved. I've asked @Viticulum to check.

Hello Sam,

I have verified 2 books : good and bad news.

I have verified the whole book « L'Île Ste. Hélène. Passé, présent et avenir ». I have not detected any missing text, or duplication of text.

Margins: Unfortunately there is an issue with left & right margin. They are way too large. This diminishes the available space for the text. In a previous PDF from sept 2019 there was 89 pages. In the current version (02/13/20), there are 119 pages.

Of course, this implies that missing or duplicated text at end of pages won’t be at the same places. But I have not found any problems.

This causes unexpected situation:

Chapter « À L’ÎLE STE. HÉLÈNE » (numbered p 6) : the poem is split on 2 lines, and it was not in the previous pdf version.

Page numbered 9: Text « A. ACHINTRE. » : should be right aligned.

Page numbered 26: list starting by « Hudson-River » should be left. This could possibly arranged with a different alignment technique within Wikisource.

Another issue: change of size of police font at different chapter. These changes do not occure in Wikisource, or in PDF from sept 2019.

Page numbered 34, chapter « PALÉONTOLOGIE » : font size smaller
Page numbered 50, chapter « FLORE » : return to bigger font size
Page numbered 67, chapter « GÉOGRAPHIE. » : font size smaller
Page numbered 76, chapter « FAUNE. » : return to bigger font size until end of book

Note: at page numbered 4, text « MM. A. ACHINTRE & J. A. CREVIER, M.D. » : there is something bizarre under M.D. Looks like an underline. Can this be removed ?

Second book tested: Poissons d’eau douce du Canada (Stopped verification at page 200)

I have not detected any missing text, or duplication of text.

Margin caused the PDf to increase from 868 pages (sept 2019) to 999 pages, current version.

Change of police font size :
Page numbered 38, chapter « DES POISSONS » « DESCRIPTION GÉNÉRALE » font size smaller
Page numbered 104, chapter « LA PERCHAUDE » font size seems to me even smaller, but not certain
Page numbered 158, chapter « LE CRAPET CALICOT » return to bigger font size
Page numbered 182, chapter « L’ACHIGAN » font size smaller

If you need to see previous PDF, please let me know, and tell me how to do so.

Please do not hesitate to ask me to do more testing. I am very happy to help in any way.

Thanks for such detailed notes! I'll work through things one by one.

Margins: Unfortunately there is an issue with left & right margin. They are way too large. This diminishes the available space for the text. In a previous PDF from sept 2019 there was 89 pages. In the current version (02/13/20), there are 119 pages.

I've made a fix for this: https://github.com/wsexport/tool/pull/219

The test instance is updated with the above fix for the margins: https://wsexport-test.wmflabs.org/book.php Can you have a test and see what you think?

Let's create separate bugs for each different issue, e.g.:

It sounds like the missing text issue is fixed. After you've tested and it's been deployed to production, we can close this ticket I think.

Happy to see there is a test environment !
What format should I use?

Good news:
Situation with margin is fixed.
Bug of missing text or duplication is fixed as far as I can see.

New problem:
When opening some PDF, getting message « Données insuffisantes pour une image ». Insufficient data for an image. We have this since a few days. Not in every book.
There is also a blank page at the beginning. Seems to be missing image of cover page.
Tpt seems to be aware, from a conversation I have read in Wikisource fr

Book : Dorothée, danseuse de corde

Another problem:
This is limited to my book ! Never had this problem before.

Unable to export book Poissons d’eau douce du Canada in test environment because of: exceeded the timeout of 120 seconds.

Full message:
The process "'ebook-convert' '/tmp/www-data/ws-c0_Poissons_d_eau_douce_du_Canada-19873697986395.epub' '/tmp/www-data/ws-c0_Poissons_d_eau_douce_du_Canada-198731808347599.pdf' '--page-breaks-before' '/' '--paper-size' 'letter' '--pdf-page-margin-bottom' '48' '--pdf-page-margin-top' '60' '--pdf-page-margin-left' '36' '--pdf-page-margin-right' '36' '--pdf-page-numbers' '--preserve-cover-aspect-ratio'" exceeded the timeout of 120 seconds.

Unable to export same book in production:
The last export that worked is dated Feb 13, 2020

Full message:
The command "'ebook-convert' '/tmp/www-data/ws-c0_Poissons_d_eau_douce_du_Canada-22088597957127.epub' '/tmp/www-data/ws-c0_Poissons_d_eau_douce_du_Canada-22088863163988.pdf' '--page-breaks-before' '/' '--paper-size' 'a5' '--margin-bottom' '32' '--margin-top' '40' '--margin-left' '24' '--margin-right' '24' '--pdf-page-numbers' '--preserve-cover-aspect-ratio'" failed.

Exit Code: 1(General error)

Working directory: /var/www/tool/public

Output:

No write acces to /var/www/.config/calibre using a temporary dir instead

Error Output:

Traceback (most recent call last):
File "site.py", line 75, in main
File "site-packages/calibre/__init__.py", line 19, in <module>
File "site-packages/calibre/constants.py", line 277, in <module>
File "tempfile.py", line 331, in mkdtemp
File "tempfile.py", line 275, in gettempdir
File "tempfile.py", line 217, in _get_default_tempdir
IOError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/var/www/tool/public']

Thank you

We discussed this today in Estimation.

It appears the issue has been resolved. Do you agree @Viticulum?

If it is resolved, we'll close this ticket.

There are related issues, which we can tackle in separate ticket. @Samwilson has a sense of what these issues are, so perhaps he can write them out as separate issues/tasks. Thank you!

Samwilson claimed this task.

Unable to export book Poissons d’eau douce du Canada in test environment because of: exceeded the timeout of 120 seconds.

The timeout is now 240 seconds, and Poissons d’eau douce du Canada seems to be working.

When opening some PDF, getting message « Données insuffisantes pour une image ». Insufficient data for an image. We have this since a few days. Not in every book.
There is also a blank page at the beginning. Seems to be missing image of cover page.
Tpt seems to be aware, from a conversation I have read in Wikisource fr
Book : Dorothée, danseuse de corde

This could have been related to the timeout issue, if not all images were retrieved. I am not able to replicate it now. Can you open a new issue if it is still happening?

L’Île Ste. Hélène. Passé, présent et avenir
Chapter « À L’ÎLE STE. HÉLÈNE » (numbered p 6) : the poem is split on 2 lines, and it was not in the previous pdf version.
Page numbered 9: Text « A. ACHINTRE. » : should be right aligned.
Page numbered 26: list starting by « Hudson-River » should be left. This could possibly arranged with a different alignment technique within Wikisource.

"A. Achintre." and the "Hudson-River" list are formatted like this in Wikisource, and any fix should happen there.

I don't think there are any more outstanding issues in any of the comments above.

Let's open separate tickets for other formatting issues.

Hi @Samwilson and @ifried

  • Margin: Ok
  • Missing or duplicate text: Ok
  • When opening some PDF, getting message « Données insuffisantes pour une image ». Insufficient data for an image.
    • Note: this happens when opening the PDF file on a PC after extraction with test extraction tool.
    • This still is happening in test environment.

The process "'ebook-convert' '/tmp/www-data/ws-c0_Poissons_d_eau_douce_du_Canada-85491881556931.epub' '/tmp/www-data/ws-c0_Poissons_d_eau_douce_du_Canada-8549120700558.pdf' '--page-breaks-before' '/' '--paper-size' 'a5' '--pdf-page-margin-bottom' '32' '--pdf-page-margin-top' '40' '--pdf-page-margin-left' '24' '--pdf-page-margin-right' '24' '--pdf-page-numbers' '--preserve-cover-aspect-ratio'" exceeded the timeout of 120 seconds.

In this test environment : "https://wsexport-test.wmflabs.org/book.php" I am still getting 120 seconds timeout:

Good catch! Only the prod timeout had been increased. I've now increased test to 240 seconds as well.

I've still not been able to reproduce the "Insufficient data for an image." error. Are you seeing it on both https://wsexport.wmflabs.org and https://wsexport-test.wmflabs.org ? If so, we should open a new ticket for it.

For book "Poissons d’eau douce du Canada" : no time-out in production. Still have a time-out after 240 seconds in test. But then it is a big book with over 200 images. It is slow in production, and very slow in test. Maybe the test environment is slower. So no big deal.

Please open a new ticket for "Insufficient data for an image." This is happening in production and in test. I am trying to determine why on some book and not other. I will fill in more info this week-end.

Is is possible for me to know what is being implemented in production ? I like to make final tests... It will also allow me to inform the community. I noticed that the margin are back to normal. This is very appreciated.