Page MenuHomePhabricator

Wikisource Export: Copy/Paste of PDF Exports Shows Incorrect Text
Closed, ResolvedPublic

Description

As a Wikisource user, I want the bug related to copy/paste in PDF exports fixed, so that I can still be able to use PDF exports of Wikisource materials in a variety of ways, including copying the text to various places.

Steps to Reproduce:

  1. Go to https://hi.wikisource.org/wiki/%E0%A4%B9%E0%A4%BF%E0%A4%A8%E0%A5%8D%E0%A4%A6_%E0%A4%B8%E0%A5%8D%E0%A4%B5%E0%A4%B0%E0%A4%BE%E0%A4%9C
  2. Choose to download PDF of book
  3. Try to copy the left side of the first line of the title (see screenshot example below)
  4. Paste text somewhere else. You will notice that it is not the same text as the one you copied. When Satdeep & I tested, I got :िहद वराज He got: िहᇳद ᇿवराज. Both are wrong.

Acceptance Criteria:

  • Fix bug, so that if users copy/paste content from a downloaded PDF ebook, the content remains unchanged

Visual Example:

Screen Shot 2021-02-11 at 11.13.18 AM.png (580×741 px, 86 KB)

Event Timeline

Prateek saw:
िहद वराज

@ifried got:
:िहद वराज (ignore the colon, it got pasted by accident and now I cant seem to remove it)

@SGill saw:
िहᇳद ᇿवराज

My observation:

  • Half 'n' and the half 's' sounds are missing
  • They're either hidden, or shown as a circle, or a square
  • Not sure why one of them has 'ᇳ' which is a Hangul character (Korean)
ifried updated the task description. (Show Details)

@Samwilson We tried to discuss this in estimation today, but there were too many unknowns. Do you have any insight into what may be going on here? Thanks in advance!

Looks like an upstream issue.

I've lodged a bug with Calibre: https://bugs.launchpad.net/calibre/+bug/1915485

The Calibre developer replied and it's not good news:

Sadly PDF generation is not in calibre's control. It is done by Qt
WebEngine (aka Chromium). Chromium recently switched to using harfbuzz
for font shaping instead of sfntly, that might be the cause for it You
can test it by using a version of calibre from before the change,
possibly 5.6 or 5.7.

I don't think we want to revert to an earlier version, as there are other improvements that we want to keep.

It seems like there's not much we can do at the moment about this bug, other than wait for Chromium to fix it. We could perhaps look into using another renderer for PDF, such as Pandoc, but there are different errors when we go that route. Here's an example of the above work, made with Pandoc (EPUB → Latex → PDF):

Thanks for looking into this, @Samwilson! I agree that we shouldn't revert to an earlier version. One of the goals of our project is to update the infrastructure & experience of WS-Export, so a revert like that would go against our aims. However, I do think it is important that we know this information, since we can communicate to Wikisource community members that this is not directly caused by our work (pinging @SGill so he sees this). In the meantime, we can wait for Chromium to fix it. Is there a way we can log an error for Chromium developers like we did for Calibre developers? I guess that seems like the most appropriate next steps.

Some questions from the Qt WebEngine Team:

  • Does it happen only with Devanagari script?
  • Can you explain the difference? Is it change in content characters, or a change in styling and/or grouping?

I've done my best to explain the latter, but if anyone has any more information that'd be great. (Add it here and I can copy it to the other ticket; or go and comment there directly.)

The Qt WebEngine response is that the bug is upstream, in Chromium.

I've reported a bug there: https://crbug.com/1182519

There's a minimal test file at https://ws-export-test.wmcloud.org/T274560.html

Could someone have a look at the latest comment on the above ticket, and reply? Sounds like this might be working in the latest version of Chromium.

Could someone have a look at the latest comment on the above ticket, and reply? Sounds like this might be working in the latest version of Chromium.

No reply; assuming this is resolved. Please comment (and provide version info and steps to reproduce including URLs) if not. Thanks!