Page MenuHomePhabricator

CJK characters not shown in PDF exports
Closed, ResolvedPublic

Description

Downloading any text at enWS or zhWS (and presumably the other Wikisources) with CJK characters in it causes "boxes of doom" in the PDF export:

2021-02-17_104351_640x401_screenshot.png (401×640 px, 56 KB)

This works OK in EPUB and TXT exports:

2021-02-17_104517_584x90_screenshot.png (90×584 px, 21 KB)

Example: https://en.wikisource.org/wiki/Domicile_Ordinance

And the PDF produced from https://ws-export.wmcloud.org/?page=Domicile_Ordinance&lang=en&format=pdf:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

It looks like we just don't have the right fonts installed. I just installed fonts-noto-cjk on the test site, and this seems to work: https://ws-export-test.wmcloud.org/?title=Domicile_Ordinance&lang=en&format=pdf-a4&fonts=noto-sans-cjk-sc (I'm not quite sure what the different CJK fonts are for).

@Samwilson that makes sense.

The Noto fonts are:

  • SC - Simplified Chinese
  • TC - Traditional Chinese
  • KR - Korean
  • JP - Japanese

AFAIK they contain the same glyphs but with different variants:

e.g. https://commons.wikimedia.org/wiki/File:Source_Han_Sans_Version_Difference.svg

Thanks, that makes sense.

I've installed the above fonts package on production as well now, so the fix for this issue is to include fonts=noto-sans-cjk-sc or whichever.

Makes me wonder if we want to add a per-work way of setting a default font for export, because even once we've got a per-wiki way of doing that (T274561) it won't help works like this one.

it won't help works like this one.

Even if the font isn't default, at least for PDF, it'll still fall back to the any font that supports the glyphs. For example, https://ws-export.wmcloud.org/?page=Domicile_Ordinance&lang=en&format=pdf is now working.

I can't immediately tell *which* fallback font it's using, pdffont shows:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
LiberationSerif-Bold                 CID TrueType      Identity-H       yes no  yes     27  0
LiberationSerif                      CID TrueType      Identity-H       yes no  yes     28  0
[none]                               Type 3            Custom           yes no  yes     38  0
[none]                               Type 3            Custom           yes no  yes     39  0
[none]                               Type 3            Custom           yes no  yes     40  0
[none]                               Type 3            Custom           yes no  yes     41  0
[none]                               Type 3            Custom           yes no  yes     42  0
[none]                               Type 3            Custom           yes no  yes     43  0
[none]                               Type 3            Custom           yes no  yes     44  0
[none]                               Type 3            Custom           yes no  yes     45  0
LiberationSerif-Italic               CID TrueType      Identity-H       yes no  yes     46  0
[none]                               Type 3            Custom           yes no  yes    206  0
[none]                               Type 3            Custom           yes no  yes    207  0
[none]                               Type 3            Custom           yes no  yes    208  0
[none]                               Type 3            Custom           yes no  yes    204  0
[none]                               Type 3            Custom           yes no  yes    205  0

But I do now see CJK chars in the PDF.

For EPUB, even if the font isn't bundled, the reader itself can do the fallback, and I think most readers come with a decent set of fonts.

@Samwilson I would normally bring this up in estimation today, but since we'll be having the earlier meeting that you can't attend, I'm just pinging you directly. What is the status of this ticket? It seems a fix is in progress, so I just want to make sure this doesn't fall through the cracks. Maybe we can put it on the board if you have already done some work on it? Thanks!

@ifried This is complete. New fonts were installed on the servers, and no other changes were required. Further font-selecting work will happen in T274561 I think.

ifried claimed this task.

Fantastic; thanks for the update, @Samwilson! In that case, I'll mark this ticket as Resolved.