Page MenuHomePhabricator

"Create a book" and "Download as PDF" don't wrap Chinese or Japanese lines
Closed, ResolvedPublic

Description

Author: yaoziyuan

Description:
PROBLEM:

The Chinese and Japanese languages are the only two languages in the world that don't use spaces to separate words. Instead, they just stick all words together. So it's common to see a whole Chinese/Japanese paragraph without any spaces in it.

Because of this peculiarity, MediaWiki's "Create a book" and "Download as PDF" features don't wrap Chinese/Japanese lines in a generated PDF file, resulting in a whole paragraph rendered on a single line and truncated when the line runs out of the right edge of the page.

HOW TO REPRODUCE THE PROBLEM:

  1. Go to http://www.mediawiki.org/wiki/MediaWiki/zh-hans
  2. Open the page's "Download as PDF" link in a new browser tab;
  3. You will get a PDF file that doesn't wrap a long Chinese line but instead truncates it at the right edge of the page:

"MediaWiki是一个最初用于维基百科的自由wiki程序包,用PHP语言写成。现在,非营利的维基媒体基金会的其他计划、许多其他wiki网站以及本网站(MediaWiki主页)都在使用这个程序包。"

HOW TO FIX IT

  1. As a quick workaround, consider inserting special, invisible Unicode control characters into a long Chinese/Japanese line that will cause line wraps;
  2. As a dirty rule, if you can't wrap a line at a space, wrap it at the right margin of the page forcibly;
  3. In principle, every Chinese/Japanese character can be considered a "word" and therefore a line wrap is allowable before or after any Chinese/Japanese character;
  4. Experts may have a better solution than the above.

Version: unspecified
Severity: major

Details

Reference
bz33430

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:02 AM
bzimport set Reference to bz33430.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

The Chinese and Japanese languages are the only two languages in the world that
don't use spaces to separate words.

Not true, many other languages does not use spaces as well :)

IIRC, the wrapping of Chinese paragraphs was fine around September because I used it to generate several files. Probably some changes in the config or extension itself caused this problem.

yaoziyuan wrote:

Benjamin Chen: AFAIK, Only Chinese and Japanese apply. Korean uses square-like characters but it does have spaces between words.

yaoziyuan wrote:

Enabling the Chinese Wikipedia to provide ebook creation properly can help spread Wikipedia knowledge in China freely.

Fixing this bug will probably be only a partial success. About 18 month ago we (PediaPress) were experimenting a little bit with Japanese, but we encountered numerous problems (text-direction, layout rules, lack of support both from the community and our tools) that scared us off pursuing this further. Imo it will take a lot of determination and perseverance as well as ongoing support from native speakers/developers to create decent ebooks.

yaoziyuan wrote:

(In reply to comment #4)

Fixing this bug will probably be only a partial success. About 18 month ago we
(PediaPress) were experimenting a little bit with Japanese, but we encountered
numerous problems (text-direction, layout rules, lack of support both from the
community and our tools) that scared us off pursuing this further. Imo it will
take a lot of determination and perseverance as well as ongoing support from
native speakers/developers to create decent ebooks.

First, fixing this bug alone will improve the usefulness of Chinese/Japanese ebooks from 1% to 99.9%.

Second, I suggest MediaWiki reuses a mature HTML rendering engine (e.g. WebKit) or text rendering engine (e.g. Pango), instead of reinventing all the wheels again.

Third, MediaWiki can for now ignore complex formatting features such as "text-direction, layout rules" and just focus on drawing plain text lines and images correctly. "Keep it simple, stupid" for the first version.

yaoziyuan wrote:

I have played around with some Chinese pages on mediawiki.org and so far the only problem I have seen is "no line wrapping". I don't see problems you mentioned like "text-direction"; note that Chinese and Japanese also use the left-to-right text direction just like English. Text direction is only a problem for Middle East languages like Arabic and Hebrew.

I see MediaWiki can already draw basic stuff right: text, images and tables, except line wrapping for Chinese/Japanese.

Here is a simple rule set for line wrapping:

IF there is a whitespace near the page's right margin THEN

break the line at that whitespace;

ELSE IF there is a Chinese/Japanese character near the page's right margin THEN

break before or after that character;

ELSE

break forcibly at the page's right margin (and optionally draw a "soft return" character to indicate this forced break).

yaoziyuan wrote:

Although Chinese and Japanese don't use spaces to separate words, you can actually think there is an "invisible space" before and after every Chinese/Japanese character, and this "invisible space" is always a good line-wrapping point just like normal spaces.

There is actually a Unicode control character U+200B "zero-width space" (http://en.wikipedia.org/wiki/Zero-width_space) for this "invisible space" concept.

With U+200B in mind, we can also simplify our line-wrapping rule set as:

add a U+200B after every Chinese/Japanese character;
IF there is a whitespace (including U+200B) near the page's right margin THEN

break the line at that whitespace;

ELSE

break forcibly at the page's right margin (and optionally draw a "soft return" character to indicate this forced break).

yaoziyuan wrote:

Either of the above two rule sets can solve the line wrapping problem, although in the long run I recommend using a mature HTML-to-PDF library instead of reinventing all the wheels.

yaoziyuan wrote:

I just did a little research on what FOSS PDF libraries are available. Here's a good list:

http://en.wikipedia.org/wiki/List_of_PDF_software#Development_libraries

TCPDF (http://en.wikipedia.org/wiki/TCPDF) seems to be a good candidate.

yaoziyuan wrote:

It seems currently MediaWiki's "Collection" extension uses the "ReportLab" PDF library to render PDF files (http://www.mediawiki.org/wiki/Extension:PDF_Writer#Technical).

ReportLab is one of the PDF libraries listed in the above Wikipedia reference.

Maybe we should persuade ReportLab to fix this problem first.

yaoziyuan wrote:

On ReportLab's "Samples" page (http://www.reportlab.com/software/documentation/rml-samples/), there is a "test_031_japanese.pdf" (http://www.reportlab.com/examples/rml/test/test_031_japanese.pdf) which shows that ReportLab can do Japanese text wrapping perfectly, while MediaWiki's "Download as PDF" can't wrap a long Japanese line at all. Why is that? I'm also asking this on ReportLab's mailing list (http://two.pairlist.net/mailman/listinfo/reportlab-users).

yaoziyuan wrote:

Good news, everybody! The solution to this problem has been given by ReportLab's personnel, as follows:

On 2 January 2012 11:33, Yao Ziyuan <yaoziyuan@gmail.com> wrote:

So now I'm confused. Is it MediaWiki's or ReportLab's fault for the
line wrapping problem described in the above bug report
(https://bugzilla.wikimedia.org/show_bug.cgi?id=33430)?

MediaWiki (actually PediaPress.de) decided to use our library a few years ago;
we did some work to improve inline images to support equations, but
they did not mention Asian line wrapping at the time and I did not know
about this limitation.

I guess they are simply not using our wordwrap=CJK option. Our library
needs to be told "this is Japanese/Chinese, use a different algorithm";
it does not auto-detect based on the encoding.

Also, until some time last year, we could not properly handle mixed text
in the same sentence. We have improved this now.

  • Andy

volker.haas wrote:

We need to distinguish two different cases:

  1. rendering a PDF from the chinese/japanese Wikipedia.
  1. rendering a PDF from any other wikipedia which has some chinese/japanese text embedded inside the article.

The example you (Ziyuan) give at the very top is case 2). Your last post suggests that this case can be handled correctly with a recent reportlab version.

I believe this is not true. I checked out the latest reportlab version from their subversion repository and made a little test script (I'll attach that). The result seems to indicate that mixed cjk and and non-cjk text can't be rendered correctly. The line breaks are either correct for cjk of non-cjk text. (line wrapping behaviour can be toggled by enabling or disabling the CJK wordWrapping.)

(I didn't bother to use a proper font for cjk text - but that should not matter, except that all cjk letters are rendered as black boxes.)

Case 1) is a different matter: this should basically work. If not please provide a minimal example / article URL.

volker.haas wrote:

test script for linebreak check for mixed cjk and non-cjk text

Attached:

yaoziyuan wrote:

First, I don't have MediaWiki installed on my computer so I can't run your test script.

If ReportLab doesn't support line wrapping for mixed cjk and non-cjk text correctly, I suggest we do the following:

Step 1: For every CJK character in the text, insert a Unicode control character U+200B "zero-width space" after it. This is supposed to cause a line-wrapping after a CJK character when a line is full.

Step 2: Disable CJK wordWrapping. Use Western-style word wrapping.

Step 3: Now you should see a long CJK string wrapped at the end of a line.

yaoziyuan wrote:

The line-wrapping rule for CJK/non-CJK mixed text is actually very simple: You should either wrap the line at a whitespace (as in a Western text), or after a CJK character.

So, if possible, use the above rule to pre-wrap a text before feeding it to ReportLab.

yaoziyuan wrote:

OK, now I installed python-reportlab in my Fedora 16 and can run your test script. I understand your problem. I'll test if I can insert U+200B after every CJK character. If U+200B fails, we can insert a normal space after every CJK character. This will definitely wrap a line after a CJK character, but with the drawback that all CJK characters will be separated by spaces (instead of sticking together).

yaoziyuan wrote:

OK. I tried. U+200B doesn't work with ReportLab:

p1 = Paragraph(u"MediaWiki\u200B是\u200B一\u200B个\u200B最\u200B初\u200B用\u200B于\u200B维\u200B基\u200B百\u200B科\u200B的\u200B自\u200B由\u200Bwiki\u200B程\u200B序\u200B包\u200B,\u200B用\u200BPHP\u200B语\u200B言\u200B写\u200B成\u200B。\u200B现\u200B在\u200B,\u200B非\u200B营\u200B利\u200B的\u200B维\u200B基\u200B媒\u200B体\u200B基\u200B金\u200B会\u200B的\u200B其\u200B他\u200B计\u200B划\u200B、\u200B许\u200B多\u200B其\u200B他\u200Bwiki\u200B网\u200B站\u200B以\u200B及\u200B本\u200B网\u200B站\u200B(\u200BMediaWiki\u200B主\u200B页\u200B)\u200B都\u200B在\u200B使\u200B用\u200B这\u200B个\u200B程\u200B序\u200B包\u200B。", s)

But normal spaces do:

p1 = Paragraph(u"MediaWiki 是 一 个 最 初 用 于 维 基 百 科 的 自 由 wiki 程 序 包 , 用 PHP 语 言 写 成 。 现 在 , 非 营 利 的 维 基 媒 体 基 金 会 的 其 他 计 划 、 许 多 其 他 wiki 网 站 以 及 本 网 站 ( MediaWiki 主 页 ) 都 在 使 用 这 个 程 序 包 。", s)

yaoziyuan wrote:

I'll write to ReportLab's mailing list, suggesting them to create a new wordWrap option "mixed", so that ReportLab can directly support wrapping mixed text.

yaoziyuan wrote:

ReportLab says working on this problem is not their priority. So I'm trying to fix it personally in their source code.

I found (and they told me) their source code is actually very old (2006). It's before Unicode went mainstream, which is why they don't support mixed text wrapping well.

So, is it hard for PediaPress to switch to a more "modern" PDF library, such as TCPDF, which I already saw people say is good at Unicode support?

yaoziyuan wrote:

Cite http://en.wikipedia.org/wiki/TCPDF :

"TCPDF is currently the only PHP-based library that includes complete support for UTF-8 Unicode and right-to-left languages, including the bidirectional algorithm.[1]"

yaoziyuan wrote:

OK, Volker Haas, I have come up with a simple way to fix all this:

We will first determine whether a wiki page is "mostly Western" (then we'll use wordwrap=Western) or not (then we'll use wordwrap=CJK).

The definition of "mostly Western" can be: the longest consecutive CJK string in the page is shorter than 10 characters.

volker.haas wrote:

As you also found out, reportlab does not support zero width space chars. I needed that for other purposes in the past as well. The best solution/hack I could come up with, was too use space and set the font size to the smallest possible value.

I implemented the following:

In all non-cjk wikis the text is checked for cjk characters. If cjk characters are found fake zero-width-space chars are inserted. I tested this for a couple of articles and the strategy seems to make sense.

As for your suggestion to use another PDF framework as reportlab: doing this is a huge amount of work, therefore this is no option at the moment.

The render servers will be updated in the next 24 hours. I'll close this as fixed.

yaoziyuan wrote:

Great news. Can you give me a PDF that demostrates your smallest-size spaces?

yaoziyuan wrote:

I like your solution for non-cjk wikis (using tiny spaces). But you didn't mention what to do with cjk wikis. I assume you will use wordwrap=CJK for them, right?

yaoziyuan wrote:

I just tried out your "tiniest space" concept in LibreOffice. Perfect! Virtually invisible spaces! You're a genius. No need to show me the PDF now.

yaoziyuan wrote:

One more question: Your tiny-space idea is a universal solution that can also apply to cjk wikis, because a cjk wiki can also contain Western words (which better be wrapped at spaces).

What is the reason you don't apply it to cjk wikis?

volker.haas wrote:

For cjk wikis the built-in cjk word wrapping of reportlab is used. This probably breaks non-cjk text that is embedded...But I am pretty sure that at least for japanese the algorithms to break lines are more sophisticated than just splitting after any letter. I am hoping that the built-in reportlab word wrapping function does that. But I am not sure...

yaoziyuan wrote:

First, using ReportLab's cjk wordwrap algorithm will break English words into two lines. This is well demonstrated by your own test script.

Second, also ReportLab's cjk wordwrap algorithm can more sophisticatedly break Japanese sentences, this benefit is very small, while the drawback of cutting Western words in halves is very significant.

In Chinese, some full-width punctuation marks such as ,。;” generally don't appear at the beginning of a line either, but as a Chinese I consider this an expendable rule if we can keep Western words uncut.

yaoziyuan wrote:

s/also/although

yaoziyuan wrote:

Here are two Wikipedia links that talk about the so-called CJK wordwrap rules:

http://en.wikipedia.org/wiki/Word_wrap#Word_wrapping_in_text_containing_Chinese.2C_Japanese.2C_and_Korean

http://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_language#Line_breaking_rules_in_Japanese_text_.28Kinsoku_Shori.29

I have reviewed them all. Not a single of them is as serious as "don't break Western words into two lines". They can be ignored altogether. Most text editors and viewers don't obey these rules anyway.

yaoziyuan wrote:

Found a problem with the tiny space approach: Chinese characters don't take up the full space of a line; there is still much space left on the right side of each line. For example: try http://www.mediawiki.org/wiki/MediaWiki/zh-hans

I guess this is caused by how ReportLab counts the text length of what's already put on a line: after putting each word, it adds that word's length and a normal space's width. But now there are actually two kinds of space width: normal width (as between two English words) and tiny width (as between two Chinese characters). It seems ReportLab thinks all spaces are using the normal width, therefore starting a new line prematurely.

Can this be fixed? Can you let ReportLab count tiny spaces as tiny spaces, not normal spaces?

yaoziyuan wrote:

If we can't easily modify ReportLab to distinguish tiny space widths from normal space widths, I'd rather see this arrangement:

For non-cjk wikis, insert a normal-sized space after each CJK character, and then use wordwrap=Western.

For cjk wikis, use wordwrap=cjk.

volker.haas wrote:

I just found out that the latest reportlab version seems to handle non-cjk text inside cjk text (with wordWrap='CJK') correctly. Installation of the newest reportlab version failed, and I didn't realize that.
--> Merging the latest reportlab version should therefore solve this problem. I'll see if I can do this...

One Problem in non-cjk inside cjk remains: the text isnt' justified correctly anymore, but I'd just ignore that...

yaoziyuan wrote:

Great to hear that. Eager to see a sample PDF of your latest finding.

yaoziyuan wrote:

I confirm. I downloaded and installed the latest snapshot reportlab-20120111203740 successfully and ran your test script. It does wrap both CJK and Western text correctly.

volker.haas wrote:

I updated to the latest reportlab version. The problem mixing cjk and non-cjk text should be fixed. The render servers will be updated sometime next week.

yaoziyuan wrote:

Volker: Appreciate your hard work!

yaoziyuan wrote:

Are render servers updated yet? As I still see Chinese lines not take up a page's full width (there's much space left on each Chinese line's right side).