Page MenuHomePhabricator

PDF generation does not support Complex Script Wikis (e.g. Indic languages) and needs to be re-written
Closed, ResolvedPublic

Description

While support for complex script wikis (e.g. Indic languages) is lacking, developers I spoke with are trying to add that support. Please incorporate this support.


Version: unspecified
Severity: enhancement
See Also:
http://web.archive.org/web/20111002214457/http://code.pediapress.com/wiki/ticket/857
https://bugzilla.wikimedia.org/show_bug.cgi?id=30437
https://bugzilla.wikimedia.org/show_bug.cgi?id=32317

Details

Reference
bz28206

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 21 2014, 11:30 PM
bzimport added projects: Collection, I18n.
bzimport set Reference to bz28206.

Apparently there's this new package for building PDF that can handle Indic scripts, and someone built a proof-of-concept tool on top of it that generates PDFs from Wikipedia pages. See http://ultimategerardm.blogspot.com/2011/03/pdf-library-with-potential.html

mayurdce wrote:

I have also registered a similar bug https://bugzilla.wikimedia.org/show_bug.cgi?id=30508 for hindi scripts.

Assigning this to Tomasz after triage so that he can bring this up with ErikM and, hopefully, find some resources to take care of this.

I should add that, during triage, we talked about opening a new bug: "pdf export should support indic languages and needs top be re-written".

I'll just update the title of this one.

  • Bug 30508 has been marked as a duplicate of this bug. ***

*** Bug 20403 has been marked as a duplicate of this bug. ***

arjunaraoc wrote:

Pediapress view of the book (Which can only be ordered) is displaying Telugu text properly on all the text pages, except the cover page. But the downloadable PDF still has rendering problem for Telugu.

vantharith.oum wrote:

I experienced the same issue with Khmer Unicode font in Khmer Wikipedia too.
Any update on this rending issue for complex scripts?
Thank you!

Re-assigning back to default as I am not actively working on this.

Any update on this?

Currently, nobody is actively working on this - See "Assigned To: Nobody" above.
In case anybody wants to / is able to help codewise, please check http://www.mediawiki.org/wiki/Developer_access

I would be interested in resolving this issue myself. Please someone guide me into it.
I have checked that http://silpa.org.in/Render renders more properly.

Hi, Rahmanuddin, this not render bengali at all, shown garbage.

(In reply to comment #14)

I would be interested in resolving this issue myself. Please someone guide me
into it.

Does the docs/download/install info on https://www.mediawiki.org/wiki/Extension:Collection help? If not you might want to contact developers directly to pinpoint the code area(s).

I just tried saving a pdf from hindi wikipedia. The pdf apparently contains the text correctly but the rendering of the devanagari combining marks is incorrect. Copy-pasting the text into other text processors renders the text correctly.

As far as I know, the Collection extension[1] uses the PDF Writer extension[2] to create pdfs which uses the mwlib.rl library[3] for the pdf creation. The library uses the GNU freefont project's fonts.

I think there's possibly some problem with the GNU freefont coverage[4] which is causing the rendering issues. I may be wrong in this assessment though. It'd be nice if a developer could confirm the cause.

[1]: https://www.mediawiki.org/wiki/Extension:Collection
[2]: https://www.mediawiki.org/wiki/Extension:PDF_Writer
[3]: https://github.com/pediapress/mwlib.rl
[4]: http://www.gnu.org/software/freefont/coverage.html

I suggest that the server on which the rendering is being done, let it have some free licensed fonts installed for each language, at least the prominent ones.

Note that there are plans to rework the current code. See https://www.mediawiki.org/wiki/PDF_rendering

Any update on this?

The new OCG renderer handles Indic scripts much better. It is now enabled on the production wikis, although you need to use the 'Create a book' function in the sidebar to access it.

The new OCG renderer handles Indic scripts much better. It is now enabled on the production wikis, although you need to use the 'Create a book' function in the sidebar to access it.

No It does not render properly! Its buggy and unreadable with ligatures shown wide apart.

Can you provide a specific page, along with details on the specific ligature which is incorrect? Are you sure you are looking at the output of the "OCG latex renderer"? (You need to use the "Create a book" function, and specifically select the "e-book (PDF, ocg latex renderer)" format.)

On Telugu Wikipedia, te.wikipedia.org, I added home page itself as book. an hour back I got wrong rendering. Now I get the following error :
Rendering failed
Generation of the document file has failed.

Status: Rendering process died with non zero code: 1

Another note : I checked on Hindi Wikipedia, Its working fine there.
Jayantha, please confirm about Bangla. I will check rest South Indian languages as well.

Tamil Wikipedia's page rendered as pdf by ocg latex renderer

For Tamil, bold and heading characters are shown as question marks.

attachment test (1).pdf ignored as obsolete

In general: Please please provide exact and clear steps with URLs to reproduce problems, otherwise we might end up with misunderstandings and trying different things. Thanks :)

In Bengali wiki we found same as On Telugu Wikipedia error :
Rendering failed
Generation of the document file has failed.

Status: Rendering process died with non zero code: 1

Please give me exact pages on different wikipedias, that helps me a lot and lets me add a reproducible test case. If you just say "Tamil's wikipedia" I need to do a extra work to figure out what the language code of tamil is, and then to try to find a reasonable test page without my being able to read tamil at all.

Note that the renderer currently has an issue with images in the PDF, which we are working on fixing. "Rendering process died with non zero code: 1" seems to be that image bug. So if you could find test pages without images on them, that would be helpful.

BEFORE COMMENTING HERE: Please read comment 27, comment 29, and https://www.mediawiki.org/wiki/How_to_report_a_bug . Thank you!

Created attachment 16103
Bengali Wikipedia's page rendered as pdf by ocg latex renderer

Attached:

As in Bengali complex word not rendered properly.

  1. Go to https://te.wikipedia.org with Firefox version 31;
  2. Check a page with no images (I chose 1911)
  3. Select Book creator to the leftsidebar of the page (పుస్తకం కూర్పరిని అచేతనం చెయ్యి under ముద్రించండి/ఎగుమతి చేయండి head)
  4. Start book
  5. Add 1911 page to the book.
  6. Navigate to the book page, and then give some title and subtitle
  7. Select ebook (PDF ocg latex renderer) in the dropdown under దింపుకోండి head
  8. Click on Export,

Expected Result : Book gets rendered and download link to pdf is given

Actual result : "Rendering failed

Generation of the document file has failed.

Status: Rendering process died with non zero code: 1 "

Thanks for the good testcase in comment 33. We might eventually have to split this up into separate bugs for the Tamil, Bengali, Telegu issues, but we can keep them together for now.

If you'd like to help debug the issues at a lower-level, the new PDF backend is comprised of a "bundler" and a "renderer" portion, which are described at https://www.npmjs.org/package/mw-ocg-bundler and https://www.npmjs.org/package/mw-ocg-latexer and can be run standalone if you're brave. I reproduced the issue described in comment 33 as follows:

$ mw-ocg-bundler -o tamil.zip -p tewiki 1911
$ mw-ocg-latexer -D -v -o tamil.pdf tamil.zip

which gave me the following error from xelatex:

! Package polyglossia Error: The current roman font does not contain the Telugu script!
(polyglossia) Please define \telugufont with \newfontfamily.

So it looks like I need to find a good font covering Telegu. Can you suggest one?

And note that my commands above named the files tamil.zip and tamil.pdf, even though the bug is really about telugu support. Whoops!

Change 150634 had a related patch set uploaded by Cscott:
Use FAKESTYLES for FreeSerif.

https://gerrit.wikimedia.org/r/150634

For Tamil I'm looking at:
https://ta.wikipedia.org/wiki/1911
and it appears that it is using the FreeSerif font for Tamil. I've fixed the bad boldface issue -- but is there a better font for Tamil we could/should be using?

It looks like Bengali is also using FreeSerif, which probably explains the issues with "complex words". What font should we be using?

Change 150634 merged by jenkins-bot:
Use FAKESTYLES for FreeSerif.

https://gerrit.wikimedia.org/r/150634

For Bengali Wikipedia (https://bn.wikipedia.org), you can use 'Lohit Bengali' (https://fedorahosted.org/lohit/ ) or Siyam Rupali ( already both available in ULS).

The Language Engineering team might also have expertise in recommending fonts with sufficient support for Indic languages, based on their ULS experience. CC'ing Runa.

It's not strictly comparable, since web typography has different constraints (file size, format, etc) which don't apply to the XeLaTeX engine -- and for the OCG servers it's important that the fonts in question are well-packaged for Ubuntu. But ULS is often a good start.

For Telugu, Lohit Telugu of Lohit set of fonts could be used. Also, Vemana can be included. Both fonts are available as ttf-telugu-fonts and ttf-indic-core-fonts packages in Ubuntu/Debian.

wikisource is untested. it should work, but it's better to use wikipedia test cases for now, so that we're dealing with one bug at a time.

The malayalam test case I used was http://ml.
wikipedia.org/wiki/Special:Redirect/revision/1852257 ; I've attached a copy of the PDF to the bug. It uses the Rachana font.

Created attachment 16110
Malayalam page (http:// ml.wikipedia.org/wiki/മലയാളം) rendered as pdf by ocg latex renderer

Attached:

I spent most of yesterday working on this, but I had trouble finding appropriate fonts for Tamil, Telugu, and Bengali which also had coverage of the latin code points. When I use lohit, for example, all the latin-script numbers (and list bullets) render as tofu. :(

I have a good idea how this can be worked around; I've filed that as bug 68922.

Thanks. Malayalam rendering in the attached pdf is good with Rachana font although windows users generally prefers Anjali font (and also I am not able to find the collection extension in ml.wp).

Thanks. Malayalam rendering in the attached pdf is good with Rachana font although windows users generally prefer Anjali font (and also I am not able to find the collection extension in ml.wp).

Tamil Wikipedia's page rendered as pdf by ocg latex renderer

Tamil now works fine including bold / latin characters (Attached new file), although it would be better, if we could find a better font. I shall check the output with Meera-Tamil. Thanks.

Attached:

For Bengali Wikipedia I found same as complex letter not renders properly , may be due to font issue! Could you please add Lohit Bengali or Siyam Rupali , so I can test it !

Change 151360 had a related patch set uploaded by Cscott:
Use Lohit fonts when possible.

https://gerrit.wikimedia.org/r/151360

Change 151360 merged by jenkins-bot:
Use Lohit fonts when possible.

https://gerrit.wikimedia.org/r/151360

Created attachment 16140
Bengali Wikipedia's page rendered as pdf by ocg latex renderer (কমন জেন্ডার-দ্য ফিল্ম)

Today I have checked again, but out have same issues like complex letter not rendered properly.Till using Free serif font instead of Lohit Bengali

Attached:

(In reply to C. Scott Ananian from comment #38)

It looks like Bengali is also using FreeSerif, which probably explains the
issues with "complex words". What font should we be using?

For FreeSerif see also these bugs

I did a session with Indic Wikipedians at wikimedia. I believe I have fixed all the bugs in our indic languages, but I have not yet patched the polyglossia package in production. (The current PDFs don't use the correct font.) Working on it.

I don't have any expertise with Burmese, however. If a native reader can volunteer to check rendering, I can probably fix any bugs which exist.

cscott added a comment.Oct 8 2014, 4:00 PM

This bug has grown unwieldy, and the new OCG renderer fixes most of these problems.

Please open new bugs for specific issues with specific languages, after confirming that they still exist.

After a week or so, I'll close this bug.

Aklapper closed this task as Resolved.Dec 21 2016, 10:33 AM
Aklapper added a subscriber: Aklapper.

This bug has grown unwieldy, and the new OCG renderer fixes most of these problems.
Please open new bugs for specific issues with specific languages, after confirming that they still exist.
After a week or so, I'll close this bug.

No comments afterwards, hence doing so.

Restricted Application added a project: Internet-Archive. · View Herald TranscriptDec 21 2016, 10:33 AM