Page MenuHomePhabricator

PDF export extension doesn't support some characters in Arabic script
Closed, ResolvedPublic

Assigned To
None
Authored By
Yamaha5
Aug 11 2011, 1:30 PM
Referenced Files
F8009: nazli_pak.zip
Nov 21 2014, 11:48 PM
F8007: nazlib.ttf
Nov 21 2014, 11:48 PM
F8005: farsi_fonts.zip
Nov 21 2014, 11:48 PM
F8004: test_freefarsi.pdf
Nov 21 2014, 11:48 PM
F8003: test_nazli.pdf
Nov 21 2014, 11:48 PM

Details

Reference
bz30326

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 11:48 PM
bzimport added projects: Collection, I18n.
bzimport set Reference to bz30326.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 17766 has been marked as a duplicate of this bug. ***

and PDFs still is LTR instead of RTL

volker.haas wrote:

Could any of you please do me two favors:

  1. provide a minimal example: a) create a page in your user space containing exactly a single arabic word which has one character missing in the PDF OR b) post this example word right here
  1. I also need a suggestion for a font which is suitable for the arabic script. This font ideally completely covers the following unicode blocks:

The most important thing is that this font is guaranteed to include the missing glyph in the example provided in 1).


I fixed the problem for fa.wikipedia.org (LTR instead of RTL) but didn't update the render servers yet. I'll do that later today probably.

(In reply to comment #4)

Could any of you please do me two favors:

  1. provide a minimal example: a) create a page in your user space containing exactly a single arabic word

which has one character missing in the PDF OR

b) post this example word right here
  1. I also need a suggestion for a font which is suitable for the arabic script.

This font ideally completely covers the following unicode blocks:

The most important thing is that this font is guaranteed to include the missing
glyph in the example provided in 1).


I fixed the problem for fa.wikipedia.org (LTR instead of RTL) but didn't update
the render servers yet. I'll do that later today probably.

1)I wrote "لا" in my user space ( http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:Reza1615/pdf )
2)we have problem with ligutar ( http://en.wikipedia.org/wiki/Typographic_ligature )connection ل + ا ==> لا .it is rendering problem not missing glyph.
3)the best font that is now using in fa.wiki and fa.book for printing ( http://fa.wikipedia.org/wiki/%D9%85%D8%AF%DB%8C%D8%A7%D9%88%DB%8C%DA%A9%DB%8C:Print.css ) ( http://fa.wikibooks.org/wiki/%D9%85%D8%AF%DB%8C%D8%A7%D9%88%DB%8C%DA%A9%DB%8C:Print.css ) is Nazli that is used in SVG rendering. it is also beautiful for Arabic printing, but till now Arabic wiki projects don't have any Css printing definition.
4)we have many RTL wikies such as "mzn, glk, ckb, ur, pnb, arz, dv, ps, sd, ks, yi,.." that they need RTL support

volker.haas wrote:

Thanks a lot for compiling the page with the problematic markup!

Regarding the font choice: Nazli is already used...therefore the problem with the missing glyphs is probably not caused by the font - I'll investigate. The only problem with Nazli seems to be that there is not bold variant. Can you confirm that? Wouldn't that be a problem?

I also added the above mentioned languages to the list of RTL languages [1]

Words like fetching, rendering can currently not be localized. The problem is that currently only the rendering software is localized. The component which fetches the resources and starts the rendering process is not localized.

Note: I found out that the width calculation for arabic currently fails, that is probably the reason for the strange alignment of all text. I am currently trying to solve that.

[1] http://code.pediapress.com/git/mwlib.rl?p=mwlib.rl;a=commit;h=1abb5024b50b82bc117c730c8c255529fb3ec484

(In reply to comment #7)

Thanks a lot for compiling the page with the problematic markup!

Regarding the font choice: Nazli is already used...therefore the problem with
the missing glyphs is probably not caused by the font - I'll investigate. The
only problem with Nazli seems to be that there is not bold variant. Can you
confirm that? Wouldn't that be a problem?

I also added the above mentioned languages to the list of RTL languages [1]

Words like fetching, rendering can currently not be localized. The problem is
that currently only the rendering software is localized. The component which
fetches the resources and starts the rendering process is not localized.

Note: I found out that the width calculation for arabic currently fails, that
is probably the reason for the strange alignment of all text. I am currently
trying to solve that.

[1]
http://code.pediapress.com/git/mwlib.rl?p=mwlib.rl;a=commit;h=1abb5024b50b82bc117c730c8c255529fb3ec484

thank you for you responsible replies
in http://meta.wikimedia.org/wiki/SVG_fonts Nazli has bold variant also we used it in http://fa.wikipedia.org/wiki/%D9%85%D8%AF%DB%8C%D8%A7%D9%88%DB%8C%DA%A9%DB%8C:Print.css as ''.wikitable caption {font-weight: bold;}''for wikitable caption so it has bold variant

according to bug 745 please add also 'am' ,'arc','bcc' ,'bqi' ,'dz' ,'ha' ,'he' ,'ku' ,'ug' as RTL wikis

after updating servers in fa.wiki all of bugs are solved
http://fa.wikipedia.org/wiki/%D8%AA%D9%88%DA%A9%D9%84_%D9%85%D8%A7_%D8%A8%D9%87_%D8%AE%D8%AF%D8%A7%D8%B3%D8%AA

except:

1- problem in rendering لا
2-would you please convert page numbers and any numbers that are in text as formatnum or
1==>۱
2==>۲
3==>۳
4==>۴
5==>۵
6==>۶
7==>۷
8==>۸
9==>۹
0==>۰
3- rendering reordered non-spacing marks
4-pdf export in fa.wiki doesn't support ''' as bold
5-infobox's font is smaller that usual also its shape is wider that usual
http://fa.wikipedia.org/wiki/%D9%85%DB%8C%D9%84%D8%A7%D9%86
6-pdf export doesn't support correctly location map in any wiki (also en wiki)
http://en.wikipedia.org/wiki/Ahvaz

Maybe problem of لا is from nazli font. so I added Roya font in Mediawiki:print.css. Roya and Nazli is open source and GNU-based font. you can see details in here:
http://fa.farsiweb.ir/fawiki/Persian_Fonts

for missing bold nazli has nazlib font may be you didn't use it!

I thinks there is another bug for books
see:
http://fa.wikipedia.org/wiki/%DB%B4%DB%B2_(%D9%85%DB%8C%D9%84%D8%A7%D8%AF%DB%8C)
and please make a PDF from this page. the table has so many problems! :(

volker.haas wrote:

I added the above mentioned languages to the set of rtl-languages. More importantly: I have fixed the bug that was responsible for the general mis-alignment of all arabic text. (More specifically: all text that required character shaping)

The missing bold font is also fixed (nazlib as suggested).

Furthermore I fixed the problem with "لا" - this should have taken care of all "missing glyph" boxes in the text.

The main problem that persists is rendering of complex/stacked non-spacing marks as shown in [1]. I have no idea how to fix that - that might be complex.

Amir: I'll take a look at the article you mentioned.

[1] https://secure.wikimedia.org/wikibooks/ar/wiki/%D9%85%D8%B3%D8%AA%D8%AE%D8%AF%D9%85:Reza1615/test#rendering_reordered_non-spacing_marks

i updated http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:Reza1615/pdf
1-it is contained bugs that amir mentioned this bug is with {{#expr: }} and {{formatnum: |R}}
2- this extension doesn't support Farsi numbers in text also with #
3-for non-spacing marks firfox had this bug and they solved it in this report https://bugzilla.mozilla.org/show_bug.cgi?id=635639
4-this extension force all of text to be RTL but some of texts must be LTR (I mentioned it in my sample)

volker.haas wrote:

Regarding the italic problem:
I couldn't find an italic (nor bold italic) variant of the Nazli font. For this to work I need: 4 font files (regular, bold, italic, bold-italic) which cover the unicode ranges Arabic, Arabic Supplement, Arabic Presentation Forms A + B.

The zip file linked from [1] does not contain a font satisfying all criteria.

[1] http://fa.farsiweb.ir/fawiki/Persian_Fonts

volker.haas wrote:

I just stumbled over Freefarsi:

http://fpf.sourceforge.net/per/index.html

ubuntu/debian package: ttf-freefarsi

This font looks promising - I'll check if it works properly for arabic/farsi

popeno2003 wrote:

I have checked this font in LibreOffice and seems to be ok.

volker.haas wrote:

I just uploaded two test document. The one using freefarsi fixes the italic/bold-italic issue - this was expected.

One more thing that seems (mostly?) fixed are the non-spacing marks: could you guys please check the "rendering reordered non-spacing marks" section and compare it to the source [1].

The only issue I can see is wrong vertical positioning of the non-spacing marks. Any hints on how to solve that are appreciated (thanks for the firefox bug link, that already helped...).

Is the freefarsi version readable? The reason that I am asking is: completely fixing this issue might be beyond the scope currently.

[1] http://fa.wikipedia.org/w/index.php?title=%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:Reza1615/pdf&oldid=5430743

(In reply to comment #22)

Created attachment 8934 [details]
test pdf with freefarsi font

Source of PDF:
http://fa.wikipedia.org/w/index.php?title=%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:Reza1615/pdf&oldid=5430743

  1. bugs with non-spacing marks is solved
  2. italic and bold italic now are ok

New bugs with will happen Freefarsi:

  1. This font is replacing bullets of "*" wikimarkaup withة
  2. According to http://fa.wikipedia.org/wiki/%D9%BE%D8%B1%D9%88%D9%86%D8%AF%D9%87:L2_versus_L3.jpg with freefarsi هٔ can not rendered well so this font is not good enough for using in Persian and Arabic texts!

Please don't use this font because it is not standard.
Nazli is the best as I checked that in http://fa.wikipedia.org/w/index.php?title=%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:Reza1615/pdf&printable=yes and it can render in italic form by browser.
So, in my opinion it is better to use italic render engine instead of using italic font.

Attached:

vertical positioning of the non-spacing marks is not so important and it is OK :) horizontal position is important .in this sample it is correct.

volker.haas wrote:

Italic:

If freefarsi can't be used that's fine with me, but: I can't get italic to work if I use the Nazli font. The browser might render a non-italic font in an italic way - the render engine I am using to produce PDFs can't do that.

> I need a font including all its variants (normal, bold, italic, bold-italic). Everything that is missing will be rendered using the regular font.

Something completely different:

The direction of the text is sometimes broken. This happens if pretty much anything except plain text is used in conjunction with LRT text when the base direction is RTL.

Example:

C<sub>50</sub>H<sub>70</sub>... (as you pointed out Reza)

or an even more fun example:

'''توسط word one''' word two - هم‌کاری

I can't fix this. Not today, and not anywhere in the near future. If this is supposed to be fixed I need to switch the underlying PDF render engine. Since there is no real alternative to what we use now (reportlab) this is even more complicated. Therefore I am guessing that changing the render engine would take at least 6 months. Unfortunately this is completely out of scope at the moment.

If the text direction problem I just described is a show stopper for right-to-left languages than we can stop now. If the problem does not occur too often and is acceptable we can go on. The third alternative is swapping the rendering engine, but a sponsor for 6 months of work would need to be found ;)

Created attachment 8942
updating webfarsi italic font and solving freefarsi bug

I made italic version for Nazli also I solved freefarsi bullet bug . would you please test them
is it possible that pdf render engine, renders python or other language codes in color mode? http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:Reza1615/pdf2

would you please solve other bugs such as numbers and infobox's size?

Attached:

volker.haas wrote:

I rendered test documents with the updated fonts you, reza, provided. I chose to upload them to our server which is more convenient for me and shouldn't make a difference for anybody else.

Please compare:

http://pediapress.com/files/rtl/test_nazli.pdf
http://pediapress.com/files/rtl/test_freefarsi.pdf

I spotted a possible bug in the bold variant of the nazli font: the line-heights seem to be a little too big. This might result in the missing spacing below the bold paragraph. If the Nazli font is supposed to be used then this should be fixed as well.

Reza: I'll look in the other issues. Regarding the numbers: which numbers should be converted to the farsi variant? Only for numbered lists or also for page numbers etc.

Created attachment 8958
Nazli font bold New version

1-I attached new version of Nazli bold version please replace it with old one
1-1- in Nazli version i saw the non-spacing marks had problem but in freefarsi it is ok. sadly freefarsi font is not Persian style some of its glyphs are in Urdu's style.is it possible to solve the non-spacing marks' problem in Nazli?

2- for numbers:
2-1-page number
2-2-numbered lists
2-3-citation numbers (references numbers)
2-4- all of the numbers in Infobox in my sample are in person but in pdf version only some of them are converted.

Attached:

1-syntax highlight in english version is colorful but in fa.wiki is black and wight
http://en.wikipedia.org/wiki/User:Reza1615/pdf
http://fa.wikipedia.org/wiki/کاربر:Reza1615/pdf
2- location map in both en.wiki and fa.wiki is incorrect.

volker.haas wrote:

(In reply to comment #29)

Created attachment 8958 [details]
Nazli font bold New version

1-I attached new version of Nazli bold version please replace it with old one
1-1- in Nazli version i saw the non-spacing marks had problem but in freefarsi
it is ok. sadly freefarsi font is not Persian style some of its glyphs are in
Urdu's style.is it possible to solve the non-spacing marks' problem in Nazli?

This probably has to be "fixed" in the font. My assumtion is based on the mozilla/firefox bug report you mentioned earlier: the non-spacing marks have "wrong" width values. A rendering engine with better support for diacritics probably would not exhibit this behavior. But for the PDF rendering backend we are using the font probably needs to be corrected: the width of the non-spacing marks has to be explicitly set (or unset?) to zero. A comparison of one non-spacing mark in the two fonts might reveal the problem

2- for numbers:
2-1-page number
2-2-numbered lists
2-3-citation numbers (references numbers)
2-4- all of the numbers in Infobox in my sample are in person but in pdf
version only some of them are converted.

I'll check that and try to fix it.

Attached:

Created attachment 8959
Nazli New pak

I changed all of non-spacing marks properties for (Nazli,Nazlib,Nazli-italic,Nazlibold-italic).would you please test them?

Attached:

volker.haas wrote:

(In reply to comment #30)

1-syntax highlight in english version is colorful but in fa.wiki is black and
wight
http://en.wikipedia.org/wiki/User:Reza1615/pdf
http://fa.wikipedia.org/wiki/کاربر:Reza1615/pdf

fixed with:
http://code.pediapress.com/git/mwlib.rl?p=mwlib.rl;a=commit;h=59d45fe60762fa55e43eaac58d9b3ee9fed985fb

2- location map in both en.wiki and fa.wiki is incorrect.

This is a known issue: absolute positioned content isn't rendered correctly. Unfortunately I have no idea how to fix this.

volker.haas wrote:

(In reply to comment #32)

Created attachment 8959 [details]
Nazli New pak

I changed all of non-spacing marks properties for
(Nazli,Nazlib,Nazli-italic,Nazlibold-italic).would you please test them?

There seems to be a problem with the font. The rendering engine crashes with the following traceback:

Traceback (most recent call last):

File "/home/volker/repos/mwlib.rl/mwlib/rl/rlwriter.py", line 528, in renderBook
  self.doc.build(elements)
File "/home/volker/py26/lib/python2.6/site-packages/mwlib.ext-0.12.3-py2.6-linux-i686.egg/mwlib/ext/reportlab/platypus/doctemplate.py", line 906, in build
  self._endBuild()
File "/home/volker/py26/lib/python2.6/site-packages/mwlib.ext-0.12.3-py2.6-linux-i686.egg/mwlib/ext/reportlab/platypus/doctemplate.py", line 848, in _endBuild
  if getattr(self,'_doSave',1): self.canv.save()
File "/home/volker/py26/lib/python2.6/site-packages/mwlib.ext-0.12.3-py2.6-linux-i686.egg/mwlib/ext/reportlab/pdfgen/canvas.py", line 1123, in save
  self._doc.SaveToFile(self._filename, self)
File "/home/volker/py26/lib/python2.6/site-packages/mwlib.ext-0.12.3-py2.6-linux-i686.egg/mwlib/ext/reportlab/pdfbase/pdfdoc.py", line 235, in SaveToFile
  f.write(self.GetPDFData(canvas))
File "/home/volker/py26/lib/python2.6/site-packages/mwlib.ext-0.12.3-py2.6-linux-i686.egg/mwlib/ext/reportlab/pdfbase/pdfdoc.py", line 247, in GetPDFData
  fnt.addObjects(self)
File "/home/volker/py26/lib/python2.6/site-packages/mwlib.ext-0.12.3-py2.6-linux-i686.egg/mwlib/ext/reportlab/pdfbase/ttfonts.py", line 1126, in addObjects
  pdfFont.ToUnicode = doc.Reference(cmapStream, 'toUnicodeCMap:' + baseFontName)
File "/home/volker/py26/lib/python2.6/site-packages/mwlib.ext-0.12.3-py2.6-linux-i686.egg/mwlib/ext/reportlab/pdfbase/pdfdoc.py", line 516, in Reference
  raise ValueError, "redefining named object: "+repr(name)

ValueError: redefining named object: 'toUnicodeCMap:AAAAAA+Nazli'

Attached:

(In reply to comment #29)

Created attachment 8958 [details]
Nazli font bold New version

is it ok?

Attached:

(In reply to comment #33)

(In reply to comment #30)

2- location map in both en.wiki and fa.wiki is incorrect.

This is a known issue: absolute positioned content isn't rendered correctly.
Unfortunately I have no idea how to fix this.

I updated http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:Reza1615/pdf#test may be it helps you

volker.haas wrote:

(In reply to comment #35)

(In reply to comment #29)

Created attachment 8958 [details]
Nazli font bold New version

is it ok?

This is getting a little confusing...The new font version with the fixed non-spacing marks does not render at all. The PDF export crashes due to some error in the font I guess.

Attached:

volker.haas wrote:

(In reply to comment #36)

(In reply to comment #33)

(In reply to comment #30)

2- location map in both en.wiki and fa.wiki is incorrect.

This is a known issue: absolute positioned content isn't rendered correctly.
Unfortunately I have no idea how to fix this.

I updated
http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D8%B1%D8%A8%D8%B1:Reza1615/pdf#test
may be it helps you

Thanks, but that does not help. I investigated the issue with absolute positioned stuff intensively in the past. The result was that for the downloadable PDFs I have no idea how to solve it.

volker.haas wrote:

Let me sum up the font situation:

Freefarsi: all four variants present, non-spacing marks rendered correctly. BUT: one character (هٔ) not suited for persian or arabic

Nazli: all four font variants present (thanks Reza!) and working. non-spacing marks are not rendered correctly. Reza tried to fix the font, but that resulted in crashes of the rendering engine. Error message below:

ValueError: redefining named object: 'toUnicodeCMap:AAAAAA+Nazli'

Rendering with the Nazli font works if I only use the regular variant. Therefore my suspicion is that there was a mistake when constructing the bold (possibly als for italic/bold-italic) variant of the font. Maybe the internal font name was accidentally changed from something like NazliBold to Nazli - therefore colliding with the regular Nazli font.

@Reza: could you double check that and if possible fix it. In that case I could use your Nazli font variants with fixed non-spacing mark positioning.

sumanah wrote:

I added the internationalization keyword (i18n).

Can people still reproduce this problem? Is it still affecting people? I'm checking since we've deployed MediaWiki 1.18 to all Wikimedia Foundation wikis and that has some fixes that might have solved this problem.

Thanks Volker Haas, he solved them in PDF export engine.