Page MenuHomePhabricator

Language conversion is not applied in documents delivered by the Collection extension
Open, NormalPublic

Description

Author: yaoziyuan

Description:
After the fixing of T35430, the Chinese Wikipedia community says there is still another problem that prevents them from adopting the latest MediaWiki version that provides PDF/ebook creation for the Chinese Wikipedia.

This remaining problem is, because wiki text of the Chinese Wikipedia is a mix of both simplified and traditional Chinese (mainlanders tend to contribute edits in simplified Chinese, while Taiwanese / Hong Kongese tend to contribute in traditional Chinese), it needs to be converted to all-simplified or all-traditional before being displayed or made into PDFs.


Version: unspecified
Severity: major
See Also:
http://web.archive.org/web/20111002213849/http://code.pediapress.com/wiki/ticket/574

Details

Reference
bz34919

Related Objects

StatusAssignedTask
OpenNone
Opencscott
OpenNone
Opencscott
Invalid GWicke
Resolvedliangent
Resolvedthiemowmde
OpenNone
Resolvedcscott
Resolvedcscott
ResolvedElitre
Resolvedcscott
Resolvedcscott
Resolvedcscott
Resolvedcscott
Resolvedcscott
Opencscott
Resolvedcscott
Opencscott
Opencscott
Opencscott

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 12:19 AM
bzimport added projects: Collection, I18n.
bzimport set Reference to bz34919.
bzimport added a subscriber: Unknown Object (MLST).
bzimport created this task.Mar 3 2012, 1:22 AM

Language converter is not only used on zhwiki.

volker.haas wrote:

Is the conversion to all-simplified of all-traditional done for "regular" display in the browser - and therefore only a problem with the PDFs at the moment? If that is the case:

  • how is the conversion done for the browser
  • can someone provide a minimal example with simplified and traditional chinese
  • what would be a good start to read in order to understand the problematic of simplified vs. traditional chinese and conversion methods

yaoziyuan wrote:

The Chinese Wikipedia itself already has a simplified <-> traditional Chinese automatic conversion tool for displaying. It is explained here:

http://meta.wikimedia.org/wiki/Automatic_conversion_between_simplified_and_traditional_Chinese

An example of the conversion in action:

Simplified: http://zh.wikipedia.org/zh-cn/%E4%BA%94%E4%BB%A3%E5%8D%81%E5%9B%BD

Traditional: http://zh.wikipedia.org/zh-tw/%E4%BA%94%E4%BB%A3%E5%8D%81%E5%9B%BD

(In reply to comment #2)

Is the conversion to all-simplified of all-traditional done for "regular"
display in the browser - and therefore only a problem with the PDFs at the
moment? If that is the case:

  • how is the conversion done for the browser
  • can someone provide a minimal example with simplified and traditional chinese
  • what would be a good start to read in order to understand the problematic of

simplified vs. traditional chinese and conversion methods

Technically the language conversion process is done after the normal parsing process. This means if you parse the article in your own way (to generate PDF) you have to apply conversion to your parser result manually. Note that the current converter (in languages/LanguageConverter.php) is just designed to convert HTML.

yaoziyuan wrote:

I'm sure there are many PHP-based simplified/traditional Chinese conversion libraries.

(In reply to comment #5)

I'm sure there are many PHP-based simplified/traditional Chinese conversion
libraries.

mwlib (the wikitext parser & PDF generator used by Extension:Collection) is not written by PHP. Besides you have to consider conversion markups such as -{}-.

volker.haas wrote:

The conversion script doesn't exactly look trivial: http://svn.wikimedia.org/doc/LanguageConverter_8php_source.html

Does anybody have an idea how to get the conversion done without the need to reimplement the language converter in python suitable for mwlib?

yaoziyuan wrote:

Google for an existing python-based conversion library?

ralf_wikimedia wrote:

or just ask for patches?

yaoziyuan wrote:

Google Translate also offers simp. <-> trad. Chinese conversion. Maybe call its API?

(In reply to comment #10)

Google Translate also offers simp. <-> trad. Chinese conversion. Maybe call its
API?

Even in LanguageConverter.php, more code is used to do, for example, parsing conversion markup, grabbing proper parts to convert, reading on-site conversion table, handle page links etc., than actually convert the text.

yaoziyuan wrote:

I increasingly believe, such features should better be implemented on the client side, e.g. a "site to pdf ebook" program that converts a given site (blog, wiki, pages of certain depth from a start page, etc.) to a pdf.

yaoziyuan wrote:

If you do it too "back end"-wise, you have to much processing in the middle, like this chinese conversion thing.

volker.haas wrote:

The problem with the "client-side" approach is that every client needs to re-implement these specific features (like the simple/traditional conversion).

If we ever use HTML as the base for PDF rendering this problem will be solved as long as MediaWiki takes care of the transformation. In the meantime I'd happily accept a patch for the problem, but I lack the time to implement the simple/traditional conversion.

yaoziyuan wrote:

(In reply to comment #14)

The problem with the "client-side" approach is that every client needs to
re-implement these specific features (like the simple/traditional conversion).

No, because simple/traditional conversion is already taken care of by the Chinese Wikipedia on the server side.

If we ever use HTML as the base for PDF rendering this problem will be solved
as long as MediaWiki takes care of the transformation. In the meantime I'd
happily accept a patch for the problem, but I lack the time to implement the
simple/traditional conversion.

That's exactly why I think third-party client-side or browser-side pdf/ebook creation solutions would provide what PrediaPress hasn't provided.

barabbas wrote:

FYI, before LanguageConverter.php, there's a quick'n'dirty trail of LanguageZh.php: https://bugzilla.wikimedia.org/show_bug.cgi?id=5343

Created attachment 16595
Корисник:Никола Смоленски/Collection bugs.pdf

Serbian test case PDF as produced by [[mw:OCG]]/rdf2latex/new PDF rendering.

Attached:

Yes, this is a side-effect of the fact that Parsoid still lacks support for language converter. But I'm working on it!

Restricted Application added a project: Internet-Archive. · View Herald TranscriptJan 28 2016, 2:11 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Restricted Application added a subscriber: Cosine02. · View Herald TranscriptDec 21 2016, 9:58 AM
Liuxinyu970226 removed a subscriber: wikibugs-l-list.

Apologize for copying this sentense here, that @Aklapper you said in many OCG related tasks:

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on mediawiki.org.

Let's focus on T167603? Or this problem will still exists even loss PDF features?

Amire80 moved this task from Untriaged to Script conversion on the I18n board.Feb 4 2018, 10:48 AM