Page MenuHomePhabricator

Automatically infer language from unicode character class
Closed, DeclinedPublic

Description

Users are pretty bad at labeling foreign-language inclusions in their texts. This often causes incorrect rendering (bad font, incorrect hyphenation, sometimes wrong directionality (but see T73869)).

We should probably add a pass which tries to infer appropriate <span lang="..."></span> tags when the unicode character class of the characters in the text diverge from the expected language.

Event Timeline

cscott raised the priority of this task from to Needs Triage.
cscott updated the task description. (Show Details)
cscott added a project: OCG-PDF-renderer.
cscott added a subscriber: cscott.

As a general rule guessing language is a hard problem... Are language tags actually what you want here or something to force the renderer to load fonts?

For instance I have no idea how to distinguish Hindi from Sanskrit given both are written in Devanagari script...

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on mediawiki.org.

Declining this task as OCG has been dead for years and superseded by Proton and Electron-PDFs on Wikimedia servers.
If this is still wanted for the currently available PDF export on Wikimedia servers, then please file a new ticket with updated information and steps to reproduce. Thanks!