Page MenuHomePhabricator

Automatically infer language from unicode character class
Open, Needs TriagePublic

Description

Users are pretty bad at labeling foreign-language inclusions in their texts. This often causes incorrect rendering (bad font, incorrect hyphenation, sometimes wrong directionality (but see T73869)).

We should probably add a pass which tries to infer appropriate <span lang="..."></span> tags when the unicode character class of the characters in the text diverge from the expected language.

Event Timeline

cscott created this task.Jul 19 2015, 4:06 PM
cscott raised the priority of this task from to Needs Triage.
cscott updated the task description. (Show Details)
cscott added a project: OCG-PDF-renderer.
cscott added a subscriber: cscott.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 19 2015, 4:06 PM
brion added a subscriber: brion.Jul 21 2015, 12:46 AM

As a general rule guessing language is a hard problem... Are language tags actually what you want here or something to force the renderer to load fonts?

For instance I have no idea how to distinguish Hindi from Sanskrit given both are written in Devanagari script...

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on mediawiki.org.