Maniphest T191571

LanguageConverter::guessVariant should go away
Open, MediumPublic
Actions

Assigned To

None

Authored By

	cscott
	Apr 5 2018, 9:49 PM

Description

The only language which uses LanguageConverter::guessVariant is LanguageSr. It's a huge hack -- it avoids conversion of the *whole page* to sr-ec if there are enough "cyrillic looking" letters on the page, even if there are large chunks of latin text which need conversion. Even if you like the overall idea of guessing the variant used by a given editor, the granularity is all wrong: recursiveConvertTopLevel is first called on *the entire page contents* (HTML tags and attributes included) and then makes a decision whether to do conversion *at all*. It's basically an error if guessVariant ever returns true at this point, since that will cause all conversion on the rest of the page to be skipped. The next time recursiveConvertTopLevel is invoked we're at the level of individual attribute strings (typically title attributes for links) -- and if guessVariant returns true here it will result in the title text not being converted while the link text is converted; that's basically never the right thing to do either. There are probably other ways you can sneak back into the recursiveConvertTopLevel, but I can't think of anything good that would come of it.

The guessVariant function didn't even seem to work right when it was first introduced (see T37076 and its comments). I'm guessing its only mildly acceptable on srwiki because all articles on srwiki seem to be written either one variant or the other (usually cyrillic), so guessVariant results in conversion being completely disabled at the top level for one or the other variant.

I'm not going to implement guessVariant for Parsoid (T43716) -- it's basically impossible to do so compatibly because it depends on running an arbitrary character-counting heuristic over the exactly HTML string which the PHP parser generates. I'd recommend it be deprecated and eventually removed from the PHP parser as well.