Page MenuHomePhabricator

Language conversion in page/html endpoints need to fall back to old LanguageConverter classes of Parsoid doesn't support the conversion.
Closed, ResolvedPublic

Description

Parsoid's language variant conversion support is incomplete. For a conversion that is not supported by parsoid, we should fall back to the old converter.

However, the old converter is designed to be called by the classic Parser between parsing stages, on text that is not full HTML yet. There is a comment in Parser::internalParseHalfParsed() that sais: The position of the convert() call should not be changed. It assumes that the links are all replaced and the only thing left is the <nowiki> mark.

There are two choices:

  1. Just apply the old conversion logic to the full HTML. It may not work 100% initially, but the glitches could be ironed out, or we accept them until the conversion in question has been implemented in Parsoid.
  2. Don't use Parsoid HTML at all, use the old parser to generate the rendering, including conversion.

The following languages currently have variant conversion implemented in Parsoid: crh, en, ku, sr, zh. This leaves the following which only have conversion implemented in MW: ban, gan, iu, kk, shi, tg, uz.

Event Timeline

Change 852769 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/core@master] WIP: LanguageVariantConverter: Add fallback to core LanguageConverter

https://gerrit.wikimedia.org/r/852769

daniel triaged this task as High priority.Nov 7 2022, 8:06 PM

Change 855974 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/services/parsoid@master] Add method to check if variant conversion is enabled for language

https://gerrit.wikimedia.org/r/855974

Here's the difference in the HTML generated by the core Language converter and the Parsoid library with a simple HTML string

Core language converter

Input:

<p>Siltemeniñ astın sız:</p>

Output:

<html>

<head></head>

<body>
	<p>Сілтеменің астын сыз:</p>
</body>

</html>

Parsoid library

Input (via PageBundle::html):

<p>Ово је тестна страница</p>

Output:

<html>

<head>
    <meta http-equiv="content-language" content="sr-Latn" />
    <meta http-equiv="vary" content="Accept, Accept-Language" />
</head>

<body data-mw-variant-lang="sr-ec">
    <p data-mw-variant-lang="sr-ec">Ovo je testna stranica</p>
</body>

</html>

Change 855974 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add method to check if variant conversion is implemented for language

https://gerrit.wikimedia.org/r/855974

Change 864839 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump parsoid to 0.17.0-a8

https://gerrit.wikimedia.org/r/864839

Change 864839 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.17.0-a8

https://gerrit.wikimedia.org/r/864839

Change 852769 merged by jenkins-bot:

[mediawiki/core@master] LanguageVariantConverter: Add fallback to core LanguageConverter

https://gerrit.wikimedia.org/r/852769

Change 884358 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Fix LanguageConverter::implementsLanguageConversion(); use Bcp47Code

https://gerrit.wikimedia.org/r/884358

Change 884358 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Fix LanguageConverter::implementsLanguageConversion(); use Bcp47Code

https://gerrit.wikimedia.org/r/884358

Change 885031 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a13

https://gerrit.wikimedia.org/r/885031

Change 885031 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a13

https://gerrit.wikimedia.org/r/885031