Page MenuHomePhabricator

VisualEditor may add excessive LanguageConverter tags since 1.46.0-wmf.17
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

(Feel free to replace this with a real reproduction)

Example revisions:
91676183, 91681581, 91681790, 91681746, 91681228, 91680448, 91682546, 91682559, 91681679, 91683678, 91686264

What happens?:

LanguageConverter tags (-{}-) are sometimes sprinkled in the wikitext diff. It started with deployment of 1.46.0-wmf.17 to group2 wikis. I suspect it's related to T380517: Make Parsoid language conversion into an OutputTransform pass.

A user suspect it results from pasting content to the VE (link).

What should have happened instead?:

These ugly diffs should not happen.

Software version: 1.46.0-wmf.17

Other information (browser name/version, screenshots, etc.):

Should this be a subtask of T413808: 1.46.0-wmf.17 deployment blockers or T418006: CTT tasks week of 2026-02-20?

Event Timeline

Bewfip triaged this task as Unbreak Now! priority.Feb 27 2026, 5:53 AM

Raising the priority as I believe it's as annoying as T411238#11415736. Sorry for the extra housekeeping work if I get it wrong.

This looks like the output of a cut-and-paste from Parsoid's 'old' language converter implementation, which shouldn't be enabled on zhwiki by default. This isn't the output of the 'new' language converter implementation from T380517: Make Parsoid language conversion into an OutputTransform pass. Somehow Visual Editor is getting language-converted output now for zhwiki?

This doesn't seem to be a wmf.17 issue, we have found affected edits from before wmf.17 rolled out to zhwiki at 2026-02-26T19:09:31Z, for example:

We're not certain whether this is a parsoid issue, a core issue, or something else (visual editor, an experiment, ?). It's been hard to reproduce, but we haven't found any instances outside of zhwiki. It seems like it might affect certain users more than others.

The most plausible explanation is that Visual Editor's action=visualeditor&paction=parse request is sometimes setting the htmlVariantLanguage field when invoking Parsoid, so that it getting variant-converted HTML in response instead of "canonical" HTML (which is what should be used for editing). What is unknown is what changed recently to cause this.

Sorry for making the wrong guess on the background (the title should be correctted later).

I noticed that all the tags VE adds are of the form -{zh-hans: original; zh-xx: converted}- where xx is user's variant. The "zh-hans" contain zh-hant characters when the user is using zh-cn.

Change #1245433 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] LanguageConverter: ensure zh language converter is actually disabled

https://gerrit.wikimedia.org/r/1245433

OK, I believe I've figured this out.

First surprising finding was that we never actually managed to disable the zh language converter in production (T346657: Requests originating from zhwiki wikifeeds caused parsoid outage).  The first patch should fix that, and that would be sufficient to avoid the dirty diffs.

But there's still the root cause issue which is that visual editor was returning read view (language converted) html instead of canonical html.  This seems to be a cache corruption error, which is why it was so hard to reproduce.

Using a ?uselang=.... URL with Parsoid Read Views enabled would do Parsoid-side language conversion on the output via the "temporary hack" code in ParsoidParser:

			// TEMPORARY HACK
			if ( $options->getRenderReason() === 'page_view' || $options->getRenderReason() === 'page_view_old' ) {
				$langFactory = MediaWikiServices::getInstance()->getLanguageFactory();
				$lang = $langFactory->getLanguage( $langCode );
				$langConv = $this->languageConverterFactory->getLanguageConverter( $lang );
				$htmlVariantLanguage = $langFactory->getLanguage( $langConv->getPreferredVariant() );
			} else {

This is related to T267067: Make language variant a parser option in that the LanguageConverter::getPreferredVariant() call in the penultimate line is reading the request URL (a global) to act on the uselang=... and switch the variant without it being recorded in the parser options and hence not included in the parser cache key.

The REST API gets around this with the following code in HtmlOutputRendererHelper:

	private function getParserOutput(): ParserOutput {
		if ( !$this->parserOutput ) {
			$this->parserOptions->setRenderReason( __METHOD__ );

			$defaultLanguage = $this->getDefaultPageLanguage();

			if ( $this->pageLanguage
				&& $this->pageLanguage->toBcp47Code() !== $defaultLanguage->toBcp47Code()
			) {
				$languageObj = $this->languageFactory->getLanguage( $this->pageLanguage );
				$this->parserOptions->setTargetLanguage( $languageObj );
				// Ensure target language splits the parser cache, when
				// non-default; targetLangauge is not in
				// ParserOptions::$cacheVaryingOptionsHash for the legacy
				// parser.
				$this->parserOptions->addExtraKey( 'target=' . $languageObj->getCode() );
			}

But note that this extra key is only added /if the target language is not the default/.

So when VisualEditor (correctly) asks for the zh version (no variant) of the page, the extra key isn't added to the parser options and we can end up (incorrectly) fetch the corrupted zh-cn or other language-converted version from the Parsoid parser cache.

This can only happen if the language-converted versions make it into the parsoid parser cache, which only happens if users are using Parsoid Read Views on zhwiki with a variant specified (I think it could either be in their user preferences or in ?uselang=).

So this bug didn't show up until Reader Growth Team enrolled a bunch of zhwiki users in Parsoid Read Views as part of their mobile TOC experiment. There were two different communications failures here -- they had listed the affected wikis as "(ar, cn, en, fr, id, vt)" which we didn't flag as an issue because of confusion between CN (the country code for Mainland China) and ZH (the language code for the Chinese Language).

In addition, the experiment was intended for logged-out mobile users only, and Content-Transform-Team was confusing the mobile *app* (which has used Parsoid content for a long time) and the mobile *web* (which isn't fully transitioned to Parsoid).

Finally, the Reader Growth Team actually included 1.5k logged-in users in the mobile toc experiment, which caused some momentary confusion because the vast majority of the problematic edits we were examining from logged-in users. Now that the root cause is better understood, the logged-out users were likely contributing to the problem as well, since either could put variant-converted output in the cache.  Logged-out users would have had to visit a ?uselang=... url or a zh.wikipedia.org/zh-{cn,tw,...}/... URL though to trigger conversion, since they wouldn't have a variant in their "user preferences", but many page views use one of those forms.

The short term fix is to pause the mobile TOC experiment on zhwiki. Removing the (old, pre-T380517: Make Parsoid language conversion into an OutputTransform pass Parsoid language converter support is another protection. We should add an parser option to differentiate the cache when variant conversion is performed, but that will probably be done as part of T267067: Make language variant a parser option. We are also running a script on recent changes to identify pages which were affected by this bug in order to assist zhwiki in the cleanup effort.

The short term fix is to pause the mobile TOC experiment on zhwiki.

zhwiki was removed from the experiment's target wikis as of an hour ago – the experiment is off zhwiki for the remainder of the experiment

Change #1245446 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] LanguageConverter: ensure zh language converter is actually disabled (take 2)

https://gerrit.wikimedia.org/r/1245446

Change #1245465 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Ensure that Parsoid canonical HTML is not language converted

https://gerrit.wikimedia.org/r/1245465

The cache corruption is slightly different from what I thought -- we were in fact adding the variant to the parser cache key, via ParserOptions::optionsHash(). The problem was that we were doing so *even in the visual editor case* where we actually wanted the *non* converted output. So if the cache was empty, VE would generate canonical HTML correctly, *but store it under the wrong key*; and if a previous Parsoid Read View had stored the language-converted output correctly under the right key, VE would *incorrectly retrieve it* when it wanted the canonical output instead.

Ensure that Parsoid canonical HTML is not language converted (1245465) · Gerrit Code Review is a fix

Change #1245472 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Fix the RE used to detect lang-variant corruption

https://gerrit.wikimedia.org/r/1245472

Once we backport scott's patch and we stop further dirty diffs, I'll upload a list of all pages with dirty diffs, so those pages can be appropriately fixed up. As of this moment, we have about 111, with the first one from Feb 24th.

Change #1245477 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@wmf/1.46.0-wmf.17] Ensure that Parsoid canonical HTML is not language converted

https://gerrit.wikimedia.org/r/1245477

Change #1245472 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Fix the RE used to detect lang-variant corruption

https://gerrit.wikimedia.org/r/1245472

Change #1245477 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.17] Ensure that Parsoid canonical HTML is not language converted

https://gerrit.wikimedia.org/r/1245477

Mentioned in SAL (#wikimedia-operations) [2026-02-27T22:53:41Z] <cscott@deploy2002> Started scap sync-world: Backport for [[gerrit:1245477|Ensure that Parsoid canonical HTML is not language converted (T418549)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-27T22:55:29Z] <cscott@deploy2002> cscott: Backport for [[gerrit:1245477|Ensure that Parsoid canonical HTML is not language converted (T418549)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-27T23:02:29Z] <cscott@deploy2002> Finished scap sync-world: Backport for [[gerrit:1245477|Ensure that Parsoid canonical HTML is not language converted (T418549)]] (duration: 08m 47s)

Change #1245465 merged by jenkins-bot:

[mediawiki/core@master] Ensure that Parsoid canonical HTML is not language converted

https://gerrit.wikimedia.org/r/1245465

Deployed 1245465 to production, which should ensure that Visual Editor doesn't pick up any more language-converted HTML.

Bewfip lowered the priority of this task from Unbreak Now! to High.

Once we backport scott's patch and we stop further dirty diffs, I'll upload a list of all pages with dirty diffs, so those pages can be appropriately fixed up. As of this moment, we have about 111, with the first one from Feb 24th.

Thanks! Lately there is a loose filter zh:Special:AbuseFilter/394 that tracks large additions of LC tags. The filter has not been hit for a few hours, therefore I believe the issue has stopped.

has the complete list of urls (diffs) that are dirties because of this issue.

The LC tag in the first revision (91658373) is likely manually added by the user (it doesn't start with -{zh-hans:). I will mention the list in the wiki.

Change #1245433 merged by jenkins-bot:

[mediawiki/services/parsoid@master] LanguageConverter: ensure zh language converter is actually disabled

https://gerrit.wikimedia.org/r/1245433

Change #1247655 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.23.0-a20

https://gerrit.wikimedia.org/r/1247655

Change #1247655 abandoned by C. Scott Ananian:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.23.0-a20

Reason:

Fails serialization compatibility tests

https://gerrit.wikimedia.org/r/1247655

Change #1245446 merged by jenkins-bot:

[mediawiki/services/parsoid@master] LanguageConverter: ensure zh language converter is actually disabled (take 2)

https://gerrit.wikimedia.org/r/1245446

Change #1259169 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.23.0-a23

https://gerrit.wikimedia.org/r/1259169

Change #1259169 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.23.0-a23

https://gerrit.wikimedia.org/r/1259169