Page MenuHomePhabricator

Add tests to compare wikitext to HTML output between the REST API and Parser
Closed, ResolvedPublic

Description

Write tests to compare output of wikitext to HTML with variant conversion from the REST API with the Parser.

Inputs can be retrieved from a file, and the output from one can be compared with the other.

Event Timeline

abi_ changed the task status from Open to In Progress.Dec 9 2022, 4:11 PM
abi_ claimed this task.
abi_ triaged this task as Medium priority.

Change 866731 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/core@master] tests: compare html output between Parser and HtmlOutputRendererHelper

https://gerrit.wikimedia.org/r/866731

As part of T318401, we added a fallback to the core language variant converter incase variant conversion is not supported by Parsoid.

if ( !$this->parsoid->implementsLanguageConversion( $pageConfig, $targetVariantCode ) ) {
	$baseLanguage = $this->languageFactory->getParentLanguage( $targetVariantCode );
	$languageConverter = $this->languageConverterFactory->getLanguageConverter( $baseLanguage );

	$convertedHtml = $languageConverter->convertTo( $pageBundle->html, $targetVariantCode );

	// Hack: Pass the HTML to parsoid for variant conversion in order to add metadata that is
	// missing when we use the core LanguageConverter directly.

	// Replace the original page bundle, so Parsoid gets the converted HTML as input.
	$pageBundle = new PageBundle(
		$convertedHtml,
		[],
		[],
		$pageBundle->version,
		[ 'content-language' => $targetVariantCode ]
	);
}

$modifiedPageBundle = $this->parsoid->pb2pb(
	$pageConfig, 'variant', $pageBundle,
	[
		'variant' => [
			'source' => $sourceVariantCode,
			'target' => $targetVariantCode,
		]
	]
);

See: GitHub code browser

Here are the difference noticed with some simple:

Sample #1 - Parser vs Parsoid with fallback to language variant converter

  • Wikitext: == Сілтеме астын сызу ==
  • Source: kk
  • Target kk-latn - not supported by inbuilt Parsiod variant conversion

Parser

<h2><span
		id=".D0.A1.D1.96.D0.BB.D1.82.D0.B5.D0.BC.D0.B5_.D0.B0.D1.81.D1.82.D1.8B.D0.BD_.D1.81.D1.8B.D0.B7.D1.83"></span><span
		class="mw-headline" id="Сілтеме_астын_сызу">Silteme astın sızw</span>
	<mw:editsection page="MediaWiki\Parser\Parsoid\LanguageVariantConversionOutputTest::testCompareParserOutput"
		section="1">Silteme astın sızw</mw:editsection>
</h2>

Parsoid with fallback to language variant converter

<!DOCTYPE html>\n
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/"
	about="http://192.168.72.98/mediawiki/index.php/Special:Redirect/revision/2">

<head prefix="mwr: http://192.168.72.98/mediawiki/index.php/Special:Redirect/">
	<meta charset="utf-8" />
	<meta property="mw:pageId" content="2" />
	<meta property="mw:pageNamespace" content="0" />
	<link rel="dc:replaces" resource="mwr:revision/0" />
	<meta property="mw:revisionSHA1" content="6a16449be05d6165c620c6a66c3c8813fff4c75b" />
	<meta property="dc:modified" content="2022-12-15T06:47:01.000Z" />
	<meta property="mw:htmlVersion" content="2.7.0" />
	<meta property="mw:html:version" content="2.7.0" />
	<link rel="dc:isVersionOf"
		href="http://192.168.72.98/mediawiki/index.php/MediaWiki%5CParser%5CParsoid%5CLanguageVariantConversionOutputTest%3A%3AtestCompareParserOutput" />
	<base href="http://192.168.72.98/mediawiki/index.php/" />
	<title>MediaWiki\Parser\Parsoid\LanguageVariantConversionOutputTest::testCompareParserOutput</title>
	<link rel="stylesheet"
		href="/mediawiki/load.php?lang=kk&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles&amp;only=styles&amp;skin=vector" />
	<meta http-equiv="content-language" content="kk-Latn" />
	<meta http-equiv="vary" content="Accept, Accept-Language" />
</head>

<body id="mwAA" lang="kk" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output"
	dir="ltr">
	<section data-mw-section-id="0" id="mwAQ"></section>
	<section data-mw-section-id="1" id="mwAg">
		<h2 id="Сілтеме_астын_сызу"><span
				id=".D0.A1.D1.96.D0.BB.D1.82.D0.B5.D0.BC.D0.B5_.D0.B0.D1.81.D1.82.D1.8B.D0.BD_.D1.81.D1.8B.D0.B7.D1.83"
				typeof="mw:FallbackId"></span>Silteme astın sızw</h2>
	</section>
</body>

</html>

Sample #2 - Parser vs Parsoid with inbuilt language variant converter

  • Wikitext: == Hello world ==
  • Source: en
  • Target en-x-piglatin - supported by inbuilt Parsiod variant conversion

Parser

<h2><span class="mw-headline" id="Hello_world">Ellohay orldway</span>
	<mw:editsection page="MediaWiki\Parser\Parsoid\LanguageVariantConversionOutputTest::testCompareParserOutput"
		section="1">Ellohay orldway</mw:editsection>
</h2>

Parsoid with inbuilt language variant converter

<!DOCTYPE html>\n
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/"
	about="http://192.168.72.98/mediawiki/index.php/Special:Redirect/revision/3">

<head prefix="mwr: http://192.168.72.98/mediawiki/index.php/Special:Redirect/">
	<meta charset="utf-8" />
	<meta property="mw:pageId" content="2" />
	<meta property="mw:pageNamespace" content="0" />
	<link rel="dc:replaces" resource="mwr:revision/2" />
	<meta property="mw:revisionSHA1" content="ba1aa2e80bd448e6f17e3403869122b3411b4a14" />
	<meta property="dc:modified" content="2022-12-15T06:47:02.000Z" />
	<meta property="mw:htmlVersion" content="2.7.0" />
	<meta property="mw:html:version" content="2.7.0" />
	<link rel="dc:isVersionOf"
		href="http://192.168.72.98/mediawiki/index.php/MediaWiki%5CParser%5CParsoid%5CLanguageVariantConversionOutputTest%3A%3AtestCompareParserOutput" />
	<base href="http://192.168.72.98/mediawiki/index.php/" />
	<title>MediaWiki\Parser\Parsoid\LanguageVariantConversionOutputTest::testCompareParserOutput</title>
	<link rel="stylesheet"
		href="/mediawiki/load.php?lang=en&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles&amp;only=styles&amp;skin=vector" />
	<meta http-equiv="content-language" content="en-x-piglatin" />
	<meta http-equiv="vary" content="Accept, Accept-Language" />
</head>

<body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output"
	dir="ltr" data-mw-variant-lang="en">
	<section data-mw-section-id="0" id="mwAQ"></section>
	<section data-mw-section-id="1" id="mwAg">
		<h2 id="Hello_world">Ellohay <span typeof="mw:LanguageVariant"
				data-mw-variant='{"twoway":[{"l":"en","t":"world"},{"l":"en-x-piglatin","t":"orldway"}],"rt":true}'>orldway</span>
		</h2>
	</section>
</body>

</html>

In this limited test, the primary difference between the inbuilt and fallback conversion in LanguageVariantConverter.php is with the presence of a <span typeof="mw:LanguageVariant" ...></span>

CC: @ssastry , @cscott

@ssastry , @cscott Any idea why the inbuilt variant converter in Parsoid wraps some content in a span like so:

<h2 id="Hello_world">Ellohay <span typeof="mw:LanguageVariant"
				data-mw-variant='{"twoway":[{"l":"en","t":"world"},{"l":"en-x-piglatin","t":"orldway"}],"rt":true}'>orldway</span>

Also wondering why the span (span typeof="mw:LanguageVariant") was only wrapped around a part of the text rather then the entire string?

Change 866731 abandoned by Abijeet Patro:

[mediawiki/core@master] tests: compare html output between Parser and HtmlOutputRendererHelper

Reason:

In favor of Idab6ee7e6e832b9937e2bd6f1bf64027ad23b668

https://gerrit.wikimedia.org/r/866731

Change 876188 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/core@master] Add script to compare output between Parser and Parsoid

https://gerrit.wikimedia.org/r/876188

I've added a script to compare output between Parser and Parsoid.

The script fetches the words from the output and compares them to see the differences.

I've used a simple page with some simple wikitext:

<strong>MediaWiki has been installed.</strong>

Consult the [https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents User's Guide] for information on using the wiki software!!

== Getting started ==
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Configuration_settings Configuration settings list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ MediaWiki FAQ]
* [https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ MediaWiki release mailing list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Localisation#Translation_resources Localise MediaWiki for your language]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Combating_spam Learn how to combat spam on your wiki]

[[Special:MyLanguage/T62544#This_is_another_heading]]

I then run the script on this page:

php maintenance/run.php compareHtmlParsingTest Main_Page en-x-piglatin --show-all

Word count - Parsoid: 40, Parser: 40

----------------------------------------------------------------------------------------------------
|  Line | Parsoid                             | Parser                              | Difference   |
----------------------------------------------------------------------------------------------------
|     1 | MediaWiki                           | EdiamayIkiway                       | Different    |
|     2 | ashay                               | ashay                               | OK           |
|     3 | eenbay                              | eenbay                              | OK           |
|     4 | installedway.                       | installedway.                       | OK           |
|     5 | Onsultcay                           | Onsultcay                           | OK           |
|     6 | ethay                               | ethay                               | OK           |
|     7 | User'sway                           | User'sway                           | OK           |
|     8 | Uidegay                             | Uidegay                             | OK           |
|     9 | orfay                               | orfay                               | OK           |
|    10 | informationway                      | informationway                      | OK           |
|    11 | onway                               | onway                               | OK           |
|    12 | usingway                            | usingway                            | OK           |
|    13 | ethay                               | ethay                               | OK           |
|    14 | ikiway                              | ikiway                              | OK           |
|    15 | oftwaresay!!                        | oftwaresay!!                        | OK           |
|    16 | Ettinggay                           | Ettinggay                           | OK           |
|    17 | artedstay                           | artedstay[edit]                     | Different    |
|    18 | Onfigurationcay                     | Onfigurationcay                     | OK           |
|    19 | ettingssay                          | ettingssay                          | OK           |
|    20 | istlay                              | istlay                              | OK           |
|    21 | MediaWiki                           | EdiamayIkiway                       | Different    |
|    22 | FAQ                                 | FAQ                                 | OK           |
|    23 | MediaWiki                           | EdiamayIkiway                       | Different    |
|    24 | eleaseray                           | eleaseray                           | OK           |
|    25 | ailingmay                           | ailingmay                           | OK           |
|    26 | istlay                              | istlay                              | OK           |
|    27 | Ocaliselay                          | Ocaliselay                          | OK           |
|    28 | MediaWiki                           | EdiamayIkiway                       | Different    |
|    29 | orfay                               | orfay                               | OK           |
|    30 | ouryay                              | ouryay                              | OK           |
|    31 | anguagelay                          | anguagelay                          | OK           |
|    32 | Earnlay                             | Earnlay                             | OK           |
|    33 | owhay                               | owhay                               | OK           |
|    34 | otay                                | otay                                | OK           |
|    35 | ombatcay                            | ombatcay                            | OK           |
|    36 | amspay                              | amspay                              | OK           |
|    37 | onway                               | onway                               | OK           |
|    38 | ouryay                              | ouryay                              | OK           |
|    39 | ikiway                              | ikiway                              | OK           |
|    40 | Ecialspay:MyLanguage/T62544#Isthay… | Ecialspay:YmayAnguagelay/T62544#Is… | Different    |
----------------------------------------------------------------------------------------------------

Some more testing is needed with more complicated input.

Change 876188 merged by jenkins-bot:

[mediawiki/core@master] Add script to compare output between Parser and Parsoid

https://gerrit.wikimedia.org/r/876188

Change 880929 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/core@master] Change the way we determine the diff for changed words

https://gerrit.wikimedia.org/r/880929

Change 880929 merged by jenkins-bot:

[mediawiki/core@master] compareLanguageConverterOutput: Use Diff and ArrayDiffFormatter

https://gerrit.wikimedia.org/r/880929

@ssastry , @cscott Any idea why the inbuilt variant converter in Parsoid wraps some content in a span like so:

<h2 id="Hello_world">Ellohay <span typeof="mw:LanguageVariant"
				data-mw-variant='{"twoway":[{"l":"en","t":"world"},{"l":"en-x-piglatin","t":"orldway"}],"rt":true}'>orldway</span>

Also wondering why the span (span typeof="mw:LanguageVariant") was only wrapped around a part of the text rather then the entire string?

I was digging around some code and ran into this phab task and then noticed a few unanswered questions. For the same of completeness, I am responding here now even though the main task is unrelated to the output found by the script and has ben resolved.

See https://www.mediawiki.org/wiki/Specs/HTML/2.8.0#Language_conversion_blocks .. Parsoid adds that span wrapper with a typeof to identify language variant markup that needs to be processed by the language converter (only applicable to Parsoid-supported language converters).