Page MenuHomePhabricator

Parsoid tokenization breaks JSON strings with embedded HTML
Closed, DeclinedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Use Parsoid to tokenize this raw wikitext string (no templates needed): {"date":"October 20, 2023 - 10:00 <abbr data-tz=\"-07:00\" title=\"Pacific Daylight Time (UTC-7)\">PDT<\/abbr>"}
  • Run the wikitext-to-expanded-tokens pipeline.
  • Inspect the resulting token stream for the <abbr> tag and its attributes.
 $input = '{"date":"October 20, 2023 - 10:00 <abbr data-tz=\\"-07:00\\" ' .
    'title=\\"Pacific Daylight Time (UTC-7)\\">PDT<\/abbr>"}';

$siteConfig = new MockSiteConfig( [] );
$pageConfig = new MockPageConfig( $siteConfig, [], null );
$env = new MockEnv( [
    'pageConfig' => $pageConfig,
    'siteConfig' => $siteConfig,
    'pageContent' => $input,
] );
$title = Title::newFromText( 'Test', $env->getSiteConfig() );
$frame = new Frame( $title, $env, [], new SourceString($input) );
$tokens = PipelineUtils::processContentInPipeline(
    $env,
    $frame,
    $input,
    [
        'pipelineType' => 'wikitext-to-expanded-tokens',
        'pipelineOpts' => [
            'expandTemplates' => false,
            'inTemplate' => true,
            'extTag' => null
        ],
        'sol' => true,
        'srcText' => $input,
        'srcOffsets'=> new SourceRange(0, strlen($input)),
        'tplArgs' => [],
        'toplevel' => false,
    ]
);

What happens?:

  • Parsoid tokenizes <abbr> inside the JSON as HTML.
  • The <abbr> start tag becomes a TagTk, while the escaped </abbr> stays inside a string token.
  • The <abbr> token’s attributes are split incorrectly, and the JSON string is no longer intact.

tokens

image.png (296×705 px, 36 KB)

attributes in abbr start tag token

image.png (796×718 px, 86 KB)

What should have happened instead?:

  • Either the JSON string should remain a single literal text token (i.e., not parsed as HTML), or
  • If HTML parsing inside JSON is intended, the <abbr> tag should be fully tokenized as a proper start/end tag pair (TagTk + EndTagTk) without splitting the JSON string.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):|

  • Parsoid: master branch (current)
  • PHP: 8.2.29
  • PHPUnit: 10.5.58

Other information (browser name/version, screenshots, etc.):
General question
What is the intended, general‑purpose way to convert Parsoid token arrays back into wikitext strings for parser‑function arguments? We currently use TokenUtils::tokensToString(..., true) and then fall back to PipelineUtils::processContentInPipeline (expanded-tokens-to-fragment) + WikitextSerializer. Is this the recommended approach, or is there a Parsoid utility/API that should be used instead for argument stringification?

Current token→string approach (for context):

$strictResult = TokenUtils::tokensToString( $tokens, true );
if ( is_string( $strictResult ) ) {
    return trim( $strictResult );
}
// fallback: expanded-tokens-to-fragment → WikitextSerializer
if ( !( PHPUtils::lastItem( $tokens ) instanceof EOFTk ) ) {
    $tokens[] = new EOFTk();
}
$dom = PipelineUtils::processContentInPipeline( $env, $frame, $tokens, [
    'pipelineType' => 'expanded-tokens-to-fragment',
    'pipelineOpts' => [
        'attrExpansion' => true,
        'inlineContext' => true,
        'expandTemplates' => false,
        'inTemplate' => true,
    ],
    'sol' => false,
    'toplevel' => false,
] );
$ws = new WikitextSerializer( $env, [] );
return $dom instanceof DocumentFragment ? $ws->domToWikitext( [], $dom ) : '';

Event Timeline

What are you actually trying to do here? JSON is not wikitext. The mechanism to protect a non-wikitext string in a wikitext parsing context is an extension tag (<nowiki> at least).

The new PFragment extension/parser function interface is another way to address the issue of passing complex types to extension functions, if that is what you are trying to do.

We have a custom Lua engine integration using Parsoid’s FunctionHookHandler. By the time our handler runs, Parsoid already provides a tokenized Frame (args are token arrays), so we only control token→string conversion, not tokenization.

In our usage, JSON strings are sometimes produced by Lua modules and passed as template arguments. When JSON contains literal HTML tags (e.g., <abbr>), Parsoid tokenizes those tags as HTML, which splits the JSON and breaks the literal string (as I showed you above).

Given this setup and we’d prefer not to modify existing Lua modules/templates if possible, what is the recommended approach for argument stringification? Is <nowiki> protection the expected solution, or is the PFragment interface intended for passing non‑wikitext data like JSON through parser‑function/Lua arguments?

I think <nowiki> protection is the currently recommended solution. There might be an additional option involving PFragments in the future, but I don't think that's ready for general use yet.

MSantos subscribed.

I'm declining this task since it's not our intention to support this as a new feature. Happy to discuss it further if needed.