Steps to replicate the issue (include links if applicable):
- Use Parsoid to tokenize this raw wikitext string (no templates needed): {"date":"October 20, 2023 - 10:00 <abbr data-tz=\"-07:00\" title=\"Pacific Daylight Time (UTC-7)\">PDT<\/abbr>"}
- Run the wikitext-to-expanded-tokens pipeline.
- Inspect the resulting token stream for the <abbr> tag and its attributes.
$input = '{"date":"October 20, 2023 - 10:00 <abbr data-tz=\\"-07:00\\" ' .
'title=\\"Pacific Daylight Time (UTC-7)\\">PDT<\/abbr>"}';
$siteConfig = new MockSiteConfig( [] );
$pageConfig = new MockPageConfig( $siteConfig, [], null );
$env = new MockEnv( [
'pageConfig' => $pageConfig,
'siteConfig' => $siteConfig,
'pageContent' => $input,
] );
$title = Title::newFromText( 'Test', $env->getSiteConfig() );
$frame = new Frame( $title, $env, [], new SourceString($input) );
$tokens = PipelineUtils::processContentInPipeline(
$env,
$frame,
$input,
[
'pipelineType' => 'wikitext-to-expanded-tokens',
'pipelineOpts' => [
'expandTemplates' => false,
'inTemplate' => true,
'extTag' => null
],
'sol' => true,
'srcText' => $input,
'srcOffsets'=> new SourceRange(0, strlen($input)),
'tplArgs' => [],
'toplevel' => false,
]
);What happens?:
- Parsoid tokenizes <abbr> inside the JSON as HTML.
- The <abbr> start tag becomes a TagTk, while the escaped </abbr> stays inside a string token.
- The <abbr> token’s attributes are split incorrectly, and the JSON string is no longer intact.
tokens
attributes in abbr start tag token
What should have happened instead?:
- Either the JSON string should remain a single literal text token (i.e., not parsed as HTML), or
- If HTML parsing inside JSON is intended, the <abbr> tag should be fully tokenized as a proper start/end tag pair (TagTk + EndTagTk) without splitting the JSON string.
Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):|
- Parsoid: master branch (current)
- PHP: 8.2.29
- PHPUnit: 10.5.58
Other information (browser name/version, screenshots, etc.):
General question
What is the intended, general‑purpose way to convert Parsoid token arrays back into wikitext strings for parser‑function arguments? We currently use TokenUtils::tokensToString(..., true) and then fall back to PipelineUtils::processContentInPipeline (expanded-tokens-to-fragment) + WikitextSerializer. Is this the recommended approach, or is there a Parsoid utility/API that should be used instead for argument stringification?
Current token→string approach (for context):
$strictResult = TokenUtils::tokensToString( $tokens, true );
if ( is_string( $strictResult ) ) {
return trim( $strictResult );
}
// fallback: expanded-tokens-to-fragment → WikitextSerializer
if ( !( PHPUtils::lastItem( $tokens ) instanceof EOFTk ) ) {
$tokens[] = new EOFTk();
}
$dom = PipelineUtils::processContentInPipeline( $env, $frame, $tokens, [
'pipelineType' => 'expanded-tokens-to-fragment',
'pipelineOpts' => [
'attrExpansion' => true,
'inlineContext' => true,
'expandTemplates' => false,
'inTemplate' => true,
],
'sol' => false,
'toplevel' => false,
] );
$ws = new WikitextSerializer( $env, [] );
return $dom instanceof DocumentFragment ? $ws->domToWikitext( [], $dom ) : '';
