# Description

Updated mediawiki/extensions Project: mediawiki/extensions/Parsoid 1a0b8840d69f40aff7dbb128c9df8f61a893049b

Support hierarchical parsing: Parse extension content in new scope.

• Content within extention tags should ideally not be parsed in the context of the surrounding scope since not all extensions use wikitext for their content.
• For those that do not (ex: $), we shouldn't even be tokenizing the extension content since doing so can cause incorrect tokenizing in the surrounding context. Ex: The "}}" in "\frac{foo \frac{bar}}" should not be parsed as a template closing tag that can incorrectly close an open template in surrounding content. • For those that do (ex: <ref>), any errors in tokenizing should be confined to the ref-tag itself and not spill over into the surrounding scope. Ex: <ref><!--boo-></ref> with an incorrectly closed comment should not cause havoc outside the ref-tag) • This patch tricks the single-pass tokenizer by stripping extension content and replacing it with harmless content that is matched right away and removed from the token stream. This trick is similar to the chunky-tokenizer trick where we modify the input stream mid-way and this only works because pegjs gives us direct access to the object that holds the input being parsed. This trick also lets us leave source offsets unchanged. • Handling unbalanced tags for uninstalled extensions: • PHP parser doesn't attempt to find a matching pair for xml-tags that don't match installed extensions. • {{PAGESINCATEGORY:<bogus>}} parses as a template • {{PAGESINCATEGORY:[itex]}} parses as plain text with an error message about "}}" being invalid syntax for the math extension. • So, the tokenizer needs info about installed extensions to know how to handle unmatched xml-tags. Right now, this patch adds a hack with a small list of known extensions and the tokenizer uses a utility method to query whether the tag-name is an installed extension. A future patch should probably fetch this info from the configuration info fetched from the API. • Also fixed buggy helper in mediawiki.tokenizer.peg.js that parse a string with a production name passed in. It needed to handle productions that return tokens directly and those that return tokens to a callback. • 2 more wt2wt tests green. • This patch now eliminates lot or all RT errors from several pages that use the math extension. • en:Voltage conversion • en:Van Der Waal's Bond • en:Regularization (mathematics) • This patch also now parses the following snippets more accurately: • {{echo|[itex]a=b$}} -- this is not treated as a KV pair with $a: b$ as in master but as 1:$a=b$
• {{echo|<includeonly>|foo|</includeonly>bla}} is also properly parsed like the previous example.
• <ref><!--boo-></ref> -- The unclosed comment is treated as plain text within the ref-tag and doesn't spill over to surrounding context.

# Details

Provenance
 ssastry Authored on Gerrit Code Review Committed on Mar 5 2013, 11:01 PM
Parents
rMEXTa065734c69af: Updated mediawiki/extensions Project: mediawiki/extensions/TimedMediaHandler…
Branches
Unknown
Tags
Unknown
ChangeId