HomePhabricator

Support hierarchical parsing: Parse extension content in new scope.

Description

Support hierarchical parsing: Parse extension content in new scope.

  • Content within extention tags should ideally not be parsed in the context of the surrounding scope since not all extensions use wikitext for their content.
    • For those that do not (ex: <math>), we shouldn't even be tokenizing the extension content since doing so can cause incorrect tokenizing in the surrounding context. Ex: The "}}" in "\frac{foo \frac{bar}}" should not be parsed as a template closing tag that can incorrectly close an open template in surrounding content.
    • For those that do (ex: <ref>), any errors in tokenizing should be confined to the ref-tag itself and not spill over into the surrounding scope. Ex: <ref><!--boo-></ref> with an incorrectly closed comment should not cause havoc outside the ref-tag)
  • This patch tricks the single-pass tokenizer by stripping extension content and replacing it with harmless content that is matched right away and removed from the token stream. This trick is similar to the chunky-tokenizer trick where we modify the input stream mid-way and this only works because pegjs gives us direct access to the object that holds the input being parsed. This trick also lets us leave source offsets unchanged.
  • Handling unbalanced tags for uninstalled extensions:
    • PHP parser doesn't attempt to find a matching pair for xml-tags that don't match installed extensions.
      • {{PAGESINCATEGORY:<bogus>}} parses as a template
      • {{PAGESINCATEGORY:<math>}} parses as plain text with an error message about "}}" being invalid syntax for the math extension.
    • So, the tokenizer needs info about installed extensions to know how to handle unmatched xml-tags. Right now, this patch adds a hack with a small list of known extensions and the tokenizer uses a utility method to query whether the tag-name is an installed extension. A future patch should probably fetch this info from the configuration info fetched from the API.
  • Also fixed buggy helper in mediawiki.tokenizer.peg.js that parse a string with a production name passed in. It needed to handle productions that return tokens directly and those that return tokens to a callback.
  • 2 more wt2wt tests green.
  • This patch now eliminates lot or all RT errors from several pages that use the math extension.
    • en:Voltage conversion
    • en:Van Der Waal's Bond
    • en:Regularization (mathematics)
  • This patch also now parses the following snippets more accurately:
    • {{echo|<math>a=b</math>}} -- this is not treated as a KV pair with <math>a: b</math> as in master but as 1:<math>a=b</math>
    • {{echo|<includeonly>|foo|</includeonly>bla}} is also properly parsed like the previous example.
    • <ref><!--boo-></ref> -- The unclosed comment is treated as plain text within the ref-tag and doesn't spill over to surrounding context.

Change-Id: Id67528f6527833492f431404b4dad980b8f22ed8