Page MenuHomePhabricator

Improve substitition of {{...}} in Special:ViewXML
Open, HighPublic

Description

There is a bug in parsing nested templates in version 0.6.1. Example:

{{Some template1
|param=Lorem ipsum1
}}
{{Some template2
|param=Lorem ipsum2{{Template|vlaue}} and {{Template|value having : colon}} lorem ipsum2
}}
{{Some template3
|param=Lorem ipsum3
}}

I use

  • $wgDataTransferViewXMLParseFields = False; and
  • $wgDataTransferViewXMLParseFreeText = False;

When transformed (on a German Wiki) it gets mixed up (see the broken {{Template…<xmlstuff>…}}:

<Seiten>
  <Seite>
    <Kennung>3851</Kennung>
    <Titel>Testseite XML Export</Titel>
    <Some_template1><param>Lorem ipsum1</param></Some_template1>
    <Some_template2><param>Lorem ipsum2{{Template</param><Feld_1>value</Feld_1></Some_template2>
    <Freitext id="1">and</Freitext>
    <Template><Feld_1>vlaue having : colon}} lorem ipsum2</Feld_1></Template>
    <Some_template3><param>Lorem ipsum3</param></Some_template3>
  </Seite>
</Seiten>

As far as I digged into the PHP code DataTransfer/includes/DT_PageStructure.php function parsePageContents() has some loosely regex assumption which cause this behaviour. I'm no expert in recursive regex, so I asked on stackoverflow (“PCRE recursive pattern with 1st level condition {{1st-level-test: anything {{ne{{s}}ted}} }}”), and I suggest to rewrite those escape patterns to something like:

// escape out parser functions
// $page_contents = preg_replace( '/{{(#.+)}}/', '&#123;&#123;$1&#125;&#125;', $page_contents );
$page_contents = preg_replace( '/{{(\s*#[\w\d_]+:([^{}]*+(?:{{(?2)}}[^{}]*)*+))}}/', '&#123;&#123;$1&#125;&#125;', $page_contents );

(not sure about the naming conventions of parser functions)

and to something like:

// escape out transclusions, and calls like "DEFAULTSORT"
// $page_contents = preg_replace( '/{{(.*:.+)}}/', '&#123;&#123;$1&#125;&#125;', $page_contents );
$page_contents = preg_replace( '/{{(\s*[\w\d]+:([^{}]*+(?:{{(?2)}}[^{}]*)*+))}}/', '&#123;&#123;$1&#125;&#125;', $page_contents );

I'm not sure if it covers all necessary cases but at least takes care to fetch the right closing bracket :-)

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 19 2016, 9:00 PM
infinite-dao triaged this task as High priority.May 20 2016, 8:08 AM

Or add recursive parsing à la http://us.php.net/manual/en/function.preg-replace-callback.php#example-5364?

function escapeTemplateParameterRecursive($input){
  $regex = '@{{{([^{}]*+(?:{{{(?1)}}}[^{}]*)*+)}}}@ms'; // 
  if (is_array($input)) { $input = '&#123;&#123;&#123;'.$input[1].'&#125;&#125;&#125;'; }
  return preg_replace_callback($regex, __FUNCTION__, $input);
}

function escapeWikiParserFunktionRecursive($input){
  $regex = '@{{(\s*#[\w\d_]+:)([^{}]*+(?:{{(?2)}}[^{}]*|{{{(?2)}}}[^{}]*)*+)}}@ms';
  if (is_array($input)) { $input = '&#123;&#123;'.$input[1].$input[2].'&#125;&#125;'; }
  return preg_replace_callback($regex, __FUNCTION__, $input);
}

function escapeWikiMagickWordRecursive($input){
  $regex = '@{{(\s*[\w\d_]+:)([^{}]*+(?:{{(?2)}}[^{}]*|{{{(?2)}}}[^{}]*)*+)}}@ms';
  if (is_array($input)) { $input = '&#123;&#123;'.$input[1].$input[2].'&#125;&#125;'; }
  return preg_replace_callback($regex, __FUNCTION__, $input);
}