Page MenuHomePhabricator

Parser rule clean-up: extlink_preprocessor_text_parameterized
Closed, ResolvedPublic

Description

In extlink_preprocessor_text_parameterized rule in /lib/wt2html/pegTokenizer.pegjs, we have

 r:(
      $[^'<~[{\n\r|!\]}\-\t&="' \u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]+
    / !inline_breaks s:( directive / no_punctuation_char / [&|{\-] ) { return s; }
    / $([.:,] !(space / eolf))
    / $(['] ![']) // single quotes are ok, double quotes are bad
)+ { return tu.flattenString(r);  }

line 2 $[^'..."'...] has single quote twice, and one should be removed.

line 4 $([.:,] is not reachable because the 3 characters aren't excluded by the first test.

!inline_breaks can be thought as &[=|!{}:;\r\n[\]\-] &{return magic_fn()}. Since [;:] aren't excluded by the first test, we can think of it as &[=|!{}\r\n[\]\-] &{return magic_fn()}

So the code can be simplified to

 r:(
      $[^<~[{\n\r|!\]}\-\t&="' \u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]+
    / &[=|!{}\r\n[\]\-] &{return magic_fn()}
      s:(
          directive
        / no_punctuation_char
        / [&|{\-]
      ) { return s; }
    / $(['] ![']) // single quotes are ok, double quotes are bad
)+ { return tu.flattenString(r);  }

Refer to https://docs.google.com/spreadsheets/d/185Fr3AFtmPTmYQ8-WRGUKA-whryTZUgrX0E7kL0BHTw/edit?usp=sharing

(See note #2)
no_punctuation_char / [&|{\-] is really just [^"'<>,.%:\[\]\x00-\x20\x7F\u180E \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000].

So the code can be simplified to

r:(
     $[^|[\]{!\n\r\-}~\t<&="' \u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]+
   / &[=|!{}\r\n[\]\-] &{return magic_fn()}
      s:(
          directive
        / [^"'<>,.%:\[\]\x00-\x20\x7F\u180E \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]
      ) { return s; }
   / $(['] ![']) // single quotes are ok, double quotes are bad
)+ { return tu.flattenString(r);  }

(See note #3)
Line 6 (the third [...]) can be simplified further to [-{}|!=~&]

So the code can be simplified to

r:(
     $[^|[\]{!\n\r\-}~\t<&="'\u180E \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+ // ⇔ $([^|[\]{!\n\r\-}~\t<&="'\u180E] / unispace)+
   / &[=|!{}:;\r\n[\]\-] &{return magic_fn()}
      s:( directive / [-{}|!=~&] ) { return s; }
   / $(['] ![']) // single quotes are ok, double quotes are bad
)+ { return tu.flattenString(r);  }

or

r:(
     $[^|[\]{!\n\r\-}~\t<&="'\u180E \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+ // ⇔ $([^|[\]{!\n\r\-}~\t<&="'\u180E] / unispace)+
   / !inline_breaks s:( directive / [-{}|!=~&] ) { return s; }
   / $(['] ![']) // single quotes are ok, double quotes are bad
)+ { return tu.flattenString(r);  }

Event Timeline

Dan1wang created this task.May 8 2019, 2:22 AM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptMay 8 2019, 2:22 AM
ssastry added a subscriber: ssastry.May 8 2019, 3:02 AM

@Dan1wang thanks very much for all the phab tasks exploring the PEG tokenizer with suggestions for possible simplifications / improvements of the rules. We appreciate the attention and effort you've been putting in here.

We are currently deep in the middle of a port of Parsoid from Node.js to PHP and are effectively in a code freeze wrt the JS codebase. The bulk of the changes to the JS version of Parsoid has been refactoring and cleanup that aids our porting. So, we are unable to act on most of the phab tasks you have filed till another 3 months. But, where and when possible, we might look at some tweaks here and there.

Feel free to continue filing phab tasks related to the PEG rules. We'll take a look at them once we are done with our port and are ready to resume actual bug fixing and improvements to Parsoid. I wanted to clarify this here so you don't think we are ignoring your suggestions or patches.

Dan1wang updated the task description. (Show Details)May 8 2019, 3:22 AM
Dan1wang updated the task description. (Show Details)May 8 2019, 3:58 AM

The rule doesn't make sense.

[^             "   <>  \x00-\x20  \x7F\uFFFD  \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]    // EXT_LINK_URL_CLASS in PHP
[^-{}|!= &[\]~'"   <   \n\r\t                 \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+   // Parser rule
  • PHP accepts "-" without further parsing, but PEG would try to check for -{{lang_variant}}-- (T222328)
  • PHP rejects "<", but PEG would try to check for tags (T222328)
  • Makes sense that we reject "{" first. But doesn't make sense to reject "}". See url and autourl rules
  • Don't know about rejecting [=|!] first (url and autourl rules don't do this)
  • Makes sense we reject "[" and "]" (url and autourl rules do the same)
  • Makes no sense we reject "~" (the current rule will always accept it)
  • Why are we allowing [\x00-\x20] (minus [\t\n\r]) and "\x7F"?

The rule should be changed to

extlink_preprocessor_text_parameterized
  = r:(
      // same as PHP's EXT_LINK_URL_CLASS, further excluding "^{|!=&[]" and "'"
      $[^{|!=&[\]'"<>\x00-\x20\x7F\uFFFD \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+
    / !inline_breaks s:( directive / [{|!=] ) { return s; }
    / $(['] ![']) // single quotes are ok, double quotes are bad
    )+ {
        return tu.flattenString(r);
    }
ssastry triaged this task as Low priority.May 8 2019, 3:40 PM
ssastry added a comment.EditedMay 8 2019, 3:42 PM

I have marked all these tasks low priority for now. We'll revisit them post-port.

LGoto moved this task from Needs Triage to Backlog on the Parsoid board.Feb 15 2020, 12:05 AM
LGoto assigned this task to Arlolra.May 21 2020, 6:15 PM
LGoto moved this task from Backlog to Needs Investigation on the Parsoid board.

Change 599870 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Cleanup extlink_preprocessor_text_parameterized

https://gerrit.wikimedia.org/r/599870

Change 599907 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Add ~ to text_char

https://gerrit.wikimedia.org/r/599907

Makes no sense we reject "~" (the current rule will always accept it)

The patch in T222770#6177448 removes ~ from the grammar altogether.

The rule doesn't make sense.

[^             "   <>  \x00-\x20  \x7F\uFFFD  \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]    // EXT_LINK_URL_CLASS in PHP
[^-{}|!= &[\]~'"   <   \n\r\t                 \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+   // Parser rule
  • PHP accepts "-" without further parsing, but PEG would try to check for -{{lang_variant}}-- (T222328)
  • PHP rejects "<", but PEG would try to check for tags (T222328)

You've filed T222328 to investigate those so we'll leave them for now.

  • Makes sense that we reject "{" first. But doesn't make sense to reject "}". See url and autourl rules
  • Don't know about rejecting [=|!] first (url and autourl rules don't do this)
  • Makes sense we reject "[" and "]" (url and autourl rules do the same)

The rest of these characters are necessary to give inline_breaks a chance. It's not enough to look at other rules or classes in the legacy parser since, as in T222328#5152953, Parsoid has a different processing model. A case by case analysis needs to be done if we really want to eliminate them, which is more work that this simple cleanup.

  • Why are we allowing [\x00-\x20] (minus [\t\n\r]) and "\x7F"?

This seem odd though.

In T222770#6177034 I've pushed a patch that matches the suggestions in the description, which seem reasonable.

cscott added a subscriber: cscott.May 29 2020, 7:51 PM

The control characters below 0x20 are discussed in T106079#6046615. Parsoid, PHP, HTML5, and XML are all mutually inconsistent in odd ways (and the HTML5 spec was fairly recently changed as well), so it is known there are differences in existing code. In theory Parsoid should "do the right thing", but even that's a little hard to define precisely. We'd like to be consistent with HTML5, but for legacy reasons (Internet Explorer bugs) Visual Editor currently parses Parsoid output with an XML parser in some situations, so it's safer to be consistent with XML. The rule of thumb "be liberal with what you accept but strict in what you output" suggests that Parsoid should probably pass through control characters from the input wikitext, but we should try to strip them in the PST so that we never save control characters in wikitext. Or maybe we always need to strip them from the output so XML parsing will always work... but that creates issues with the DSR map. Anyway, it's complicated. Further discussion of control character (including grammar patches) probably should go in T106079: Wikitext includes control characters that are not allowed in HTML 5.

Change 599907 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add ~ to text_char

https://gerrit.wikimedia.org/r/599907

Change 599870 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Cleanup extlink_preprocessor_text_parameterized

https://gerrit.wikimedia.org/r/599870

Arlolra closed this task as Resolved.Jun 5 2020, 5:38 PM

Change 603571 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571

Change 603571 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571