Page MenuHomePhabricator

Remex could use some helper/utility classes
Open, Needs TriagePublic

Description

Ideally there should be helper classes for typical Remex use cases. In particular, loading HTML from a string or a file should be a one liner (currently it's 10ish lines of not-easily-discoverable code ).

Event Timeline

cscott created this task.Mar 7 2019, 4:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 7 2019, 4:45 PM
cscott updated the task description. (Show Details)Mar 7 2019, 4:45 PM
Tgr added a subscriber: Tgr.Mar 7 2019, 4:58 PM

I was trying to figure out the reason for the difference reported in failing test #1 in https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/494253/10//COMMIT_MSG.
I suspected one of (a) file reading (b) Remex parsing (c) XML Serializer.

So, I started with an input HTML file with <span>\r</span> (that is the single \r character not a \ followed by r).

After ruling out (a) and (c), I was looking at (b). But, I was getting different results with the existing test scripts, and after a bunch of playing around, I finally nailed it to a difference in tokenizer options.

I started with https://github.com/wikimedia/remex-html/blob/fa8a6a6b491b2f482e4c237cc345f0439bdbf6a0/bin/test.php#L98-L110. But, if I passed in the tokenizer options in https://github.com/wikimedia/remex-html/blob/fa8a6a6b491b2f482e4c237cc345f0439bdbf6a0/bin/test.php#L217-L222, the \r is not converted to a \n. More specifically, it seems to be tied to 'skipPreProcess' => true. I haven't looked at why \r handling varies on the presence of this option (i.e. if this is a bug or a feature).

Anyway, this bug reiterates that the importance of this phab task.

See psysh session below.

>>> require 'vendor/autoload.php'; '';
=> ""
>>> $error = function ( $msg, $pos ) { }; '';
=> ""
>>> $html = '<span>^M</span>'; 
=> "<span>\r</span>"
>>> $formatter = new \RemexHtml\Serializer\HtmlFormatter; '';
=> ""
>>> $domBuilder = new \RemexHtml\DOM\DOMBuilder( $error ); '';
=> ""
>>> $serializer = new \RemexHtml\DOM\DOMSerializer( $domBuilder, $formatter ); '';
=> ""
>>> $treeBuilder = new \RemexHtml\TreeBuilder\TreeBuilder( $serializer, [] ); '';
=> ""
>>> $dispatcher = new \RemexHtml\TreeBuilder\Dispatcher( $treeBuilder ); '';
=> ""
>>> $tokenizerOptions = [ 'skipPreprocess' => true ]; '';
=> ""
>>> $tokenizer = new \RemexHtml\Tokenizer\Tokenizer( $dispatcher, $html, $tokenizerOptions ); '';
=> ""
>>> $tokenizer->execute( [] ); '';
=> ""
>>> $serializer->getResult();
=> "<html><head></head><body><span>\r</span></body></html>"
>>> 
>>> $tokenizerOptions = []; '';
=> ""
>>> $tokenizer = new \RemexHtml\Tokenizer\Tokenizer( $dispatcher, $html, $tokenizerOptions ); '';
=> ""
>>> $tokenizer->execute( [] ); '';
=> ""
>>> $serializer->getResult();
=> """
   <html><head></head><body><span>\n
   </span></body></html>
   """

More specifically, it seems to be tied to 'skipPreProcess' => true. I haven't looked at why \r handling varies on the presence of this option (i.e. if this is a bug or a feature).

Feature.

Docs and the normalization performed in code.

More specifically, it seems to be tied to 'skipPreProcess' => true. I haven't looked at why \r handling varies on the presence of this option (i.e. if this is a bug or a feature).

Feature.

Docs and the normalization performed in code.

Just to complete the tangent: This is just what the spec mandates ... https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream ... but this also underscores the benefit of utility classes for dealing with common case scenarios.

cscott added a comment.EditedMar 10 2019, 9:16 PM

T217849: Remex needs documentation of how to use its API as well. The spec requires \r stripping but IIRC MW also does \r stripping so we're guaranteed that any article we fetch from the DB already has carraige returns stripped, which is why there's an optimization in remex to avoid unnecessary work. I wonder if the time savings is actually significant enough to merit the developer cost of maintaining a separate option. In any case, we need to document this stuff better.

I punted on this originally, hoping that once we had some users, we would know what pipelines are most commonly used and thus need shortcuts to access them. But last time I checked, I think everyone was using a different pipeline. Maybe we need a pipeline builder class, with chainable mutator methods and sensible defaults, so that even diverse use cases can be catered for. A possibly complimentary option is to have local convenience functions, so that the kind of pipeline Parsoid generally needs would be provided by a utility class within Parsoid.

Tgr added a comment.Mar 14 2019, 12:41 AM

On the other hand, maybe there would be more users if it were easier to figure out how to? I think the basic use cases are fairly obvious:

  • turn a string representation of a HTML document into a DOM tree
  • replace part of a DOM tree with something that's given as a HTML string (ie. do what setting innerHTML does in Javascript; see also T217705 on that)

Those would be helpful for reusers with more complicated use cases as well, as they would serve as the canonical example of what the building blocks are. Currently your best bet is test.php for that, which is not particularly helpful.

I'd say there's one other use case, and it's what tidy does (AIUI): mutate a string representation of a HTML document in a "safe" way, without every building the complete DOM tree in memory. That is, "safe" string-to-string transformations. There are probably lots of weird things you could do here, but I would love to see a basic "insert X into Y" (like innerHTML) or "append X to Y" utility, done in a safe way that respected tag boundaries etc. The API of https://github.com/wikimedia/html-formatter/blob/master/src/HtmlFormatter.php could be a guide, just imagine doing it string-to-string without creating an intermediate DOM.

(One could also imagine a lazy "string like" type, which was partially parsed, and would let you do string append operations by feeding the right hand side into the tokenizer. Such a thing might be helpful in incrementally porting bits of mediawiki which do string concatenation. That was the sort of idea I was aiming at with my initial Balancer implementation.)