Remex could use some helper/utility classes
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	cscott
	Mar 7 2019, 4:45 PM

Description

Ideally there should be helper classes for typical Remex use cases. In particular, loading HTML from a string or a file should be a one liner (currently it's 10ish lines of not-easily-discoverable code ).

Details

	Subject	Repo	Branch	Lines +/-
	Add HtmlHelper::modifyElements() for small HTML modifications	mediawiki/core	master	+121 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T255586 Replace HTMLFormatter by Remex
		Resolved		None	T217850 Remex could use some helper/utility classes

Event Timeline

cscott created this task.Mar 7 2019, 4:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 7 2019, 4:45 PM

cscott updated the task description. (Show Details)Mar 7 2019, 4:45 PM

cscott mentioned this in T217849: Remex needs documentation of how to use its API.Mar 7 2019, 4:49 PM

cscott mentioned this in T217708: Remex should offer an option to not set namespaceURI.Mar 7 2019, 4:52 PM

Tgr subscribed.Mar 7 2019, 4:58 PM

I was trying to figure out the reason for the difference reported in failing test #1 in https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/494253/10//COMMIT_MSG.
I suspected one of (a) file reading (b) Remex parsing (c) XML Serializer.

So, I started with an input HTML file with <span>\r</span> (that is the single \r character not a \ followed by r).

After ruling out (a) and (c), I was looking at (b). But, I was getting different results with the existing test scripts, and after a bunch of playing around, I finally nailed it to a difference in tokenizer options.

I started with https://github.com/wikimedia/remex-html/blob/fa8a6a6b491b2f482e4c237cc345f0439bdbf6a0/bin/test.php#L98-L110. But, if I passed in the tokenizer options in https://github.com/wikimedia/remex-html/blob/fa8a6a6b491b2f482e4c237cc345f0439bdbf6a0/bin/test.php#L217-L222, the \r is not converted to a \n. More specifically, it seems to be tied to 'skipPreProcess' => true. I haven't looked at why \r handling varies on the presence of this option (i.e. if this is a bug or a feature).

Anyway, this bug reiterates that the importance of this phab task.

See psysh session below.

>>> require 'vendor/autoload.php'; '';
=> ""
>>> $error = function ( $msg, $pos ) { }; '';
=> ""
>>> $html = '<span>^M</span>'; 
=> "<span>\r</span>"
>>> $formatter = new \RemexHtml\Serializer\HtmlFormatter; '';
=> ""
>>> $domBuilder = new \RemexHtml\DOM\DOMBuilder( $error ); '';
=> ""
>>> $serializer = new \RemexHtml\DOM\DOMSerializer( $domBuilder, $formatter ); '';
=> ""
>>> $treeBuilder = new \RemexHtml\TreeBuilder\TreeBuilder( $serializer, [] ); '';
=> ""
>>> $dispatcher = new \RemexHtml\TreeBuilder\Dispatcher( $treeBuilder ); '';
=> ""
>>> $tokenizerOptions = [ 'skipPreprocess' => true ]; '';
=> ""
>>> $tokenizer = new \RemexHtml\Tokenizer\Tokenizer( $dispatcher, $html, $tokenizerOptions ); '';
=> ""
>>> $tokenizer->execute( [] ); '';
=> ""
>>> $serializer->getResult();
=> "<html><head></head><body><span>\r</span></body></html>"
>>> 
>>> $tokenizerOptions = []; '';
=> ""
>>> $tokenizer = new \RemexHtml\Tokenizer\Tokenizer( $dispatcher, $html, $tokenizerOptions ); '';
=> ""
>>> $tokenizer->execute( [] ); '';
=> ""
>>> $serializer->getResult();
=> """
   <html><head></head><body><span>\n
   </span></body></html>
   """

In T217850#5013208, @ssastry wrote:

More specifically, it seems to be tied to 'skipPreProcess' => true. I haven't looked at why \r handling varies on the presence of this option (i.e. if this is a bug or a feature).

Feature.

Docs and the normalization performed in code.

In T217850#5013240, @ssastry wrote:

In T217850#5013208, @ssastry wrote:

More specifically, it seems to be tied to 'skipPreProcess' => true. I haven't looked at why \r handling varies on the presence of this option (i.e. if this is a bug or a feature).

Feature.

Docs and the normalization performed in code.

Just to complete the tangent: This is just what the spec mandates ... https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream ... but this also underscores the benefit of utility classes for dealing with common case scenarios.

T217849: Remex needs documentation of how to use its API as well. The spec requires \r stripping but IIRC MW also does \r stripping so we're guaranteed that any article we fetch from the DB already has carraige returns stripped, which is why there's an optimization in remex to avoid unnecessary work. I wonder if the time savings is actually significant enough to merit the developer cost of maintaining a separate option. In any case, we need to document this stuff better.

I punted on this originally, hoping that once we had some users, we would know what pipelines are most commonly used and thus need shortcuts to access them. But last time I checked, I think everyone was using a different pipeline. Maybe we need a pipeline builder class, with chainable mutator methods and sensible defaults, so that even diverse use cases can be catered for. A possibly complimentary option is to have local convenience functions, so that the kind of pipeline Parsoid generally needs would be provided by a utility class within Parsoid.

On the other hand, maybe there would be more users if it were easier to figure out how to? I think the basic use cases are fairly obvious:

turn a string representation of a HTML document into a DOM tree
replace part of a DOM tree with something that's given as a HTML string (ie. do what setting innerHTML does in Javascript; see also T217705 on that)

Those would be helpful for reusers with more complicated use cases as well, as they would serve as the canonical example of what the building blocks are. Currently your best bet is test.php for that, which is not particularly helpful.

I'd say there's one other use case, and it's what tidy does (AIUI): mutate a string representation of a HTML document in a "safe" way, without every building the complete DOM tree in memory. That is, "safe" string-to-string transformations. There are probably lots of weird things you could do here, but I would love to see a basic "insert X into Y" (like innerHTML) or "append X to Y" utility, done in a safe way that respected tag boundaries etc. The API of https://github.com/wikimedia/html-formatter/blob/master/src/HtmlFormatter.php could be a guide, just imagine doing it string-to-string without creating an intermediate DOM.

(One could also imagine a lazy "string like" type, which was partially parsed, and would let you do string append operations by feeding the right hand side into the tokenizer. Such a thing might be helpful in incrementally porting bits of mediawiki which do string concatenation. That was the sort of idea I was aiming at with my initial Balancer implementation.)

cscott added a parent task: T255586: Replace HTMLFormatter by Remex.Jun 16 2020, 4:53 PM

Change 789986 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] Add HtmlHelper::modifyElements() for small HTML modifications

https://gerrit.wikimedia.org/r/789986

gerritbot added a project: Patch-For-Review.May 8 2022, 1:55 PM

Change 789986 merged by jenkins-bot:

[mediawiki/core@master] Add HtmlHelper::modifyElements() for small HTML modifications

https://gerrit.wikimedia.org/r/789986

Maintenance_bot removed a project: Patch-For-Review.May 10 2022, 1:30 AM

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.12; 2022-05-16).May 10 2022, 2:01 AM

tstarling moved this task from Inbox to Actually In RemexHtml on the RemexHtml board.Dec 22 2022, 11:39 PM

tstarling triaged this task as Low priority.Dec 22 2022, 11:47 PM

There is a set of helpers/utilities in Parsoid, which are now available to all MediaWiki code, and which we've been using with success and enjoyment in DiscussionTools:

Wikimedia\Parsoid\Utils\DOMUtils: https://doc.wikimedia.org/Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMUtils.html
Wikimedia\Parsoid\Utils\DOMCompat: https://doc.wikimedia.org/Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMCompat.html

There are perhaps less efficient for some operations than code written with Remex could be, since they always construct and operate on DOM trees, but that is also much easier to think about (at least for me) than the Remex pipelines.

I would suggest just using those (when peak efficiency isn't required), and not creating more new utilities. (HtmlHelper::modifyElements() seems nice, though.)

(Please reopen if you disagree.)

Remex could use some helper/utility classesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Remex could use some helper/utility classes
Closed, ResolvedPublic
Actions

Related Objects
Search...