Page MenuHomePhabricator

Use the new Tidy in the MW parser to perform HTML 5 parsing/reserialization.
Closed, DeclinedPublic

Description

Broken off from T89331 to see if we can work with the developers of the newer tidy effort to get a DOM from tidy.

See https://github.com/htacg/tidy-html5/issues/277

Event Timeline

MarkAHershberger updated the task description. (Show Details)
MarkAHershberger raised the priority of this task from to Needs Triage.
MarkAHershberger added a subscriber: MarkAHershberger.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 28 2015, 5:00 PM
cscott added a subscriber: cscott.Sep 28 2015, 5:19 PM

Let's see if upstream is interested in this at all, first.

There are much better solutions for PHP HTML5 parsers, and I think using a pure-PHP solution in core would be preferable. (Although we'd want to replace it with a native solution for performance in production.)

Krenair added a subscriber: Krenair.

Let's see if upstream is interested in this at all, first.

Right.

There are much better solutions for PHP HTML5 parsers, and I think using a pure-PHP solution in core would be preferable. (Although we'd want to replace it with a native solution for performance in production.)

Right. There are many cases where a pure-PHP solution is preferable, but if we can get a native solution, that would probably be easier to deploy for many people than a java-based service.

We have a response!

But, yes, given a stream or buffer to parse tidy produces an internal DOM tree. As the tidy.h sample code dumpNode() shows, you can iterate through all the nodes... if this is what you want... and build your own tree from that...

But there are presently no internal DOM tree sevices - to create nodes, append or insert nodes, delete nodes, deal with attributes, create, add, delete, sort, and add a text node, etc, etc - exposed. They are all private, internal to the library...

I suppose given a use case such services could be added to the API, but this would not be an easy task, like all the text is not actually kept in the node tree, but rather as offsets into a lexer buffer... which is part of the document...

Please go read the full response and my reply. https://github.com/htacg/tidy-html5/issues/277

Oh, I see: you are fighting the services war.

MarkAHershberger added a comment.EditedSep 30 2015, 3:08 PM

Oh, I see: you are fighting the services war.

I realize responding here would not be productive. I've posted a response on my blog. Future discussion on that bit should go there, not on this task.

tstarling added a subscriber: tstarling.EditedOct 1 2015, 3:51 AM

Oh, I see: you are fighting the services war.

I don't think that is really a helpful comment. As I reported on T89331, I investigated a number of non-service solutions and will definitely implement at least one.

Mark: I explained why I don't want to use tidy on the mailing list, and Gabriel and cscott tried to explain it to you directly on T89331. This is not about service versus non-service, this is about having a fixed, well-specified markup language versus having a markup language which is defined by its only implementation.

The reason we're not asking the tidy-html5 project to cater to our goals is because their goals conflict with our goals, and there are already a lot of HTML 5 parser implementations which meet our goals. Tidy is a nice tool for taking human-authored HTML and turning it into aesthetically-pleasing, standards-compliant HTML which roughly reflects the author's intention. That's just not the problem we're trying to solve. We need precision and stability.

tstarling writes:

The reason we're not asking the tidy-html5 project to cater to our
goals is because their goals conflict with our goals, and there are
already a lot of HTML 5 parser implementations which meet our goals.

Thanks for clarifying. I've responded in more depth on the tidy-html5
issue tracker. As I pointed out there, there is some confusion in the
communication about needs. Others have said "We need a DOM" so that was
my focus.

I understand that goals differ.

However, as a member of MW Stakeholders, I am aware of other MW user's
goals and resources. My hope was that tidy-html5 could help other MW
users meet their goals while not requiring too much of an increased
investment in their resources.

I apologize for my apparently misguided approach.

MarkAHershberger closed this task as Declined.Oct 5 2015, 5:04 PM
MarkAHershberger claimed this task.

Lack of interest in using libtidy and lack of interest in libtidy devs to accomodate the needs of MW.