Page MenuHomePhabricator

Create a unified HTML entity encoder/decoder
Closed, DeclinedPublic

Description

Goal: create a single code base for HTML entities decoding & encoding for MediaWiki and Parsoid.

Reason: make it easy to maintain the code and let MediaWiki and Parsoid have same behavior

Strategy: create a code generator that translate proto-code to actual JS and PHP code. The generator's code should be well-documented and avoid the use of any "magic number". The generator should take W3C's official named-character-reference dictionary AS-IS and then construct an efficient code.

Proof of concept: https://github.com/dan1wang/wt-entities

The proof-of-concept project works as follows:

src/entities-dict.js loads W3C's official named-entity list as-is, filters out legacy entities (e.g. &amp and "), and construct lookup arrays by entity lengths (we don't want to search an associative array with over 2,000 entries!).

src/build-decoder.js takes the constructed arrays to create code for the decoding function. The decoding code uses fast incremental parsing (similar to what you'd find in WikiPEG) instead of slower regular expression.

The resulting code is saved to build/ directory. The PHP and JavaScript code are nearly identical (except for how numeric entity is parsed).

The resulting code differs from MediaWiki's current sanitizer by the following:

  • the decoder doesn't yet recognize the two ‏ aliases MediaWiki accepts (TODO)
  • the decoder recognizes HTML 5 named entities except for legacy ones (entities without trailing semi-colon and >, <, &, ") (T94603)
  • according to HTML5 standard, all numeric references with hex ending FFFE and FFFF are invalid. MediaWiki's current sanitizer rejects  but accepts &#x1FFFE, whereas the decoder would reject &#x1FFFE as well.

The project is work-in-progress. I've tested the JavaScript code. I don't know PHP and I don't know how to test the code. The generated code passes PHP linter, though.

The project doesn't do any encoding yet (Parsoid needs this).

Event Timeline

Change 514648 had a related patch set uploaded (by Dan1wang; owner: Dan1wang):
[sandbox@master] unified HTML entity encoder/decoder (T225049)

https://gerrit.wikimedia.org/r/514648

Thanks @Dan1wang. This looks like a potentially interesting idea / project. Now that Parsoid itself has been ported to PHP, the JS/PHP issue has gone away and we don't have a need for this. So, I am inclined to close this out. I'm tagging a bunch of others as an FYI in case they know of any other places where this might potentially have utility. cc @Krinkle @Pchelolo @Tgr @bearND @Mholloway. I'm going to leave this open for a week.

Why can't we use domino and/or remex? They both have these already.

Note that *wikitext* entities are different from *html5* entities, and these differences go deeper than just which names are allowed. The trailing semicolon rules are very different.

For JS code, there's already the npm html-entities project, which Parsoid uses and is build directly from the W3C spec. So I'm not entirely sure which problem you're trying to solve?

Also, fwiw: the PHP entity decoder in Remex was extensively optimized; I'd be impressed if you managed to make something faster. It uses a table generated directly from https://www.w3.org/TR/2014/REC-html5-20141028/syntax.html#tokenizing-character-references

It has been a while since I looked at Remex and domino code and so if they already cover these bases, I'm just going to decline this then.

Change 514648 abandoned by Aklapper:
[sandbox@master] unified HTML entity encoder/decoder (T225049)

Reason:
No reply from patch author; unfortunately abandoning per last comment

https://gerrit.wikimedia.org/r/514648