Goal: create a single code base for HTML entities decoding & encoding for MediaWiki and Parsoid.
Reason: make it easy to maintain the code and let MediaWiki and Parsoid have same behavior
Strategy: create a code generator that translate proto-code to actual JS and PHP code. The generator's code should be well-documented and avoid the use of any "magic number". The generator should take W3C's official named-character-reference dictionary AS-IS and then construct an efficient code.
Proof of concept: https://github.com/dan1wang/wt-entities
The proof-of-concept project works as follows:
src/entities-dict.js loads W3C's official named-entity list as-is, filters out legacy entities (e.g. & and "), and construct lookup arrays by entity lengths (we don't want to search an associative array with over 2,000 entries!).
src/build-decoder.js takes the constructed arrays to create code for the decoding function. The decoding code uses fast incremental parsing (similar to what you'd find in WikiPEG) instead of slower regular expression.
The resulting code is saved to build/ directory. The PHP and JavaScript code are nearly identical (except for how numeric entity is parsed).
The resulting code differs from MediaWiki's current sanitizer by the following:
- the decoder doesn't yet recognize the two ‏ aliases MediaWiki accepts (TODO)
- the decoder recognizes HTML 5 named entities except for legacy ones (entities without trailing semi-colon and >, <, &, ") (T94603)
- according to HTML5 standard, all numeric references with hex ending FFFE and FFFF are invalid. MediaWiki's current sanitizer rejects  but accepts , whereas the decoder would reject  as well.
The project is work-in-progress. I've tested the JavaScript code. I don't know PHP and I don't know how to test the code. The generated code passes PHP linter, though.
The project doesn't do any encoding yet (Parsoid needs this).