There are two implementations of the preprocessor in the mediawiki core code base. We should deprecate one of them. Why?
- With the migration to Parsoid/PHP, we are going to be first hooking Parsoid into the preprocessor and then later replacing the legacy preprocessor entirely. Maintaining two copies of the preprocessor needlessly duplicates work (and introduces the potential for subtle bugs) in code we are ultimately going to remove anyway.
- It is good practice according to our deprecation strategy to deprecate before removal; the Parsoid/PHP transition is going to be involved and won't necessarily provide adequate notice before certain features in one preprocessor implementation can't be supported any more (see https://gerrit.wikimedia.org/r/418198 comment on PS2 for example). Deprecating one of the implementations early in 1.33 is kinder to our downstreams and lets us identify any unnecessary use of a specific preprocessor class (like https://gerrit.wikimedia.org/r/460200) before it becomes a problem with the Parsoid port.
So, if we should deprecate one, which one should we deprecate?
- The original reason for splitting the preprocessor seems to have been to avoid a dependency on the standard dom extension to PHP. But present-day MediaWiki already depends on the dom extension in other places: Remex-based tidy, the localisation cache, and SiteImporter for example. The dom extension is standard in PHP and enabled by default.
- A secondary reason was that the hash-based implementation performed better in early experiments with HipHop (and there is a vague reference in the WMF configuration to iffy memory allocation ). But HHVM will shortly end support for PHP, and MediaWiki is dropping HHVM support (T192166).
- However, the Preprocessor_DOM implementation doesn't "natively" use the DOM, instead it does the same string-based processing as Preprocessor_Hash and then runs DOMDocument#loadXML to construct a DOM tree at the end. This seems wasteful. Further, it has some limits not present in Preprocessor_Hash (see T216664).
Since the point of this exercise is to facilitate a future Parsoid port, we recommend keeping the Preprocessor_DOM implementation. It will play better with Parsoid (which is DOM-based), and is apparently faster on PHP 7 (which we are moving to: T176370).
EDIT: given the limitations of Preprocessor_DOM (and the DOM extension) my (@cscott) current recommendation would be to deprecate it and keep Preprocessor_Hash as the single implementation.
Note that we aren't committed to removing code just because we've deprecated it, although Parsoid/PHP will eventually replace the preprocessor entirely with a unified tokenizer.