Page MenuHomePhabricator

Deprecate one of the Preprocessor implementations for 1.34
Closed, ResolvedPublic


There are two implementations of the preprocessor in the mediawiki core code base. We should deprecate one of them. Why?

  • With the migration to Parsoid/PHP, we are going to be first hooking Parsoid into the preprocessor and then later replacing the legacy preprocessor entirely. Maintaining two copies of the preprocessor needlessly duplicates work (and introduces the potential for subtle bugs) in code we are ultimately going to remove anyway.
  • It is good practice according to our deprecation strategy to deprecate before removal; the Parsoid/PHP transition is going to be involved and won't necessarily provide adequate notice before certain features in one preprocessor implementation can't be supported any more (see comment on PS2 for example). Deprecating one of the implementations early in 1.33 is kinder to our downstreams and lets us identify any unnecessary use of a specific preprocessor class (like before it becomes a problem with the Parsoid port.

So, if we should deprecate one, which one should we deprecate?

  • The original reason for splitting the preprocessor seems to have been to avoid a dependency on the standard dom extension to PHP. But present-day MediaWiki already depends on the dom extension in other places: Remex-based tidy, the localisation cache, and SiteImporter for example. The dom extension is standard in PHP and enabled by default.
  • A secondary reason was that the hash-based implementation performed better in early experiments with HipHop (and there is a vague reference in the WMF configuration to iffy memory allocation ). But HHVM will shortly end support for PHP, and MediaWiki is dropping HHVM support (T192166).
  • However, the Preprocessor_DOM implementation doesn't "natively" use the DOM, instead it does the same string-based processing as Preprocessor_Hash and then runs DOMDocument#loadXML to construct a DOM tree at the end. This seems wasteful. Further, it has some limits not present in Preprocessor_Hash (see T216664).

Since the point of this exercise is to facilitate a future Parsoid port, we recommend keeping the Preprocessor_DOM implementation. It will play better with Parsoid (which is DOM-based), and is apparently faster on PHP 7 (which we are moving to: T176370).

EDIT: given the limitations of Preprocessor_DOM (and the DOM extension) my (@cscott) current recommendation would be to deprecate it and keep Preprocessor_Hash as the single implementation.

Note that we aren't committed to removing code just because we've deprecated it, although Parsoid/PHP will eventually replace the preprocessor entirely with a unified tokenizer.

Event Timeline

cscott created this task.Sep 20 2018, 12:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2018, 12:44 PM

Change 460032 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Hard deprecate Preprocessor_Hash unless running under HHVM

cscott renamed this task from Deprecate one of the Preprocessor implementations for 1.32 to Deprecate one of the Preprocessor implementations for 1.33.Oct 18 2018, 6:36 PM
cscott updated the task description. (Show Details)

It's not happening in 1.32, we just branched that. Hopefully for 1.33!

@tstarling said (in comments on which I'm copying here):

I don't want to see this merged until after we drop HHVM support. HHVM support is not deprecated. So it's -2 from me on the basis that mere changes to the patch can't make it mergeable.
The idea of something being deprecated or not depending on details of the runtime configuration seems wrong to me.
Have any benchmarks been done on Preprocessor_Hash versus Preprocessor_DOM on recent versions of PHP?
Preprocessor_Hash continues to have the advantage that its memory usage is accounted for in the memory limit, and in memory_get_usage(), and thus in memory profilers.

So I've made this task depend on T192166: Drop HHVM support from MediaWiki.

Krinkle changed the task status from Open to Stalled.Feb 4 2019, 11:59 PM
Krinkle added a project: MW-1.33-release.
Krinkle updated the task description. (Show Details)
cscott updated the task description. (Show Details)Apr 2 2019, 5:13 PM

I was wrong, Preprocessor_DOM is a lie. At least the "DOM" part of it is -- it just runs DOMDocument::loadXML to wrap a thin veneer of DOMness around the result at the end. Furthermore, it has issues (like T216664) and the DOM extension has other further issues (T215000). Let's just kill it and keep Preprocessor_Hash instead.

Change 502578 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Hard deprecate Preprocessor_DOM

Change 460032 abandoned by C. Scott Ananian:
Hard deprecate Preprocessor_Hash unless running under HHVM

Abandoned in favor of Ica5d1ad5b1e677542962fc36d582a793f941155e

Krinkle added a comment.EditedJun 14 2019, 2:55 PM
From the Task description

Per the above change to wmf-config, I assume this is no longer the case, and thus the task no longer stalled?

cscott updated the task description. (Show Details)Jun 14 2019, 3:55 PM

Yes, now that has been merged, this task is no longer stalled.

cscott updated the task description. (Show Details)Jun 14 2019, 4:02 PM

I updated the patch ( ) to resolve merge conflicts since it was originally written for 1.33. Would be a good idea to deploy this (or something like it) since WMF is running Preprocessor_Hash on PHP7 now but our official release still defaults to Preprocessor_DOM on that platform.

Change 502578 merged by jenkins-bot:
[mediawiki/core@master] Hard deprecate Preprocessor_DOM

Jdforrester-WMF renamed this task from Deprecate one of the Preprocessor implementations for 1.33 to Deprecate one of the Preprocessor implementations for 1.34.Jun 14 2019, 8:54 PM
Jdforrester-WMF closed this task as Resolved.
Jdforrester-WMF assigned this task to cscott.