Page MenuHomePhabricator

Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ Parser.php and (b) slow
Open, Needs TriagePublic

Description

Parsoid uses mb_strlen() on the input content for metrics and resource limits.

This is O(N) on the size of the input, because it needs to scan the entire string and parse out the UTF-8 codepoints.

However, it is more "fair" to non-latin-script wikis, who might overwise see their resource limits being up to 4x smaller than (say) enwiki enjoys.

On the gripping hand, it is inconsistent with the legacy parser, which uses strlen() for its resource limits, which means that the legacy parser can parse/save pages which then Parsoid can't open or vice-versa (depending on how the various limits are actually set).

At the very least we should probably do a single mb_strlen on the expanded input size and cache that, rather than recomputing the # of unicode codepoints multiple times. We might also figure out if we can change some of the legacy parser limits to use mb_strlen to allow Parsoid and the legacy parser to be more compatible.

Event Timeline

ssastry renamed this task from Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ PHP and (b) slow to Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ Parser.php and (b) slow.Dec 4 2019, 4:53 PM

Change 554556 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Add a FIXME comment to remove some slow mb_strlens in the future

https://gerrit.wikimedia.org/r/554556

Change 554556 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add a FIXME comment to remove some slow mb_strlens in the future

https://gerrit.wikimedia.org/r/554556