Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ Parser.php and (b) slow
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cscott
	Dec 4 2019, 4:50 PM

Description

Parsoid uses mb_strlen() on the input content for metrics and resource limits.

This is O(N) on the size of the input, because it needs to scan the entire string and parse out the UTF-8 codepoints.

However, it is more "fair" to non-latin-script wikis, who might overwise see their resource limits being up to 4x smaller than (say) enwiki enjoys.

On the gripping hand, it is inconsistent with the legacy parser, which uses strlen() for its resource limits, which means that the legacy parser can parse/save pages which then Parsoid can't open or vice-versa (depending on how the various limits are actually set).

At the very least we should probably do a single mb_strlen on the expanded input size and cache that, rather than recomputing the # of unicode codepoints multiple times. We might also figure out if we can change some of the legacy parser limits to use mb_strlen to allow Parsoid and the legacy parser to be more compatible.

Details

Subject	Repo	Branch	Lines +/-
Bump parsoid to 0.14.0-a6	mediawiki/vendor	master	+321 K -378 K
Enforce wikitext limits like in the legacy parser	mediawiki/services/parsoid	master	+67 -9
Add a FIXME comment to remove some slow `mb_strlen`s in the future	mediawiki/services/parsoid	master	+7 -2

Customize query in gerrit

Related Objects

Mentioned In: T280381: 413 error while trying to fetch using desktop api
T275319: Change $wgMaxArticleSize limit from byte-based to character-based
T238456: Missing implementation to post Parsoid/PHP lints to production database
T239830: Add metrics for startup time for language variant code
T239643: Bugs in PHP port of LanguageConverter
Mentioned Here: T238456: Missing implementation to post Parsoid/PHP lints to production database
T239643: Bugs in PHP port of LanguageConverter
T239830: Add metrics for startup time for language variant code
rGPARb81bbf40c6d6: Bump wikimedia/langconv to 0.3.1

Event Timeline

cscott created this task.Dec 4 2019, 4:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 4 2019, 4:50 PM

ssastry renamed this task from Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ PHP and (b) slow to Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ Parser.php and (b) slow.Dec 4 2019, 4:53 PM

Change 554556 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Add a FIXME comment to remove some slow mb_strlens in the future

https://gerrit.wikimedia.org/r/554556

gerritbot added a project: Patch-For-Review.Dec 4 2019, 4:55 PM

Change 554556 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add a FIXME comment to remove some slow mb_strlens in the future

https://gerrit.wikimedia.org/r/554556

Maintenance_bot removed a project: Patch-For-Review.Dec 4 2019, 6:10 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-04T18:45:12Z] <arlolra> Updated Parsoid to b81bbf4 (T239643, T239830, T238456, T239841)

Aklapper edited projects, added Parsoid; removed Parsoid-PHP.Apr 10 2020, 4:27 PM

ssastry moved this task from Needs Triage to Tech Debt / Big changes on the Parsoid board.Apr 10 2020, 4:49 PM

cscott mentioned this in T275319: Change $wgMaxArticleSize limit from byte-based to character-based.Mar 4 2021, 8:14 PM

Arlolra mentioned this in T280381: 413 error while trying to fetch using desktop api.May 8 2021, 2:38 PM

Kelson added a project: affects-Kiwix-and-openZIM.May 9 2021, 12:39 PM

Kelson subscribed.

Change 690029 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Enforce wikitext limits like in the legacy parser

https://gerrit.wikimedia.org/r/690029

gerritbot added a project: Patch-For-Review.Jun 4 2021, 6:49 PM

Kelson moved this task from TRIAGE to TOP on the affects-Kiwix-and-openZIM board.Jun 10 2021, 11:53 AM

Change 690029 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Enforce wikitext limits like in the legacy parser

https://gerrit.wikimedia.org/r/690029

Maintenance_bot removed a project: Patch-For-Review.Jun 22 2021, 2:10 PM

Arlolra closed this task as Resolved.Jun 22 2021, 3:47 PM

Arlolra claimed this task.

Change 701949 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump parsoid to 0.14.0-a6

https://gerrit.wikimedia.org/r/701949

gerritbot added a project: Patch-For-Review.Jun 28 2021, 5:16 PM

Change 701949 merged by jenkins-bot: