Page MenuHomePhabricator

Spurious Bad UTF-8 (full string verification) on large articles with Parsoid/PHP
Closed, ResolvedPublicBUG REPORT

Description

On large articles, Parsoid/PHP throws an assertion exception "Bad UTF-8 (full string verification)" even though the articles do not actually contain any invalid UTF-8. This is because the regex used to validate that the entire article is valid UTF-8 is exceeding the configured pcre.recursion_limit in php.ini (the preg_match() returns false and preg_last_error() returns PREG_RECURSION_LIMIT_ERROR).

Steps to Reproduce:

  1. Find a large article, and enable the Parsoid version of PHP (the one I was able to reproduce with had an expanded size of 150KB)
  2. Query the rest.php endpoint to get Parsoid to parse the page: https://<wiki.url>/w/rest.php/<wiki.url>/v3/page/html/<page_name>/<revid>
  3. It will throw an exception about a failed assertion, with the text "Bad UTF-8 (full string verification)"

Event Timeline

Change 656596 had a related patch set uploaded (by Skizzerz; owner: Skizzerz):
[mediawiki/services/parsoid@master] Modify UTF-8 regex to use builtin PCRE validation

https://gerrit.wikimedia.org/r/656596

ssastry triaged this task as Medium priority.Jan 25 2021, 11:43 PM
ssastry moved this task from Needs Triage to Current & Upcoming Work on the Parsoid board.

Change 656596 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Modify UTF-8 regex to use builtin PCRE validation

https://gerrit.wikimedia.org/r/656596

ssastry assigned this task to Skizzerz.

Change 666213 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a25

https://gerrit.wikimedia.org/r/666213

Change 666213 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a25

https://gerrit.wikimedia.org/r/666213