On large articles, Parsoid/PHP throws an assertion exception "Bad UTF-8 (full string verification)" even though the articles do not actually contain any invalid UTF-8. This is because the regex used to validate that the entire article is valid UTF-8 is exceeding the configured pcre.recursion_limit in php.ini (the preg_match() returns false and preg_last_error() returns PREG_RECURSION_LIMIT_ERROR).
Steps to Reproduce:
- Find a large article, and enable the Parsoid version of PHP (the one I was able to reproduce with had an expanded size of 150KB)
- Query the rest.php endpoint to get Parsoid to parse the page: https://<wiki.url>/w/rest.php/<wiki.url>/v3/page/html/<page_name>/<revid>
- It will throw an exception about a failed assertion, with the text "Bad UTF-8 (full string verification)"