CirrusSearch jobs sometimes fail with "RemexHtml\Tokenizer\Tokenizer: pcre.backtrack_limit exhausted"
Closed, ResolvedPublic

Description

For https://beta.wikiversity.org/wiki/Repeating_Decimals_(1/99998999999999900001)

#0 /srv/mediawiki/php-1.32.0-wmf.15/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(866): RemexHtml\Tokenizer\Tokenizer->throwPregError()
#1 /srv/mediawiki/php-1.32.0-wmf.15/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(1011): RemexHtml\Tokenizer\Tokenizer->handleCharRefs(string, integer)
#2 /srv/mediawiki/php-1.32.0-wmf.15/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(462): RemexHtml\Tokenizer\Tokenizer->emitDataRange(integer, integer)
#3 /srv/mediawiki/php-1.32.0-wmf.15/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(310): RemexHtml\Tokenizer\Tokenizer->dataState(boolean)
#4 /srv/mediawiki/php-1.32.0-wmf.15/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(151): RemexHtml\Tokenizer\Tokenizer->executeInternal(boolean)
#5 /srv/mediawiki/php-1.32.0-wmf.15/includes/parser/Sanitizer.php(1983): RemexHtml\Tokenizer\Tokenizer->execute()
#6 /srv/mediawiki/php-1.32.0-wmf.15/includes/content/WikiTextStructure.php(179): Sanitizer::stripAllTags(string)
#7 /srv/mediawiki/php-1.32.0-wmf.15/includes/content/WikiTextStructure.php(225): WikiTextStructure->extractWikitextParts()
#8 /srv/mediawiki/php-1.32.0-wmf.15/includes/content/WikitextContentHandler.php(150): WikiTextStructure->getOpeningText()
#9 /srv/mediawiki/php-1.32.0-wmf.15/extensions/CirrusSearch/includes/Updater.php(343): WikitextContentHandler->getDataForSearchIndex(WikiPage, ParserOutput, CirrusSearch)
#10 /srv/mediawiki/php-1.32.0-wmf.15/extensions/CirrusSearch/includes/Updater.php(396): CirrusSearch\Updater::buildDocument(CirrusSearch, WikiPage, CirrusSearch\Connection, integer, integer, integer)
#11 /srv/mediawiki/php-1.32.0-wmf.15/extensions/CirrusSearch/includes/Updater.php(204): CirrusSearch\Updater->buildDocumentsForPages(array, integer)
#12 /srv/mediawiki/php-1.32.0-wmf.15/extensions/CirrusSearch/includes/Updater.php(83): CirrusSearch\Updater->updatePages(array, integer)
#13 /srv/mediawiki/php-1.32.0-wmf.15/extensions/CirrusSearch/includes/Job/LinksUpdate.php(52): CirrusSearch\Updater->updateFromTitle(Title)
#14 /srv/mediawiki/php-1.32.0-wmf.15/extensions/CirrusSearch/includes/Job/Job.php(99): CirrusSearch\Job\LinksUpdate->doJob()
#15 /srv/mediawiki/php-1.32.0-wmf.15/extensions/EventBus/includes/JobExecutor.php(67): CirrusSearch\Job\Job->run()
#16 /srv/mediawiki/rpc/RunSingleJob.php(80): JobExecutor->execute(array)
#17 {main}
ssastry created this task.Aug 3 2018, 3:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2018, 3:24 PM
Krinkle added a subscriber: Krinkle.EditedOct 2 2018, 9:17 PM

Still seen on 1.32.0-wmf.23. Recent sample for search results and to aid investigation.

RemexHtml\Tokenizer\Tokenizer: pcre.backtrack_limit exhausted

#0 /srv/mediawiki/php-1.32.0-wmf.23/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(866): RemexHtml\Tokenizer\Tokenizer->throwPregError()
#1 /srv/mediawiki/php-1.32.0-wmf.23/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(1011): RemexHtml\Tokenizer\Tokenizer->handleCharRefs(string, integer)
#2 /srv/mediawiki/php-1.32.0-wmf.23/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(462): RemexHtml\Tokenizer\Tokenizer->emitDataRange(integer, integer)
#3 /srv/mediawiki/php-1.32.0-wmf.23/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(310): RemexHtml\Tokenizer\Tokenizer->dataState(boolean)
#4 /srv/mediawiki/php-1.32.0-wmf.23/vendor/wikimedia/remex-html/RemexHtml/Tokenizer/Tokenizer.php(151): RemexHtml\Tokenizer\Tokenizer->executeInternal(boolean)
#5 /srv/mediawiki/php-1.32.0-wmf.23/includes/parser/Sanitizer.php(1984): RemexHtml\Tokenizer\Tokenizer->execute()
#6 /srv/mediawiki/php-1.32.0-wmf.23/includes/content/WikiTextStructure.php(179): Sanitizer::stripAllTags(string)
#7 /srv/mediawiki/php-1.32.0-wmf.23/includes/content/WikiTextStructure.php(225): WikiTextStructure->extractWikitextParts()
#8 /srv/mediawiki/php-1.32.0-wmf.23/includes/content/WikitextContentHandler.php(152): WikiTextStructure->getOpeningText()
#9 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Updater.php(351): WikitextContentHandler->getDataForSearchIndex(WikiPage, ParserOutput, CirrusSearch)
#10 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Updater.php(407): CirrusSearch\Updater::buildDocument(CirrusSearch, WikiPage, CirrusSearch\Connection, integer, integer, integer)
#11 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Updater.php(205): CirrusSearch\Updater->buildDocumentsForPages(array, integer)
#12 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Updater.php(84): CirrusSearch\Updater->updatePages(array, integer)
#13 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Job/LinksUpdate.php(52): CirrusSearch\Updater->updateFromTitle(Title)
#14 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Job/Job.php(99): CirrusSearch\Job\LinksUpdate->doJob()
#15 /srv/mediawiki/php-1.32.0-wmf.23/extensions/EventBus/includes/JobExecutor.php(64): CirrusSearch\Job\Job->run()
#16 /srv/mediawiki/rpc/RunSingleJob.php(67): JobExecutor->execute(array)

Frequency and breakdown:

Krinkle renamed this task from RemexHtml\Tokenizer\Tokenizer: pcre.backtrack_limit exhausted to CirrusSearch jobs sometimes fail with "RemexHtml\Tokenizer\Tokenizer: pcre.backtrack_limit exhausted".Oct 2 2018, 9:20 PM
Krinkle added a project: CirrusSearch.
Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 2 2018, 9:21 PM
debt added a subscriber: debt.

We'll be watching this and waiting to see if the Core Platform team needs our help.

Looking over the last week of logs, these only seem to occur in cases of vandalism that create very non-standard wikitext pages. In principle, what is the desired outcome here?

  • Should CirrusSearch fall back to some other tag stripping algorithm when Remex fails? Maybe php's strip_tags ?
  • Should Remex be fixed to fallback from regex to something else and never fail?

Looking over the last week of logs, these only seem to occur in cases of vandalism that create very non-standard wikitext pages. In principle, what is the desired outcome here?

  • Should CirrusSearch fall back to some other tag stripping algorithm when Remex fails? Maybe php's strip_tags ?
  • Should Remex be fixed to fallback from regex to something else and never fail?

Let me chat with Tim later this week and/or look at code to see what makes sense. Remex shouldn't fail for sure ... or at least if it fails, it should have a better failure mode. I lost track of this .. otherwise could have chatted in person when we are all there in Portland.

The usual way to fix exhaustion of pcre.backtrack_limit is to just increase the limit. I documented on line 1449 of Remex's Tokenizer.php that it needs to be at least twice the length of the input string. The current limit is 1MB, which I thought would be enough, but the input to RemexHtml for this test case is 1.4MB. I confirmed with eval.php that increasing pcre.backtrack_limit to 2MB fixes the issue for this test case. But let's make it 5MB to be on the safe side.

Change 471904 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/puppet@production] Increase pcre.backtrack_limit to 5MB

https://gerrit.wikimedia.org/r/471904

The reason I'm not concerned about increasing this limit is because the effect on CPU time is O(N). It just limits the number of characters examined by PCRE, and PCRE takes a very small amount of time for each character. The reason it exists is because for certain regexes, short input strings could cause an exponential amount of backtracking. Setting the backtrack limit to some constant factor times the input size avoids this problem, bounding execution time to be approximately linear. Settings the backtrack limit to less than the input size is pointless, it implies that the goal is sublinear performance, i.e. better than O(N), which is not possible.

Change 471904 merged by Tim Starling:
[operations/puppet@production] Increase pcre.backtrack_limit to 5MB

https://gerrit.wikimedia.org/r/471904

tstarling closed this task as Resolved.Nov 6 2018, 3:53 AM
tstarling claimed this task.

Should be fixed now, feel free to undelete the page and try it.