Page MenuHomePhabricator

extensions/CirrusSearch/includes/Sanity/Checker.php:369 Cannot fetch ids from index
Closed, ResolvedPublic

Description

id
AWqeP542m2VjIW06Z09x
trace
#0 /srv/mediawiki/php-1.34.0-wmf.4/extensions/CirrusSearch/includes/Sanity/Checker.php(122): CirrusSearch\Sanity\Checker->loadPagesFromIndex(array)
#1 /srv/mediawiki/php-1.34.0-wmf.4/extensions/CirrusSearch/includes/Job/CheckerJob.php(217): CirrusSearch\Sanity\Checker->check(array)
#2 /srv/mediawiki/php-1.34.0-wmf.4/extensions/CirrusSearch/includes/Job/Job.php(100): CirrusSearch\Job\CheckerJob->doJob()
#3 /srv/mediawiki/php-1.34.0-wmf.4/extensions/EventBus/includes/JobExecutor.php(66): CirrusSearch\Job\Job->run()
#4 /srv/mediawiki/rpc/RunSingleJob.php(77): JobExecutor->execute(array)
#5 {main}

Impact

Over 12,000 failed JobQueue jobs per day, logged as "ERROR" severity in the production exception channel.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 9 2019, 8:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey added a subscriber: elukey.May 18 2019, 4:16 PM

Today I have seen some alarms firing for mediawiki exceptions due to this error :)

Krinkle updated the task description. (Show Details)May 19 2019, 9:38 AM
Krinkle updated the task description. (Show Details)
Krinkle triaged this task as High priority.EditedMay 19 2019, 9:42 AM
Krinkle added a subscriber: Krinkle.

Seen since 1.34-wmf.1 in the Logstash. Tentatively triaging as High priority due to it being one of the top 5 most frequent production errors, which is making is making it hard to reliably detect new regressions that are less frequent than this one.

If these jobs are not required to succeed (e.g. they just try something and that's it, no further action to be taken), then the Job class should presumably catch and ignore all exceptions and still return true.

If the internal failure rate is of interest to the maintainers, one could consider a Statsd metric or INFO-severity message in its stead.

Change 513585 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Don't spam the logs with errors from saneitizer jobs when elastic is down

https://gerrit.wikimedia.org/r/513585

dcausse claimed this task.May 31 2019, 12:44 PM
dcausse moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.

Change 513585 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Don't spam the logs with errors from saneitizer jobs when elastic is down

https://gerrit.wikimedia.org/r/513585

Krinkle closed this task as Resolved.EditedJun 17 2019, 6:33 PM

Confirmed fixed in prod on 1.34-wmf.8:

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM