Page MenuHomePhabricator

Exception from CirrusSearch/Sanity/Checker: Cannot fetch ids from index
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request ID: 3638258ddb473017d85b8a28

message
[{exception_id}] {exception_url}   Exception from line 369 of /srv/mediawiki/php-1.33.0-wmf.22/extensions/CirrusSearch/includes/Sanity/Checker.php: Cannot fetch ids from index
trace
#0 /srv/mediawiki/php-1.33.0-wmf.22/extensions/CirrusSearch/includes/Sanity/Checker.php(122): CirrusSearch\Sanity\Checker->loadPagesFromIndex(array)
#1 /srv/mediawiki/php-1.33.0-wmf.22/extensions/CirrusSearch/includes/Job/CheckerJob.php(214): CirrusSearch\Sanity\Checker->check(array)
#2 /srv/mediawiki/php-1.33.0-wmf.22/extensions/CirrusSearch/includes/Job/Job.php(100): CirrusSearch\Job\CheckerJob->doJob()
#3 /srv/mediawiki/php-1.33.0-wmf.22/extensions/EventBus/includes/JobExecutor.php(65): CirrusSearch\Job\Job->run()
#4 /srv/mediawiki/rpc/RunSingleJob.php(77): JobExecutor->execute(array)

Impact

Uncertain. I'm not familiar with what the CirrussSearch CheckerJob does. But I assume that from the fatal error, it means the job is skipped, which usually means that in part or in whole the intended work is not being performed.

Notes

Seen for at least 30 days. Oldest currently available records show it on 1.33-wmf.18.

capture.png (1×2 px, 106 KB)

Event Timeline

debt triaged this task as Medium priority.Mar 28 2019, 5:12 PM
debt moved this task from needs triage to elastic / cirrus on the Discovery-Search board.

Not sure what to do here, the errors are legit since Elasticsearch was unreachable, should this more explicit in the error message?

Additionally this error is logged twice by the EventBus JobExecutor:

This Job had retries disabled, I'll switch allowRetries() to true to circumvent this scenario (even if losing this job is not a problem since it's a verification process that is constantly restarted).

Change 500965 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Allow retries for CheckerJob

https://gerrit.wikimedia.org/r/500965

@dcausse If this is something that can happen under normal operation and doesn't require an improvement to the code to prevent or do something with, then it probably should not be a top-level uncaught exception.

It should probably instead be caught in the job and the job marked as success. Possibly with an info/warning message logged in a CirussSearch-specific channel, if it is something you want to be able to find in Logstash. The current exception is meant to be indicative of a JobRunner or MediaWiki-level problem, and is used as such to inform automatic rollbacks during deployments and SRE pages about MediaWiki availability.

Awesome! I didn't realise see the patch does exactly that… :)

Change 500965 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Allow retries for CheckerJob

https://gerrit.wikimedia.org/r/500965

debt claimed this task.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM