Page MenuHomePhabricator

Should ElasticaWrite job allow retries?
Open, Needs TriagePublic

Description

	public function allowRetries() {
		return false;
	}

This overridde has a doc comment which explains why it is disabled, namely that the job has its own retry logic.

However, if I understand correctly, this also means that the job is lost without recovery if the PHP process is killed. E.g. due to deployment and we roll over php-fpm across the fleet, or due to any other php-fpm restart (such as opcache capacity being reached, which is done by a cronjob currently), due during switch overs when we kill running processes, etc.

Disabling retries is relatively rare, and when done it is typically for jobs that exist only as optimisation or that can self-correct relatively quickly (e.g. warming up thumbnail or parser cache), or for unsafe/complex code that isn't atomic and cannot restart with e.g. a user asynchronously waiting on the other end who would notice and know to re-try at the higher level (such as upload chunk assembly).

I don't know if there is a regular and automated way by which this would self-correct for CirrusSearch. If not, then it might be worth turning this back on. Given that the code is already wrapped in a try-catch, it should be impossible for blind job queue growth to happen. The cases where a runtime error is found of the kind that you don't want to retry, the existing code will kick in as usual and signal that it should be counted as success.

It's only when the process is aborted from the outside, and thus the job runner never gets a response or gets HTTP 500, that it will thus have permission to retry/requeue upto 3 times.