Page MenuHomePhabricator

Clean up CirrusSearch job retries
Open, MediumPublic

Description

Currently CirrusSearch jobs are configured to not do any retries cause cirrus jobs manage retries internally.

Instead, cirrus jobs should report success in case they failed and have scheduled a retry, and let change-prop do overall retries in case of a catastrophic job failure.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

let change-prop do overall retries in case of a catastrophic job failure.

I think part of the problem is that retries aren't catastrophic failure in cirrusserach. In our jobs retries are a natural and expected part of operations.

I think part of the problem is that retries aren't catastrophic failure in cirrusserach.

I mean more like MW jobrunner hardware failure or nuclear attack on the datacenter :) Basically the ticket is about checking that cirrus jobs report success if they posted their own retry and then setting change-prop retry policy to standard 2 retries on failure - in case something unexpected happens.

let change-prop do overall retries in case of a catastrophic job failure.

I think part of the problem is that retries aren't catastrophic failure in cirrusserach. In our jobs retries are a natural and expected part of operations.

The context for this ticket is that I found the changeprop-jobqueue configuration too complicated, in particular that we use ChangeProp's own concept of retries in some cases here, and actually hardcode this for some jobs in the changeprop-jobqueue service.

The reason for this being surprising and undesirable to me is that MediaWiki already does this. Part of the job metadata submitted to changeprop-jobqueue from MediaWiki by, for example, CirussSearch is Job::allowRetries(). When a job fails, and it allows retries, MediaWiki re-queues it upto 3 times. There is no need for this to be hardcoded at the changeprop level.

For CirussSearch, I'm aware that it has some jobs that don't need retries or that feature their custom retry logic. That's fine. Either way, disabling CirrusSearch-job-retries inside changeprop config is redundant. These CirussSearch jobs always return "success" to MW/JobQueue and the CirussSearch job may or may not have internally caught and error and requeued a variant of itself by then. This unconditional "success" marking has been correctly done on the CirussSearch side for a long time now, so this changeprop config can just be removed.

The reason for this coming up now is T253673. As part of that I've been thinking about what happens when we forcefully kill/restart php-fpm across the fleet, and that includes jobrunners. As such, there is a scenario where CirussSearch won't be unable to "catch" this and requeue from within the runtime. I don't know whether you have a preference for how we handle this, but as it stands today, I suspect those jobs would be lost since changeprop-jobqueue has been hardcoded to not allow retries for CirussSearch jobs.

At the conclusion of T253673, the php-fpm restarts we do around deployments today will happen by default instead of only on some days for some servers.

Thanks for the context, that is very helpful. Everything in cirrus is a bit weird because it's trying to implement multi-cluster replication through the job queue. It does make sense that actual catastrophic failures, either because php-fpm was restarted or whatever, need to get retried at the job queue level. I'll review our jobs and make sure we are sending appropriate return codes from the job execution, then swap over the allowRetries.

Removing inactive assignee (Platform Engineering: Please unassign tasks of previous team members.)