Clean up CirrusSearch job retries
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• Pchelolo
	Mar 24 2021, 7:51 PM

Description

Currently CirrusSearch jobs are configured to not do any retries cause cirrus jobs manage retries internally.

Instead, cirrus jobs should report success in case they failed and have scheduled a retry, and let change-prop do overall retries in case of a catastrophic job failure.

Related Objects

Mentioned In: T314240: Enable rolling restart for all MW servers (tracking)
Mentioned Here: T253673: Avoid php-opcache corruption in WMF production

Event Timeline

• Pchelolo created this task.Mar 24 2021, 7:51 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMar 24 2021, 7:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• AMooney triaged this task as Medium priority.Mar 30 2021, 8:00 PM

• AMooney moved this task from Inbox to Later on the Platform Team Workboards (Clinic Duty Team) board.

MPhamWMF moved this task from needs triage to making others happy on the Discovery-Search board.Apr 28 2021, 4:32 PM

let change-prop do overall retries in case of a catastrophic job failure.

I think part of the problem is that retries aren't catastrophic failure in cirrusserach. In our jobs retries are a natural and expected part of operations.

I think part of the problem is that retries aren't catastrophic failure in cirrusserach.

I mean more like MW jobrunner hardware failure or nuclear attack on the datacenter :) Basically the ticket is about checking that cirrus jobs report success if they posted their own retry and then setting change-prop retry policy to standard 2 retries on failure - in case something unexpected happens.

In T278382#7042630, @EBernhardson wrote:

let change-prop do overall retries in case of a catastrophic job failure.

I think part of the problem is that retries aren't catastrophic failure in cirrusserach. In our jobs retries are a natural and expected part of operations.

The context for this ticket is that I found the changeprop-jobqueue configuration too complicated, in particular that we use ChangeProp's own concept of retries in some cases here, and actually hardcode this for some jobs in the changeprop-jobqueue service.

The reason for this being surprising and undesirable to me is that MediaWiki already does this. Part of the job metadata submitted to changeprop-jobqueue from MediaWiki by, for example, CirussSearch is Job::allowRetries(). When a job fails, and it allows retries, MediaWiki re-queues it upto 3 times. There is no need for this to be hardcoded at the changeprop level.

For CirussSearch, I'm aware that it has some jobs that don't need retries or that feature their custom retry logic. That's fine. Either way, disabling CirrusSearch-job-retries inside changeprop config is redundant. These CirussSearch jobs always return "success" to MW/JobQueue and the CirussSearch job may or may not have internally caught and error and requeued a variant of itself by then. This unconditional "success" marking has been correctly done on the CirussSearch side for a long time now, so this changeprop config can just be removed.

The reason for this coming up now is T253673. As part of that I've been thinking about what happens when we forcefully kill/restart php-fpm across the fleet, and that includes jobrunners. As such, there is a scenario where CirussSearch won't be unable to "catch" this and requeue from within the runtime. I don't know whether you have a preference for how we handle this, but as it stands today, I suspect those jobs would be lost since changeprop-jobqueue has been hardcoded to not allow retries for CirussSearch jobs.

At the conclusion of T253673, the php-fpm restarts we do around deployments today will happen by default instead of only on some days for some servers.

Thanks for the context, that is very helpful. Everything in cirrus is a bit weird because it's trying to implement multi-cluster replication through the job queue. It does make sense that actual catastrophic failures, either because php-fpm was restarted or whatever, need to get retried at the job queue level. I'll review our jobs and make sure we are sending appropriate return codes from the job execution, then swap over the allowRetries.

Removing inactive assignee (Platform Engineering: Please unassign tasks of previous team members.)

Krinkle mentioned this in T314240: Enable rolling restart for all MW servers (tracking).Aug 6 2022, 12:56 AM

Clean up CirrusSearch job retriesOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Clean up CirrusSearch job retries
Open, MediumPublic
Actions