Page MenuHomePhabricator

Restart CronJobs on failure of the service mesh
Open, LowPublic

Description

Define a specific exit code in MWScript.php when the service mesh doesn't respond in time, so it can be caught by a podFailurePolicy and the Job can be retried (since it never actually started work, that is safe).

Related Objects

StatusSubtypeAssignedTask
Resolveddancy
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
DeclinedNone
ResolvedRLazarus
ResolvedBUG REPORTRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedLucas_Werkmeister_WMDE
ResolvedArian_Bozorg
OpenNone
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
ResolvedRLazarus
OpenNone
OpenNone
ResolvedRLazarus
Resolvedtstarling
DuplicateNone
InvalidNone
DuplicateRLazarus
ResolvedJoe
OpenRLazarus
ResolvedClement_Goubert
ResolvedScott_French
OpenNone
ResolvedRLazarus
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedJMeybohm
OpenNone

Event Timeline

Change #1133935 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/mediawiki-config@master] MWScript.php: Specific exit code on mesh failure

https://gerrit.wikimedia.org/r/1133935

podFailurePolicy isn't available on the kubernetes version currently running our production wikikube clusters.

Change #1133935 merged by jenkins-bot:

[operations/mediawiki-config@master] MWScript.php: exit code on mesh, longer timeout

https://gerrit.wikimedia.org/r/1133935

Mentioned in SAL (#wikimedia-operations) [2025-04-10T10:45:56Z] <cgoubert@deploy1003> Started scap sync-world: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]]

Mentioned in SAL (#wikimedia-operations) [2025-04-10T10:54:08Z] <cgoubert@deploy1003> cgoubert: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-04-10T11:08:11Z] <cgoubert@deploy1003> Finished scap sync-world: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]] (duration: 22m 15s)

MLechvien-WMF moved this task from Inbox to Backlog on the ServiceOps new board.