Define a specific exit code in MWScript.php when the service mesh doesn't respond in time, so it can be caught by a podFailurePolicy and the Job can be retried (since it never actually started work, that is safe).
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| MWScript.php: exit code on mesh, longer timeout | operations/mediawiki-config | master | +6 -4 |
Event Timeline
Change #1133935 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/mediawiki-config@master] MWScript.php: Specific exit code on mesh failure
podFailurePolicy isn't available on the kubernetes version currently running our production wikikube clusters.
Change #1133935 merged by jenkins-bot:
[operations/mediawiki-config@master] MWScript.php: exit code on mesh, longer timeout
Mentioned in SAL (#wikimedia-operations) [2025-04-10T10:45:56Z] <cgoubert@deploy1003> Started scap sync-world: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]]
Mentioned in SAL (#wikimedia-operations) [2025-04-10T10:54:08Z] <cgoubert@deploy1003> cgoubert: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
Mentioned in SAL (#wikimedia-operations) [2025-04-10T11:08:11Z] <cgoubert@deploy1003> Finished scap sync-world: Backport for [[gerrit:1133935|MWScript.php: exit code on mesh, longer timeout (T390972 T387208)]] (duration: 22m 15s)