Page MenuHomePhabricator

Partial Wikifunctions service outage: Z504s (not found) being thrown, because Z13518 was mis-cached as a Z504/error
Closed, ResolvedPublicPRODUCTION ERROR

Description

https://www.wikifunctions.org/view/en/Z16424 was edited two hours ago and seems to have worked then, but not now, which is after the train this week, so I don't think that's at fault, and we've not (and no one else has) made any changes to prod since then.

First apparent failure is from 18:36, on https://www.wikifunctions.org/view/en/Z16430. It looks like the Type serialisation is failing in that case? 'Could not serialize input Python object: 3'.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Do not store error envelopes in memcached or retrieve ones that were stored in the past.repos/abstract-wiki/wikifunctions/function-orchestrator!438apineapine-dont-store-errorsmain
Do not eagerly evaluate the keys of Z20s, which are often degenerate Z7s.repos/abstract-wiki/wikifunctions/function-orchestrator!437apineapine-fix-z20main
Customize query in GitLab

Event Timeline

Jdforrester-WMF triaged this task as Unbreak Now! priority.
Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptSep 3 2025, 9:31 PM

{"Z1K1":"Z7","Z7K1":"Z16409","Z16409K1":{"Z1K1":"Z13518","Z13518K1":"20"}} should return a value but throws a Z504 for Z13518 for instance.

Mentioned in SAL (#wikimedia-operations) [2025-09-03T21:37:52Z] <James_F> Running mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --quick --zType Z4 --verbose to try to fix T403671

Similarly {"Z1K1":"Z7","Z7K1":"Z14290","Z14290K1":"123","Z14290K2":"Z1002"} throws a "Z500K1":"Call tuples failed in returnOnFirstError. Error: TypeError: Cannot read properties of undefined (reading 'resolveEphemeral').".

jforrester merged https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merge_requests/438

Do not store error envelopes in memcached or retrieve ones that were stored in the past.

Change #1184620 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2025-09-02-205403 to 2025-09-04-003606

https://gerrit.wikimedia.org/r/1184620

Change #1184620 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2025-09-02-205403 to 2025-09-04-003606

https://gerrit.wikimedia.org/r/1184620

Jdforrester-WMF lowered the priority of this task from Unbreak Now! to High.Sep 4 2025, 12:49 AM

OK, this should now be fixed for new Function calls. Sorry for the disruption. Unfortunately cached calls will remain for 24 hours.

DSantamaria changed the task status from Open to In Progress.Sep 4 2025, 9:46 AM
Jdforrester-WMF renamed this task from Partial Wikifunctions service outage: Z504s (not found) being thrown, plus Type serialisation is failing to Partial Wikifunctions service outage: Z504s (not found) being thrown, because Z13518 was mis-cached as a Z504/error.Sep 4 2025, 7:18 PM

Thanks for the fix!

Since then, test evaluation details are quite often unavailable. For example, in decimal string from Rational, composition (Z27983), clicking “Details” in the first test (Z21788) does not display the details (and can make the content unscrollable on a mobile device, while leaving other interactions available). This failure recurs in edit mode, so there appears to be no way to see the error details.

Reported by @99of9 on Telegram.

Thanks for the fix!

Since then, test evaluation details are quite often unavailable. For example, in decimal string from Rational, composition (Z27983), clicking “Details” in the first test (Z21788) does not display the details (and can make the content unscrollable on a mobile device, while leaving other interactions available). This failure recurs in edit mode, so there appears to be no way to see the error details.

Reported by @99of9 on Telegram.

Thanks, but this is a totally different issue. I'll file it.