Page MenuHomePhabricator

handle large log response
Closed, ResolvedPublic

Description

What/Why:
Mishandled responses from our logger is causing 500 outages in our Evaluator (both JS and Py).
For ex:

core-function-evaluator-javascript-1  | TypeError: Cannot read properties of undefined (reading '0')
core-function-evaluator-javascript-1  |     at WikimediaEcsFormatMiddleware.transform (/srv/service/node_modules/@wikimedia/service-utils/dist/index.js:29:39)
core-function-evaluator-javascript-1  |     at Format.transform (/srv/service/node_modules/logform/combine.js:20:24)
core-function-evaluator-javascript-1  |     at DerivedLogger._transform (/srv/service/node_modules/winston/lib/winston/logger.js:314:29)
core-function-evaluator-javascript-1  |     at Transform._read (/srv/service/node_modules/readable-stream/lib/_stream_transform.js:166:10)
core-function-evaluator-javascript-1  |     at Transform._write (/srv/service/node_modules/readable-stream/lib/_stream_transform.js:155:83)
core-function-evaluator-javascript-1  |     at doWrite (/srv/service/node_modules/readable-stream/lib/_stream_writable.js:390:139)
core-function-evaluator-javascript-1  |     at writeOrBuffer (/srv/service/node_modules/readable-stream/lib/_stream_writable.js:381:5)
core-function-evaluator-javascript-1  |     at Writable.write (/srv/service/node_modules/readable-stream/lib/_stream_writable.js:302:11)
core-function-evaluator-javascript-1  |     at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
core-function-evaluator-javascript-1  |     at DerivedLogger.log (/srv/service/node_modules/winston/lib/winston/logger.js:225:14)

coming from this line.

relevant bug tasks:

replicate: calling perform tests for the function Z12864 (e.g. localhost or wikifunctions)

How: tbd

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Update function-schemata sub-module to HEAD (a293535)repos/abstract-wiki/wikifunctions/wikilambda-cli!72jforrestersync-function-schematamain
Update function-schemata sub-module to HEAD (a293535)repos/abstract-wiki/wikifunctions/function-evaluator!345jforrestersync-function-schematamain
Update function-schemata sub-module to HEAD (cf50aca)repos/abstract-wiki/wikifunctions/function-orchestrator!366apinesync-function-schematamain
Minimize types as much as possible before sending to the evaluator.repos/abstract-wiki/wikifunctions/function-orchestrator!363apineapine-smol-typesmain
Do not include expanded ZObjects in error output.repos/abstract-wiki/wikifunctions/function-orchestrator!360apineapine-tiny-errmain
initApp: Handle null configrepos/abstract-wiki/wikifunctions/function-orchestrator!359ecarggrace/T393130/log-bugmain
initApp: Handle null configrepos/abstract-wiki/wikifunctions/function-evaluator!341ecarggrace/T393130/log-bugmain
Customize query in GitLab

Event Timeline

ecarg triaged this task as High priority.
  • maybe we can start by adding a null check(s): e.g. url.path = info.request?.params?.[ 0 ];

I tried these in our Sandbox (Prod):

Notes:

  • I think this shows that those mysterious (https://phabricator.wikimedia.org/T388203) 500: [object Object] errors we've been seeing the past few weeks may also be related (and hopefully resolved alongside this)
  • However, I don't see the 504 Gateway Timeout from running these requests in SB
  • At best, I think the solution is twofold:
    1. Handle potential null case(s) and/or in service-utils
    2. Find where we could be trying to log huge JSON objects and stop doing that.

I think this is related to the huge request issue that @gengh is looking at. I am gonna try to work on point 2. (unless you were already doing it, @ecarg )?

On the service-utils side, that property should've been filled in by express:

When you use a regular expression for the route definition, capture groups are provided in the array using req.params[n], where n is the nth capture group. This rule is applied to unnamed wild card matches with string routes such as /file/*

Regex routes are used by service-template-node. There's probably some route that isn't being captured here that needs some sensible default now that it isn't technically on service-template-node.

So, we're running into an issue here with how we retrieve type converters.

In a single step, we turn a Z7 into a binary-formatted evaluator request. This means that we don't have any options about the object that actually ends up in that request--we can either resolve everything (in order to have access to, e.g., the declared type converters), or we can not to do that and break things.

We should instead resolve everything, extract the required information, then create a stripped-down version of the arguments (crucially, with the types not fully resolved). We can then assemble the evaluator request and convert it to binary form in a separate step.

As a final step, I am working to prevent us from shipping huge type definitions from the orchestrator to the evaluator.

Change #1143075 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Update orchestrator from 2025-04-23-134615 to 2025-05-06-142345

https://gerrit.wikimedia.org/r/1143075

Change #1143075 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Update orchestrator from 2025-04-23-134615 to 2025-05-06-142345

https://gerrit.wikimedia.org/r/1143075

@cmassaro as this is causing

Mishandled responses from our logger is causing 500 outages in our Evaluator (both JS and Py).

Shouldn't it be either a "bug" or a "Production error"?

I suppose so! I wasn't the original filer of this task, but I would have no complaints about having it re-classified.

Current status: the immediate bug will be resolved once https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merge_requests/366 and https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merge_requests/363 are merged. The long-term solution is described in the document I linked in Slack.

Change #1144563 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (a293535)

https://gerrit.wikimedia.org/r/1144563

Change #1144563 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (a293535)

https://gerrit.wikimedia.org/r/1144563

Change #1145863 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Update evaluators from 2025-05-07-003410 to 2025-05-12-235119

https://gerrit.wikimedia.org/r/1145863

Change #1145864 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Update orchestrator from 2025-05-06-142345 to 2025-05-14-112404

https://gerrit.wikimedia.org/r/1145864

Change #1145863 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Update evaluators from 2025-05-07-003410 to 2025-05-12-235119

https://gerrit.wikimedia.org/r/1145863

Change #1145864 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Update orchestrator from 2025-05-06-142345 to 2025-05-14-112404

https://gerrit.wikimedia.org/r/1145864

I think we can close this. @DSantamaria , @ecarg , @gengh , is that accurate?

Jdforrester-WMF subscribed.

The linked reproduction step still has 500s in it; is that just caching?

The linked reproduction step still has 500s in it; is that just caching?

I'm not sure! @gengh has more context on this one.

This particular issue has been resolved.
Still some underlying failures (the 500s described by James), related to typed lists handling. This is already detected and being tracked in different tasks.