Page MenuHomePhabricator

Internal server error when calling function on NLG types
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

Here is the function call through the API sandbox:

https://wikifunctions.beta.wmflabs.org/wiki/Special:ApiSandbox#action=wikilambda_function_call&format=json&wikilambda_function_call_zobject=%7B%22Z1K1%22%3A%22Z7%22%2C%22Z7K1%22%3A%22Z10143%22%2C%22Z10143K1%22%3A%7B%22Z1K1%22%3A%22Z10139%22%2C%22Z10139K1%22%3A%22111%22%2C%22Z10139K2%22%3A%5B%22Z10138%22%2C%7B%22Z1K1%22%3A%22Z10138%22%2C%22Z10138K4%22%3A%7B%22Z1K1%22%3A%22Z10015%22%2C%22Z10015K1%22%3A%22222%22%7D%2C%22Z10138K5%22%3A%22333%22%2C%22Z10138K6%22%3A%22444%22%7D%5D%7D%7D

The call simply calls a function to convert the input zobject to a string.

What happens?:

Quickly fails with "Internal server errors", without much more info. It also randomly succeeds on some simpler objects, but not always. Not sure if it is flakiness due to performance issues or if it is another problem, e.g. related to this particular input.

Event Timeline

It seems that the orchestrator is getting an invalid response from the MediaWiki API:

invalid json response body at https://wikifunctions.beta.wmflabs.org/w/api.php?action=wikilambda_fetch&format=json&zids=Z7%7CZ9%7CZ8%7CZ17%7CZ6%7CZ12%7CZ14 reason: Unexpected token < in JSON at position 0

(logstash)

If I curl the same URL from the command line on deployment-docker-wikifunctions01, I get back a valid JSON response.

Ok, I hacked in some debugging code to include the HTML body in the response, and it looks like the orchestrator is getting an error page with the message:

Error: 429, Scripted requests from your IP have been blocked, please see https://meta.wikimedia.org/wiki/User-Agent_policy

Need to check what UA string (if any) the orchestrator is sending with API requests.

OK, it looks like the default User-Agent string sent by node-fetch is blocked by Varnish:
https://github.com/wikimedia/puppet/blob/9843300dba/modules/varnish/templates/wikimedia-frontend.vcl.erb#L716-L718
We need to set a custom user-agent string for the orchestrator.

Change 829060 had a related patch set uploaded (by Ori; author: Ori):

[mediawiki/services/function-orchestrator@master] Set custom User-Agent header on requests

https://gerrit.wikimedia.org/r/829060

Change 829060 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Set custom User-Agent header on requests

https://gerrit.wikimedia.org/r/829060

The API Sandbox request in the task description is still failing, but the underlying error is now different:

TypeError: Cannot read properties of null (reading 'validateStatus')
    at GenericSchema.validateStatus (/srv/service/function-schemata/javascript/src/schema.js:514:48)
    at GenericSchema.validateStatus (/srv/service/function-schemata/javascript/src/schema.js:514:49)
    at async BuiltIn.BUILTIN_SCHEMA_VALIDATOR_ [as functor_] (/srv/service/src/builtins.js:449:20)
    at async BuiltIn.execute (/srv/service/src/implementation.js:30:27)
    at async executeInternal (/srv/service/src/execute.js:454:15)
    at async execute (/srv/service/src/execute.js:522:17)
    at async runValidationFunction (/srv/service/src/validation.js:68:9)
    at async returnOnFirstError (/srv/service/src/utils.js:261:17)
    at async validateAsType (/srv/service/src/execute.js:64:9)
    at async Frame.processArgument (/srv/service/src/execute.js:139:25)
    at async Frame.retrieveArgument (/srv/service/src/execute.js:190:30)
    at async Promise.all (index 0)
    at async executeInternal (/srv/service/src/execute.js:442:32)
    at async execute (/srv/service/src/execute.js:522:17)
    at async returnOnFirstError (/srv/service/src/utils.js:261:17)
    at async orchestrate (/srv/service/src/orchestrate.js:118:16)

(Logstash)

That no longer looks like an error that would be specific to the Beta cluster environment. @AAssaf-WMF , can you see if you get the same error locally?

Can we see the function call that was sent? Even just copying the ZObject from expert mode in the UI will help.

Can we see the function call that was sent? Even just copying the ZObject from expert mode in the UI will help.

It's:

{
  "Z1K1": "Z7",
  "Z7K1": "Z10143",
  "Z10143K1": {
    "Z1K1": "Z10139",
    "Z10139K1": "111",
    "Z10139K2": [
      "Z10138",
      {
        "Z1K1": "Z10138",
        "Z10138K4": {
          "Z1K1": "Z10015",
          "Z10015K1": "222"
        },
        "Z10138K5": "333",
        "Z10138K6": "444"
      }
    ]
  }
}

Ah, part of the issue would be the type definitions for Z10139 and Z10138. The numbering of the keys for Z10138 doesn't start a Z10138K1 as it ought to, among other issues. I filed bugs for this last week. You can instead use Z7139 and Z7138, respectively, which are now built-in types.

However, when I make that change locally, I still get an error.

Macro scremcat:

So I will try to debug this until I can provide a working function call.

We can close this. The remaining bugs are all due to either 1) issues with the types defined on the Beta cluster, for which separate Phabricator tasks have been filed, or 2) a bug with argument resolution in the orchestrator, for which a task has also been filed.

Thank you Ori and Cory for looking into this!

ori claimed this task.