Page MenuHomePhabricator

Consider Propagating Remaining Time among Orchestrator and Evaluator Processes
Closed, ResolvedPublic

Description

Description

The orchestrator can receive a timeout when it's first called. When it calls the evaluator, it can supply a timeout of (timeout - time_elapsed_so_far). Likewise, the evaluator can do the same when it makes reentrant calls to the orchestrator.

Desired behavior/Acceptance criteria (returned value, expected error, performance expectations, etc.)

  • have a discussion about this solution!
    • if this solution is decided on, then ...
      • allow each backend service to track time elapsed between being called and calling back to another service
      • timeout propagated should be the difference between the original timeout and time elapsed
      • if time elapsed exceeds timeout, we just straight-up shouldn't call the other service
    • if this solution is rejected, then ...
      • figure out how to obviate fork bombs with reentrant functions

Remove all the non-applicable tags from the "Tags" field, leave only the tags of the projects/repositories related to this task


Completion checklist

  • Extend the orchestrator<->evaluator API to track the remaining time
  • Change the orchestrator to use this new API to adjust time tracking
  • Change the evaluator to use this new API to adjust time tracking
  • Have the terminate evaluation when the timer runs out
  • Allow the starting timer to be configured

Event Timeline

Notes from discussion: Discussed earlier; necessary for re-entrancy, but if we disable re-entrancy in production for launch this isn't needed.

I am making the re-entrancy code dependent on this.

James's estimated effort: 3 days? (if whoever picks this up thinks that will take considerably longer, please check whether we agree on the task)

Change 904275 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-schemata@master] Add v0.0.4 of Avro schema which includes remainingTime.

https://gerrit.wikimedia.org/r/904275

Change 904383 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-orchestrator@master] Advance Avro schema semver to 0.0.4 and pass remaining time in calls to evaluator.

https://gerrit.wikimedia.org/r/904383

Change 904275 merged by jenkins-bot:

[mediawiki/services/function-schemata@master] Add v0.0.4 of Avro schema which includes remainingTime.

https://gerrit.wikimedia.org/r/904275

Change 904547 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-orchestrator@master] Update function-schemata sub-module to HEAD (2600732)

https://gerrit.wikimedia.org/r/904547

Change 904548 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-evaluator@master] Update function-schemata sub-module to HEAD (2600732)

https://gerrit.wikimedia.org/r/904548

Change 904547 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Update function-schemata sub-module to HEAD (2600732)

https://gerrit.wikimedia.org/r/904547

Change 904548 merged by jenkins-bot:

[mediawiki/services/function-evaluator@master] Update function-schemata sub-module to HEAD (2600732)

https://gerrit.wikimedia.org/r/904548

Change 904648 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-evaluator@master] Allow function-evaluator to process requests serialized with Avro versions 3 and 4.

https://gerrit.wikimedia.org/r/904648

SO. I think I can get this code-complete with one more day's work. As long as we pass a monotonically decreasing timeout between the orchestrator and evaluator, the function calls are guaranteed to end eventually.

However, I've realized I have no idea how we can test this!

A proper test of this would involve writing a quine or something similar--a function that would produce infinite mutual recursion (henceforth Nastyfunc) between the evaluator and the orchestrator. The cascading timeouts should ensure that, eventually, a function call is made that doesn't manage to execute before the timeout expires. Whether this works or not, all the caller will see is a timeout message. Metadata won't help here: whether we implement cascading timeouts or not, if somebody writes a Nastyfunc, all the metadata could ever tell us, at best, is how many calls will have been made before the top-level orchestrator call returns, which is bound by the top-level timeout.

What we would want is a test that calls the infinitely mutually recursive function, and then counts how many calls to the orchestrator and evaluator are made. I assume we could do this, with difficulty, in Kubernetes? @SDunlap @Jdforrester-WMF

Change 905223 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/tools/wikilambda-cli@master] Update function-schemata sub-module to HEAD (adb4e69)

https://gerrit.wikimedia.org/r/905223

Change 905226 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (adb4e69)

https://gerrit.wikimedia.org/r/905226

Change 905223 merged by jenkins-bot:

[mediawiki/tools/wikilambda-cli@master] Update function-schemata sub-module to HEAD (adb4e69)

https://gerrit.wikimedia.org/r/905223

Change 905226 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (adb4e69)

https://gerrit.wikimedia.org/r/905226

Change 904648 merged by jenkins-bot:

[mediawiki/services/function-evaluator@master] Allow function-evaluator to process requests serialized with Avro versions 3 and 4.

https://gerrit.wikimedia.org/r/904648

Change 904383 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Advance Avro schema semver to 0.0.4 and pass remaining time in calls to evaluator.

https://gerrit.wikimedia.org/r/904383

Change 906629 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-orchestrator@master] Do not make a recursive call to the orchestrator if timeout has expired.

https://gerrit.wikimedia.org/r/906629

So I've looked back over the orchestrator code and I don't think we need to do anything more here. When the evaluator calls back to the orchestrator, the new function is orchestrated as part of the same original request. While it is worthwhile to pass the timeout to the evaluator to save resources, I don't think we're in any danger of fork-bombs. I've also added an extra safety measure in this patch to refrain from calling orchestrate if the timeout has already expired and will add something similar in the evaluator.

Also, please note that if we ever decide to change this for some reason (e.g., by calling the endpoint itself within the orchestrator), we WILL be at risk of this and will need to implement the cascading timeout thing.

Change 906629 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Do not make a recursive call to the orchestrator if timeout has expired.

https://gerrit.wikimedia.org/r/906629

That was a ride. Thank you for the explanations!