Consider Propagating Remaining Time among Orchestrator and Evaluator Processes
Closed, ResolvedPublic
Actions

Description

The orchestrator can receive a timeout when it's first called. When it calls the evaluator, it can supply a timeout of (timeout - time_elapsed_so_far). Likewise, the evaluator can do the same when it makes reentrant calls to the orchestrator.

Desired behavior/Acceptance criteria (returned value, expected error, performance expectations, etc.)

have a discussion about this solution!
- if this solution is decided on, then ...
  - allow each backend service to track time elapsed between being called and calling back to another service
  - timeout propagated should be the difference between the original timeout and time elapsed
  - if time elapsed exceeds timeout, we just straight-up shouldn't call the other service
- if this solution is rejected, then ...
  - figure out how to obviate fork bombs with reentrant functions

Remove all the non-applicable tags from the "Tags" field, leave only the tags of the projects/repositories related to this task

Completion checklist

Extend the orchestrator<->evaluator API to track the remaining time
Change the orchestrator to use this new API to adjust time tracking
Change the evaluator to use this new API to adjust time tracking
Have the terminate evaluation when the timer runs out
Allow the starting timer to be configured

Before closing this task, review one by one the checklist available here: https://www.mediawiki.org/wiki/Abstract_Wikipedia_team/Definition_of_Done#Back-end_Task/Bug_completion_checklist

Details

Subject	Repo	Branch	Lines +/-
Do not make a recursive call to the orchestrator if timeout has expired.	mediawiki/services/function-orchestrator	master	+8 -4
Advance Avro schema semver to 0.0.4 and pass remaining time in calls to evaluator.	mediawiki/services/function-orchestrator	master	+36 -17
Allow function-evaluator to process requests serialized with Avro versions 3 and 4.	mediawiki/services/function-evaluator	master	+158 -73
Update function-schemata sub-module to HEAD (adb4e69)	mediawiki/extensions/WikiLambda	master	+1 -1
Update function-schemata sub-module to HEAD (adb4e69)	mediawiki/tools/wikilambda-cli	master	+1 -1
Update function-schemata sub-module to HEAD (2600732)	mediawiki/services/function-evaluator	master	+1 -1
Update function-schemata sub-module to HEAD (2600732)	mediawiki/services/function-orchestrator	master	+1 -1
Add v0.0.4 of Avro schema which includes remainingTime.	mediawiki/services/function-schemata	master	+84 -5

Customize query in gerrit

	Title	Reference	Author	Source Branch	Dest Branch
	Parameterize timeout.	repos/abstract-wiki/wikifunctions/function-orchestrator!55	apine	apine-unhack-the-hax	main

Customize query in GitLab

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	BUG REPORT	None	T344206 Special:MathStatus exception error on Wikifunctions
Resolved		Jdforrester-WMF	T342865 Post-creation work for wikifunctionswiki
Resolved		BTullis	T289316 Prepare and check storage layer for Wikifunctions.org (new public content wiki)
Resolved		Jdforrester-WMF	T275945 Create Wikifunctions.org
Resolved		Jdforrester-WMF	T299176 Phase Theta – Root task
Resolved		Jdforrester-WMF	T282913 Implement agreed security and control measures in the Wikifunctions system
Resolved		cmassaro	T323049 Consider Propagating Remaining Time among Orchestrator and Evaluator Processes

Event Timeline

cmassaro created this task.Nov 14 2022, 4:54 PM

cmassaro added a parent task: T282913: Implement agreed security and control measures in the Wikifunctions system.

DVrandecic moved this task from To triage to Phase θ – Throttling on the Abstract Wikipedia team board.Nov 15 2022, 7:36 PM

DVrandecic edited projects, added Abstract Wikipedia team (Phase θ – Throttling); removed Abstract Wikipedia team.

Jdforrester-WMF moved this task from Incoming to Ready: G6. Stable & secure on the Abstract Wikipedia team (Phase θ – Throttling) board.Nov 15 2022, 8:43 PM

Notes from discussion: Discussed earlier; necessary for re-entrancy, but if we disable re-entrancy in production for launch this isn't needed.

I am making the re-entrancy code dependent on this.

James's estimated effort: 3 days? (if whoever picks this up thinks that will take considerably longer, please check whether we agree on the task)

cmassaro claimed this task.Jan 17 2023, 7:43 PM

Change 904275 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-schemata@master] Add v0.0.4 of Avro schema which includes remainingTime.

https://gerrit.wikimedia.org/r/904275

gerritbot added a project: Patch-For-Review.Mar 29 2023, 7:15 PM

Change 904383 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-orchestrator@master] Advance Avro schema semver to 0.0.4 and pass remaining time in calls to evaluator.

https://gerrit.wikimedia.org/r/904383

Change 904275 merged by jenkins-bot:

[mediawiki/services/function-schemata@master] Add v0.0.4 of Avro schema which includes remainingTime.

https://gerrit.wikimedia.org/r/904275

cmassaro mentioned this in rMSFSde645647543d: Add v0.0.4 of Avro schema which includes remainingTime..Mar 30 2023, 10:54 AM

Change 904547 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-orchestrator@master] Update function-schemata sub-module to HEAD (2600732)

https://gerrit.wikimedia.org/r/904547

Change 904548 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-evaluator@master] Update function-schemata sub-module to HEAD (2600732)

https://gerrit.wikimedia.org/r/904548

Change 904547 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Update function-schemata sub-module to HEAD (2600732)

https://gerrit.wikimedia.org/r/904547

cmassaro mentioned this in rMSFO0f0f977b0a8e: Update function-schemata sub-module to HEAD (2600732).Mar 30 2023, 3:59 PM

Change 904548 merged by jenkins-bot:

[mediawiki/services/function-evaluator@master] Update function-schemata sub-module to HEAD (2600732)

https://gerrit.wikimedia.org/r/904548

cmassaro mentioned this in rMSFE241fc523c1b9: Update function-schemata sub-module to HEAD (2600732).Mar 30 2023, 4:01 PM

Change 904648 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-evaluator@master] Allow function-evaluator to process requests serialized with Avro versions 3 and 4.

https://gerrit.wikimedia.org/r/904648

SO. I think I can get this code-complete with one more day's work. As long as we pass a monotonically decreasing timeout between the orchestrator and evaluator, the function calls are guaranteed to end eventually.

However, I've realized I have no idea how we can test this!

A proper test of this would involve writing a quine or something similar--a function that would produce infinite mutual recursion (henceforth Nastyfunc) between the evaluator and the orchestrator. The cascading timeouts should ensure that, eventually, a function call is made that doesn't manage to execute before the timeout expires. Whether this works or not, all the caller will see is a timeout message. Metadata won't help here: whether we implement cascading timeouts or not, if somebody writes a Nastyfunc, all the metadata could ever tell us, at best, is how many calls will have been made before the top-level orchestrator call returns, which is bound by the top-level timeout.

What we would want is a test that calls the infinitely mutually recursive function, and then counts how many calls to the orchestrator and evaluator are made. I assume we could do this, with difficulty, in Kubernetes? @SDunlap @Jdforrester-WMF

Change 905223 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/tools/wikilambda-cli@master] Update function-schemata sub-module to HEAD (adb4e69)

https://gerrit.wikimedia.org/r/905223

Change 905226 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (adb4e69)

https://gerrit.wikimedia.org/r/905226

Change 905223 merged by jenkins-bot:

[mediawiki/tools/wikilambda-cli@master] Update function-schemata sub-module to HEAD (adb4e69)

https://gerrit.wikimedia.org/r/905223

Jdforrester-WMF mentioned this in rMTWC9539cf18c159: Update function-schemata sub-module to HEAD (adb4e69).Apr 3 2023, 3:01 PM

Change 905226 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (adb4e69)

https://gerrit.wikimedia.org/r/905226

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.3; 2023-04-03).Apr 3 2023, 4:00 PM

Change 904648 merged by jenkins-bot:

[mediawiki/services/function-evaluator@master] Allow function-evaluator to process requests serialized with Avro versions 3 and 4.

https://gerrit.wikimedia.org/r/904648

cmassaro mentioned this in rMSFEbecb0ff16f8a: Allow function-evaluator to process requests serialized with Avro versions 3….Apr 4 2023, 3:15 PM

Change 904383 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Advance Avro schema semver to 0.0.4 and pass remaining time in calls to evaluator.

https://gerrit.wikimedia.org/r/904383

Jdforrester-WMF mentioned this in rMSFOdc0c12cc3f93: Advance Avro schema semver to 0.0.4 and pass remaining time in calls to….Apr 6 2023, 1:51 PM

Maintenance_bot removed a project: Patch-For-Review.Apr 6 2023, 2:11 PM

Jdforrester-WMF updated the task description. (Show Details)Apr 6 2023, 3:12 PM

Change 906629 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[mediawiki/services/function-orchestrator@master] Do not make a recursive call to the orchestrator if timeout has expired.

https://gerrit.wikimedia.org/r/906629

gerritbot added a project: Patch-For-Review.Apr 6 2023, 4:13 PM

So I've looked back over the orchestrator code and I don't think we need to do anything more here. When the evaluator calls back to the orchestrator, the new function is orchestrated as part of the same original request. While it is worthwhile to pass the timeout to the evaluator to save resources, I don't think we're in any danger of fork-bombs. I've also added an extra safety measure in this patch to refrain from calling orchestrate if the timeout has already expired and will add something similar in the evaluator.

Also, please note that if we ever decide to change this for some reason (e.g., by calling the endpoint itself within the orchestrator), we WILL be at risk of this and will need to implement the cascading timeout thing.

cmassaro moved this task from Ready: G6. Stable & secure to QA / Ready to close on the Abstract Wikipedia team (Phase θ – Throttling) board.Apr 11 2023, 5:32 PM

Change 906629 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Do not make a recursive call to the orchestrator if timeout has expired.

https://gerrit.wikimedia.org/r/906629

cmassaro mentioned this in rMSFO145ea28eaaa3: Do not make a recursive call to the orchestrator if timeout has expired..Apr 12 2023, 8:46 AM

Maintenance_bot removed a project: Patch-For-Review.Apr 12 2023, 9:10 AM

That was a ride. Thank you for the explanations!

apine opened https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merge_requests/55

Parameterize timeout.

jforrester merged https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merge_requests/55

Parameterize timeout.

Maintenance_bot removed a project: Patch-For-Review.Jul 26 2023, 10:30 PM