Performance review of Wikifunctions
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dr0ptp4kt
	Aug 26 2020, 5:55 PM

Description

This is a request for a performance review of Wikifunctions from 1-June-2021 (Tuesday) to 14-June-2021 (Monday), inclusive.

The system is composed of a JavaScript Vue client, MediaWiki PHP middleware, and backend Node.js orchestration/evaluation (n.b., user-defined functions will be supported in Node.js and Python in earlier releases of Wikifunctions).

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikiLambda/+/refs/heads/master
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/function-orchestrator/+/refs/heads/master
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/function-evaluator/+/refs/heads/master

Preview environment

It is presently possible to run the Mediawiki extension WikiLambda in MediaWiki-Docker with minimal configuration, and wireup for more replete orchestration and evaluation is in flight.

Additionally, Lucas Werkmeister has in spare time been running an independent system (which is not the target of evaluation, to be clear) - https://notwikilambda.toolforge.org/wiki/Main_Page

Which code to review

We're interested in evaluation of performance across the tiers. It should be noted that we expect this new Wikimedia project's user base to be small at first, and therefore its computational needs comparatively modest at first.

Performance assessment

Q&A:

What work has been done to ensure the best possible performance of the feature? So far, we haven't begun optimization for performance.
What are likely to be the weak areas (e.g. bottlenecks) of the code in terms of performance? Snappy FTUX on the Vue client side app, or at least effective management of user expectations for FTUX on the Vue client side app, type conformance checking on the JSON UGC, and function execution bottlenecks and in-memory or on-disk pressure points.
Are there potential optimisations that haven't been performed yet? Yes.
Please list which performance measurements are in place for the feature and/or what you've measured ad-hoc so far. If you are unsure what to measure, ask the Performance Team for advice: performance-team@wikimedia.org. Not much. We do intend to have various profiling measures in place in the future, as it will likely play into different types of optimization (e.g., caching and parallelism bounds) and we'll need guards in place to monitor and address yet-to-be-defined SLOs; but this is MVP work.

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• Gilles	T261341 Performance review of Wikifunctions
Resolved		ori	T284162 Create a Beta Cluster version of Wikifunctions.org
Resolved		Jdforrester-WMF	T283031 Begin branching the WikiLambda extension
Resolved		sbassett	T274682 Security Readiness Review For Wikifunctions
Resolved	Security	gengh	T284403 WikiLambda extension (wikilambda_edit action) is missing CSRF protection
Resolved		Jdforrester-WMF	T289315 Work out how we're going to have "production-like" versions of the wikifunctions evaluator and orchestrator services in Beta Cluster
Invalid		None	T295477 Request creation of deployment-prep-k8s VPS project
Resolved		Jdforrester-WMF	T290274 Add any necessary dependencies for WikiLambda to production vendor once they've passed Security review
Duplicate		maryyang	T310184 Alerting for function-* services on Beta

Event Timeline

dr0ptp4kt created this task.Aug 26 2020, 5:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 26 2020, 5:55 PM

dr0ptp4kt updated the task description. (Show Details)Aug 26 2020, 5:57 PM

dr0ptp4kt updated the task description. (Show Details)

DVrandecic moved this task from To triage to Phase δ on the Abstract Wikipedia team board.Aug 27 2020, 5:10 PM

DVrandecic edited projects, added Abstract Wikipedia team (Phase δ); removed Abstract Wikipedia team.

• Gilles claimed this task.Aug 31 2020, 6:33 PM

• Gilles moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Aug 31 2020, 6:35 PM

Marking this as stalled for my own tracking of performance reviews that are currently actionable for us. Please move this back to open once there is something running for us to review.

Bumping this task. We would like to have a performance readiness review performed in Q4 FY 2020-2021. Exact projected requested start date to be determined and agreed upon within the Abstract Wikipedia team.

Heads up @Gilles, request for performance readiness review Q4. As you probably know, the architecture has changed, with the WikiLambda extension (still) in MediaWiki and a service orchestrator / choreographer component in Node.js and programming language evaluator componentry behind that; it may change in some ways although that would likely be after first launch dependent on real world observed workload behavior.

We'll need to update the task Description to better reflect reality. But first, I wanted to get this in the queue for Q4. Looking forward to it!

Sounds good to me, thanks for requesting it early!

Jdforrester-WMF added a project: WikiLambda.Mar 24 2021, 10:24 PM

DVrandecic moved this task from Phase δ to Phase ε on the Abstract Wikipedia team board.Apr 7 2021, 4:17 PM

DVrandecic edited projects, added Abstract Wikipedia team (Phase ε); removed Abstract Wikipedia team (Phase δ).

dr0ptp4kt renamed this task from Performance review of WikiLambda extension to Performance review of Wikifunctions.May 7 2021, 7:10 PM

dr0ptp4kt changed the task status from Stalled to Open.

dr0ptp4kt triaged this task as High priority.

dr0ptp4kt updated the task description. (Show Details)

@Gilles I touched up the Description some. As emailed, we'd like review from the 1st of June (Tuesday) to the 14th of June (Tuesday), 2021, inclusive.

Thanks!

dr0ptp4kt updated the task description. (Show Details)May 7 2021, 7:23 PM

dr0ptp4kt updated the task description. (Show Details)

dr0ptp4kt updated the task description. (Show Details)May 7 2021, 7:26 PM

dr0ptp4kt updated the task description. (Show Details)

dr0ptp4kt updated the task description. (Show Details)May 7 2021, 7:28 PM

As promised, I've updated the main docker-compose instructions to hopefully cover both the main testing use case and the likely variants that you'll encounter, and confirmed in a brand-new check-out that it works as I expect, using:

function-orchestrator:
  image: docker-registry.wikimedia.org/wikimedia/mediawiki-services-function-orchestrator:2021-05-26-191706-production
  ports:
    - 6254:6254
function-evaluator:
  image: docker-registry.wikimedia.org/wikimedia/mediawiki-services-function-evaluator:2021-05-27-170058-production
  ports:
    - 6927:6927

Special:CreateZObject page

A fair amount of content jumping around can be avoided by matching the no-JS placeholder's sizing to the actual final UI's. This should be optimised for the common case (user has JS). The same is true for Special:EvaluateFunctionCall
The language list query doesn't appear to be cached at all (Cache-Control: private, must-revalidate, max-age=0 response). I don't expect the list of languages to change that frequently? Even caching this for a few minutes server-sider and client-side would help.
The loaded JS files coming from Ace are uncached a well (mode-json.js theme-chrome.js worker-json.js)

I don't know why Ace needs a service worker at all, but I didn't see an obvious way to avoid it in their documentation. I guess that odd architecture choice is something we'll have to live with as long as Ace is what's used for syntax highlighting.

Orchestrator and evaluator

I benchmarked the orchestrator and evaluator by hammering the API endpoint with ab. There doesn't seem to be any memory leak in either service.

In terms of performance, the drawback of the single-threaded nature of node starts to show pretty quickly as the concurrency increases. With 10 concurrent requests, the median response time is around 600ms. As expected with 100 concurrent requests, it becomes 6s.

Having a dynamically sized pool of evaluator services is desirable for the production deployment. I assume that the orchestrator sends requests to the evaluator asynchronously and as a result shouldn't need to be pooled as well.

The issue might be solvable inside the service as well, if the actual running of functions is done inside a Web Worker instead of directly on the main thread. But attempting to solve this in-service has downsides: less control over the concurrency and observability becomes more complex as well.

I would like to test the timeout abilities, but I really can't figure out the UI in order to create something that would call a custom JS function. Is there a step-by-step tutorial I could follow in order to do that?

Hi, Gilles, a few questions/comments.

I assume that the orchestrator sends requests to the evaluator asynchronously and as a result shouldn't need to be pooled as well

Correct, all communication with the orchestrator and evaluator is asynchronous. Out of curiosity, with these concurrent requests, how much time is spent in the orchestrator vs. the evaluator?

The issue might be solvable inside the service as well, if the actual running of functions is done inside a Web Worker instead of directly on the main thread

I did not know about this! Are you recommending this for the evaluator or the orchestrator?

I really can't figure out the UI in order to create something that would call a custom JS function

Do you want to call this from the wiki or insert your request somewhere deeper in the stack (e.g. orchestrator, evaluator)? If you check out the API sandbox, there is an example request that will let you run JS code. As for a step-by-step tutorial ... hmm, there isn't, but there should be. At present, I am happy to discuss this with you if you like!

I haven't checked what's causing the single-threaded behaviour. I presume it's happening in the evaluator, as you're probably running code on the main thread there?

I've tried modifying the python3 API sandbox example, but I suspect it's actually not fully working for me, as the output values don't have the result of the sum. I get:

{
    "query": {
        "wikilambda_function_call": {
            "Orchestrated": {
                "success": "",
                "data": "{\"Z1K1\":\"Z5\",\"Z5K1\":{\"Z1K1\":\"Z402\",\"Z402K1\":{\"Z1K1\":\"Z421\",\"Z421K1\":null}}}"
            }
        }
    }
}

Is this the expected output?

And trying to modify the python code to call python's time.sleep() doesn't seem to sleep at all. I have a feeling that the python code is actually not being run and there's no visible error in the output. Is there an error log somewhere from the evaluator that I could inspect?

Hmm, I also notice that nothing is working at the moment. We've made a bunch of changes to the orchestrator recently, so I assume that's the reason. I will investigate today.

probably running code on the main thread

All code is run within promises, but in the main thread, yes.

error log somewhere from the evaluator

Not yet; that is something I would very much like to have.

DVrandecic moved this task from Phase ε to Phase ζ on the Abstract Wikipedia team board.Jun 23 2021, 4:32 PM

DVrandecic edited projects, added Abstract Wikipedia team (Phase ζ); removed Abstract Wikipedia team (Phase ε).

• dpifke subscribed.Aug 5 2021, 7:23 PM

Krinkle removed • Gilles as the assignee of this task.Aug 9 2021, 6:58 PM

Krinkle added a subscriber: • Gilles.

• dpifke moved this task from To-do: Goals, prioritized next 4 Quarters to Inbox, needs triage on the Performance-Team board.Aug 10 2021, 6:34 PM

• dpifke assigned this task to Krinkle.Aug 10 2021, 6:51 PM

• dpifke moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.

Jdforrester-WMF moved this task from Phase ζ to Launch Beta Cluster wikifunctions.org on the Abstract Wikipedia team board.Aug 20 2021, 4:04 PM

Jdforrester-WMF edited projects, added Abstract Wikipedia team; removed Abstract Wikipedia team (Phase ζ).

Following up from off thread discussion: we'll plan on the following based on what we know today:

Loading to beta cluster. Work will be happening over the next short while to get there.
Sometime later in Q2, further performance readiness review.
In 2022, after working through pre-launch production considerations, MVP production launch.

dr0ptp4kt added a subtask: T284162: Create a Beta Cluster version of Wikifunctions.org.Oct 1 2021, 5:37 PM

DVrandecic moved this task from Launch Beta Cluster wikifunctions.org to Phase κ – Clean-up on the Abstract Wikipedia team board.Oct 7 2021, 5:19 PM

DVrandecic edited projects, added Abstract Wikipedia team (Phase κ – Clean-up); removed Abstract Wikipedia team.

Jdforrester-WMF changed the status of subtask T284162: Create a Beta Cluster version of Wikifunctions.org from Open to In Progress.Nov 23 2021, 3:50 PM

sbassett mentioned this in T289322: Pre-launch security review of Wikifunctions.Feb 4 2022, 9:08 PM

ori closed subtask T284162: Create a Beta Cluster version of Wikifunctions.org as Resolved.Jun 30 2022, 8:34 PM

Krinkle removed Krinkle as the assignee of this task.Sep 19 2022, 6:00 PM

Krinkle subscribed.

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Sep 26 2022, 5:28 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

Krinkle closed this task as Resolved.Sep 27 2022, 6:35 PM

Krinkle assigned this task to • Gilles.