Investigate why function evaluation is slow
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ori
	Jun 7 2022, 6:12 PM

Description

Via @DVrandecic, a way to reproduce slowness:

Create a positive integer type;
Define a function successor with an implementation in Python (which returns x+1);
Define a function floored predecessor with an implementation in JS (which returns x-1 or 0, whatever's bigger);
Define an is_zero function;
Use that to implement addition via composition as: add(x, y) := if(is zero(y), x, add(successor(x), predecessor (y)))

Details

	Subject	Repo	Branch	Lines +/-
	Add Scott numerals unit tests in the orchestrator.	mediawiki/services/function-orchestrator	master	+717 -11

Customize query in gerrit

Related Objects

Mentioned In: T315027: Reduce the performance cost of object validation
rMSFOb43fbea4bc2a: Add Scott numerals unit tests in the orchestrator.

Event Timeline

ori created this task.Jun 7 2022, 6:12 PM

Jdforrester-WMF moved this task from To triage to Phase θ – Throttling on the Abstract Wikipedia team board.Jun 7 2022, 6:45 PM

Jdforrester-WMF edited projects, added Abstract Wikipedia team (Phase θ – Throttling); removed Abstract Wikipedia team.

@DVrandecic Are you able to dump the JSON literals for the various types, functions, and implementations you are using?

I tried to implement this but it's not clear to me how -- the UI is somewhat broken, and I had to rely on Geno to hand-edit JSON data, which is not ideal.

Jdforrester-WMF moved this task from Incoming to Ready: G2. Correct & efficient on the Abstract Wikipedia team (Phase θ – Throttling) board.Jun 14 2022, 6:47 PM

I've tried to do this again today, and was still not able to get through the first step (creating a custom type) through the UI.

• AAssaf added a subscriber: • OZhang-WMF.Jul 13 2022, 2:17 PM

• AAssaf subscribed.

Change 815318 had a related patch set uploaded (by AAssaf; author: AAssaf):

[mediawiki/services/function-orchestrator@master] Add Scott numerals unit tests in the orchestrator.

https://gerrit.wikimedia.org/r/815318

gerritbot added a project: Patch-For-Review.Jul 19 2022, 3:43 PM

Change 815318 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Add Scott numerals unit tests in the orchestrator.

https://gerrit.wikimedia.org/r/815318

Jdforrester-WMF mentioned this in rMSFOb43fbea4bc2a: Add Scott numerals unit tests in the orchestrator..Jul 20 2022, 2:19 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 20 2022, 2:30 PM

Findings so far:

First, I found that you can bypass npm and the mocha test runner binary using this command-line:

$ node /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann

This is preferable to using the wrappers because it allows us to pass command-line options to node and V8, and because it helps us be confident that we’re profiling the right process.

Next, I wanted to know what portion of the test latency is spent on-CPU (executing instructions) vs. off-CPU (blocked on I/O, waiting on locks, etc.). To do this, I used taskset to pin the task to a single CPU core, and ran it under ‘time’:

$ time taskset 0x1 node /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann

real	0m9.902s
user	0m9.666s
sys	0m0.228s

user+sys time is very close to real time, which means the test is CPU bound.

The next step was to capture a CPU profile. I tried three different approaches to compare the output. Documenting my process below so it's reproducible.

Getting a profile

Note: I am using nodejs v18.6.0, a newer version than what we're using in production, because of various profiling improvements I wanted to use.

Using the built-in profiler

$ node --prof /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann
# This produces a log file named isolate-0xXXXXXXX-XXXXX-v8.log.

$ node --prof-process isolate-0x70eb2d0-72829-v8.log > report.txt

report.txt105 KBDownload

I later discovered 0x, a tool that wraps the built-in profiler and can generate flame graphs.

$ npm install -g 0x
$ 0x /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann

https://people.wikimedia.org/~ori/aw/flamegraph-0x.html

Using `perf`

I followed these instructions: https://nodejs.org/en/docs/guides/diagnostics-flamegraph/ , but used Brendan Gregg's flamegraph tools, since the output produced by the graphing library recommended by the nodejs docs did not render correctly.

$ sudo perf record -e cycles:u -g -- node --perf-basic-prof /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann
# Note: for a wall-time profile, use -e task-clock instead of cycles:u.
$ perf script > perfs.out
# See https://nodejs.org/en/docs/guides/diagnostics-flamegraph/#filtering-out-node-js-internal-functions :
$ sed -i \
  -e "/( __libc_start| LazyCompile | v8::internal::| Builtin:| Stub:| LoadIC:|\[unknown\]| LoadPolymorphicIC:)/d" \
  -e 's/ LazyCompile:[*~]\?/ /' \
  perfs.out
$ FlameGraph/stackcollapse-perf.pl perfs.out  >| collapsed.out
$ FlameGraph/flamegraph.pl --minwidth=0.3 collapsed.out >| flamegraph-perf.svg

https://people.wikimedia.org/~ori/aw/flamegraph-perf.svg

The output of perf is not very useful here, because the stack traces are a mix of native and JavaScript frames, and the way JavaScript are represented is inconsistent.

Using pprof

$ npm install --save pprof
$ node --require pprof /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann


  orchestrate
    ✔ orchestrate msw: Scott numeral Ackermann(1, 1) (6544ms)


  1 passing (7s)

Note the test case runtime of 6544ms; that is more meaningful than the total process run time since we don't care about the performance cost of bootstrapping the test runner or the function-orchestrator itself.

pprof produces a file named pprof-profile-NNN.pb.gz (where NNN is the profiled process PID).

# Start the pprof web interface:
$ pprof -http=: pprof-profile-73355.pb.gz

pprof-profile-73355.pb.gz65 KBDownload

Note that pprof is generating a wall-time profile, but since wall time and CPU time for this test are close, it's not a problem.

The disadvantage of using pprof is that its flame graph is not easily extractable from the web UI, so it's harder to share. Here is a screenshot:

I managed to scrape the contents of the web UI using wget -r -np -k --adjust-extension, uploaded here:

https://people.wikimedia.org/~ori/aw/pprof/ (directed graph)
https://people.wikimedia.org/~ori/aw/pprof/flamegraph.html

Analysis

pprof shows that 4091ms -- 63% of the test execution time -- is spent in object validation (validateStatus and its children), with the majority of that spent in symbols with a name like 'validateN' (where N is some number) that have no source info. The absence of source (file/line) info was a clue that this is dynamically-generated JavaScript code (i.e. via new Function or eval), which led me to realize that these are validation functions generated at runtime by Ajv.

The numbering of the symbols is not related to ZIDs. Each Ajv instance uses a simple counter to construct names for generated functions, so validate9 simply means it's the 9th validation function generated by the Ajv instance; there is no relation to Z9. Unfortunately, each Ajv instance keeps its own counter, and we create four: one each for normal, canonical, and mixed forms, plus one more. This means that there are multiple symbols named validate8, validate9, etc. and the profiles don't distinguish between them.

I remembered that one point, V8 (the JavaScript engine used in node) did not optimize dynamic code. Ajv supports advance compilation of schemas, so I wondered if that would produce more efficient validation code. I found this StackOverflow tip for checking whether a function is optimized, and patched in this code into function-schemata. I was able to confirm that the validation functions are in fact getting optimized, so it is unlikely that compiling schemas in advance will improve performance. However, it may help with debugging performance, since we'll get source file/line info with profiling samples.

The next step I plan to take is to add some debugging calls into Ajv, so I can see what type of objects validate11 and validate14 are validating.

ori mentioned this in T315027: Reduce the performance cost of object validation.Aug 11 2022, 4:16 PM

ori closed this task as Resolved.Oct 12 2022, 2:31 PM

	F35382875: report.txt
	Aug 2 2022, 3:54 PM

Investigate why function evaluation is slowClosed, ResolvedPublicActions