Page MenuHomePhabricator

Investigate why function evaluation is slow
Closed, ResolvedPublic

Description

Via @DVrandecic, a way to reproduce slowness:

  • Create a positive integer type;
  • Define a function successor with an implementation in Python (which returns x+1);
  • Define a function floored predecessor with an implementation in JS (which returns x-1 or 0, whatever's bigger);
  • Define an is_zero function;
  • Use that to implement addition via composition as: add(x, y) := if(is zero(y), x, add(successor(x), predecessor (y)))

Event Timeline

@DVrandecic Are you able to dump the JSON literals for the various types, functions, and implementations you are using?

I tried to implement this but it's not clear to me how -- the UI is somewhat broken, and I had to rely on Geno to hand-edit JSON data, which is not ideal.

I've tried to do this again today, and was still not able to get through the first step (creating a custom type) through the UI.

Change 815318 had a related patch set uploaded (by AAssaf; author: AAssaf):

[mediawiki/services/function-orchestrator@master] Add Scott numerals unit tests in the orchestrator.

https://gerrit.wikimedia.org/r/815318

Change 815318 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Add Scott numerals unit tests in the orchestrator.

https://gerrit.wikimedia.org/r/815318

Findings so far:

First, I found that you can bypass npm and the mocha test runner binary using this command-line:

$ node /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann

This is preferable to using the wrappers because it allows us to pass command-line options to node and V8, and because it helps us be confident that we’re profiling the right process.

Next, I wanted to know what portion of the test latency is spent on-CPU (executing instructions) vs. off-CPU (blocked on I/O, waiting on locks, etc.). To do this, I used taskset to pin the task to a single CPU core, and ran it under ‘time’:

$ time taskset 0x1 node /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann

real	0m9.902s
user	0m9.666s
sys	0m0.228s

user+sys time is very close to real time, which means the test is CPU bound.

The next step was to capture a CPU profile. I tried three different approaches to compare the output. Documenting my process below so it's reproducible.

Getting a profile

Note: I am using nodejs v18.6.0, a newer version than what we're using in production, because of various profiling improvements I wanted to use.

Using the built-in profiler
$ node --prof /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann
# This produces a log file named isolate-0xXXXXXXX-XXXXX-v8.log.

$ node --prof-process isolate-0x70eb2d0-72829-v8.log > report.txt

I later discovered 0x, a tool that wraps the built-in profiler and can generate flame graphs.

$ npm install -g 0x
$ 0x /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann

https://people.wikimedia.org/~ori/aw/flamegraph-0x.html

Using perf

I followed these instructions: https://nodejs.org/en/docs/guides/diagnostics-flamegraph/ , but used Brendan Gregg's flamegraph tools, since the output produced by the graphing library recommended by the nodejs docs did not render correctly.

$ sudo perf record -e cycles:u -g -- node --perf-basic-prof /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann
# Note: for a wall-time profile, use -e task-clock instead of cycles:u.
$ perf script > perfs.out
# See https://nodejs.org/en/docs/guides/diagnostics-flamegraph/#filtering-out-node-js-internal-functions :
$ sed -i \
  -e "/( __libc_start| LazyCompile | v8::internal::| Builtin:| Stub:| LoadIC:|\[unknown\]| LoadPolymorphicIC:)/d" \
  -e 's/ LazyCompile:[*~]\?/ /' \
  perfs.out
$ FlameGraph/stackcollapse-perf.pl perfs.out  >| collapsed.out
$ FlameGraph/flamegraph.pl --minwidth=0.3 collapsed.out >| flamegraph-perf.svg

https://people.wikimedia.org/~ori/aw/flamegraph-perf.svg

The output of perf is not very useful here, because the stack traces are a mix of native and JavaScript frames, and the way JavaScript are represented is inconsistent.

Using pprof
$ npm install --save pprof
$ node --require pprof /srv/function-orchestrator/node_modules/mocha/lib/cli/cli.js --timeout 15000 --recursive --fgrep Ackermann


  orchestrate
    ✔ orchestrate msw: Scott numeral Ackermann(1, 1) (6544ms)


  1 passing (7s)

Note the test case runtime of 6544ms; that is more meaningful than the total process run time since we don't care about the performance cost of bootstrapping the test runner or the function-orchestrator itself.

pprof produces a file named pprof-profile-NNN.pb.gz (where NNN is the profiled process PID).

# Start the pprof web interface:
$ pprof -http=: pprof-profile-73355.pb.gz

Note that pprof is generating a wall-time profile, but since wall time and CPU time for this test are close, it's not a problem.

The disadvantage of using pprof is that its flame graph is not easily extractable from the web UI, so it's harder to share. Here is a screenshot:

flamegraph-pprof.png (1×2 px, 1 MB)

I managed to scrape the contents of the web UI using wget -r -np -k --adjust-extension, uploaded here:

https://people.wikimedia.org/~ori/aw/pprof/ (directed graph)
https://people.wikimedia.org/~ori/aw/pprof/flamegraph.html

Analysis

pprof shows that 4091ms -- 63% of the test execution time -- is spent in object validation (validateStatus and its children), with the majority of that spent in symbols with a name like 'validateN' (where N is some number) that have no source info. The absence of source (file/line) info was a clue that this is dynamically-generated JavaScript code (i.e. via new Function or eval), which led me to realize that these are validation functions generated at runtime by Ajv.

The numbering of the symbols is not related to ZIDs. Each Ajv instance uses a simple counter to construct names for generated functions, so validate9 simply means it's the 9th validation function generated by the Ajv instance; there is no relation to Z9. Unfortunately, each Ajv instance keeps its own counter, and we create four: one each for normal, canonical, and mixed forms, plus one more. This means that there are multiple symbols named validate8, validate9, etc. and the profiles don't distinguish between them.

I remembered that one point, V8 (the JavaScript engine used in node) did not optimize dynamic code. Ajv supports advance compilation of schemas, so I wondered if that would produce more efficient validation code. I found this StackOverflow tip for checking whether a function is optimized, and patched in this code into function-schemata. I was able to confirm that the validation functions are in fact getting optimized, so it is unlikely that compiling schemas in advance will improve performance. However, it may help with debugging performance, since we'll get source file/line info with profiling samples.

The next step I plan to take is to add some debugging calls into Ajv, so I can see what type of objects validate11 and validate14 are validating.