Investigate how to fix the performance problems caused by CPU bound work on the MCS services
Open, HighPublic

Description

Discussed in the Platform-Audiences sync as follow-up of T189526.

Because of the heavy weight DOM manipulation happening in the endpoints, there are some endpoints that are being starved because the event loop is being blocked for too long, producing GC problems and memory pressure and worsening the response times of the other endpoints that could be responding earlier.

We should make it so that the event loop is never blocked for too long so that the server can respond faster to things that are done and avoid blocking the event loop for too long.

What

Figure out the best way to improve the performance of the MCS server and services

How

Suggested:

  • Investigate different alternatives, benchmark their performance characteristics and document the impact on the code/development. To decide which one we should apply to all the code/services. Suggested approaches to look into:
    • Splitting synchronous CPU bound algorithms to be asynchronous and not go for longer than Xms, delaying work to nextTick after a certain amount of time.
    • Having a pool of threads for CPU bound work (dom manipulations) and a pool for the network handling, and some sort of queue to send work to the CPU bound threads.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jhernandez triaged this task as High priority.Aug 23 2018, 4:20 PM
Jhernandez moved this task from To Triage to Upcoming on the Readers-Web-Backlog board.
bearND updated the task description. (Show Details)Aug 23 2018, 4:34 PM

Having a pool of threads for CPU bound work (dom manipulations) and a pool for the network handling, and some sort of queue to send work to the CPU bound threads.

That's not easy to do in the node environment - node has a UV_THREAD_POOL, but you can't access it without native code. There probably exist some libraries to offload things to threads, but that will require a native dependency and might not really help with performance since for thread safety all the data should be copied on the way out of the main thread and on the way back in - this will increase memory consumption and GC footprint quite significantly, since we will be copying Parsoid HTML which could be quite big.

I think the easiest thing to start with is to break long blocking tasks into smaller portions and break them down into chains of promises, or wrap each in process.nextTick.

Change 465801 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] [Demo] Break long synchronous processing tasks into promise chains

https://gerrit.wikimedia.org/r/465801

Change 465802 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] [Demo] Add LRU cache for parsed page DOMs

https://gerrit.wikimedia.org/r/465802

Progress update:

I spent some time going over the codebase and increasing its asynchronicity by promise-wrapping expensive operations (such as domino.createDocument) and replacing series of synchronous processing steps with promise chains. I focused on the summary endpoint, although changes ended up being rather far-reaching due to the amount of shared code involved in summary processing. In the end, this effort turned out to have no measurable impact on the endpoint's performance as measured by ab.

So, I tried another experiment. I knew from profiling that quite a bit of our processing effort is being expended constructing a domino Document object in domino.createDocument, so I tried simply caching the constructed Document with lru-cache, using the etag from the HTML request as the cache key. This turned out to improve performance considerably:

Current:

Document Path:          /en.wikipedia.org/v1/page/summary/Canada
Document Length:        3531 bytes

Concurrency Level:      10
Time taken for tests:   27.998 seconds
Complete requests:      100
Failed requests:        0
Keep-Alive requests:    100
Total transferred:      445100 bytes
HTML transferred:       353100 bytes
Requests per second:    3.57 [#/sec] (mean)
Time per request:       2799.804 [ms] (mean)
Time per request:       279.980 [ms] (mean, across all concurrent requests)
Transfer rate:          15.52 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:   684 2619 417.8   2715    3109
Waiting:      684 2619 417.9   2715    3109
Total:        684 2619 417.7   2715    3109

Percentage of the requests served within a certain time (ms)
  50%   2715
  66%   2777
  75%   2803
  80%   2835
  90%   2894
  95%   2992
  98%   3094
  99%   3109
 100%   3109 (longest request)

With Document object cache:

Document Path:          /en.wikipedia.org/v1/page/summary/Canada
Document Length:        3531 bytes

Concurrency Level:      10
Time taken for tests:   11.326 seconds
Complete requests:      100
Failed requests:        0
Keep-Alive requests:    100
Total transferred:      445100 bytes
HTML transferred:       353100 bytes
Requests per second:    8.83 [#/sec] (mean)
Time per request:       1132.593 [ms] (mean)
Time per request:       113.259 [ms] (mean, across all concurrent requests)
Transfer rate:          38.38 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:   434 1030 122.8   1051    1350
Waiting:      434 1030 122.8   1051    1350
Total:        434 1030 122.7   1051    1351

Percentage of the requests served within a certain time (ms)
  50%   1051
  66%   1082
  75%   1100
  80%   1110
  90%   1130
  95%   1159
  98%   1253
  99%   1351
 100%   1351 (longest request)

That said, I doubt we'd see gains this dramatic in production, for a few reasons. First, of course, we wouldn't just be hammering away at a single page and enjoying a 100% cache hit rate; also, we'd expect that many of the requests that reach MCS are in response to page content changes, meaning that cached Document objects from previous renders wouldn't help us. OTOH, most MCS endpoints for both page and feed content include domino.createDocument as a processing step, so at least for popular pages we could expect cached Document objects to be reused at least several times.

Another point to consider is that these Document objects can be rather large (up to 8 or more MB for very large pages, as measured by this method), so we might need to allocate a considerable amount of memory to such a cache in order to see a benefit.

That said, I doubt we'd see gains this dramatic in production, for a few reasons. First, of course, we wouldn't just be hammering away at a single page and enjoying a 100% cache hit rate; also, we'd expect that many of the requests that reach MCS are in response to page content changes, meaning that cached Document objects from previous renders wouldn't help us. OTOH, most MCS endpoints for both page and feed content include domino.createDocument as a processing step, so at least for popular pages we could expect cached Document objects to be reused at least several times.

MCS is only touched if the HTML content has changed - thus etag is changed - thus the cache will not work. As for sharing between different endpoints - this will probably not work either, cause given the number of workers multiplied by the number of nodes the probability of 2 requests landing in the same worker is negligeable.

The only thing where I think it could be beneficial is feed endpoints - they're not stored, they touch a small subset of pages over a long period of time, so the probability of cache hitting is much better.

As @Pchelolo, I also doubt that a cache would help us in production, considering our service architecture having many worker. Hitting the same page in your ab test is not indicative of production behavior.

@Mholloway I'd be interested to learn more about the results of the promise chains. I think we're not really after absolute performance gains but more responsiveness of workers, so that a worker's main thread is not blocked for too long when it's already processing a large page.

@Pchelolo would adding more workers to MCS be a reasonable course of action?

@Pchelolo would adding more workers to MCS be a reasonable course of action?

Don't think so. We have ncpu workers now, so they don't compete with each other for CPU on the os level, if we add more workers - they will start..

Thanks @Pchelolo and @bearND. I had been meaning to check on whether each worker was a separate process. Looks like the Document cache idea is a bust.

(Maybe we could take pressure off the main workers by creating a separate, internal-only service within mobileapps, with the sole purpose of parsing pages into Document objects on demand?)

I wasn't sure how to test the effect of the promise chain approach on event loop blockage except as reflected in overall performance. I'd expect the version with the promise chains to have less of a performance hit in response to increasing concurrency, but in fact there appears to be no change.

https://phabricator.wikimedia.org/P7662 (current)
https://phabricator.wikimedia.org/P7663 (with processing promise chains)

The -c 10 numbers for the promise chain version do look a bit better in this sample, but I think that only reflects a transient network connection improvement. After running multiple times, I haven't seen the numbers consistently better or worse.

The -c 10 numbers for the promise chain version do look a bit better in this sample, but I think that only reflects a transient network connection improvement. After running multiple times, I haven't seen the numbers consistently better or worse.

Actually, in the numbers you've posted the p95/p99 performance dramatically decreased, from 5s to almost 9s. However, for p95/p99 we need to perform way more requests then 100.

I've repeated your experiments but cached the parsoid and mw api results in memory to eliminate networking. The variant without promisifying the long sync processing is faster on average, but the promisified version have way more flat distribution of the request latencies.

Change 465802 abandoned by Mholloway:
[Demo] Add LRU cache for parsed page DOMs

Reason:
Won't help in production.

https://gerrit.wikimedia.org/r/465802

It is normal to wither see no changes in number or to see slightly higher numbers when the process is promise-heavier. However, we are here interested in the performance of other requests when one request holds the event loop. In this case, results for other requests should be better. It is acceptable to have slightly higher latencies for the CPU-heavy request if other requests' latencies improve.

In general, there are two types of promisification we can do: (i) inter-method; and (ii) intra-method. The former is what you were testing (putting calls to next methods on nextTick). This can help, but as you note correctly most of the work gets done inside manipulation methods. Hence, we need to find a (reasonable) way of implementing intra-method promisification. The logic would be something like: assess/calculate the amount of transformations/manipulations that would need to be done and then split that over a number of event loop ticks. For example, if divs are being removed, we could say that removing up to 10 of them in a single loop pass is acceptable. Since parsing is usually the most expensive step (especially for long documents), IMHO we should first try to focus on how to promisify that step in a sustainable manner.

As for using the threaded model, I agree that using it in Node is not a good approach. We would need to use a different technology/framework to achieve this. This would, naturally, complicate the architecture so we need to carefully assess the impact of moving on with this idea. For this reason, I think we shouldn't take it off the table, but I think we should first see how much can we improve the current situation by modifying the existing code base.

Just saw https://github.com/wikimedia/restbase/pull/1074, and am wondering if that is enough to cover also this task or should we continue promisifying CPU intensive tasks?

Just saw https://github.com/wikimedia/restbase/pull/1074, and am wondering if that is enough to cover also this task or should we continue promisifying CPU intensive tasks?

That PR is only tangentially related to this, the root cause still has to be addressed.

And the fact it did increase the perf of /ontishday 2x suggests there is a lot of room for improvement in MCS itself - we're making 5 times more requests RB->MCS for that endpoint now, so more CPU time for making the request, parsing and routing it in MCS, combining the results, but just by splitting the actual processing into different nodes we get a lot of improvement.

What caused the 5x increase in requests?

we've switched from requesting all the onthisday content at once to requesting every portion separately

Change 470744 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] [Demo] [WIP] Use custom, internally promisified createDocument method

https://gerrit.wikimedia.org/r/470744

Since parsing is usually the most expensive step (especially for long documents), IMHO we should first try to focus on how to promisify that step in a sustainable manner.

First stab at internal promisification of the HTML parsing step is in the patch above. Seems fairly reasonable to me, but I'll leave that to the beholder. ;)

@Mholloway Have you produced any flame graphs? Anything worth sharing in regards to that?

Change 471358 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] [Demo] [WIP] Use custom, internally promisified createDocument method

https://gerrit.wikimedia.org/r/471358

bearND added a comment.EditedNov 5 2018, 7:30 PM

We've been able to create flame graphs with:

npm install -g autocannon clinic
clinic flame --on-port='autocannon -c1 http://localhost:6927/en.wikipedia.org/v1/page/mobile-sections/Cat' -- node server.js

Overall 20% of time is spent in createDocument(); 40% of time is spent in transforms.stripUnneededMarkup().

Hottest areas inside transforms.stripUnneededMarkup on master (5999b322c) with domino 2.1.0 and above request are:

  • *find mobileapps/node_modules/domino/lib/select.js:851:20

Top of Stack: 3.9% (149 of 3775 samples)
On Stack: 11% (414 of 3775 samples)

  • *value mobileapps/node_modules/domino/lib/FilteredElementList.js:57:30

Top of Stack: 3.3% (126 of 3775 samples)
On Stack: 8.2% (311 of 3775 samples)

  • *find mobileapps/node_modules/domino/lib/select.js:851:20

Top of Stack: 1.3% (48 of 3775 samples)
On Stack: 8.6% (325 of 3775 samples)

Change 474768 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] WIP: Internally promisify page processing steps

https://gerrit.wikimedia.org/r/474768

Change 470744 abandoned by Mholloway:
[Demo] [WIP] Use custom, internally promisified createDocument method

https://gerrit.wikimedia.org/r/470744

Change 465801 abandoned by Mholloway:
[Demo] Break long synchronous processing tasks into promise chains

https://gerrit.wikimedia.org/r/465801

Change 474768 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Internally promisify page processing steps

https://gerrit.wikimedia.org/r/474768

Mentioned in SAL (#wikimedia-operations) [2018-12-06T18:17:11Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@1dba3cd]: Internally promisify page processing steps (T202642)

Mentioned in SAL (#wikimedia-operations) [2018-12-06T18:21:04Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@1dba3cd]: Internally promisify page processing steps (T202642) (duration: 03m 54s)

Change 471358 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Use promisified incremental domino.createDocument method

https://gerrit.wikimedia.org/r/471358

Mentioned in SAL (#wikimedia-operations) [2018-12-10T21:14:49Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@9f4b567]: More internal promisification and other performance tweaks (T202642)

Mentioned in SAL (#wikimedia-operations) [2018-12-10T21:19:06Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@9f4b567]: More internal promisification and other performance tweaks (T202642) (duration: 04m 17s)

Mentioned in SAL (#wikimedia-operations) [2018-12-18T19:38:27Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@963d704]: Parse summaries from lead objects only (T202642)

Mentioned in SAL (#wikimedia-operations) [2018-12-18T19:43:53Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@963d704]: Parse summaries from lead objects only (T202642) (duration: 05m 26s)