Page MenuHomePhabricator

Lower Parsoid heap limit
Closed, ResolvedPublic

Description

As part of the recent switch of changeprop processing from eqiad to codfw, we noticed a temporary dip in Parsoid p99 latencies:

pasted_file (1×1 px, 290 KB)

However, p99 duly recovered after a few hours.

I think I have noticed this pattern before. I have also seen heap usage of specific workers grow over time, which did affect latency when usage grew beyond about 1G.

The current Parsoid heap limit is set at 800m, which is fairly high. Service-runner heap limits are only enforced after being breached for several minutes in a row, so allow for temporary spikes. It is thus not necessary to set the limit to the absolute maximum memory you expect a sane request to consume.

Reducing this heap limit can potentially reduce p99 latencies by restarting large workers sooner. It might also reduce the rate of non-graceful worker restarts. I would thus propose to gradually lower the configured heap limit, perhaps to 600m first, followed by 500m or 400m.

Event Timeline

I am not convinced about the numbers going back up permanently -- they fluctuate. Let us wait and watch.

P99 latency fluctuation is consistent with heap causing issues, as those workers are eventually killed -- either gracefully by the heap limiter, or hard on watchdog timeout.

Given the low cost of this experiment, I think it is worth trying a lower heap limit, and see what this does to

  • the frequency of hard worker kills (watchdog timeout),
  • the frequency of graceful heap limit restarts, and
  • p99 parse latency.

If my hunch is correct, we would see

  • less frequent hard worker kills,
  • slightly more frequent graceful worker restarts from the heap limit, and
  • a lowered and more consistent p99 parse latency.

I was mostly suggesting that bad p99 values are either because of large pages or because of contention. With the move to codfw, the latter is mostly removed. But, large pages that generate large DOMs will still have bad p99 numbers.

IIRC, the heap based restart is only set to kill workers if they exceed heap limits continuously for N mins (where N > 2). But, req. timeouts are set to 2mins and right now in Parsoid, on request timeouts, we kill the worker anyway. So, a lower heap limit would probably not do much there as far as I can tell. Presumably GC would take are of clearing memory between requests.

When we switched to service-runner, we increased the limit to get rid of the annoying heap limit exceeded warnings. But, yes, we could see what a lower heap limit would do. I don't fully understand why you think heap size and p99 values are correlated here. I am not saying you are wrong. Just don't see the connection yet.

I don't fully understand why you think heap size and p99 values are correlated here. I am not saying you are wrong. Just don't see the connection yet.

Garbage collection with large heaps uses more CPU time and causes longer GC pauses. This increases p99 by slowing requests processed in workers with such large heaps.

A lower heap limits restarts those workers gracefully before there is an appreciable slow-down from large heaps.

I don't fully understand why you think heap size and p99 values are correlated here. I am not saying you are wrong. Just don't see the connection yet.

Garbage collection with large heaps uses more CPU time and causes longer GC pauses. This increases p99 by slowing requests processed in workers with such large heaps.

A lower heap limits restarts those workers gracefully before there is an appreciable slow-down from large heaps.

I understand this in the general sense, but I am asking about specifics. For example, this also depends on the specific GC algorithm (I should look up what v8 uses -- I've forgotten). Secondly, I am not sure if I would call 800mb a large heap on today's servers. So, I am mostly skeptical that a 400mb increase in GC can lead to substantial p99 time differences. In any case, we have a non-trivial number of pages with largish DOMs where the extra heap size is going to be used.

But, come January, we could experiment with lowering this to 600mb and see what impacts this has. I am not opposed to that.

Do we know why is the memory usage growing in the first place? Did we try to analyse the leak? Maybe it's something really obvious and easily fixable?

In RESTBase the memory usage grows over time too, but my attempts to identify the leaks were not really fruitful, so we never invested in getting RESTBase to stop leaking memory, but maybe Parsoid is different?

I believe the heap growth is caused more by heap fragmentation, codespace (JIT) memory growth, and incremental collection, and not by an actual JS code leak. This is similar to what we are seeing in RESTBase. A manual full collection tends to bring the memory back down, but is obviously expensive.

In any case, Node 6 will also bring some improvements in the v8 GC (see for example this blog post). Our initial benchmarks showed lower memory usage overall, but we'll see how this actually plays out in production.

@Pchelolo, there is no leak. Large pages can easily be 10MB or larger whereas p95 is ~200K (see here) ... in DOM form, these pages can easily occupy 100s of MBs especially in earlier stages of DOM processing where we have lots more meta information.

https://grafana.wikimedia.org/dashboard/db/restbase?panelId=9&fullscreen&from=now-30d&to=now shows that the p99 numbers have continued to fluctuate and have remained well under the 2 min timeout limit and often around 1.3 min.

Looking at https://grafana.wikimedia.org/dashboard/db/restbase?panelId=9&fullscreen&from=now-90d&to=now ... the p99 numbers have been consistently between the 1.3 and 1.7 min mark since the switch to codfw for background processing. Is there any reason to consider this request anymore? I am inclined to close this ticket.

Sure, why not. :-) 600mb seems like a good number to start with.

Something for @Arlolra or @cscott to try out next week or later. Over the next 6 weeks, I'll be India working half time and almost always outside the deploy timezone.

Change 345208 had a related patch set uploaded (by Arlolra):
[mediawiki/services/parsoid/deploy@master] T153798: Lower heap limit to 600m

https://gerrit.wikimedia.org/r/345208

Arlolra triaged this task as Medium priority.Mar 28 2017, 9:28 PM

Change 345208 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy@master] T153798: Lower heap limit to 600m

https://gerrit.wikimedia.org/r/345208

Following up on this, so far it does not look like this change noticeably moved the needle on p99 latencies. There was a small dip, but looking at the longer term trend in https://grafana.wikimedia.org/dashboard/db/restbase?panelId=9&fullscreen&orgId=1&from=now-30d&to=now this seems to be all below the noise level.

image.png (1×1 px, 214 KB)

The impact on 5xx (incl. permanent timeout) response rates looks uncertain as well. Lots of noise in that metric, too.

Given the large number of other factors and noise at play here, I suspect that we won't be able to clearly see any big effect even in the longer term. On the plus side, so far it looks like the change didn't hurt either.

Thanks for trying it, anyway!

ssastry claimed this task.