Page MenuHomePhabricator

Understand APC size increase after HHVM upgrade/restart
Closed, ResolvedPublic

Description

Follows-up from T167885: Regression: Backend Save Timing raised by about 50%

Restarting a server causes APC to get cleared since it is part of the same process. While it might be nice to preserve it somehow, I think we've gotten used to this being the case and accepted this. It has also become an unfortunate, but regular habit, to restart HHVM when APC usage grows out of hand (given HHVM doesn't do LRU garbage collection and just lets itself OOM if not enough keys expire in time).

Regardless, the recent HHVM upgrade did *not* cause a memory drop like usual, but rather an increase.

  • Verify that these events are correlated (e.g. not a coincidence). - Reproduce by restarting e.g. mwdebug servers.
  • If true, verify what happens to APC values. Are they preserved? Assuming not, where is the memory going? Why is it not being released?

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2017-06-29T03:56:26Z] <Krinkle> 'service hhvm restart' on mwdebug1001 and mwdebug1002 to help investigate T168540

2017-06-15 08:02 moritzm: updating HHVM on terbium/wasat to 3.18

1GB increase in memory usage immediately following the upgrade:

UpgradeMonth
Screen Shot 2017-06-28 at 20.30.18.png (420×1 px, 48 KB)
Screen Shot 2017-06-28 at 20.32.21.png (410×1 px, 42 KB)

2017-06-13 16:35 moritzm: upgrading osmium to HHVM 3.18

UpgradeMonth
Screen Shot 2017-06-28 at 20.37.47.png (392×1 px, 48 KB)
Screen Shot 2017-06-28 at 20.37.57.png (390×1 px, 44 KB)

On the other hand, looking at HHVM APC Usage (which is what initiated this investigation) on an individual app server, I do actually see a drop (as one would expect, since an upgrade involves a restart, and as such all cache keys in memory are meant to be lost).

2017-06-13 14:11 moritzm: upgrading mw1299-mw1306 to HHVM 3.18

https://grafana-admin.wikimedia.org/dashboard/db/hhvm-apc-usage-per-server?refresh=5m&orgId=1&from=1497225600000&to=1497528000000&var-server=mw1299

Screen Shot 2017-06-28 at 20.52.13.png (722×2 px, 121 KB)
Screen Shot 2017-06-28 at 20.52.09.png (1×2 px, 222 KB)

2017-06-29 03:56 Krinkle: 'service hhvm restart' on mwdebug1001 and mwdebug1002 to help investigate T168540
2017-06-29 04:01 Krinkle: 'service hhvm restart' on mwdebug1001 and mwdebug1002 to help investigate T168540

Screen Shot 2017-06-28 at 21.21.51.png (1×1 px, 145 KB)

This also shows clearly that APC keys are correctly cleared as part of an HHVM restart.

The overall usage across the cluster also looks to have stabilised since the upgrades/restarts.

https://grafana-admin.wikimedia.org/dashboard/db/hhvm-apc-usage?refresh=5m&orgId=1&from=1494197161210&to=1498710155693

Screen Shot 2017-06-28 at 21.24.12.png (920×2 px, 307 KB)

Screen Shot 2017-06-28 at 21.24.19.png (910×2 px, 341 KB)

Perhaps the base footprint is a bit higher in the new HHVM, or perhaps we're using APC more in MediaWiki in general, but there doesn't appear to be a problem with restarts not clearing APC, and it certainly hasn't doubled nor is it otherwise retaining previous keys.

Resolving this task as such.

Mentioned in SAL (#wikimedia-operations) [2017-06-29T04:33:00Z] <Krinkle> 'service hhvm restart' on mwdebug1001 and mwdebug1002 (T168540)