Page MenuHomePhabricator

Understand APC size increase after HHVM upgrade/restart
Closed, ResolvedPublic

Description

Follows-up from T167885: Regression: Backend Save Timing raised by about 50%

Restarting a server causes APC to get cleared since it is part of the same process. While it might be nice to preserve it somehow, I think we've gotten used to this being the case and accepted this. It has also become an unfortunate, but regular habit, to restart HHVM when APC usage grows out of hand (given HHVM doesn't do LRU garbage collection and just lets itself OOM if not enough keys expire in time).

Regardless, the recent HHVM upgrade did *not* cause a memory drop like usual, but rather an increase.

  • Verify that these events are correlated (e.g. not a coincidence). - Reproduce by restarting e.g. mwdebug servers.
  • If true, verify what happens to APC values. Are they preserved? Assuming not, where is the memory going? Why is it not being released?

Event Timeline

Krinkle created this task.Jun 21 2017, 4:20 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2017, 4:20 PM
Gilles assigned this task to Krinkle.Jun 21 2017, 6:55 PM
Gilles triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-operations) [2017-06-29T03:56:26Z] <Krinkle> 'service hhvm restart' on mwdebug1001 and mwdebug1002 to help investigate T168540

Krinkle closed this task as Resolved.Jun 29 2017, 4:26 AM

2017-06-15 08:02 moritzm: updating HHVM on terbium/wasat to 3.18

1GB increase in memory usage immediately following the upgrade:

UpgradeMonth

2017-06-13 16:35 moritzm: upgrading osmium to HHVM 3.18

UpgradeMonth

On the other hand, looking at HHVM APC Usage (which is what initiated this investigation) on an individual app server, I do actually see a drop (as one would expect, since an upgrade involves a restart, and as such all cache keys in memory are meant to be lost).

2017-06-13 14:11 moritzm: upgrading mw1299-mw1306 to HHVM 3.18

https://grafana-admin.wikimedia.org/dashboard/db/hhvm-apc-usage-per-server?refresh=5m&orgId=1&from=1497225600000&to=1497528000000&var-server=mw1299

2017-06-29 03:56 Krinkle: 'service hhvm restart' on mwdebug1001 and mwdebug1002 to help investigate T168540
2017-06-29 04:01 Krinkle: 'service hhvm restart' on mwdebug1001 and mwdebug1002 to help investigate T168540

This also shows clearly that APC keys are correctly cleared as part of an HHVM restart.

The overall usage across the cluster also looks to have stabilised since the upgrades/restarts.

https://grafana-admin.wikimedia.org/dashboard/db/hhvm-apc-usage?refresh=5m&orgId=1&from=1494197161210&to=1498710155693

Perhaps the base footprint is a bit higher in the new HHVM, or perhaps we're using APC more in MediaWiki in general, but there doesn't appear to be a problem with restarts not clearing APC, and it certainly hasn't doubled nor is it otherwise retaining previous keys.

Resolving this task as such.

Mentioned in SAL (#wikimedia-operations) [2017-06-29T04:33:00Z] <Krinkle> 'service hhvm restart' on mwdebug1001 and mwdebug1002 (T168540)

Krinkle updated the task description. (Show Details)Jun 29 2017, 9:01 PM