Understand APC size increase after HHVM upgrade/restart
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Jun 21 2017, 4:20 PM

Description

Follows-up from T167885: Regression: Backend Save Timing raised by about 50%

Restarting a server causes APC to get cleared since it is part of the same process. While it might be nice to preserve it somehow, I think we've gotten used to this being the case and accepted this. It has also become an unfortunate, but regular habit, to restart HHVM when APC usage grows out of hand (given HHVM doesn't do LRU garbage collection and just lets itself OOM if not enough keys expire in time).

Regardless, the recent HHVM upgrade did *not* cause a memory drop like usual, but rather an increase.

Verify that these events are correlated (e.g. not a coincidence). - Reproduce by restarting e.g. mwdebug servers.
If true, verify what happens to APC values. Are they preserved? Assuming not, where is the memory going? Why is it not being released?

Related Objects

Mentioned In: T167885: Regression: Backend Save Timing raised by about 50%
Mentioned Here: T167885: Regression: Backend Save Timing raised by about 50%

Event Timeline

Krinkle created this task.Jun 21 2017, 4:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2017, 4:20 PM

Krinkle mentioned this in T167885: Regression: Backend Save Timing raised by about 50%.Jun 21 2017, 4:20 PM

• Gilles assigned this task to Krinkle.Jun 21 2017, 6:55 PM

• Gilles triaged this task as Medium priority.

• Gilles moved this task from Inbox, needs triage to To-do: Goals prioritized current Quarter on the Performance-Team board.

Nikerabbit subscribed.Jun 22 2017, 5:59 AM

Mentioned in SAL (#wikimedia-operations) [2017-06-29T03:56:26Z] <Krinkle> 'service hhvm restart' on mwdebug1001 and mwdebug1002 to help investigate T168540

2017-06-15 08:02 moritzm: updating HHVM on terbium/wasat to 3.18

1GB increase in memory usage immediately following the upgrade:

Upgrade	Month

2017-06-13 16:35 moritzm: upgrading osmium to HHVM 3.18

Upgrade	Month

On the other hand, looking at HHVM APC Usage (which is what initiated this investigation) on an individual app server, I do actually see a drop (as one would expect, since an upgrade involves a restart, and as such all cache keys in memory are meant to be lost).

2017-06-13 14:11 moritzm: upgrading mw1299-mw1306 to HHVM 3.18

https://grafana-admin.wikimedia.org/dashboard/db/hhvm-apc-usage-per-server?refresh=5m&orgId=1&from=1497225600000&to=1497528000000&var-server=mw1299

Screen Shot 2017-06-28 at 20.52.13.png (722×2 px, 121 KB)

Screen Shot 2017-06-28 at 20.52.09.png (1×2 px, 222 KB)

2017-06-29 03:56 Krinkle: 'service hhvm restart' on mwdebug1001 and mwdebug1002 to help investigate T168540
2017-06-29 04:01 Krinkle: 'service hhvm restart' on mwdebug1001 and mwdebug1002 to help investigate T168540

Screen Shot 2017-06-28 at 21.21.51.png (1×1 px, 145 KB)

This also shows clearly that APC keys are correctly cleared as part of an HHVM restart.

The overall usage across the cluster also looks to have stabilised since the upgrades/restarts.

https://grafana-admin.wikimedia.org/dashboard/db/hhvm-apc-usage?refresh=5m&orgId=1&from=1494197161210&to=1498710155693

Screen Shot 2017-06-28 at 21.24.12.png (920×2 px, 307 KB)

Screen Shot 2017-06-28 at 21.24.19.png (910×2 px, 341 KB)

Perhaps the base footprint is a bit higher in the new HHVM, or perhaps we're using APC more in MediaWiki in general, but there doesn't appear to be a problem with restarts not clearing APC, and it certainly hasn't doubled nor is it otherwise retaining previous keys.

Resolving this task as such.

Mentioned in SAL (#wikimedia-operations) [2017-06-29T04:33:00Z] <Krinkle> 'service hhvm restart' on mwdebug1001 and mwdebug1002 (T168540)

Krinkle updated the task description. (Show Details)Jun 29 2017, 9:01 PM

	F8555323: Screen Shot 2017-06-28 at 21.24.12.png
	Jun 29 2017, 4:26 AM

	F8555205: Screen Shot 2017-06-28 at 20.52.13.png
	Jun 29 2017, 4:26 AM

	F8555284: Screen Shot 2017-06-28 at 21.21.51.png
	Jun 29 2017, 4:26 AM

	F8555110: Screen Shot 2017-06-28 at 20.32.21.png
	Jun 29 2017, 4:26 AM

	F8555121: Screen Shot 2017-06-28 at 20.37.57.png
	Jun 29 2017, 4:26 AM

	F8555099: Screen Shot 2017-06-28 at 20.30.18.png
	Jun 29 2017, 4:26 AM

	F8555325: Screen Shot 2017-06-28 at 21.24.19.png
	Jun 29 2017, 4:26 AM

	F8555206: Screen Shot 2017-06-28 at 20.52.09.png
	Jun 29 2017, 4:26 AM

Understand APC size increase after HHVM upgrade/restartClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Understand APC size increase after HHVM upgrade/restart
Closed, ResolvedPublic
Actions