Page MenuHomePhabricator

Translation cache exhaustion caused by changes to PHP code in file scope
Closed, ResolvedPublic

Description

(My understanding of this problem is still inexact, so my language is imprecise. Feel free to edit this task as we gain some clarity.)

Problem statement

  • The broader the scope, the harder it is for a JIT to reason about invariants and to optimize code.
  • HHVM does not deal with code in file scope all that well. It doesn't optimize it thoroughly, it translates it into the cold translation cache, and it does not always pick up code changes in top-scope.
  • We have a lot of code in file scope. InitialiseSettings.php alone is five times longer than the Bhagavad-Gita.
  • HHVM's translation cache does not have an eviction mechanism, or its eviction mechanism does not work for the cold cache, or it has some unspecified bug.

Consequences of the above for us are:

  • We don't benefit from HHVM as much as we could, because so much of our code is in file scope. (This is a relatively minor issue.)
  • Certain code changes require that HHVM be restarted before the change is picked up. Changes to StartProfile.php seem particularly susceptible to this.
  • Changes to code in file scope are more likely than other changes to lead to a sharp increase in the size of the translation cache. When the translation cache is full, HHVM restarts with a SIGABRT, aborting any current requests, causing an outage until enough servers have recovered.

Plan of action

Short-term

  • Make it possible to restart HHVM on all app servers without running scap, as @bd808 proposed in T103008: Scap should restart HHVM.
  • Do this at least once on any day in which deployments occur.
  • Avoid deploying changes to StartProfile.php and wikitech.php in quiet hours.

Mid-term

  • Ensure that these issues are reported upstream and poke the HHVM team periodically for status updates.
  • Iterate on the graceful restart procedure until it no longer generates alerts or spikes of 5xxs.

Long term

Details

Related Gerrit Patches:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krenair updated the task description. (Show Details)Jun 25 2015, 6:31 PM
Krenair set Security to None.
Legoktm updated the task description. (Show Details)Jun 25 2015, 6:38 PM

Do you mean:
HHVM's translation cache does *not* have an eviction mechanism, or its eviction mechanism does not work for the cold cache, or it has some unspecified bug.

ori updated the task description. (Show Details)Jun 25 2015, 6:51 PM
ori added subscribers: faidon, tstarling.
Joe added a comment.Jun 25 2015, 7:11 PM

@Manybubbles it does not have an eviction mechanism at all.

@Manybubbles it does not have an eviction mechanism at all.

Like the pcre cache didn't. I see a common theme!

Avoid deploying changes to StartProfile.php and wikitech.php in quiet hours.

I'm not sure that wikitech.php is in any way a trigger. What seems more likely to me based on the global scope hypothesis is that any scap or sync-* runs a percentage chance of setting off the TC exhaustion cascade. All sync operations include a post-rsync step that touches InitializeSettings.php to trigger the local serialized setting cache to be rebuilt. This will trigger each host to reconstruct the global variable caches for each version+wiki combination as they are requested.

Change 220941 had a related patch set uploaded (by BryanDavis):
Add an hhvm-graceful-all command

https://gerrit.wikimedia.org/r/220941

Quiddity updated the task description. (Show Details)Jun 25 2015, 9:32 PM
faidon added a comment.EditedJun 25 2015, 10:31 PM

Iterate on the graceful restart procedure until it no longer generates alerts or spikes of 5xxs.

This is listed under mid-term but I think it should be more of a priority. Last time we tried this, we got both internal and external alerts. The internal alert catches big spikes in 500s/503s, while the external alert thresholds is catching user-visible errors (repeated failures to load a page using Chrome). In other words, this was an actual outage (albeit, short).

Change 220941 merged by jenkins-bot:
Add an hhvm-graceful-all command

https://gerrit.wikimedia.org/r/220941

mmodell lowered the priority of this task from High to Medium.Jul 6 2015, 6:40 PM
mmodell moved this task from To Triage to In-progress on the Deployments board.
greg added a comment.Sep 24 2015, 6:23 AM

The plan of action in this task is good, but can we get subtasks for the things that still need doing? I, at least, am having a hard time figuring out where we are along this plan.

Iterate on the graceful restart procedure until it no longer generates alerts or spikes of 5xxs.

This is listed under mid-term but I think it should be more of a priority. Last time we tried this, we got both internal and external alerts. The internal alert catches big spikes in 500s/503s, while the external alert thresholds is catching user-visible errors (repeated failures to load a page using Chrome). In other words, this was an actual outage (albeit, short).

So what can be done to improve the graceful restart scenario? Does it just need to be less aggressive about it? Depooling?

Iterate on the graceful restart procedure until it no longer generates alerts or spikes of 5xxs.

This is listed under mid-term but I think it should be more of a priority. Last time we tried this, we got both internal and external alerts. The internal alert catches big spikes in 500s/503s, while the external alert thresholds is catching user-visible errors (repeated failures to load a page using Chrome). In other words, this was an actual outage (albeit, short).

So what can be done to improve the graceful restart scenario? Does it just need to be less aggressive about it? Depooling?

Scap has most of the parts needed for restarting HHVM implemented for T103008, but the process for communicating with pybal by messing with the Apache process doesn't work so we disabled it (T74366). There is a comprehensive list of steps for a safe restart in the comments on T74366.

The open task for fixing the pybal interaction with scap is T104352: Make scap able to depool/repool servers via the conftool API.

ori added a comment.Nov 25 2015, 5:09 PM

Recent releases of HHVM can reclaim TC space.

Recent releases of HHVM can reclaim TC space.

That's awesome! How difficult do you think it would be to upgrade?

Athough we upgraded fleet-wide, I think the tests @ori did showed a great deal of instabilities in this functionality

This issue was fixed in July 2016, released in HHVM 3.15.0.

Can we try again with T99740: Use static php array files for l10n cache instead of CDB?

Gilles raised the priority of this task from Medium to Needs Triage.Dec 7 2016, 7:11 PM
Gilles moved this task from Backlog: Small & Maintenance to Radar on the Performance-Team board.
Krinkle triaged this task as High priority.May 26 2017, 4:59 PM
Joe added a comment.Jul 10 2017, 6:19 AM

@Krinkle sure, we can enable reusing TC in beta for now and test if the feature is stable and working as expected.

Change 364148 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] deployment-prep: enable reusable TC on HHVM

https://gerrit.wikimedia.org/r/364148

@Krinkle sure, we can enable reusing TC in beta for now and test if the feature is stable and working as expected.

My understand of the problem is as follows:

  • We generally do use translation cache (compilation cache?), but changes to files in which most statements are in the global scope, trigger a memory leak in the translation cache.
  • We do have various files like this (most things in wmf-config), but changes to them are relatively infrequently and only affect a few files at a time. That, plus our regular restarts keep things "fine".
  • Localisation cache was too big to be part of this cache while actively leaking, trying so resulted in OOM and crashed HHVM.

Looking at enable_reusable_tc a bit, it's a bit unclear to me what this option is actually for. I assumed the memory leak we found was a bug, but reading about this option makes me thing HHVM just unconditionally holds on to everything it compiled, even when a newer version of the file has been compiled since. And enabling this option would make it actually start garbage collecting stale TC data.

Questions:

  • Is the above accurate?
  • Do we (still) perform daily restarts?

Looking at enable_reusable_tc a bit, it's a bit unclear to me what this option is actually for. I assumed the memory leak we found was a bug, but reading about this option makes me thing HHVM just unconditionally holds on to everything it compiled, even when a newer version of the file has been compiled since. And enabling this option would make it actually start garbage collecting stale TC data.
Questions:

  • Is the above accurate?

That's my understanding as well, and it's also confirmed by the original announcement: http://hhvm.com/blog/9995/hhvm-3-9-0

  • Do we (still) perform daily restarts?

We do, HHVM is restarted after three days or when reaching 50% memory usage

Change 364148 merged by Giuseppe Lavagetto:
[operations/puppet@production] deployment-prep: enable reusable TC on HHVM

https://gerrit.wikimedia.org/r/364148

demon added a comment.Oct 23 2017, 5:22 PM

We've been running the reusable TC in beta for over 2 months, I don't think we've had any issues. Should we move forward with a limited production test?

Change 414876 had a related patch set uploaded (by Chad; owner: Chad):
[operations/puppet@production] mwdebug1001/1002: enable reusable TC on HHVM

https://gerrit.wikimedia.org/r/414876

Change 414876 merged by Dzahn:
[operations/puppet@production] Enable reusable TC on HHVM on canary appservers

https://gerrit.wikimedia.org/r/414876

Dzahn added a subscriber: Dzahn.EditedMar 7 2018, 6:54 PM

applied on mwdebug1001 via puppet, other canary appservers to follow within 30 min

Change 440823 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::mediawiki::hhvm: enable TC garbage collection everywhere

https://gerrit.wikimedia.org/r/440823

Joe claimed this task.Jun 18 2018, 8:27 AM
Joe added a project: User-Joe.

will merge this change once we're out of the deployment freeze.

Does that actually still make sense at this point? We'll get rid of HHVM in 6-9 months and we don't have current issues with the TC cache, while enabling it in general could actually expose some subtle bugs. Not opposing it per se, but wondering whether the benefit warrants the potential risks.

@MoritzMuehlenhoff The alternative is to post-pone the switching of MediaWiki localisation cache from CDB to PHP by 9 months (see T99740).

This change is not simple, which means delaying the start of this work would require us to either push it back more than 9 months or otherwise ensure the same resources are available then to monitor and address issues. (Various parts of the deployment workflow may need to adapt to it, and there may be other issues we'll find and address we as go.) Also, it'd be nice to finish T99740 sooner than that so that we may start reaping its benefits.

@Krinkle Ok, that piece of context was missing. Makes sense, then.

Joe added a comment.Jun 25 2018, 6:33 AM

Does that actually still make sense at this point? We'll get rid of HHVM in 6-9 months and we don't have current issues with the TC cache, while enabling it in general could actually expose some subtle bugs. Not opposing it per se, but wondering whether the benefit warrants the potential risks.

It's been active for months on the canaries, I don't see many potential risks to be honest.

Change 440823 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::mediawiki::hhvm: enable TC garbage collection everywhere

https://gerrit.wikimedia.org/r/440823

Krinkle closed this task as Resolved.Apr 12 2019, 2:50 PM

Seems fine. We'll find out for sure when we continue work on T99740 for localisation cache.