Page MenuHomePhabricator

Avoid php-opcache corruption in WMF production
Open, HighPublic

Description

Background

The main tracking task for the php-opcache corruption we saw in production was T224491. This was closed in September after various workarounds were put in place that reduced the chances of corruption happening.

In a nut shell:

  • In order for PHP to not perform unacceptably slow, it is required that we have a compilation cache. This is enabled by default and is called opcache. Similar systems existed in HHVM and PHP5 as well. (It is not new). This system translates the .php files from disk into a parsed and optimised machine-readable format, stored in RAM.
  • Our current deployment model is based on changing files directly on disk (Scap, and rsync), and does not involve containers, new servers, or servers being depooled/re-pooled. Instead, they remain live serving.
  • Updates to files are picked up by PHP by comparing the mtime of files on disk if more than a configurable number of seconds have past since the last time it checked.
  • When it finds such an update, it recompiles the file on-the-fly and adds it to memory. It does not remove or replace the existing entry, as another on-going request might still be using that.
  • In addition to not removing or replacing on-demand, there is also no background process to or other garbage collection concept in place. Instead, it grows indefinitely until it runs out of memory, at which point it is forced to reset the opcache and start from scratch. When this happens, php7-opcache shits itself and causes unrecoverable corruption to the interpreted source code. (Unrecoverable, meaning, it is not temporary or self-correcting, any corruption that occurs tends to be sticky until a human restarts the server.)

What we did to close T224491:

  • A cronjon is active on all MW app servers that checks every few minutes if opcache is close to running out of memory. If it is, we'll try to prevent it corrupting itself by voluntarily depooling the server automatically, then doing a restart cleanly in a way that has no live traffic and thus presumably no way to trigger the race condition that causes the corruption, and then repool it. This cronjob has Icgina alerting on it. And it is spread out so that we don't restart "too many" servers at once.
  • The Scap deployment tool also enacts the same script as the cronjob to perform this restart around deployments, so that if we know we're close to running out of memory we won't wait for traffic to increase memory for the new source code, but rather catch it proactively.
Status quo

We still see corruptions from time to time. New ones are now tracked at T245183.

We are kind of stuck because any kind of major deployment or other significant temporary or indefinite utilisation of opcache (e.g. T99740) should involve a php-fpm restart to be safe, but we can't easily do a rolling restart because:

  • Live traffic has to go somewhere, so we can't restart all at once.
  • If we don't restart all at once, that means we have to do a slow rolling one.
  • Which means, deployments take 15 minutes or no longer. This would be a huge increase compared to the 1-2 minutes it takes today.
Ideas
  1. Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.
    • Method: Memory is controlled by not building up stale copies of old code.
    • Benefit: Easy to add..
    • Downside: We keep all the cronjob and scap complexity we have today.
  2. Spawn fresh php-fpm instances for each deploy.
    • Method: Some kind of socket transfer for live traffic. Automatic opcache updates would be disabled, thus memory can't grow indefinitely.
    • Benefit: Fast deployment. Relatively clean and easy to reason about.
    • Benefit: We get to remove the complexity we have today.
    • Benefit: We get to prepare and re-use what we learn here for the direction of "MW on containers".
    • Downside: Non-trivial to build.
    • Downside: More disruption to apcu lifetime. We may need to do one of T244340 or T248005 first in that case.
  3. Get the php-opcache bug(s) fixed upstream.
    • Method: Contractor?
    • Benefit: Fast deployment. Relatively clean and easy to reason about.
    • Benefit: We get to remove the complexity we have today.
    • Downside: Unsure if php-opcache is beyond fixing.

Related Objects

Event Timeline

Krinkle created this task.May 26 2020, 6:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 26 2020, 6:21 PM
  1. Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

I believe we currently do this in batches of N servers at once, where the next batch starts after the previous is fully finished. Could this be optimised by letting the batches overlap? E.g. rather than chunks of N, we'd have at most N undergoing a restart at once. More "rolling". I heard some ideas on IRC also involving PoolCounter, but a local variable on the deployment server could perhaps work as well.

We currently don't have coordination from Scap with the cronjobs (which could interfere), so PoolCounter could be used to ensure coordination with thatE.g. we'd have at most N servers in a DC undergoing restart, the server would take care of it locally, and Scap just invokes the script on all servers and each one waits as needed until it's done. Another way could be to communicate with PyBall instead and base it on ensuring a minimum number of pooled servers (as opposed to ensuring a max of depooled servers). I suppose there can be race conditions there though, so maybe PoolCounter is the better way.

  1. Spawn fresh php-fpm instances for each deploy.

What would it take do this?

  1. Get the php-opcache bug(s) fixed upstream.

[…]
Downside: Unsure if php-opcache is beyond fixing.

TODO: Ref upstream tickets and determine if there is any hope.

  1. Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

I would like to do this -- it seems like it would solve a long-tail of issues that we've whack-a-mole'd -- but I have reservations about it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

My major worry isn't that deploys themselves take 15 minutes -- it's that rollbacks would also take 15 minutes. Waiting 15 minutes for a bad deploy to propagate and then waiting 15 minutes for it to be fixed is problematic (cf: T244544)

The current paradigm of deployments involves rolling forward to individual groups of wikis as a way to de-risk a given deployment. We don't have another way to gain that confidence currently. Without that confidence we need a fast rollback mechanism. Maybe making rollbacks faster is easier to solve than making every deployment fast (?) -- not that I know how to do that exactly :)

I believe we currently do this in batches of N servers at once, where the next batch starts after the previous is fully finished. Could this be optimised by letting the batches overlap? E.g. rather than chunks of N, we'd have at most N undergoing a restart at once. More "rolling". I heard some ideas on IRC also involving PoolCounter, but a local variable on the deployment server could perhaps work as well.

The way this is currently implemented is all the groups of servers (jobrunner, appserver, appserver_api, testserver) are restarted in parallel using a pool of workers. Not more than 10% of a given group is restarted/depooled at the same time. @Joe could probably speak to the actual depooling piece.

ori added a subscriber: ori.EditedMay 27 2020, 10:26 PM

A contractor is going to ask for the same thing that upstream is asking for: a reproduction case. There's not much to go on without one.

Have you tried depooling an appserver, disabling the workarounds on that server, turning on opcache.protect_memory=1, setting low limits for opcache, and hitting the server with synthetic traffic (replay GET requests from production logs using ab)?

CDanis added a subscriber: CDanis.May 27 2020, 11:09 PM
Krinkle updated the task description. (Show Details)Jun 2 2020, 2:42 PM
Krinkle triaged this task as High priority.Mon, Jun 22, 7:10 PM
Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).
Krinkle moved this task from Inbox to Blocked or Needs-CR on the Performance-Team board.
jijiki added a subscriber: jijiki.Wed, Jul 1, 2:31 PM