Page MenuHomePhabricator

Reproduce opcache corruptions in production
Closed, ResolvedPublic

Description

We have been suffering with opcache corruptions since we started rolling out php-fpm to production. We have managed to limit the frequency they occur by restarting php-fpm on a server based on how much free opcache memory is available. This is working well, but sadly we still have corruptions every now and then, which appear (for now) to be random.

In T253673 we discussed that it would help if we were able to reproduce this corruption. Understanding that this is hard to do, we can at least try.

By disabling any anti-corruption measures we have, we will replay a few hundreds of thousands of GET requests (from prodution logs) under the the following scenarios

  • just let it run for X days
  • Using opcache.protect_memory = 1. The idea is that we will get a segfault should something try to modify the memory
  • Using opcache.consistency_checks = Y. Opcache will verify the cache checksums every X requests

Extras tests:

  • Enable briefly opcache.consistency_checks = Y in a production server and check the performance penalty
  • Simulate code deployments as described here: https://phabricator.wikimedia.org/T253673#6386605
    • We have a php script where the contents are changing frequently
    • the client is constantly requesting this script

Many thanks to @ori for the ideas

Testing setup:

Server
Different scenarios will run against mwdebug1001. eqiad.wmet
Issues: mwdebug* servers are in the appserver cluster, so they can fire alerts and their metrics are included in the overall cluster metrics T262202

Client

  • mwdebug2001.codfw.wmnet
  • Sample of 750k GET requests

Related Dashboards

Event Timeline

Change 622179 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: change opcached settings for testing

https://gerrit.wikimedia.org/r/622179

Change 622179 merged by Effie Mouzeli:
[operations/puppet@production] hiera: change mwdebug1001 opcache settings for testing

https://gerrit.wikimedia.org/r/622179

The chunk of about 700k GET requests for about a week, but nothing stood up much.

I have moved forward and enabled opcache.protect_memory=1, which has severely degraded performance and has significantly increased latency.

Change 625224 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] php::admin: export additional opcache metrics

https://gerrit.wikimedia.org/r/625224

Change 625224 merged by Effie Mouzeli:
[operations/puppet@production] php::admin: export additional opcache metrics

https://gerrit.wikimedia.org/r/625224

jijiki changed the task status from Open to Stalled.EditedSep 16 2020, 6:17 PM

TL;DR: we were unable to reproduce a corruption in this iterartion

  • I run the full set of URLs a few times using opcache.protect_memory = 1. What is interesting though it is how badly performance was degraded

image.png (678×3 px, 192 KB)

  • opcache.consistency_checks = 100

Next step, I run the full set with opcache.consistency_checks = 100 , which didn't degrade performance much, but I am unable to emulate the traffic towards a real application server. If I am to see how it truly impacts performance, I would have to try it on a production server (for every 1000 requests).

  • Simulate code deployments

The testing for this scenario was as follows: on mwdebug1001 we installed the changed.php script. We had another script that would constantly change its contents, so to emulate what we do during code deployments. Normal deployments were happening to the server as well. On the client side we had mwdebug2001 continuously requesting changed.php. Sadly, since the file was rather small, wasted memory wasn't increasing much.

General Notes

  • If corruptions are related to code deployments, then we could possibly test that with *many* and larger self changing scripts
  • I added metrics for wasted memory. Wasted memory in opcache consists of cache entries that have been invalidated (eg after a deployment), but have not been released.
  • Due to alerts firing during those tests, this task started the conversation on separating the debug* servers from the production cluster as much as possible T262202

Possible future work

  • Disable opcache restarts on all mwdebug* servers
  • set opcache.consistency_checks = 100 on all mwdebug* servers
  • set opcache.consistency_checks = 1000 on an app and an api server

I am marking this task as Stalled until we get any fresh ideas, or we migrate MediaWiki on Kubernetes