We have been suffering with opcache corruptions since we started rolling out php-fpm to production. We have managed to limit the frequency they occur by restarting php-fpm on a server based on how much free opcache memory is available. This is working well, but sadly we still have corruptions every now and then, which appear (for now) to be random.
In T253673 we discussed that it would help if we were able to reproduce this corruption. Understanding that this is hard to do, we can at least try.
By disabling any anti-corruption measures we have, we will replay a few hundreds of thousands of GET requests (from prodution logs) under the the following scenarios
- just let it run for X days
- Using opcache.protect_memory = 1. The idea is that we will get a segfault should something try to modify the memory
- Using opcache.consistency_checks = Y. Opcache will verify the cache checksums every X requests
Extras tests:
- Enable briefly opcache.consistency_checks = Y in a production server and check the performance penalty
- Simulate code deployments as described here: https://phabricator.wikimedia.org/T253673#6386605
- We have a php script where the contents are changing frequently
- the client is constantly requesting this script
Many thanks to @ori for the ideas
Testing setup:
Server
Different scenarios will run against mwdebug1001. eqiad.wmet
Issues: mwdebug* servers are in the appserver cluster, so they can fire alerts and their metrics are included in the overall cluster metrics T262202
Client
- mwdebug2001.codfw.wmnet
- Sample of 750k GET requests
Related Dashboards