Reproduce opcache corruptions in production
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jijiki
	Aug 21 2020, 5:42 PM

Description

We have been suffering with opcache corruptions since we started rolling out php-fpm to production. We have managed to limit the frequency they occur by restarting php-fpm on a server based on how much free opcache memory is available. This is working well, but sadly we still have corruptions every now and then, which appear (for now) to be random.

In T253673 we discussed that it would help if we were able to reproduce this corruption. Understanding that this is hard to do, we can at least try.

By disabling any anti-corruption measures we have, we will replay a few hundreds of thousands of GET requests (from prodution logs) under the the following scenarios

just let it run for X days
Using opcache.protect_memory = 1. The idea is that we will get a segfault should something try to modify the memory
Using opcache.consistency_checks = Y. Opcache will verify the cache checksums every X requests

Extras tests:

Enable briefly opcache.consistency_checks = Y in a production server and check the performance penalty
Simulate code deployments as described here: https://phabricator.wikimedia.org/T253673#6386605
- We have a php script where the contents are changing frequently
- the client is constantly requesting this script

Many thanks to @ori for the ideas

Testing setup:

Server
Different scenarios will run against mwdebug1001. eqiad.wmet
Issues: mwdebug* servers are in the appserver cluster, so they can fire alerts and their metrics are included in the overall cluster metrics T262202

Client

mwdebug2001.codfw.wmnet
Sample of 750k GET requests

Related Dashboards

Details

	Subject	Repo	Branch	Lines +/-
	php::admin: export additional opcache metrics	operations/puppet	production	+13 -0
	hiera: change mwdebug1001 opcache settings for testing	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Krinkle	T212460 Adopt static array files for local disk storage of values (epic)
Open	None	T99740 Use static php array files for l10n cache at WMF (instead of CDB)
Resolved	Krinkle	T245183 PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.)
Resolved	Krinkle	T253673 Avoid php-opcache corruption in WMF production
Resolved	jijiki	T261009 Reproduce opcache corruptions in production

Event Timeline

jijiki created this task.Aug 21 2020, 5:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 21 2020, 5:42 PM

Change 622179 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: change opcached settings for testing

https://gerrit.wikimedia.org/r/622179

gerritbot added a project: Patch-For-Review.Aug 24 2020, 5:44 PM

Change 622179 merged by Effie Mouzeli:
[operations/puppet@production] hiera: change mwdebug1001 opcache settings for testing

https://gerrit.wikimedia.org/r/622179

Maintenance_bot removed a project: Patch-For-Review.Aug 24 2020, 10:10 PM

• Mholloway subscribed.Aug 24 2020, 11:31 PM

jijiki added a parent task: T253673: Avoid php-opcache corruption in WMF production.Sep 7 2020, 9:05 AM

jijiki updated the task description. (Show Details)Sep 7 2020, 11:31 AM

jijiki mentioned this in T262202: Create a separate 'mwdebug' cluster.Sep 7 2020, 12:15 PM

jijiki added a project: User-jijiki.Sep 8 2020, 9:54 AM

jijiki updated the task description. (Show Details)

The chunk of about 700k GET requests for about a week, but nothing stood up much.

I have moved forward and enabled opcache.protect_memory=1, which has severely degraded performance and has significantly increased latency.

jijiki updated the task description. (Show Details)Sep 8 2020, 10:02 AM

jijiki moved this task from Inbox 🐅 to In Progress 🏋️‍♀️ on the User-jijiki board.Sep 8 2020, 10:15 AM

jijiki updated the task description. (Show Details)Sep 8 2020, 5:22 PM

jijiki updated the task description. (Show Details)Sep 8 2020, 5:36 PM

Change 625224 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] php::admin: export additional opcache metrics

https://gerrit.wikimedia.org/r/625224

gerritbot added a project: Patch-For-Review.Sep 8 2020, 5:50 PM

Change 625224 merged by Effie Mouzeli:
[operations/puppet@production] php::admin: export additional opcache metrics

https://gerrit.wikimedia.org/r/625224

Maintenance_bot removed a project: Patch-For-Review.Sep 9 2020, 11:10 AM

jijiki moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Sep 10 2020, 10:26 AM

jijiki updated the task description. (Show Details)Sep 16 2020, 4:43 PM

TL;DR: we were unable to reproduce a corruption in this iterartion

I run the full set of URLs a few times using opcache.protect_memory = 1. What is interesting though it is how badly performance was degraded

opcache.consistency_checks = 100

Next step, I run the full set with opcache.consistency_checks = 100 , which didn't degrade performance much, but I am unable to emulate the traffic towards a real application server. If I am to see how it truly impacts performance, I would have to try it on a production server (for every 1000 requests).

Simulate code deployments

The testing for this scenario was as follows: on mwdebug1001 we installed the changed.php script. We had another script that would constantly change its contents, so to emulate what we do during code deployments. Normal deployments were happening to the server as well. On the client side we had mwdebug2001 continuously requesting changed.php. Sadly, since the file was rather small, wasted memory wasn't increasing much.

General Notes

If corruptions are related to code deployments, then we could possibly test that with *many* and larger self changing scripts
I added metrics for wasted memory. Wasted memory in opcache consists of cache entries that have been invalidated (eg after a deployment), but have not been released.
Due to alerts firing during those tests, this task started the conversation on separating the debug* servers from the production cluster as much as possible T262202

Possible future work

Disable opcache restarts on all mwdebug* servers
set opcache.consistency_checks = 100 on all mwdebug* servers
set opcache.consistency_checks = 1000 on an app and an api server

I am marking this task as Stalled until we get any fresh ideas, or we migrate MediaWiki on Kubernetes

jijiki triaged this task as Low priority.Sep 16 2020, 6:18 PM

jijiki moved this task from In Progress 🏋️‍♀️ to St on the User-jijiki board.Sep 23 2020, 11:30 AM

jijiki moved this task from Doing 😎 to 🔦Unused2 on the serviceops board.Nov 2 2020, 11:48 AM

The reasons have been identified in https://phabricator.wikimedia.org/T253673#6569013

	F32352641: image.png
	Sep 16 2020, 6:17 PM

Reproduce opcache corruptions in production Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Reproduce opcache corruptions in production
Closed, ResolvedPublic
Actions

Related Objects
Search...