Investigate opcache hit rate on Buster appserver
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RLazarus
	Dec 18 2020, 6:37 PM

Description

Both mw1265 and mwdebug1003 have active opcache hit ratio alerts, with status CRITICAL: opcache cache-hit ratio is below 99.85%, but their actual hit ratio is above the critical threshold (even though mwdebug1003 should be emitting a warning):

rzl@mw1265:~$ php7adm /opcache-info | jq .
{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 519450912,
    "free_memory": 512941152,
    "wasted_memory": 41349760,
    "current_wasted_percentage": 3.850996494293213
  },
  "interned_strings_usage": {
    "buffer_size": 100663296,
    "used_memory": 35895520,
    "free_memory": 64767776,
    "number_of_strings": 798770
  },
  "opcache_statistics": {
    "num_cached_scripts": 11020,
    "num_cached_keys": 18099,
    "max_cached_keys": 32531,
    "hits": 3846219216,
    "start_time": 1608053201,
    "last_restart_time": 0,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 0,
    "misses": 11108,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.99971119774261
  }
}

rzl@mwdebug1003:~$ php7adm /opcache-info | jq .
{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 150631392,
    "free_memory": 159394776,
    "wasted_memory": 4546632,
    "current_wasted_percentage": 1.4453353881835938
  },
  "interned_strings_usage": {
    "buffer_size": 52428800,
    "used_memory": 4966984,
    "free_memory": 47461816,
    "number_of_strings": 91130
  },
  "opcache_statistics": {
    "num_cached_scripts": 1150,
    "num_cached_keys": 1780,
    "max_cached_keys": 32531,
    "hits": 6403628,
    "start_time": 1608250368,
    "last_restart_time": 0,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 0,
    "misses": 1154,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.98198221266547
  }
}

The icinga service looks like it's updating regularly, so this ought to have cleared on its own -- something funny is going on here.

Details

	Subject	Repo	Branch	Lines +/-
	nrpe_check_opcache: use full path to php7adm to fix opcache monitor on buster	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved		Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved		Jdforrester-WMF	T313563 Bump lcobucci/jwt & league/uri-components for php8
Resolved		Jdforrester-WMF	T313564 Bump onoi/message-reporter in vendor.git to 1.4.2 for php 8 support
Resolved		Jdforrester-WMF	T247658 Make Wikimedia Production MediaWiki compatible with PHP 7.4
Resolved		• toan	T243590 Fix WikibaseDataModel CI for php 7.4
Resolved		Lucas_Werkmeister_WMDE	T316923 Restore skipped test in ReferenceListTest.php
Resolved		Joe	T318918 Undeploy patch to use old PHP serialization in PHP 7.4
Resolved		Jdforrester-WMF	T264168 Drop PHP 7.2 support from Wikibase master branch, once Wikimedia production is on 7.4
Resolved		Ladsgroup	T270740 Drop hacky support for doctrine/dbal class renaming
Invalid		None	T303505 [S] Remove Deprecated methods "serialize" and "unserialize" after php production upgrade to PHP 7.4
Resolved		Reedy	T251043 Cleanup css-sanitizer when it only requires PHP >= 7.4
Open		None	T166010 The Great Namespaceization Effort
Resolved		tstarling	T277618 var_dump() on various objects writes gigabytes of data and takes minutes to run
Resolved		Jdforrester-WMF	T261872 Drop PHP 7.2 & 7.3 support from MediaWiki master branch, once Wikimedia production is on 7.4
Stalled		None	T302086 Set scap minimum python version to 3.7
Resolved		None	T247045 Migrate all of production metal and VMs to Buster or later
Resolved	PRODUCTION ERROR	Legoktm	T293568 PHP Notice: Undefined offset in wikimedia/remex-html when rendering rest.php error page
Resolved		tstarling	T297667 mysqli/mysqlnd memory leak
Resolved		Joe	T271736 Migrate WMF production from PHP 7.2 to PHP 7.4
Resolved		Krinkle	T248191 Can't reopen table in wikidb-unittest_ (from SpecialPageFatalTest)
Resolved		hashar	T278203 Migrate all CI jobs from stretch to buster or later and drop stretch testing support
Resolved		Jdforrester-WMF	T252432 Drop MediaWiki testing in stretch and instead test only in buster
Resolved		Dzahn	T245757 Upgrade MediaWiki clusters to Debian Buster (debian 10)
Resolved		Dzahn	T270517 Investigate opcache hit rate on Buster appserver

Event Timeline

RLazarus triaged this task as Medium priority.Dec 18 2020, 6:37 PM

RLazarus created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 18 2020, 6:37 PM

RLazarus added a parent task: T245757: Upgrade MediaWiki clusters to Debian Buster (debian 10).Dec 18 2020, 6:38 PM

RhinosF1 subscribed.Jan 20 2021, 9:45 PM

Is this a problem with icinga?

legoktm@mwdebug1003:~$ /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
OK: opcache is healthy

Doesn't seem like a permissions issue either (assuming it runs as the nagios user:

legoktm@mwdebug1003:~$ sudo -u nagios /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
OK: opcache is healthy

But https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mwdebug1003&service=PHP+opcache+health is still critical.

This always happens after reimaging a server and then disappears after it's been running for a while.

It's not consistent. For example mw2226 is OK but mw2224 and mw2225 have the alert but all 3 are buster and have been reimaged on the same day, 8 days ago.

mwdebug1003 was one of the first servers to be reimaged and it's still critical after over a month though

In T270517#6763884, @Legoktm wrote:

Is this a problem with icinga?

Yes! And it's really weird. I tracked down the NRPE command that is run from Icinga and it behaves different on 2 example hosts:

@alert1001:/etc/icinga# /usr/lib/nagios/plugins/check_nrpe -H mw2226.codfw.wmnet -c check_opcache
OK: opcache is healthy

@alert1001:/etc/icinga# /usr/lib/nagios/plugins/check_nrpe -H mw2225.codfw.wmnet -c check_opcache
CRITICAL: opcache cache-hit ratio is below 99.85%

but running the command locally on the hosts it works on both:

@mw2226:/etc/nagios# /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
OK: opcache is healthy

@mw2225:/etc/nagios/nrpe.d# /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
OK: opcache is healthy

I already confirmed the NRPE config is the same and then also enabled debug logging in NRPE. If i do that on the host where it works I can see it in the logs like this:

Jan 21 01:14:43 mw2226 nrpe[39956]: Running command: /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
Jan 21 01:14:43 mw2226 nrpe[39956]: Command completed with return code 0 and output: OK: opcache is healthy

if i do the same on a host where it doesn't work (mw2225).. i see NOTHING in logs .. wut???

Legoktm added projects: Icinga, observability.Jan 21 2021, 1:18 AM

upon further investigation I realized mw2226 is actually still stretch and I made a mistake to mark it as DONE in the etherpad for appserver upgrades... somehow.. not sure how.

this means the difference is actually the NRPE version

I tried installing nagios-nrpe-server 4.0.3-1~bpo10+1 over 3.2.1-2 but that did not fix the issue either.

I found the issue. Changing line 28 in /usr/local/lib/nagios/plugins/nrpe_check_opcache to:

OUT=$(/usr/local/bin/php7adm /opcache-info | jq . 2>&1)

fixes it. The difference is the full path to php7adm. before was:

OUT=$(php7adm /opcache-info | jq . 2>&1)

On stretch it worked without the full path.

Dzahn claimed this task.Jan 21 2021, 3:08 AM

Change 657460 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] nrpe_check_opcache: use full path to php7adm to fix opcache monitor on buster

https://gerrit.wikimedia.org/r/657460

gerritbot added a project: Patch-For-Review.Jan 21 2021, 3:21 AM

Change 657460 merged by Dzahn:
[operations/puppet@production] nrpe_check_opcache: use full path to php7adm to fix opcache monitor on buster

https://gerrit.wikimedia.org/r/657460

03:28 <+icinga-wm> RECOVERY - PHP opcache health on mwdebug1003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw1265 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2240 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2363 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2255 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2329 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2335 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health

Dzahn closed this task as Resolved.Jan 21 2021, 3:38 AM

Maintenance_bot removed a project: Patch-For-Review.Jan 21 2021, 4:10 AM

Nice find! Thanks for tracking this down.

Investigate opcache hit rate on Buster appserverClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate opcache hit rate on Buster appserver
Closed, ResolvedPublic
Actions

Related Objects
Search...