Page MenuHomePhabricator

Investigate opcache hit rate on Buster appserver
Closed, ResolvedPublic

Description

Both mw1265 and mwdebug1003 have active opcache hit ratio alerts, with status CRITICAL: opcache cache-hit ratio is below 99.85%, but their actual hit ratio is above the critical threshold (even though mwdebug1003 should be emitting a warning):

rzl@mw1265:~$ php7adm /opcache-info | jq .
{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 519450912,
    "free_memory": 512941152,
    "wasted_memory": 41349760,
    "current_wasted_percentage": 3.850996494293213
  },
  "interned_strings_usage": {
    "buffer_size": 100663296,
    "used_memory": 35895520,
    "free_memory": 64767776,
    "number_of_strings": 798770
  },
  "opcache_statistics": {
    "num_cached_scripts": 11020,
    "num_cached_keys": 18099,
    "max_cached_keys": 32531,
    "hits": 3846219216,
    "start_time": 1608053201,
    "last_restart_time": 0,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 0,
    "misses": 11108,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.99971119774261
  }
}
rzl@mwdebug1003:~$ php7adm /opcache-info | jq .
{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 150631392,
    "free_memory": 159394776,
    "wasted_memory": 4546632,
    "current_wasted_percentage": 1.4453353881835938
  },
  "interned_strings_usage": {
    "buffer_size": 52428800,
    "used_memory": 4966984,
    "free_memory": 47461816,
    "number_of_strings": 91130
  },
  "opcache_statistics": {
    "num_cached_scripts": 1150,
    "num_cached_keys": 1780,
    "max_cached_keys": 32531,
    "hits": 6403628,
    "start_time": 1608250368,
    "last_restart_time": 0,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 0,
    "misses": 1154,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.98198221266547
  }
}

The icinga service looks like it's updating regularly, so this ought to have cleared on its own -- something funny is going on here.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
Resolved toan
ResolvedLucas_Werkmeister_WMDE
ResolvedJoe
ResolvedJdforrester-WMF
ResolvedLadsgroup
InvalidNone
ResolvedReedy
OpenNone
Resolvedtstarling
ResolvedJdforrester-WMF
StalledNone
ResolvedNone
ResolvedPRODUCTION ERRORLegoktm
Resolvedtstarling
ResolvedJoe
ResolvedKrinkle
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedDzahn
ResolvedDzahn

Event Timeline

RLazarus triaged this task as Medium priority.Dec 18 2020, 6:37 PM
RLazarus created this task.

Is this a problem with icinga?

legoktm@mwdebug1003:~$ /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
OK: opcache is healthy

Doesn't seem like a permissions issue either (assuming it runs as the nagios user:

legoktm@mwdebug1003:~$ sudo -u nagios /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
OK: opcache is healthy

But https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mwdebug1003&service=PHP+opcache+health is still critical.

This always happens after reimaging a server and then disappears after it's been running for a while.

It's not consistent. For example mw2226 is OK but mw2224 and mw2225 have the alert but all 3 are buster and have been reimaged on the same day, 8 days ago.

mwdebug1003 was one of the first servers to be reimaged and it's still critical after over a month though

Is this a problem with icinga?

Yes! And it's really weird. I tracked down the NRPE command that is run from Icinga and it behaves different on 2 example hosts:

@alert1001:/etc/icinga# /usr/lib/nagios/plugins/check_nrpe -H mw2226.codfw.wmnet -c check_opcache
OK: opcache is healthy

@alert1001:/etc/icinga# /usr/lib/nagios/plugins/check_nrpe -H mw2225.codfw.wmnet -c check_opcache
CRITICAL: opcache cache-hit ratio is below 99.85%

but running the command locally on the hosts it works on both:

@mw2226:/etc/nagios# /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
OK: opcache is healthy

@mw2225:/etc/nagios/nrpe.d# /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
OK: opcache is healthy

I already confirmed the NRPE config is the same and then also enabled debug logging in NRPE. If i do that on the host where it works I can see it in the logs like this:

Jan 21 01:14:43 mw2226 nrpe[39956]: Running command: /usr/local/lib/nagios/plugins/nrpe_check_opcache -w 100 -c 50
Jan 21 01:14:43 mw2226 nrpe[39956]: Command completed with return code 0 and output: OK: opcache is healthy

if i do the same on a host where it doesn't work (mw2225).. i see NOTHING in logs .. wut???

upon further investigation I realized mw2226 is actually still stretch and I made a mistake to mark it as DONE in the etherpad for appserver upgrades... somehow.. not sure how.

this means the difference is actually the NRPE version

I tried installing nagios-nrpe-server 4.0.3-1~bpo10+1 over 3.2.1-2 but that did not fix the issue either.

I found the issue. Changing line 28 in /usr/local/lib/nagios/plugins/nrpe_check_opcache to:

OUT=$(/usr/local/bin/php7adm /opcache-info | jq . 2>&1)

fixes it. The difference is the full path to php7adm. before was:

OUT=$(php7adm /opcache-info | jq . 2>&1)

On stretch it worked without the full path.

Change 657460 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] nrpe_check_opcache: use full path to php7adm to fix opcache monitor on buster

https://gerrit.wikimedia.org/r/657460

Change 657460 merged by Dzahn:
[operations/puppet@production] nrpe_check_opcache: use full path to php7adm to fix opcache monitor on buster

https://gerrit.wikimedia.org/r/657460

03:28 <+icinga-wm> RECOVERY - PHP opcache health on mwdebug1003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw1265 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2240 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2363 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2255 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2329 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2335 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
03:29 <+icinga-wm> RECOVERY - PHP opcache health on mw2357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health

Nice find! Thanks for tracking this down.