When deploying 1.39.0-wmf.3 to group 0 , the PHP opcache got filed on several application servers causing alarms to be triggered. scap should have restarted php on all the application servers to clear out the cache but it clearly did not.
Looking at the Scap (ECS) dashboard on Kibana https://logstash.wikimedia.org/goto/43acdb213090860ac826a636905e91b1 , searching for messages matching "check-and-restart" we have an history of the check-restart-php php7.2-fpm invocations:
Mar 22, 2022 @ 09:54:25 sync-world Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s)
Mar 22, 2022 @ 10:03:28 sync-wikiversions Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s)
86 hosts are not enough. We had 91 hosts at some point but the baseline before March 1st was 352 hosts:
Mar 7, 2022 @ 06:49:44.164 | 86 hosts |
Mar 3, 2022 @ 21:30:19.767 | 91 hosts |
Mar 1, 2022 @ 17:23:56.331 | 91 hosts |
Mar 1, 2022 @ 08:08:21.766 | 352 hosts |
Scap got updated on March 1st to 4.4.1:
17:24 <dancy@deploy1002> Finished scap: testing container image build (duration: 28m 39s) [production] 16:55 <dancy@deploy1002> Started scap: testing container image build [production] 06:46 <_joe_> uploaded scap 4.4.1 to {stretch,buster,bullseye} T302464 [production] 06:46 <_joe_> uploaded scap 4.4.1 to {stretch,buster,bullseye} [production]
We had to manually restart PHP on API app servers
10:26:55 <_joe_> !log running check-restart-php on api appservers 10:26:57 <•stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
Log of the deployment (WMF-NDA):
P22941
10:03:28 Finished sync-apaches (duration: 00m 39s) 10:03:28 Started php-fpm-restarts 10:03:28 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s)