Page MenuHomePhabricator

scap no more restart php opcache on all servers
Closed, ResolvedPublic

Description

When deploying 1.39.0-wmf.3 to group 0 , the PHP opcache got filed on several application servers causing alarms to be triggered. scap should have restarted php on all the application servers to clear out the cache but it clearly did not.

Looking at the Scap (ECS) dashboard on Kibana https://logstash.wikimedia.org/goto/43acdb213090860ac826a636905e91b1 , searching for messages matching "check-and-restart" we have an history of the check-restart-php php7.2-fpm invocations:

Mar 22, 2022 @ 09:54:25 sync-world Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s)
Mar 22, 2022 @ 10:03:28 sync-wikiversions Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s)

86 hosts are not enough. We had 91 hosts at some point but the baseline before March 1st was 352 hosts:

Mar 7, 2022 @ 06:49:44.16486 hosts
Mar 3, 2022 @ 21:30:19.76791 hosts
Mar 1, 2022 @ 17:23:56.33191 hosts
Mar 1, 2022 @ 08:08:21.766352 hosts

Scap got updated on March 1st to 4.4.1:

17:24 	<dancy@deploy1002> 	Finished scap: testing container image build (duration: 28m 39s) 	[production]
16:55 	<dancy@deploy1002> 	Started scap: testing container image build 	[production]
06:46 	<_joe_> 	uploaded scap 4.4.1 to {stretch,buster,bullseye} T302464 	[production]
06:46 	<_joe_> 	uploaded scap 4.4.1 to {stretch,buster,bullseye} 	[production]

T302464#7743541

We had to manually restart PHP on API app servers

10:26:55 <_joe_> !log running check-restart-php on api appservers
10:26:57 <•stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log

Log of the deployment (WMF-NDA):

P22941

	10:03:28 Finished sync-apaches (duration: 00m 39s)
	10:03:28 Started php-fpm-restarts
	10:03:28 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s)

Event Timeline

I have looked at scap/main.py at version 4.4.2. AbstractSync._restart_php() gets the list of target hosts from /etc/scap.cfg key mw_web_clusters which on deploy1002 has:

mw_web_clusters: appserver,api_appserver,jobrunner,testserver,parsoid_php

File last touched on Mar 1 17:22

The groups have:

$ wc -l /etc/dsh/group/{appserver,api_appserver,jobrunner,testserver,parsoid_php}
  13 /etc/dsh/group/appserver
  13 /etc/dsh/group/api_appserver
  51 /etc/dsh/group/jobrunner
  17 /etc/dsh/group/testserver
  57 /etc/dsh/group/parsoid_php
 151 total

But those files might have empty lines and comments. After filtering out I find 86 hosts:

$ cat /etc/dsh/group/{appserver,api_appserver,jobrunner,testserver,parsoid_php}| grep -c -v -P '(^#|^$)'
86

The mw_web_clusters scap setting comes from Puppet modules/scap/templates/scap.cfg.erb which hasn't been changed since 2019:

edf367ad452 (Daniel Zahn             2019-10-16 08:13:46 -0700)|mw_web_clusters: appserver,api_appserver,jobrunner,testserver,parsoid_php

Host count per dsh file:

$ grep -c -v -P '(^#|^$)' /etc/dsh/group/{appserver,api_appserver,jobrunner,testserver,parsoid_php}
/etc/dsh/group/appserver:0
/etc/dsh/group/api_appserver:0
/etc/dsh/group/jobrunner:38
/etc/dsh/group/testserver:4
/etc/dsh/group/parsoid_php:44

Last changes:

-r--r--r-- 1 root root  318 Mar  1 15:25 /etc/dsh/group/api_appserver
-r--r--r-- 1 root root  302 Mar  1 15:25 /etc/dsh/group/appserver
-r--r--r-- 1 root root 1024 Mar  4 13:15 /etc/dsh/group/jobrunner
-r--r--r-- 1 root root 1230 Apr 15  2021 /etc/dsh/group/parsoid_php
-r--r--r-- 1 root root  402 Apr 16  2021 /etc/dsh/group/testserver

I am guessing the issue is appserver not being populated properly. It is content is:

/etc/dsh/group/appserver
# DSH group appserver
# This file is managed by puppet.

# List of hosts defined in puppet
# Either directly in the declaration of the resource
# or via hiera scap::dsh::group::appserver


# List of hosts gathered from etcd
# etcd pool: /eqiad/appserver/apache2

# etcd pool: /codfw/appserver/apache2

That leads me to 6c9c8b3b0973940672bc6b59e3e95dea5ba7dab5 https://gerrit.wikimedia.org/r/c/operations/puppet/+/767203

commit 6c9c8b3b0973940672bc6b59e3e95dea5ba7dab5
Author:     Giuseppe Lavagetto <glavagetto@wikimedia.org>
AuthorDate: Tue Mar 1 16:40:05 2022 +0100
Commit:     Giuseppe Lavagetto <glavagetto@wikimedia.org>
CommitDate: Tue Mar 1 16:40:05 2022 +0100

    scap: fix dsh groups for mediawiki
    
    Change-Id: I8f178b8adceb05df8f73727d8e90f55fce5eeab8

diff --git a/hieradata/common/scap/dsh.yaml b/hieradata/common/scap/dsh.yaml
index 541354c4e3..6d970e1ae0 100644
--- a/hieradata/common/scap/dsh.yaml
+++ b/hieradata/common/scap/dsh.yaml
@@ -29,10 +29,10 @@ scap::dsh::groups:
       - {'cluster': 'testserver', 'service': 'apache2'}
   mediawiki-installation:
     conftool:
-      - {'cluster': 'appserver', 'service': 'apache2'}
-      - {'cluster': 'api_appserver', 'service': 'apache2'}
-      - {'cluster': 'jobrunner', 'service': 'apache2'}
-      - {'cluster': 'testserver', 'service': 'apache2'}
+      - {'cluster': 'appserver', 'service': 'nginx'}
+      - {'cluster': 'api_appserver', 'service': 'nginx'}
+      - {'cluster': 'jobrunner', 'service': 'nginx'}
+      - {'cluster': 'testserver', 'service': 'nginx'}
       - {'cluster': 'parsoid', 'service': 'parsoid-php'}
     hosts:
       - cloudweb2001-dev.wikimedia.org

And

hieradata/common/scap/dsh.yaml
scap::dsh::groups:
  appserver:
    conftool:
      - {'cluster': 'appserver', 'service': 'apache2'}

Last touched in Feb 2019.

I am guessing the service key in hiera should be nginx? It is probably worth revalidating all those filters.

I am poking @Joe on IRC :)

Change 772822 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: fix dsh targets for php restarts

https://gerrit.wikimedia.org/r/772822

Thanks @hashar for the awesome analysis. Fixing now and apologies for the issue.

Change 772822 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: fix dsh targets for php restarts

https://gerrit.wikimedia.org/r/772822