Page MenuHomePhabricator

Scap can't clear opcache on mw servers in Beta Cluster
Closed, ResolvedPublic

Description

Every beta-scap-eqiad job is raising:

Job ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', 'deployment-deploy01.deployment-prep.eqiad.wmflabs', 'deployment-deploy02.deployment-prep.eqiad.wmflabs', 'deployment-deploy01.deployment-prep.eqiad.wmflabs'] called with an empty host list.

deployment-deploy01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-deploy01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fddf5284f90>: Failed to establish a new connection: [Errno 111] Connection refused',))

deployment-deploy02.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received
15:16:30 15:16:30 deployment-mwmaint01.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received

deployment-snapshot01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-snapshot01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fddf5279f90>: Failed to establish a new connection: [Errno 111] Connection refused',))

Recent example: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/273564/console
List of all beta-scap-eqiad jobs: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

The code is updated nonetheless, but I was wondering if those warnings are significant and, if yes, a fix could be attempted.

Thanks.

Event Timeline

hashar renamed this task from Scap warnings for job beta-scap-eqiad to On beta, scap can't clear opcache on some mw servers.Dec 17 2019, 7:58 PM
hashar edited projects, added Release-Engineering-Team, SRE; removed Jenkins.

Seems like there is no PHP opcache cleaner on the instances. I know nothing about that mechanism though :-\

Compare the Hiera settings for the affected hosts to hieradata/labs/deployment-prep/host/deployment-mediawiki-parsoid10.yaml:

profile::mediawiki::php::enable_fpm: true
profile::mediawiki::php::fpm_config:
  opcache.interned_strings_buffer: 96
  opcache.memory_consumption: 1024
  apc.ttl: 10
# Configure php-fpm restarts
profile::mediawiki::php::restarts::ensure: present
# We set the restart watermark at 200 MB, which is approximately how much
# opcache one full day of deployments consume.
profile::mediawiki::php::restarts::opcache_limit: 200

The affected hosts probably dont have those?

@hashar based on Dzhan's comment, is that something your team could handle, sending a puppet patch for the missing hiera keys there (and I can help reviewing it and deploying it)? Let me know.

Dzahn triaged this task as Medium priority.Jan 6 2020, 9:14 PM

Those settings are for the Puppet roles. Given roles are solely for production, on WMCS the hiera lookup hierarchy does not include roles. T120165 All those settings are thus missing.

For production that is in:

hieradata/role/common/mediawiki/appserver.yaml
hieradata/role/common/mediawiki/appserver/api.yaml
hieradata/role/common/mediawiki/appserver/canary_api.yaml
hieradata/role/common/mediawiki/jobrunner.yaml

For Beta-Cluster-Infrastructure , I guess they can be applied project wide via hieradata/labs/deployment-prep/common.yaml.

So easiest would probably be to just move the settings from that @Dzahn found above T237033#5760043 . Eg move the content of hieradata/labs/deployment-prep/host/deployment-mediawiki-parsoid10.yaml to the common.yaml file?

Confirmed this is still happening on every beta deploy (latest):

01:28:32 deployment-deploy01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-deploy01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f68dd1639d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
01:28:32 deployment-deploy02.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received
01:28:32 deployment-mwmaint01.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received
01:28:32 deployment-snapshot01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-snapshot01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f68dd0f0390>: Failed to establish a new connection: [Errno 111] Connection refused',))
Krinkle renamed this task from On beta, scap can't clear opcache on some mw servers to Sap can't clear opcache on mw servers in Beta Cluster.Mar 17 2020, 1:34 AM
Krinkle renamed this task from Sap can't clear opcache on mw servers in Beta Cluster to Scap can't clear opcache on mw servers in Beta Cluster.

This one:

Job ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', 'deployment-deploy01.deployment-prep.eqiad.wmflabs', 'deployment-deploy02.deployment-prep.eqiad.wmflabs', 'deployment-deploy01.deployment-prep.eqiad.wmflabs'] called with an empty host list.

Is not a big deal: we don't have any fan-out hosts on beta.

The remainder look like updates to do with a feature that afaik is disabled in production: that is, the feature-flag php7-admin-port is not active in production (and probably shouldn't be in beta either):

thcipriani@mw1234:~$ grep admin /etc/scap.cfg
#php7-admin-port: 9181

@thcipriani @dancy I believe the equivalent of the beta-scap-eqiad job from back then (which is now 404 Not Found) is beta-scap-sync-world, is that right?

Looking at a recent build's output (link) I see neither success nor failure with restart to any restarts, and no mention of restart disablement in the scap args, and no mention of fpm anywhere either. I'm assuming then that restarts are still disabled there, si that right? (\cc @Joe, ref T266055).

@thcipriani @dancy I believe the equivalent of the beta-scap-eqiad job from back then (which is now 404 Not Found) is beta-scap-sync-world, is that right?

That's correct. I renamed the job in https://gerrit.wikimedia.org/r/c/integration/config/+/678927

Looking at a recent build's output (link) I see neither success nor failure with restart to any restarts, and no mention of restart disablement in the scap args, and no mention of fpm anywhere either. I'm assuming then that restarts are still disabled there, si that right? (\cc @Joe, ref T266055).

php_fpm_restart_script is defined in /etc/scap.cfg only for deploy1002.eqiad.wmnet and deploy2002.codfw.wmnet, so no restarts will happen in beta as currently configured.

Noting the following settings from the deployment-prep horizon project puppet config page:

profile::mediawiki::php::restarts::ensure: absent
profile::mediawiki::php::restarts::opcache_limit: 100

I'm going to change profile::mediawiki::php::restarts::ensure to present and see how things go.

Mentioned in SAL (#wikimedia-releng) [2022-06-08T15:57:20Z] <dancy> Set profile::mediawiki::php::restarts::ensure: present in deployment-prep hiera config for T237033

Change 803955 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

Change 803955 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

Change 804440 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Add new dsh groups for beta

https://gerrit.wikimedia.org/r/804440

Change 804441 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

Change 804440 merged by Dzahn:

[operations/puppet@production] Add new dsh groups for beta

https://gerrit.wikimedia.org/r/804440

Change 804441 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

dancy claimed this task.