Page MenuHomePhabricator

Scap can't clear opcache on mw servers in Beta Cluster
Open, MediumPublic

Description

Every beta-scap-eqiad job is raising:

Job ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', 'deployment-deploy01.deployment-prep.eqiad.wmflabs', 'deployment-deploy02.deployment-prep.eqiad.wmflabs', 'deployment-deploy01.deployment-prep.eqiad.wmflabs'] called with an empty host list.

deployment-deploy01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-deploy01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fddf5284f90>: Failed to establish a new connection: [Errno 111] Connection refused',))

deployment-deploy02.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received
15:16:30 15:16:30 deployment-mwmaint01.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received

deployment-snapshot01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-snapshot01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fddf5279f90>: Failed to establish a new connection: [Errno 111] Connection refused',))

Recent example: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/273564/console
List of all beta-scap-eqiad jobs: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

The code is updated nonetheless, but I was wondering if those warnings are significant and, if yes, a fix could be attempted.

Thanks.

Event Timeline

hashar renamed this task from Scap warnings for job beta-scap-eqiad to On beta, scap can't clear opcache on some mw servers.Dec 17 2019, 7:58 PM
hashar edited projects, added Release-Engineering-Team, SRE; removed Jenkins.

Seems like there is no PHP opcache cleaner on the instances. I know nothing about that mechanism though :-\

Compare the Hiera settings for the affected hosts to hieradata/labs/deployment-prep/host/deployment-mediawiki-parsoid10.yaml:

profile::mediawiki::php::enable_fpm: true
profile::mediawiki::php::fpm_config:
  opcache.interned_strings_buffer: 96
  opcache.memory_consumption: 1024
  apc.ttl: 10
# Configure php-fpm restarts
profile::mediawiki::php::restarts::ensure: present
# We set the restart watermark at 200 MB, which is approximately how much
# opcache one full day of deployments consume.
profile::mediawiki::php::restarts::opcache_limit: 200

The affected hosts probably dont have those?

@hashar based on Dzhan's comment, is that something your team could handle, sending a puppet patch for the missing hiera keys there (and I can help reviewing it and deploying it)? Let me know.

Dzahn triaged this task as Medium priority.Jan 6 2020, 9:14 PM

Those settings are for the Puppet roles. Given roles are solely for production, on WMCS the hiera lookup hierarchy does not include roles. T120165 All those settings are thus missing.

For production that is in:

hieradata/role/common/mediawiki/appserver.yaml
hieradata/role/common/mediawiki/appserver/api.yaml
hieradata/role/common/mediawiki/appserver/canary_api.yaml
hieradata/role/common/mediawiki/jobrunner.yaml

For Beta-Cluster-Infrastructure , I guess they can be applied project wide via hieradata/labs/deployment-prep/common.yaml.

So easiest would probably be to just move the settings from that @Dzahn found above T237033#5760043 . Eg move the content of hieradata/labs/deployment-prep/host/deployment-mediawiki-parsoid10.yaml to the common.yaml file?

Confirmed this is still happening on every beta deploy (latest):

01:28:32 deployment-deploy01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-deploy01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f68dd1639d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
01:28:32 deployment-deploy02.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received
01:28:32 deployment-mwmaint01.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received
01:28:32 deployment-snapshot01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-snapshot01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f68dd0f0390>: Failed to establish a new connection: [Errno 111] Connection refused',))
Krinkle renamed this task from On beta, scap can't clear opcache on some mw servers to Sap can't clear opcache on mw servers in Beta Cluster.Mar 17 2020, 1:34 AM
Krinkle renamed this task from Sap can't clear opcache on mw servers in Beta Cluster to Scap can't clear opcache on mw servers in Beta Cluster.

This one:

Job ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', 'deployment-deploy01.deployment-prep.eqiad.wmflabs', 'deployment-deploy02.deployment-prep.eqiad.wmflabs', 'deployment-deploy01.deployment-prep.eqiad.wmflabs'] called with an empty host list.

Is not a big deal: we don't have any fan-out hosts on beta.

The remainder look like updates to do with a feature that afaik is disabled in production: that is, the feature-flag php7-admin-port is not active in production (and probably shouldn't be in beta either):

thcipriani@mw1234:~$ grep admin /etc/scap.cfg
#php7-admin-port: 9181