Scap can't clear opcache on mw servers in Beta Cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MarcoAurelio
	Oct 31 2019, 3:25 PM

Description

Every beta-scap-eqiad job is raising:

Job ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', 'deployment-deploy01.deployment-prep.eqiad.wmflabs', 'deployment-deploy02.deployment-prep.eqiad.wmflabs', 'deployment-deploy01.deployment-prep.eqiad.wmflabs'] called with an empty host list.

deployment-deploy01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-deploy01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fddf5284f90>: Failed to establish a new connection: [Errno 111] Connection refused',))

deployment-deploy02.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received
15:16:30 15:16:30 deployment-mwmaint01.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received

deployment-snapshot01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-snapshot01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fddf5279f90>: Failed to establish a new connection: [Errno 111] Connection refused',))

Recent example: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/273564/console
List of all beta-scap-eqiad jobs: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

The code is updated nonetheless, but I was wondering if those warnings are significant and, if yes, a fix could be attempted.

Thanks.

Details

Subject	Repo	Branch	Lines +/-
scap.cfg.erb: Define php_fpm restart settings for beta cluster	operations/puppet	production	+4 -0
Add new dsh groups for beta	operations/puppet	production	+19 -5
scap.cfg.erb: Define php_fpm restart settings for beta cluster	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Krinkle	T212460 Adopt static array files for local disk storage of values (epic)
Open	None	T99740 Use static php array files for l10n cache at WMF (instead of CDB)
Resolved	Krinkle	T245183 PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.)
Resolved	Krinkle	T253673 Avoid php-opcache corruption in WMF production
Resolved	Joe	T266055 Update Scap to perform rolling restart for all MW deploy
Resolved	dancy	T237033 Scap can't clear opcache on mw servers in Beta Cluster

Event Timeline

MarcoAurelio created this task.Oct 31 2019, 3:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 31 2019, 3:25 PM

hashar renamed this task from Scap warnings for job beta-scap-eqiad to On beta, scap can't clear opcache on some mw servers.Dec 17 2019, 7:58 PM

hashar edited projects, added Release-Engineering-Team, SRE; removed Jenkins.

Seems like there is no PHP opcache cleaner on the instances. I know nothing about that mechanism though :-\

Krinkle added a project: serviceops.Dec 17 2019, 11:58 PM

Compare the Hiera settings for the affected hosts to hieradata/labs/deployment-prep/host/deployment-mediawiki-parsoid10.yaml:

profile::mediawiki::php::enable_fpm: true
profile::mediawiki::php::fpm_config:
  opcache.interned_strings_buffer: 96
  opcache.memory_consumption: 1024
  apc.ttl: 10
# Configure php-fpm restarts
profile::mediawiki::php::restarts::ensure: present
# We set the restart watermark at 200 MB, which is approximately how much
# opcache one full day of deployments consume.
profile::mediawiki::php::restarts::opcache_limit: 200

The affected hosts probably dont have those?

Also T236275#5621970 might be related.

@hashar based on Dzhan's comment, is that something your team could handle, sending a puppet patch for the missing hiera keys there (and I can help reviewing it and deploying it)? Let me know.

Dzahn triaged this task as Medium priority.Jan 6 2020, 9:14 PM

Those settings are for the Puppet roles. Given roles are solely for production, on WMCS the hiera lookup hierarchy does not include roles. T120165 All those settings are thus missing.

For production that is in:

hieradata/role/common/mediawiki/appserver.yaml
hieradata/role/common/mediawiki/appserver/api.yaml
hieradata/role/common/mediawiki/appserver/canary_api.yaml
hieradata/role/common/mediawiki/jobrunner.yaml

For Beta-Cluster-Infrastructure , I guess they can be applied project wide via hieradata/labs/deployment-prep/common.yaml.

So easiest would probably be to just move the settings from that @Dzahn found above T237033#5760043 . Eg move the content of hieradata/labs/deployment-prep/host/deployment-mediawiki-parsoid10.yaml to the common.yaml file?

greg edited projects, added Release-Engineering-Team (Deployment services); removed Release-Engineering-Team.Feb 11 2020, 4:37 PM

Confirmed this is still happening on every beta deploy (latest):

01:28:32 deployment-deploy01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-deploy01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f68dd1639d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
01:28:32 deployment-deploy02.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received
01:28:32 deployment-mwmaint01.deployment-prep.eqiad.wmflabs failed to update opcache: A timeout happened before a response was received
01:28:32 deployment-snapshot01.deployment-prep.eqiad.wmflabs failed to update opcache: HTTPConnectionPool(host='deployment-snapshot01.deployment-prep.eqiad.wmflabs', port=9181): Max retries exceeded with url: /opcache-free (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f68dd0f0390>: Failed to establish a new connection: [Errno 111] Connection refused',))

Krinkle renamed this task from On beta, scap can't clear opcache on some mw servers to Sap can't clear opcache on mw servers in Beta Cluster.Mar 17 2020, 1:34 AM

Krinkle renamed this task from Sap can't clear opcache on mw servers in Beta Cluster to Scap can't clear opcache on mw servers in Beta Cluster.

This one:

Job ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', 'deployment-deploy01.deployment-prep.eqiad.wmflabs', 'deployment-deploy02.deployment-prep.eqiad.wmflabs', 'deployment-deploy01.deployment-prep.eqiad.wmflabs'] called with an empty host list.

Is not a big deal: we don't have any fan-out hosts on beta.

The remainder look like updates to do with a feature that afaik is disabled in production: that is, the feature-flag php7-admin-port is not active in production (and probably shouldn't be in beta either):

thcipriani@mw1234:~$ grep admin /etc/scap.cfg
#php7-admin-port: 9181

jijiki moved this task from Incoming 🐫 to 🔦Unused2 on the serviceops board.Aug 17 2020, 11:46 PM

thcipriani edited projects, added Release-Engineering-Team (thcipriani-workboard-fiddling); removed Release-Engineering-Team (Deployment services).Apr 20 2021, 12:56 AM

thcipriani moved this task from thcipriani-workboard-fiddling to Seen (ARCHIVE) on the Release-Engineering-Team board.Apr 20 2021, 12:59 AM

thcipriani edited projects, added Release-Engineering-Team; removed Release-Engineering-Team (thcipriani-workboard-fiddling).

thcipriani edited projects, added Release-Engineering-Team (Seen); removed Release-Engineering-Team.Apr 20 2021, 3:23 PM

@thcipriani @dancy I believe the equivalent of the beta-scap-eqiad job from back then (which is now 404 Not Found) is beta-scap-sync-world, is that right?

Looking at a recent build's output (link) I see neither success nor failure with restart to any restarts, and no mention of restart disablement in the scap args, and no mention of fpm anywhere either. I'm assuming then that restarts are still disabled there, si that right? (\cc @Joe, ref T266055).

In T237033#7975492, @Krinkle wrote:

@thcipriani @dancy I believe the equivalent of the beta-scap-eqiad job from back then (which is now 404 Not Found) is beta-scap-sync-world, is that right?

That's correct. I renamed the job in https://gerrit.wikimedia.org/r/c/integration/config/+/678927

Looking at a recent build's output (link) I see neither success nor failure with restart to any restarts, and no mention of restart disablement in the scap args, and no mention of fpm anywhere either. I'm assuming then that restarts are still disabled there, si that right? (\cc @Joe, ref T266055).

php_fpm_restart_script is defined in /etc/scap.cfg only for deploy1002.eqiad.wmnet and deploy2002.codfw.wmnet, so no restarts will happen in beta as currently configured.

Noting the following settings from the deployment-prep horizon project puppet config page:

profile::mediawiki::php::restarts::ensure: absent
profile::mediawiki::php::restarts::opcache_limit: 100

I'm going to change profile::mediawiki::php::restarts::ensure to present and see how things go.

Mentioned in SAL (#wikimedia-releng) [2022-06-08T15:57:20Z] <dancy> Set profile::mediawiki::php::restarts::ensure: present in deployment-prep hiera config for T237033

Change 803955 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

gerritbot added a project: Patch-For-Review.Jun 8 2022, 5:11 PM

Change 803955 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

Maintenance_bot removed a project: Patch-For-Review.Jun 8 2022, 7:30 PM

Change 804440 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Add new dsh groups for beta

https://gerrit.wikimedia.org/r/804440

Change 804441 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

Change 804440 merged by Dzahn:

[operations/puppet@production] Add new dsh groups for beta

https://gerrit.wikimedia.org/r/804440

Change 804441 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

Maintenance_bot removed a project: Patch-For-Review.Jun 9 2022, 10:30 PM

dancy closed this task as Resolved.Jun 10 2022, 3:37 PM

dancy claimed this task.

Krinkle added a parent task: T266055: Update Scap to perform rolling restart for all MW deploy.Jun 16 2022, 8:59 PM

Scap can't clear opcache on mw servers in Beta ClusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Scap can't clear opcache on mw servers in Beta Cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...