Maniphest T241001

cp3050 depooled due to explosion in CPU usage and inuse sockets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CDanis
	Dec 17 2019, 10:11 PM

Tags

Referenced Files

	F31479671: Screenshot 2019-12-18 at 00.54.41.png
	Dec 18 2019, 12:57 AM

Subscribers

Description

We were serving about 60rps of 503 from esams: https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1576615825607&to=1576621811000&var-site=esams&var-cache_type=text&var-cache_type=upload&var-status_type=5

logstash https://logstash.wikimedia.org/goto/b6a2987ff6b4be14f1f8fa2305aef56c

Tracked this down to just cp3050 having some sort of backend-ATS stomachache (pop open the "CPU per host" section): https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=cache_text&var-instance=All&from=1576615825607&to=1576621811000

There were also a bunch of extra inuse sockets (about 2x): https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=cp3050&var-datasource=esams%20prometheus%2Fops&var-cluster=cache_text&from=1576615825607&to=1576621811000

I gathered some atslog-backend output in an NDA'd paste: P9920

and then I depooled the host.

Details

	Subject	Repo	Branch	Lines +/-
	Revert "ATS: enable xdebug plugin on 3 hosts"	operations/puppet	production	+0 -2

Customize query in gerrit

Related Objects

Mentioned Here: T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS)

Event Timeline

CDanis created this task.Dec 17 2019, 10:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 17 2019, 10:11 PM

Krinkle added a project: Wikimedia-Incident.Dec 17 2019, 10:13 PM

Krinkle moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

Krinkle moved this task from Follow-up prevention to Draft and review incident doc on the Wikimedia-Incident board.

CDanis updated the task description. (Show Details)Dec 17 2019, 10:15 PM

I've added a Grafana annotation with various tags for alignment in dashboards.

The partial esams outage lasted about 22 minutes.

Screenshot 2019-12-18 at 00.54.41.png (992×1 px, 94 KB)

• ema triaged this task as Medium priority.Dec 18 2019, 9:43 AM

The host is the only one in esams that was running with the xdebug plugin enabled in order to debug the following ttfb regression reported by the Performance-Team: T238494.

Suspecting that it might be the cause of this crash, @Vgutierrez and I disabled the plugin, restarted ats-be at 08:59 and repooled the host at 09:18.

Change 559457 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: enable xdebug plugin on 3 hosts"

https://gerrit.wikimedia.org/r/559457

gerritbot added a project: Patch-For-Review.Dec 19 2019, 1:09 PM

• ema moved this task from Backlog to Caching on the Traffic board.Dec 19 2019, 2:17 PM

Change 559457 merged by Ema:
[operations/puppet@production] Revert "ATS: enable xdebug plugin on 3 hosts"

https://gerrit.wikimedia.org/r/559457

Mentioned in SAL (#wikimedia-operations) [2019-12-19T14:41:10Z] <ema> cp1075, cp4028: ats-backend-restart to disable xdebug plugin T241001

Maintenance_bot removed a project: Patch-For-Review.Dec 19 2019, 3:10 PM

text@esams has had no similar issues since we disabled xdebug in December, closing.