Page MenuHomePhabricator

Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM
Closed, DuplicatePublic

Description

System load on graphite1003 has gone up significantly starting on 2017-01-20 around 22:00.

As a result, the OOM killer did its thing a couple of times, with carbon-cache@c.service being the victim:

[Sat Jan 21 04:05:59 2017] Out of memory: Kill process 4879 (carbon-cache) score 62 or sacrifice child
[Sat Jan 21 04:05:59 2017] Killed process 4879 (carbon-cache) total-vm:4217132kB, anon-rss:4126184kB, file-rss:1748kB

@Volans and I restarted carbon-cache@c.service by hand when that happened.

We should figure out what's going on with graphite1003's load, and perhaps consider auto-restarting the service in case of failures.

Event Timeline

ema created this task.Jan 21 2017, 4:32 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2017, 4:32 AM
ema triaged this task as High priority.Jan 21 2017, 4:32 AM
ema added a project: Operations.
ema updated the task description. (Show Details)

Change 334364 had a related patch set uploaded (by Filippo Giunchedi):
graphite: add Restart / RestartSec for graphite daemons

https://gerrit.wikimedia.org/r/334364

See also T116767: limit the impact of heavy/large graphite queries to track heavy graphite queries, closing as its duplicate.

Change 334364 merged by Filippo Giunchedi:
graphite: add Restart / RestartSec for graphite daemons

https://gerrit.wikimedia.org/r/334364