graphite1003 short of available RAM
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	Volans
	Jan 21 2017, 1:38 AM

Description

Today I had to manually restart one of the carbon-cache processes on graphite1003 because it was killed by the oom-killer.

Seems that graphite1003 is quite short on memory:

$ free -m
             total       used       free     shared    buffers     cached
Mem:         64267      64039        228       1347          1       2732
-/+ buffers/cache:      61305       2962
Swap:          255        249          6

Of course the memory us used by all the carbon-cache and uwsgi-graphite-web processes.

From grafana, at 00:03 there was a spike in the swap usage. No other metrics seems to show any spike at that time.

Related Objects

Mentioned In: T116767: limit the impact of heavy/large graphite queries
T155876: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM
Mentioned Here: T116767: limit the impact of heavy/large graphite queries

Event Timeline

Volans created this task.Jan 21 2017, 1:38 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2017, 1:38 AM

Volans mentioned this in T155876: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM.Jan 21 2017, 5:33 AM

Mentioned in SAL (#wikimedia-operations) [2017-01-26T15:37:29Z] <godog> bounce uwsgi on graphite1003 with less workers - T155872

I've tracked this down to expensive queries on graphite1003 making carbon-cache explode in memory. Namely cassandra-related 99percentile SSTablesPerReadHistogram for all columnfamilies for all instances, can generate >100MB responses in pickle data. This is sort-of related to T116767: limit the impact of heavy/large graphite queries and I'm adding this case to it too.

See also https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?from=1484907750423&to=1485512550423 for queries vs cache size. In theory carbon-cache supports limiting its cache size, in practice last time I tried it would make carbon-cache spinning out of control in CPU. It is possible newer graphite versions have fixed this though.

fgiunchedi mentioned this in T116767: limit the impact of heavy/large graphite queries.Jan 27 2017, 10:25 AM

fgiunchedi closed this task as a duplicate of T116767: limit the impact of heavy/large graphite queries.Jan 27 2017, 10:33 AM

Merging as T116767 duplicate, we can followup there as heavy queries were the root cause anyways.

graphite1003 short of available RAMClosed, DuplicatePublicActions

Description

Related Objects

Event Timeline

graphite1003 short of available RAM
Closed, DuplicatePublic
Actions