Page MenuHomePhabricator

graphite1003 short of available RAM
Closed, DuplicatePublic

Description

Today I had to manually restart one of the carbon-cache processes on graphite1003 because it was killed by the oom-killer.

Seems that graphite1003 is quite short on memory:

$ free -m
             total       used       free     shared    buffers     cached
Mem:         64267      64039        228       1347          1       2732
-/+ buffers/cache:      61305       2962
Swap:          255        249          6

Of course the memory us used by all the carbon-cache and uwsgi-graphite-web processes.

From grafana, at 00:03 there was a spike in the swap usage. No other metrics seems to show any spike at that time.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2017-01-26T15:37:29Z] <godog> bounce uwsgi on graphite1003 with less workers - T155872

I've tracked this down to expensive queries on graphite1003 making carbon-cache explode in memory. Namely cassandra-related 99percentile SSTablesPerReadHistogram for all columnfamilies for all instances, can generate >100MB responses in pickle data. This is sort-of related to T116767: limit the impact of heavy/large graphite queries and I'm adding this case to it too.

See also https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?from=1484907750423&to=1485512550423 for queries vs cache size. In theory carbon-cache supports limiting its cache size, in practice last time I tried it would make carbon-cache spinning out of control in CPU. It is possible newer graphite versions have fixed this though.

Merging as T116767 duplicate, we can followup there as heavy queries were the root cause anyways.