Page MenuHomePhabricator

Diamond collected metrics about memory usage inaccurate until third reboot
Closed, ResolvedPublic

Description

I thought the new instances were going crazy, but I couldn't find any process using a lot of memory.

Turns out it takes a couple reboots before the matrics are reliable.

For comparison:

integration-slave1401, 1402, and 1404 were rebooted once last week (for unrelated reasons):

integration-slave1403 and slave1405, however, were already working fine and their free and top output are nearly identical to that of slave1402. Yet, their alleged memory usage was huge in graphite. I checked that the graph wasn't stuck or flat, it was still reacting lively to processes. It's reporting is just off by 8.5GB.

The total didn't even add up. It was reporting 9.5GB memory usage on an instance with only 8GB of total memory.

Event Timeline

Krinkle created this task.Mar 3 2015, 12:52 AM
Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added projects: Cloud-Services, Cloud-VPS.
Krinkle added a subscriber: Krinkle.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2015, 12:52 AM
Krinkle renamed this task from Diamond collected metrics about memory inaccurate until second reboot to Diamond collected metrics about memory usage inaccurate until third reboot.Mar 3 2015, 12:52 AM
Krinkle set Security to None.
Krinkle updated the task description. (Show Details)

A better example from the new integration-slave-trusty-1010:

The first boot is fine. Then after a necessary reboot, the usage goes up until it hits a flat high line (presumably it stops reporting and graphite/diamond just echoes the last known point). Then after the third reboot it goes to normal fluctuating levels.

hashar added a subscriber: hashar.Apr 9 2015, 9:41 AM

Looking at atop history, there is nothing suspicious. I guess that is an oddity from diamond / graphite. I suspect that diamond has not been collecting / reporting metric for a while, then graphite would reuse the last known value instead of marking it NULL.

Looking at the raw metric by passing &format=json shows that the metric values are changing. Maybe it is statsd replaying the same data over and over.

Krinkle triaged this task as Low priority.Apr 14 2015, 2:24 PM

I rebooted the slaves and all metrics look fine.

hashar closed this task as Resolved.Jul 27 2015, 2:44 PM
hashar claimed this task.