Diamond collected metrics about memory usage inaccurate until third reboot
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Mar 3 2015, 12:52 AM

Description

I thought the new instances were going crazy, but I couldn't find any process using a lot of memory.

Turns out it takes a couple reboots before the matrics are reliable.

For comparison:

integration-slave1401, 1402, and 1404 were rebooted once last week (for unrelated reasons):

integration-slave1403 and slave1405, however, were already working fine and their free and top output are nearly identical to that of slave1402. Yet, their alleged memory usage was huge in graphite. I checked that the graph wasn't stuck or flat, it was still reacting lively to processes. It's reporting is just off by 8.5GB.

The total didn't even add up. It was reporting 9.5GB memory usage on an instance with only 8GB of total memory.

Screen_Shot_2015-03-03_at_01.50.48.png (996×1 px, 249 KB)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Krinkle	T94916 Re-create ci slaves (April 2015)
		Resolved		hashar	T91351 Diamond collected metrics about memory usage inaccurate until third reboot

Event Timeline

Krinkle created this task.Mar 3 2015, 12:52 AM

Krinkle raised the priority of this task from to Needs Triage.

Krinkle updated the task description. (Show Details)

Krinkle added projects: Cloud-Services, Cloud-VPS.

Krinkle subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2015, 12:52 AM

Krinkle renamed this task from Diamond collected metrics about memory inaccurate until second reboot to Diamond collected metrics about memory usage inaccurate until third reboot.Mar 3 2015, 12:52 AM

Krinkle set Security to None.

Krinkle updated the task description. (Show Details)

Krinkle added a project: Continuous-Integration-Infrastructure.Mar 5 2015, 10:56 PM

Krinkle added subscribers: yuvipanda, Andrew.Apr 8 2015, 7:34 PM

A better example from the new integration-slave-trusty-1010:

graphite.wmflabs.png (250×800 px, 18 KB)

The first boot is fine. Then after a necessary reboot, the usage goes up until it hits a flat high line (presumably it stops reporting and graphite/diamond just echoes the last known point). Then after the third reboot it goes to normal fluctuating levels.

Krinkle added a parent task: T94916: Re-create ci slaves (April 2015).Apr 9 2015, 9:29 AM

Looking at atop history, there is nothing suspicious. I guess that is an oddity from diamond / graphite. I suspect that diamond has not been collecting / reporting metric for a while, then graphite would reuse the last known value instead of marking it NULL.

Looking at the raw metric by passing &format=json shows that the metric values are changing. Maybe it is statsd replaying the same data over and over.

Krinkle mentioned this in T95912: Diamond metrics for cpu.system suddenly up 100% after a reboot.Apr 13 2015, 4:45 PM

Krinkle triaged this task as Low priority.Apr 14 2015, 2:24 PM

Krinkle moved this task from Untriaged to In-progress on the Continuous-Integration-Infrastructure board.

Krinkle moved this task from In-progress to Externally Blocked on the Continuous-Integration-Infrastructure board.

I rebooted the slaves and all metrics look fine.

hashar closed this task as Resolved.Jul 27 2015, 2:44 PM

hashar claimed this task.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:54 PM

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptJun 7 2017, 6:54 PM

• Phabricator_maintenance edited projects, added RelEng-Archive-FY201718-Q1; removed Release-Engineering-Team (Kanban), Cloud-VPS.Sep 26 2017, 11:48 PM

	F52917: Screen_Shot_2015-03-03_at_01.50.48.png
	Mar 3 2015, 12:52 AM

	F110397: graphite.wmflabs.png
	Apr 9 2015, 9:29 AM

Diamond collected metrics about memory usage inaccurate until third rebootClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Diamond collected metrics about memory usage inaccurate until third reboot
Closed, ResolvedPublic
Actions

Related Objects
Search...