Page MenuHomePhabricator

prometheus -> grafana stats for per-numa-node meminfo
Closed, ResolvedPublic

Description

Can we get data into prometheus for /sys/devices/system/node/node*/meminfo? This is similar to (but not quite identical to!) /proc/meminfo, but broken out per NUMA node. On machines with numa_networking = isolate config, this level of detail becomes necessary to make sense of memory pressure issues. Seems like someone's done the hard bits at https://github.com/prometheus/node_exporter/blob/master/collector/meminfo_numa_linux.go .

Example from cp4021 (note that Node0 has 256GB and uses almost all of it for active malloc'd memory, while Node1 has only 128GB and uses it mostly for disk cache):

root@cp4021:~# cat /sys/devices/system/node/node0/meminfo 
Node 0 MemTotal:       264076160 kB
Node 0 MemFree:         3563420 kB
Node 0 MemUsed:        260512740 kB
Node 0 Active:         256613464 kB
Node 0 Inactive:        2187088 kB
Node 0 Active(anon):   250546696 kB
Node 0 Inactive(anon):    27468 kB
Node 0 Active(file):    6066768 kB
Node 0 Inactive(file):  2159620 kB
Node 0 Unevictable:           0 kB
Node 0 Mlocked:               0 kB
Node 0 Dirty:                56 kB
Node 0 Writeback:             0 kB
Node 0 FilePages:       9442920 kB
Node 0 Mapped:          9149340 kB
Node 0 AnonPages:      249357760 kB
Node 0 Shmem:           1216532 kB
Node 0 KernelStack:       14480 kB
Node 0 PageTables:       711808 kB
Node 0 NFS_Unstable:          0 kB
Node 0 Bounce:                0 kB
Node 0 WritebackTmp:          0 kB
Node 0 Slab:             241800 kB
Node 0 SReclaimable:      27656 kB
Node 0 SUnreclaim:       214144 kB
Node 0 AnonHugePages:         0 kB
Node 0 ShmemHugePages:        0 kB
Node 0 ShmemPmdMapped:        0 kB
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
root@cp4021:~# cat /sys/devices/system/node/node1/meminfo 
Node 1 MemTotal:       132105144 kB
Node 1 MemFree:          874332 kB
Node 1 MemUsed:        131230812 kB
Node 1 Active:         118972632 kB
Node 1 Inactive:        4037628 kB
Node 1 Active(anon):    8783376 kB
Node 1 Inactive(anon):   291344 kB
Node 1 Active(file):   110189256 kB
Node 1 Inactive(file):  3746284 kB
Node 1 Unevictable:        7932 kB
Node 1 Mlocked:            7932 kB
Node 1 Dirty:             18076 kB
Node 1 Writeback:             0 kB
Node 1 FilePages:      114680792 kB
Node 1 Mapped:         113849104 kB
Node 1 AnonPages:       8337800 kB
Node 1 Shmem:            741532 kB
Node 1 KernelStack:       14648 kB
Node 1 PageTables:      2876748 kB
Node 1 NFS_Unstable:          0 kB
Node 1 Bounce:                0 kB
Node 1 WritebackTmp:          0 kB
Node 1 Slab:            4794484 kB
Node 1 SReclaimable:    4710160 kB
Node 1 SUnreclaim:        84324 kB
Node 1 AnonHugePages:         0 kB
Node 1 ShmemHugePages:        0 kB
Node 1 ShmemPmdMapped:        0 kB
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Sep 12 2017, 8:42 AM
ema moved this task from Backlog to General on the Traffic board.

AFAICT the meminfo_numa collector has been introduced in node_exporter 0.13 so it is already available across the fleet (we are running 0.14). It should be enough to add meminfo_numa to prometheus::node_exporter::collectors_extra hiera value where needed.

Change 377443 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Add NUMA meminfo stats for cache lvs nodes

https://gerrit.wikimedia.org/r/377443

Change 377443 merged by BBlack:
[operations/puppet@production] Add NUMA meminfo stats for cache lvs nodes

https://gerrit.wikimedia.org/r/377443

@BBlack your patch to add meminfo_numa seems to be working! Anything left to do ?

yeah, put it somewhere useful in grafana :)

BBlack lowered the priority of this task from Medium to Low.Oct 2 2017, 3:52 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall claimed this task.