Collect per-cgroup cpu/mem and other system level metrics
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Aug 5 2015, 1:49 PM

Description

We want to collect per-cgroup resource usage statistics, this is already achieved by cadvisor on ~23% of the fleet (cp + mw hosts).

We're looking at expanding the deployment fleetwide; enabling for example queries like "what are the top 5 (memory|cpu|etc) users either in the fleet or in a specific cluster"

Considerations, moving parts, etc in no particular order:

cadvisor seems to be working fine for our purposes, even though we had to update our version internally (cfr T325557) and the version has diverged bullseye vs buster vs official Debian. OTOH some metrics are quite fine grained, to the point of leading to significant cardinality (e.g. per-cpu * systemd units for CPU-related metrics)
There will be significantly more load on Prometheus ops eqiad/codfw.
1. Quick napkin math for eqiad: ~4k (depends on how many systemd units are running, and how many cores the host has!) new metrics per host with the default accounting, times ~900 hosts in eqiad that don't already run cadvisor that's ~3.7M new metrics to be collected, once a minute that's 62k new samples/s. Prometheus ops in eqiad now ingests ~146k samples/s. That's a 42% increase in samples ingested, with a comparable figure for disk space increase.
2. PoPs are fine though since cp hosts already run cadvisor so the increase in Prometheus load will be marginal there.
block IO and IP accounting are disabled by default in systemd (tasks/memory/cpu are enabled by default on >= buster). We can consider enabling additional accounting either for specific roles or fleetwide, either pre or post rollout.

cadvisor version

Our version/build of cadvisor 0.44 seems to run well both on Buster and Bullseye, therefore we should deploy a single version (i.e. upgrade existing 0.35.0 hosts to 0.44)

Metric cardinality

With respect to 2.A here's a breakdown of cardinality with default accounting on Buster and cadvisor 0.35

mw1456:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
   1552 container_cpu_usage_seconds_total
    410 container_tasks_state
    328 container_memory_failures_total
     82 container_start_time_seconds
     82 container_spec_memory_swap_limit_bytes
     82 container_spec_memory_reservation_limit_bytes
     82 container_spec_memory_limit_bytes
     82 container_memory_working_set_bytes
     82 container_memory_usage_bytes
     82 container_memory_swap
     82 container_memory_rss
     82 container_memory_max_usage_bytes
     82 container_memory_mapped_file
     82 container_memory_failcnt
     82 container_memory_cache
     82 container_last_seen
     82 container_cpu_user_seconds_total
     82 container_cpu_system_seconds_total
     82 container_cpu_load_average_10s
     72 container_spec_cpu_shares
     72 container_spec_cpu_period

Whereas with all accounting enabled, Bullseye and cadvisor 0.44

cp1075:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
    440 container_tasks_state
    352 container_memory_failures_total
    302 container_blkio_device_usage_total
    106 container_fs_writes_total
    106 container_fs_reads_total
    102 container_fs_writes_bytes_total
    102 container_fs_reads_bytes_total
     88 container_start_time_seconds
     88 container_spec_memory_swap_limit_bytes
     88 container_spec_memory_reservation_limit_bytes
     88 container_spec_memory_limit_bytes
     88 container_spec_cpu_shares
     88 container_spec_cpu_period
     88 container_oom_events_total
     88 container_memory_working_set_bytes
     88 container_memory_usage_bytes
     88 container_memory_swap
     88 container_memory_rss
     88 container_memory_max_usage_bytes
     88 container_memory_mapped_file
     88 container_memory_failcnt
     88 container_memory_cache
     88 container_last_seen
     88 container_cpu_user_seconds_total
     88 container_cpu_system_seconds_total
     88 container_cpu_load_average_10s
     58 container_cpu_usage_seconds_total

We'll have to work on identifying what can be realistically disabled from the get go

Plan of action

Run a single cadvisor version (0.44.0) fleetwide
Trim down default enabled metrics to reduce cardinality (percpu and cpuLoad at least, see also cadvisor Prometheus metrics)
Further trim metrics we are not going to use by discarding them at ingestion time
Evaluate whether we can ingest the new metrics in eqiad/codfw on a 100% rollout
Complete rollout in PoPs
Gradual rollout in eqiad/codfw (e.g. percentage-based)
Evaluate whether we want to enable IO/IP accounting and on what basis https://phabricator.wikimedia.org/T345078
Update docs/wikitech with cadvisor info and similar https://wikitech.wikimedia.org/wiki/Cadvisor

Details

Subject	Repo	Branch	Lines +/-
hieradata: complete cadvisor rollout on k8s	operations/puppet	production	+0 -9
hieradata: finish cadvisor rollout on k8s-aux	operations/puppet	production	+0 -3
prometheus: add rss/swap/cache memory aggregates	operations/puppet	production	+7 -1
prometheus: tune cadvisor rules	operations/puppet	production	+15 -7
prometheus: add recording rules for cadvisor cpu/mem	operations/puppet	production	+13 -0
base: wrap up cadvisor rollout	operations/puppet	production	+1 -8
base: bump cadvisor rollout to 80%	operations/puppet	production	+1 -1
base: bump cadvisor rollout to 45% in eqiad/codfw	operations/puppet	production	+1 -1
base: bump cadvisor rollout to 20% in eqiad/codfw	operations/puppet	production	+1 -1
profile: exclude kubelet production hosts from cadvisor rollout	operations/puppet	production	+13 -1
cloud-vps: disable profile::prometheus::cadvisor	operations/puppet	production	+1 -0
profile: start cadvisor rollout in eqiad/codfw	operations/puppet	production	+4 -0
profile: add ensure for prometheus::cadvisor	operations/puppet	production	+6 -2
LVS: deny cadvisor access from the world	operations/homer/public	master	+3 -2
profile: fix cadvisor deployment to PoPs	operations/puppet	production	+1 -1
profile: rollout cadvisor to PoPs	operations/puppet	production	+4 -0
cadvisor: disable percpu and cpuLoad metric classes	operations/puppet	production	+0 -2
cadvisor: add explicity metrics enable	operations/puppet	production	+52 -1
Rename cadvisor_exporter to cadvisor	operations/puppet	production	+12 -15
diamond: service stats puppet integration	operations/puppet	production	+60 -0
diamond: add upstart/systemd service stats	operations/puppet	production	+311 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T108027 Collect per-cgroup cpu/mem and other system level metrics
Resolved	fgiunchedi	T336740 Upgrade cadvisor to 0.44 fleetwide
Resolved	fgiunchedi	T337689 Extend router ACLs to block 4194/tcp on LVSes
Resolved	JMeybohm	T337836 Cadvisor may be breaking Kubernetes worker nodes
Resolved	fgiunchedi	T337856 Stop cadvisor from collecting extra metrics from docker

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

fgiunchedi added projects: acl*sre-team, observability.Aug 5 2015, 1:49 PM

fgiunchedi subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 5 2015, 1:49 PM

Change 224093 had a related patch set uploaded (by Filippo Giunchedi):
diamond: add upstart/systemd service stats

https://gerrit.wikimedia.org/r/224093

Change 224094 had a related patch set uploaded (by Filippo Giunchedi):
diamond: service stats puppet integration

https://gerrit.wikimedia.org/r/224094

Change 224094 had a related patch set uploaded (by Filippo Giunchedi):
diamond: service stats puppet integration

https://gerrit.wikimedia.org/r/224094

Change 224093 had a related patch set uploaded (by Filippo Giunchedi):
diamond: add upstart/systemd service stats

https://gerrit.wikimedia.org/r/224093

Change 224093 merged by Filippo Giunchedi:
diamond: add upstart/systemd service stats

https://gerrit.wikimedia.org/r/224093

fgiunchedi mentioned this in rOPUP16899341c2ed: diamond: add upstart/systemd service stats.Aug 14 2015, 12:13 PM

Change 224094 merged by Filippo Giunchedi:
diamond: service stats puppet integration

https://gerrit.wikimedia.org/r/224094

fgiunchedi mentioned this in rOPUPa9e96d8fae32: diamond: service stats puppet integration.Sep 3 2015, 6:00 PM

fgiunchedi mentioned this in T135991: Automated service restarts for common low-level system services.Jun 20 2016, 10:06 AM

fgiunchedi renamed this task from collect per-service cpu/mem and other system level metrics to Collect per-cgroup cpu/mem and other system level metrics.Dec 21 2017, 11:21 AM

fgiunchedi removed a project: Patch-For-Review.

fgiunchedi updated the task description. (Show Details)

fgiunchedi added a project: User-fgiunchedi.

In a prometheus world cadvisor seems to be doing what we want (i.e. export cgroup statistics, including systemd cgroups).

After enabling the following systemd options the results are available at https://phabricator.wikimedia.org/P6593

# grep -ir account /etc/systemd/system.conf 
DefaultCPUAccounting=yes
DefaultIOAccounting=yes
DefaultBlockIOAccounting=yes
DefaultMemoryAccounting=yes
DefaultTasksAccounting=yes

fgiunchedi closed this task as a duplicate of T183146: Monitor resource usage on a per-cgroup basis.Mar 16 2018, 3:12 PM

Reopening to track the implementation/deployment of this fleetwide

fgiunchedi edited projects, added Observability-Metrics; removed observability.Sep 15 2022, 12:48 PM

fgiunchedi removed fgiunchedi as the assignee of this task.Feb 22 2023, 1:16 PM

lmata mentioned this in T192551: atop on stretch overloading a host.Mar 30 2023, 4:04 PM

BTullis subscribed.Mar 30 2023, 4:34 PM

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Apr 12 2023, 9:43 AM

Change 908215 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Rename cadvisor_exporter to cadvisor

https://gerrit.wikimedia.org/r/908215

gerritbot added a project: Patch-For-Review.Apr 12 2023, 9:54 AM

Change 908215 merged by Filippo Giunchedi:

[operations/puppet@production] Rename cadvisor_exporter to cadvisor

https://gerrit.wikimedia.org/r/908215

fgiunchedi updated the task description. (Show Details)Apr 13 2023, 1:18 PM

fgiunchedi updated the task description. (Show Details)Apr 14 2023, 8:45 AM

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Apr 14 2023, 12:17 PM

fgiunchedi closed subtask T336740: Upgrade cadvisor to 0.44 fleetwide as Resolved.May 16 2023, 1:25 PM

Change 920660 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cadvisor: add explicity metrics enable

https://gerrit.wikimedia.org/r/920660

Change 920661 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cadvisor: disable percpu and cpuLoad metrics

https://gerrit.wikimedia.org/r/920661

Change 920660 merged by Filippo Giunchedi:

[operations/puppet@production] cadvisor: add explicity metrics enable

https://gerrit.wikimedia.org/r/920660

Change 920991 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: rollout cadvisor to PoPs

https://gerrit.wikimedia.org/r/920991

Change 920661 merged by Filippo Giunchedi:

[operations/puppet@production] cadvisor: disable percpu and cpuLoad metric classes

https://gerrit.wikimedia.org/r/920661

Change 920991 merged by Filippo Giunchedi:

[operations/puppet@production] profile: rollout cadvisor to PoPs

https://gerrit.wikimedia.org/r/920991

Change 922057 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: fix cadvisor deployment to PoPs

https://gerrit.wikimedia.org/r/922057

Change 922057 merged by Filippo Giunchedi:

[operations/puppet@production] profile: fix cadvisor deployment to PoPs

https://gerrit.wikimedia.org/r/922057

fgiunchedi updated the task description. (Show Details)May 22 2023, 8:25 AM

fgiunchedi updated the task description. (Show Details)May 23 2023, 8:29 AM

Change 922571 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] LVS: deny cadvisor access from the world

https://gerrit.wikimedia.org/r/922571

Change 922571 merged by jenkins-bot:

[operations/homer/public@master] LVS: deny cadvisor access from the world

https://gerrit.wikimedia.org/r/922571

With the latest changes in place we have the following metrics, cp1075 has block/network accounting enabled and thus yields more metrics.

mw1456:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
    324 container_memory_failures_total
     81 container_start_time_seconds
     81 container_spec_memory_swap_limit_bytes
     81 container_spec_memory_reservation_limit_bytes
     81 container_spec_memory_limit_bytes
     81 container_oom_events_total
     81 container_memory_working_set_bytes
     81 container_memory_usage_bytes
     81 container_memory_swap
     81 container_memory_rss
     81 container_memory_max_usage_bytes
     81 container_memory_mapped_file
     81 container_memory_failcnt
     81 container_memory_cache
     81 container_last_seen
     81 container_cpu_user_seconds_total
     81 container_cpu_system_seconds_total
     71 container_spec_cpu_shares
     71 container_spec_cpu_period
     38 container_cpu_usage_seconds_total
     36 container_blkio_device_usage_total

cp1075:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
    352 container_memory_failures_total
    270 container_blkio_device_usage_total
     95 container_fs_writes_total
     95 container_fs_reads_total
     90 container_fs_writes_bytes_total
     90 container_fs_reads_bytes_total
     88 container_start_time_seconds
     88 container_spec_memory_swap_limit_bytes
     88 container_spec_memory_reservation_limit_bytes
     88 container_spec_memory_limit_bytes
     88 container_spec_cpu_shares
     88 container_spec_cpu_period
     88 container_oom_events_total
     88 container_memory_working_set_bytes
     88 container_memory_usage_bytes
     88 container_memory_swap
     88 container_memory_rss
     88 container_memory_max_usage_bytes
     88 container_memory_mapped_file
     88 container_memory_failcnt
     88 container_memory_cache
     88 container_last_seen
     88 container_cpu_user_seconds_total
     88 container_cpu_system_seconds_total
     58 container_cpu_usage_seconds_total

Change 923530 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: add ensure for prometheus::cadvisor

https://gerrit.wikimedia.org/r/923530

Change 923531 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: start cadvisor rollout in eqiad/codfw

https://gerrit.wikimedia.org/r/923531

Change 923530 merged by Filippo Giunchedi:

[operations/puppet@production] profile: add ensure for prometheus::cadvisor

https://gerrit.wikimedia.org/r/923530

Change 923531 merged by Filippo Giunchedi:

[operations/puppet@production] profile: start cadvisor rollout in eqiad/codfw

https://gerrit.wikimedia.org/r/923531

Mentioned in SAL (#wikimedia-operations) [2023-05-29T09:13:26Z] <godog> start partial rollout of cadvisor to eqiad/codfw (~10%) T108027

In T108027#8886158, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2023-05-29T09:13:26Z] <godog> start partial rollout of cadvisor to eqiad/codfw (~10%) T108027

This added about ~3k samples/s in eqiad and ~2k in codfw. And ~180k / ~140k new metrics, respectively

Change 924106 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps: disable profile::prometheus::cadvisor

https://gerrit.wikimedia.org/r/924106

Change 924106 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps: disable profile::prometheus::cadvisor

https://gerrit.wikimedia.org/r/924106

lmata added a project: SRE Observability (FY2022/2023-Q4).May 30 2023, 1:08 PM

fgiunchedi closed subtask T337689: Extend router ACLs to block 4194/tcp on LVSes as Resolved.May 30 2023, 3:28 PM

klausman mentioned this in T337836: Cadvisor may be breaking Kubernetes worker nodes.May 31 2023, 10:21 AM

klausman added a subtask: T337836: Cadvisor may be breaking Kubernetes worker nodes.

akosiaris subscribed.Jun 1 2023, 12:39 PM

fgiunchedi closed subtask T337856: Stop cadvisor from collecting extra metrics from docker as Resolved.Jun 5 2023, 8:31 AM

Change 927198 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: exclude kubelet hosts from cadvisor rollout

https://gerrit.wikimedia.org/r/927198

Change 927198 merged by Filippo Giunchedi:

[operations/puppet@production] profile: exclude kubelet production hosts from cadvisor rollout

https://gerrit.wikimedia.org/r/927198

Change 927972 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: bump cadvisor rollout to 20% in eqiad/codfw

https://gerrit.wikimedia.org/r/927972

Change 927972 merged by Filippo Giunchedi:

[operations/puppet@production] base: bump cadvisor rollout to 20% in eqiad/codfw

https://gerrit.wikimedia.org/r/927972

I've bumped cadvisor rollout to 20% in codfw/eqiad, for a total of ~900 hosts with cadvisor fleet wide (out of ~2k hosts). The rollout percentage is higher in reality because mw/cp hosts already have cadvisor and it is fully rolled out in PoPs

Change 938810 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: bump cadvisor rollout to 45% in eqiad/codfw

https://gerrit.wikimedia.org/r/938810

Change 938810 merged by Filippo Giunchedi:

[operations/puppet@production] base: bump cadvisor rollout to 45% in eqiad/codfw

https://gerrit.wikimedia.org/r/938810

fgiunchedi updated the task description. (Show Details)Jul 17 2023, 9:55 AM

Change 939236 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: bump cadvisor rollout to 80%

https://gerrit.wikimedia.org/r/939236

Change 939236 merged by Filippo Giunchedi:

[operations/puppet@production] base: bump cadvisor rollout to 80%

https://gerrit.wikimedia.org/r/939236

lmata edited projects, added SRE Observability (FY2023/2024-Q1); removed SRE Observability (FY2022/2023-Q4).Jul 18 2023, 4:57 PM

lmata moved this task from Inbox to In progress on the SRE Observability (FY2023/2024-Q1) board.Jul 18 2023, 5:34 PM

lmata moved this task from Inbox to Prioritized on the Observability-Metrics board.

Change 940868 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: wrap up cadvisor rollout

https://gerrit.wikimedia.org/r/940868

Change 940868 merged by Filippo Giunchedi:

[operations/puppet@production] base: wrap up cadvisor rollout

https://gerrit.wikimedia.org/r/940868

Change 940879 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add recording rules for cadvisor cpu/mem

https://gerrit.wikimedia.org/r/940879

Change 940879 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add recording rules for cadvisor cpu/mem

https://gerrit.wikimedia.org/r/940879

Change 941839 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: tune cadvisor rules

https://gerrit.wikimedia.org/r/941839

Change 941839 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: tune cadvisor rules

https://gerrit.wikimedia.org/r/941839

Preliminary dashboard to explore per-unit resource usage: https://grafana.wikimedia.org/d/lxIVOKq4k/units-resource-usage-overview

Change 941855 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add rss/swap/cache memory aggregates

https://gerrit.wikimedia.org/r/941855

Change 941855 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add rss/swap/cache memory aggregates

https://gerrit.wikimedia.org/r/941855

Change 942420 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: finish cadvisor rollout on k8s-aux

https://gerrit.wikimedia.org/r/942420

Change 942420 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: finish cadvisor rollout on k8s-aux

https://gerrit.wikimedia.org/r/942420

Change 942426 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: complete cadvisor rollout on k8s

https://gerrit.wikimedia.org/r/942426

JMeybohm mentioned this in T277876: Reserve resources for system daemons on kubernetes nodes.Aug 14 2023, 12:36 PM

Change 942426 merged by JMeybohm:

[operations/puppet@production] hieradata: complete cadvisor rollout on k8s

https://gerrit.wikimedia.org/r/942426

JMeybohm closed subtask T337836: Cadvisor may be breaking Kubernetes worker nodes as Resolved.Aug 15 2023, 7:25 AM

fgiunchedi updated the task description. (Show Details)Aug 21 2023, 7:15 AM

lmata subscribed.Aug 23 2023, 1:41 AM

This is complete from my POV: the production fleet runs cadvisor and we're collecting/displaying the data.

fgiunchedi closed this task as Resolved.Aug 28 2023, 1:37 PM

fgiunchedi claimed this task.

lmata moved this task from In progress to Done on the SRE Observability (FY2023/2024-Q1) board.Oct 9 2023, 4:28 PM

Collect per-cgroup cpu/mem and other system level metricsClosed, ResolvedPublicActions

Description

cadvisor version

Metric cardinality

Plan of action

Details

Related ObjectsSearch...

Event Timeline

Collect per-cgroup cpu/mem and other system level metrics
Closed, ResolvedPublic
Actions

Related Objects
Search...