Page MenuHomePhabricator

Collect per-cgroup cpu/mem and other system level metrics
Closed, ResolvedPublic

Description

We want to collect per-cgroup resource usage statistics, this is already achieved by cadvisor on ~23% of the fleet (cp + mw hosts).

We're looking at expanding the deployment fleetwide; enabling for example queries like "what are the top 5 (memory|cpu|etc) users either in the fleet or in a specific cluster"

Considerations, moving parts, etc in no particular order:

  1. cadvisor seems to be working fine for our purposes, even though we had to update our version internally (cfr T325557) and the version has diverged bullseye vs buster vs official Debian. OTOH some metrics are quite fine grained, to the point of leading to significant cardinality (e.g. per-cpu * systemd units for CPU-related metrics)
  2. There will be significantly more load on Prometheus ops eqiad/codfw.
    1. Quick napkin math for eqiad: ~4k (depends on how many systemd units are running, and how many cores the host has!) new metrics per host with the default accounting, times ~900 hosts in eqiad that don't already run cadvisor that's ~3.7M new metrics to be collected, once a minute that's 62k new samples/s. Prometheus ops in eqiad now ingests ~146k samples/s. That's a 42% increase in samples ingested, with a comparable figure for disk space increase.
    2. PoPs are fine though since cp hosts already run cadvisor so the increase in Prometheus load will be marginal there.
  3. block IO and IP accounting are disabled by default in systemd (tasks/memory/cpu are enabled by default on >= buster). We can consider enabling additional accounting either for specific roles or fleetwide, either pre or post rollout.
cadvisor version

Our version/build of cadvisor 0.44 seems to run well both on Buster and Bullseye, therefore we should deploy a single version (i.e. upgrade existing 0.35.0 hosts to 0.44)

Metric cardinality

With respect to 2.A here's a breakdown of cardinality with default accounting on Buster and cadvisor 0.35

mw1456:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
   1552 container_cpu_usage_seconds_total
    410 container_tasks_state
    328 container_memory_failures_total
     82 container_start_time_seconds
     82 container_spec_memory_swap_limit_bytes
     82 container_spec_memory_reservation_limit_bytes
     82 container_spec_memory_limit_bytes
     82 container_memory_working_set_bytes
     82 container_memory_usage_bytes
     82 container_memory_swap
     82 container_memory_rss
     82 container_memory_max_usage_bytes
     82 container_memory_mapped_file
     82 container_memory_failcnt
     82 container_memory_cache
     82 container_last_seen
     82 container_cpu_user_seconds_total
     82 container_cpu_system_seconds_total
     82 container_cpu_load_average_10s
     72 container_spec_cpu_shares
     72 container_spec_cpu_period

Whereas with all accounting enabled, Bullseye and cadvisor 0.44

cp1075:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
    440 container_tasks_state
    352 container_memory_failures_total
    302 container_blkio_device_usage_total
    106 container_fs_writes_total
    106 container_fs_reads_total
    102 container_fs_writes_bytes_total
    102 container_fs_reads_bytes_total
     88 container_start_time_seconds
     88 container_spec_memory_swap_limit_bytes
     88 container_spec_memory_reservation_limit_bytes
     88 container_spec_memory_limit_bytes
     88 container_spec_cpu_shares
     88 container_spec_cpu_period
     88 container_oom_events_total
     88 container_memory_working_set_bytes
     88 container_memory_usage_bytes
     88 container_memory_swap
     88 container_memory_rss
     88 container_memory_max_usage_bytes
     88 container_memory_mapped_file
     88 container_memory_failcnt
     88 container_memory_cache
     88 container_last_seen
     88 container_cpu_user_seconds_total
     88 container_cpu_system_seconds_total
     88 container_cpu_load_average_10s
     58 container_cpu_usage_seconds_total

We'll have to work on identifying what can be realistically disabled from the get go

Plan of action

  • Run a single cadvisor version (0.44.0) fleetwide
  • Trim down default enabled metrics to reduce cardinality (percpu and cpuLoad at least, see also cadvisor Prometheus metrics)
  • Further trim metrics we are not going to use by discarding them at ingestion time
  • Evaluate whether we can ingest the new metrics in eqiad/codfw on a 100% rollout
  • Complete rollout in PoPs
  • Gradual rollout in eqiad/codfw (e.g. percentage-based)
  • Evaluate whether we want to enable IO/IP accounting and on what basis https://phabricator.wikimedia.org/T345078
  • Update docs/wikitech with cadvisor info and similar https://wikitech.wikimedia.org/wiki/Cadvisor

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -9
operations/puppetproduction+0 -3
operations/puppetproduction+7 -1
operations/puppetproduction+15 -7
operations/puppetproduction+13 -0
operations/puppetproduction+1 -8
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+13 -1
operations/puppetproduction+1 -0
operations/puppetproduction+4 -0
operations/puppetproduction+6 -2
operations/homer/publicmaster+3 -2
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+0 -2
operations/puppetproduction+52 -1
operations/puppetproduction+12 -15
operations/puppetproduction+60 -0
operations/puppetproduction+311 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 224093 had a related patch set uploaded (by Filippo Giunchedi):
diamond: add upstart/systemd service stats

https://gerrit.wikimedia.org/r/224093

Change 224094 had a related patch set uploaded (by Filippo Giunchedi):
diamond: service stats puppet integration

https://gerrit.wikimedia.org/r/224094

Change 224094 had a related patch set uploaded (by Filippo Giunchedi):
diamond: service stats puppet integration

https://gerrit.wikimedia.org/r/224094

Change 224093 had a related patch set uploaded (by Filippo Giunchedi):
diamond: add upstart/systemd service stats

https://gerrit.wikimedia.org/r/224093

Change 224093 merged by Filippo Giunchedi:
diamond: add upstart/systemd service stats

https://gerrit.wikimedia.org/r/224093

Change 224094 merged by Filippo Giunchedi:
diamond: service stats puppet integration

https://gerrit.wikimedia.org/r/224094

fgiunchedi renamed this task from collect per-service cpu/mem and other system level metrics to Collect per-cgroup cpu/mem and other system level metrics.Dec 21 2017, 11:21 AM
fgiunchedi removed a project: Patch-For-Review.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a project: User-fgiunchedi.

In a prometheus world cadvisor seems to be doing what we want (i.e. export cgroup statistics, including systemd cgroups).

After enabling the following systemd options the results are available at https://phabricator.wikimedia.org/P6593

# grep -ir account /etc/systemd/system.conf 
DefaultCPUAccounting=yes
DefaultIOAccounting=yes
DefaultBlockIOAccounting=yes
DefaultMemoryAccounting=yes
DefaultTasksAccounting=yes

Reopening to track the implementation/deployment of this fleetwide

Change 908215 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Rename cadvisor_exporter to cadvisor

https://gerrit.wikimedia.org/r/908215

Change 908215 merged by Filippo Giunchedi:

[operations/puppet@production] Rename cadvisor_exporter to cadvisor

https://gerrit.wikimedia.org/r/908215

Change 920660 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cadvisor: add explicity metrics enable

https://gerrit.wikimedia.org/r/920660

Change 920661 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cadvisor: disable percpu and cpuLoad metrics

https://gerrit.wikimedia.org/r/920661

Change 920660 merged by Filippo Giunchedi:

[operations/puppet@production] cadvisor: add explicity metrics enable

https://gerrit.wikimedia.org/r/920660

Change 920991 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: rollout cadvisor to PoPs

https://gerrit.wikimedia.org/r/920991

Change 920661 merged by Filippo Giunchedi:

[operations/puppet@production] cadvisor: disable percpu and cpuLoad metric classes

https://gerrit.wikimedia.org/r/920661

Change 920991 merged by Filippo Giunchedi:

[operations/puppet@production] profile: rollout cadvisor to PoPs

https://gerrit.wikimedia.org/r/920991

Change 922057 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: fix cadvisor deployment to PoPs

https://gerrit.wikimedia.org/r/922057

Change 922057 merged by Filippo Giunchedi:

[operations/puppet@production] profile: fix cadvisor deployment to PoPs

https://gerrit.wikimedia.org/r/922057

Change 922571 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] LVS: deny cadvisor access from the world

https://gerrit.wikimedia.org/r/922571

Change 922571 merged by jenkins-bot:

[operations/homer/public@master] LVS: deny cadvisor access from the world

https://gerrit.wikimedia.org/r/922571

With the latest changes in place we have the following metrics, cp1075 has block/network accounting enabled and thus yields more metrics.

mw1456:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
    324 container_memory_failures_total
     81 container_start_time_seconds
     81 container_spec_memory_swap_limit_bytes
     81 container_spec_memory_reservation_limit_bytes
     81 container_spec_memory_limit_bytes
     81 container_oom_events_total
     81 container_memory_working_set_bytes
     81 container_memory_usage_bytes
     81 container_memory_swap
     81 container_memory_rss
     81 container_memory_max_usage_bytes
     81 container_memory_mapped_file
     81 container_memory_failcnt
     81 container_memory_cache
     81 container_last_seen
     81 container_cpu_user_seconds_total
     81 container_cpu_system_seconds_total
     71 container_spec_cpu_shares
     71 container_spec_cpu_period
     38 container_cpu_usage_seconds_total
     36 container_blkio_device_usage_total
cp1075:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
    352 container_memory_failures_total
    270 container_blkio_device_usage_total
     95 container_fs_writes_total
     95 container_fs_reads_total
     90 container_fs_writes_bytes_total
     90 container_fs_reads_bytes_total
     88 container_start_time_seconds
     88 container_spec_memory_swap_limit_bytes
     88 container_spec_memory_reservation_limit_bytes
     88 container_spec_memory_limit_bytes
     88 container_spec_cpu_shares
     88 container_spec_cpu_period
     88 container_oom_events_total
     88 container_memory_working_set_bytes
     88 container_memory_usage_bytes
     88 container_memory_swap
     88 container_memory_rss
     88 container_memory_max_usage_bytes
     88 container_memory_mapped_file
     88 container_memory_failcnt
     88 container_memory_cache
     88 container_last_seen
     88 container_cpu_user_seconds_total
     88 container_cpu_system_seconds_total
     58 container_cpu_usage_seconds_total

Change 923530 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: add ensure for prometheus::cadvisor

https://gerrit.wikimedia.org/r/923530

Change 923531 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: start cadvisor rollout in eqiad/codfw

https://gerrit.wikimedia.org/r/923531

Change 923530 merged by Filippo Giunchedi:

[operations/puppet@production] profile: add ensure for prometheus::cadvisor

https://gerrit.wikimedia.org/r/923530

Change 923531 merged by Filippo Giunchedi:

[operations/puppet@production] profile: start cadvisor rollout in eqiad/codfw

https://gerrit.wikimedia.org/r/923531

Mentioned in SAL (#wikimedia-operations) [2023-05-29T09:13:26Z] <godog> start partial rollout of cadvisor to eqiad/codfw (~10%) T108027

Mentioned in SAL (#wikimedia-operations) [2023-05-29T09:13:26Z] <godog> start partial rollout of cadvisor to eqiad/codfw (~10%) T108027

This added about ~3k samples/s in eqiad and ~2k in codfw. And ~180k / ~140k new metrics, respectively

Change 924106 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps: disable profile::prometheus::cadvisor

https://gerrit.wikimedia.org/r/924106

Change 924106 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps: disable profile::prometheus::cadvisor

https://gerrit.wikimedia.org/r/924106

Change 927198 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: exclude kubelet hosts from cadvisor rollout

https://gerrit.wikimedia.org/r/927198

Change 927198 merged by Filippo Giunchedi:

[operations/puppet@production] profile: exclude kubelet production hosts from cadvisor rollout

https://gerrit.wikimedia.org/r/927198

Change 927972 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: bump cadvisor rollout to 20% in eqiad/codfw

https://gerrit.wikimedia.org/r/927972

Change 927972 merged by Filippo Giunchedi:

[operations/puppet@production] base: bump cadvisor rollout to 20% in eqiad/codfw

https://gerrit.wikimedia.org/r/927972

I've bumped cadvisor rollout to 20% in codfw/eqiad, for a total of ~900 hosts with cadvisor fleet wide (out of ~2k hosts). The rollout percentage is higher in reality because mw/cp hosts already have cadvisor and it is fully rolled out in PoPs

Change 938810 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: bump cadvisor rollout to 45% in eqiad/codfw

https://gerrit.wikimedia.org/r/938810

Change 938810 merged by Filippo Giunchedi:

[operations/puppet@production] base: bump cadvisor rollout to 45% in eqiad/codfw

https://gerrit.wikimedia.org/r/938810

Change 939236 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: bump cadvisor rollout to 80%

https://gerrit.wikimedia.org/r/939236

Change 939236 merged by Filippo Giunchedi:

[operations/puppet@production] base: bump cadvisor rollout to 80%

https://gerrit.wikimedia.org/r/939236

Change 940868 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: wrap up cadvisor rollout

https://gerrit.wikimedia.org/r/940868

Change 940868 merged by Filippo Giunchedi:

[operations/puppet@production] base: wrap up cadvisor rollout

https://gerrit.wikimedia.org/r/940868

Change 940879 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add recording rules for cadvisor cpu/mem

https://gerrit.wikimedia.org/r/940879

Change 940879 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add recording rules for cadvisor cpu/mem

https://gerrit.wikimedia.org/r/940879

Change 941839 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: tune cadvisor rules

https://gerrit.wikimedia.org/r/941839

Change 941839 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: tune cadvisor rules

https://gerrit.wikimedia.org/r/941839

Change 941855 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add rss/swap/cache memory aggregates

https://gerrit.wikimedia.org/r/941855

Change 941855 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add rss/swap/cache memory aggregates

https://gerrit.wikimedia.org/r/941855

Change 942420 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: finish cadvisor rollout on k8s-aux

https://gerrit.wikimedia.org/r/942420

Change 942420 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: finish cadvisor rollout on k8s-aux

https://gerrit.wikimedia.org/r/942420

Change 942426 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: complete cadvisor rollout on k8s

https://gerrit.wikimedia.org/r/942426

Change 942426 merged by JMeybohm:

[operations/puppet@production] hieradata: complete cadvisor rollout on k8s

https://gerrit.wikimedia.org/r/942426

fgiunchedi updated the task description. (Show Details)

This is complete from my POV: the production fleet runs cadvisor and we're collecting/displaying the data.

fgiunchedi claimed this task.