We want to collect per-cgroup resource usage statistics, this is already achieved by cadvisor on ~23% of the fleet (cp + mw hosts).
We're looking at expanding the deployment fleetwide; enabling for example queries like "what are the top 5 (memory|cpu|etc) users either in the fleet or in a specific cluster"
Considerations, moving parts, etc in no particular order:
- cadvisor seems to be working fine for our purposes, even though we had to update our version internally (cfr T325557) and the version has diverged bullseye vs buster vs official Debian. OTOH some metrics are quite fine grained, to the point of leading to significant cardinality (e.g. per-cpu * systemd units for CPU-related metrics)
- There will be significantly more load on Prometheus ops eqiad/codfw.
- Quick napkin math for eqiad: ~4k (depends on how many systemd units are running, and how many cores the host has!) new metrics per host with the default accounting, times ~900 hosts in eqiad that don't already run cadvisor that's ~3.7M new metrics to be collected, once a minute that's 62k new samples/s. Prometheus ops in eqiad now ingests ~146k samples/s. That's a 42% increase in samples ingested, with a comparable figure for disk space increase.
- PoPs are fine though since cp hosts already run cadvisor so the increase in Prometheus load will be marginal there.
- block IO and IP accounting are disabled by default in systemd (tasks/memory/cpu are enabled by default on >= buster). We can consider enabling additional accounting either for specific roles or fleetwide, either pre or post rollout.
cadvisor version
Our version/build of cadvisor 0.44 seems to run well both on Buster and Bullseye, therefore we should deploy a single version (i.e. upgrade existing 0.35.0 hosts to 0.44)
Metric cardinality
With respect to 2.A here's a breakdown of cardinality with default accounting on Buster and cadvisor 0.35
mw1456:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
1552 container_cpu_usage_seconds_total
410 container_tasks_state
328 container_memory_failures_total
82 container_start_time_seconds
82 container_spec_memory_swap_limit_bytes
82 container_spec_memory_reservation_limit_bytes
82 container_spec_memory_limit_bytes
82 container_memory_working_set_bytes
82 container_memory_usage_bytes
82 container_memory_swap
82 container_memory_rss
82 container_memory_max_usage_bytes
82 container_memory_mapped_file
82 container_memory_failcnt
82 container_memory_cache
82 container_last_seen
82 container_cpu_user_seconds_total
82 container_cpu_system_seconds_total
82 container_cpu_load_average_10s
72 container_spec_cpu_shares
72 container_spec_cpu_periodWhereas with all accounting enabled, Bullseye and cadvisor 0.44
cp1075:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}'
440 container_tasks_state
352 container_memory_failures_total
302 container_blkio_device_usage_total
106 container_fs_writes_total
106 container_fs_reads_total
102 container_fs_writes_bytes_total
102 container_fs_reads_bytes_total
88 container_start_time_seconds
88 container_spec_memory_swap_limit_bytes
88 container_spec_memory_reservation_limit_bytes
88 container_spec_memory_limit_bytes
88 container_spec_cpu_shares
88 container_spec_cpu_period
88 container_oom_events_total
88 container_memory_working_set_bytes
88 container_memory_usage_bytes
88 container_memory_swap
88 container_memory_rss
88 container_memory_max_usage_bytes
88 container_memory_mapped_file
88 container_memory_failcnt
88 container_memory_cache
88 container_last_seen
88 container_cpu_user_seconds_total
88 container_cpu_system_seconds_total
88 container_cpu_load_average_10s
58 container_cpu_usage_seconds_totalWe'll have to work on identifying what can be realistically disabled from the get go
Plan of action
- Run a single cadvisor version (0.44.0) fleetwide
- Trim down default enabled metrics to reduce cardinality (percpu and cpuLoad at least, see also cadvisor Prometheus metrics)
- Further trim metrics we are not going to use by discarding them at ingestion time
- Evaluate whether we can ingest the new metrics in eqiad/codfw on a 100% rollout
- Complete rollout in PoPs
- Gradual rollout in eqiad/codfw (e.g. percentage-based)
- Evaluate whether we want to enable IO/IP accounting and on what basis https://phabricator.wikimedia.org/T345078
- Update docs/wikitech with cadvisor info and similar https://wikitech.wikimedia.org/wiki/Cadvisor