We want to collect per-cgroup resource usage statistics, this is already achieved by cadvisor on ~23% of the fleet (cp + mw hosts).
We're looking at expanding the deployment fleetwide; enabling for example queries like "what are the top 5 (memory|cpu|etc) users either in the fleet or in a specific cluster"
Considerations, moving parts, etc in no particular order:
- cadvisor seems to be working fine for our purposes, even though we had to update our version internally (cfr T325557) and the version has diverged bullseye vs buster vs official Debian. OTOH some metrics are quite fine grained, to the point of leading to significant cardinality (e.g. per-cpu * systemd units for CPU-related metrics)
- There will be significantly more load on Prometheus ops eqiad/codfw.
- Quick napkin math for eqiad: ~4k (depends on how many systemd units are running, and how many cores the host has!) new metrics per host with the default accounting, times ~900 hosts in eqiad that don't already run cadvisor that's ~3.7M new metrics to be collected, once a minute that's 62k new samples/s. Prometheus ops in eqiad now ingests ~146k samples/s. That's a 42% increase in samples ingested, with a comparable figure for disk space increase.
- PoPs are fine though since cp hosts already run cadvisor so the increase in Prometheus load will be marginal there.
- block IO and IP accounting are disabled by default in systemd (tasks/memory/cpu are enabled by default on >= buster). We can consider enabling additional accounting either for specific roles or fleetwide, either pre or post rollout.
cadvisor version
Our version/build of cadvisor 0.44 seems to run well both on Buster and Bullseye, therefore we should deploy a single version (i.e. upgrade existing 0.35.0 hosts to 0.44)
Metric cardinality
With respect to 2.A here's a breakdown of cardinality with default accounting on Buster and cadvisor 0.35
mw1456:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}' 1552 container_cpu_usage_seconds_total 410 container_tasks_state 328 container_memory_failures_total 82 container_start_time_seconds 82 container_spec_memory_swap_limit_bytes 82 container_spec_memory_reservation_limit_bytes 82 container_spec_memory_limit_bytes 82 container_memory_working_set_bytes 82 container_memory_usage_bytes 82 container_memory_swap 82 container_memory_rss 82 container_memory_max_usage_bytes 82 container_memory_mapped_file 82 container_memory_failcnt 82 container_memory_cache 82 container_last_seen 82 container_cpu_user_seconds_total 82 container_cpu_system_seconds_total 82 container_cpu_load_average_10s 72 container_spec_cpu_shares 72 container_spec_cpu_period
Whereas with all accounting enabled, Bullseye and cadvisor 0.44
cp1075:~$ curl -s localhost:4194/metrics | grep -v ^# | cut -d\{ -f1 | uniq -c | sort -rn | awk '$1 > 10 {print}' 440 container_tasks_state 352 container_memory_failures_total 302 container_blkio_device_usage_total 106 container_fs_writes_total 106 container_fs_reads_total 102 container_fs_writes_bytes_total 102 container_fs_reads_bytes_total 88 container_start_time_seconds 88 container_spec_memory_swap_limit_bytes 88 container_spec_memory_reservation_limit_bytes 88 container_spec_memory_limit_bytes 88 container_spec_cpu_shares 88 container_spec_cpu_period 88 container_oom_events_total 88 container_memory_working_set_bytes 88 container_memory_usage_bytes 88 container_memory_swap 88 container_memory_rss 88 container_memory_max_usage_bytes 88 container_memory_mapped_file 88 container_memory_failcnt 88 container_memory_cache 88 container_last_seen 88 container_cpu_user_seconds_total 88 container_cpu_system_seconds_total 88 container_cpu_load_average_10s 58 container_cpu_usage_seconds_total
We'll have to work on identifying what can be realistically disabled from the get go
Plan of action
- Run a single cadvisor version (0.44.0) fleetwide
- Trim down default enabled metrics to reduce cardinality (percpu and cpuLoad at least, see also cadvisor Prometheus metrics)
- Further trim metrics we are not going to use by discarding them at ingestion time
- Evaluate whether we can ingest the new metrics in eqiad/codfw on a 100% rollout
- Complete rollout in PoPs
- Gradual rollout in eqiad/codfw (e.g. percentage-based)
- Evaluate whether we want to enable IO/IP accounting and on what basis https://phabricator.wikimedia.org/T345078
- Update docs/wikitech with cadvisor info and similar https://wikitech.wikimedia.org/wiki/Cadvisor