Page MenuHomePhabricator

puppetdb prometheus metrics per-host metrics
Closed, ResolvedPublic


While investigating something else I discovered that puppetdb hosts are exporting a ton of metrics, with the hostname in the metric name. Prometheus isn't happy about that and we are already blacklisting some mbeans, likely the blacklist needs to be expanded.

puppetdb1001:~$ curl -s | grep ^puppetlabs_puppetdb_http_pdb_query_v4_catalogs | wc -l
puppetdb1001:~$ curl -s | grep ^puppetlabs_puppetdb_http_pdb_query_v4_catalogs | head -2
puppetlabs_puppetdb_http_pdb_query_v4_catalogs_mw2231_codfw_wmnet_resources_service_time_98thPercentile 480.154882
puppetlabs_puppetdb_http_pdb_query_v4_catalogs_wdqs1010_eqiad_wmnet_service_time_50thPercentile 300.70412999999996

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2019, 10:06 AM

Change 524182 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetmaster: blacklist per-host catalogs metrics

jbond added a subscriber: jbond.Jul 18 2019, 10:29 AM

below are all the hosts with puppetdb package installed


Change 524182 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: blacklist per-host catalogs metrics

Change deployed, the high cardinality metrics should also be deleted. To do that we'll need to pass --web.enable-admin-api to prometheus first to be able to delete metrics:

herron triaged this task as Normal priority.Jul 23 2019, 2:03 PM

cc @EBernhardson this is likely one of the problems you have been experiencing with prometheus' web interface (i.e. dropdown/autocomplete is slow because of many metric names)

Mentioned in SAL (#wikimedia-operations) [2019-09-05T10:53:34Z] <godog> temporarily enable prometheus admin web api in prometheus@ops in eqiad to delete spammy metrics - T228395

prometheus1004 completed, with this process:

# depool
# stop puppet
# add  --web.enable-admin-api to /lib/systemd/system/prometheus@ops.service
systemctl daemon-reload
systemctl restart prometheus@ops
curl -v -X POST -g 'http://localhost:9900/ops/api/v1/admin/tsdb/delete_series?match[]={__name__=~"^puppetlabs_puppetdb_http_pdb_query_v4_catalogs_.*wmnet.*"}'
curl -v -X POST -g 'http://localhost:9900/ops/api/v1/admin/tsdb/delete_series?match[]={__name__=~"^puppetlabs_puppetdb_http_pdb_query_v4_catalogs_.*wikimedia.*"}'
curl -XPOST http://localhost:9900/ops/api/v1/admin/tsdb/clean_tombstones
# wait for tombstones to get cleaned, the curl will return once done
# reenable puppet agent, prometheus will get restarted
# verify all is well
$ curl -s http://localhost:9900/ops/api/v1/label/__name__/values | jq '.data | length'
# repool
fgiunchedi closed this task as Resolved.Thu, Sep 5, 3:02 PM
fgiunchedi claimed this task.

Both prometheus1003 and prometheus1004 have been cleaned and repooled, resolving. @EBernhardson please give the web ui another try, autocomplete should be significantly faster now