Page MenuHomePhabricator

PuppetDB grafana graphs not matching logs
Open, MediumPublic

Description

Im trying to understand the Replace facts grafana boards for puppetdb2002 vs the logs we see on puppetdb, specifically we seem to see a lot of transactions which take more then 10 or even 100 seconds which don't seem to be represented in theses graphs

puppetdb2002 ~ % zgrep -E '[0-9]{5,} ms' /var/log/puppetdb/puppetdb-2020-10-14.0.log.gz | wc -l                                                                                                  [18:32:12]
2021
puppetdb2002 ~ % zgrep -E '[0-9]{6,} ms' /var/log/puppetdb/puppetdb-2020-10-14.0.log.gz | wc -l                                                                                                  [18:32:16]
267
puppetdb2002 ~ % zgrep -E '[0-9]{6,} ms' /var/log/puppetdb/puppetdb-2020-10-14.0.log.gz | head -10                                                                                               [18:32:23]
2020-10-14T00:41:30.441Z INFO  [p.p.command] [1371130-1602635835455] [254977 ms] 'replace facts' command processed for kubernetes2013.codfw.wmnet
2020-10-14T01:42:04.464Z INFO  [p.p.command] [1375145-1602639469947] [254508 ms] 'replace facts' command processed for kubernetes2013.codfw.wmnet
2020-10-14T01:48:49.494Z INFO  [p.p.command] [1375652-1602639921861] [207625 ms] 'replace facts' command processed for kubernetes2001.codfw.wmnet
2020-10-14T02:24:15.467Z INFO  [p.p.command] [1377912-1602641976613] [278844 ms] 'replace facts' command processed for kubernetes2003.codfw.wmnet
2020-10-14T02:48:09.310Z INFO  [p.p.command] [1379594-1602643482638] [206664 ms] 'replace facts' command processed for kubernetes2001.codfw.wmnet
2020-10-14T02:49:07.730Z INFO  [p.p.command] [1379633-1602643517425] [230300 ms] 'replace facts' command processed for kubernetes2002.codfw.wmnet
2020-10-14T03:12:15.840Z INFO  [p.p.command] [1381134-1602644881234] [254597 ms] 'replace facts' command processed for kubernetes2013.codfw.wmnet
2020-10-14T03:23:35.624Z INFO  [p.p.command] [1381858-1602645533627] [281987 ms] 'replace facts' command processed for kubernetes2003.codfw.wmnet
2020-10-14T03:48:52.127Z INFO  [p.p.command] [1383594-1602647100845] [231274 ms] 'replace facts' command processed for kubernetes2002.codfw.wmnet
2020-10-14T03:54:58.582Z INFO  [p.p.command] [1383933-1602647429824] [268749 ms] 'replace facts' command processed for kubernetes2007.codfw.wmnet

I dont know if this is just something that is lost as the data is summarized, if the grafana graph is not showing what i expect or if the logs are reporting something different to jmx (i have a feeling jmx is maybe summarising the data before it gets to grafana) but would be good to understand

it would also be nice to have graphs for 'replace catalog' & 'store report' not sure if we currently send theses via jmx but are in the logs in a simlar format to 'replace facts' entries above

Event Timeline

jbond triaged this task as Medium priority.Oct 15 2020, 6:45 PM
jbond created this task.

Looking at the panel's query, AFAICT that's the rate of replace facts operations:

rate(puppetlabs_puppetdb_mq_replace_facts_5_processed_Count{instance="$instance"}[5m]) or rate(com_puppetlabs_puppetdb_command_replace_facts_3_Count{instance="$instance"}[5m])

However percentiles for the same operations are already exported as metrics, for example puppetlabs_puppetdb_mq_replace_facts_5_processing_time_98thPercentile, the easiest way to navigate the metrics is by autocompletion on https://grafana.wikimedia.org/explore or https://thanos.wikimedia.org/

hope that helps!

Change 635264 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetdb: don't export metrics for host-specific mbeans

https://gerrit.wikimedia.org/r/635264

Change 635264 merged by Filippo Giunchedi:
[operations/puppet@production] puppetdb: don't export metrics for host-specific mbeans

https://gerrit.wikimedia.org/r/635264