Page MenuHomePhabricator

vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality
Closed, ResolvedPublic

Description

Current logging output looks like:

start:1582152620 uptime:3775380 purgers:2 recvd:7919253941 bad:0 filtered:0
Purger0: input:7919253941 failed:0 q_size:86873061 q_mem:14865036237 q_max_size:157669187 q_max_mem:28148024297
Purger1: input:7832380879 failed:0 q_size:2331 q_mem:718320 q_max_size:129967 q_max_mem:18856476

with separate stat lines for each purger thread. But the monitoring exporter script expects a single purger worker, and also expects slightly different names (e.g. queue_size).

Event Timeline

I had a chat with the author to make sure we understand the meaning of the fields:

First line:
start: this is just the *nix timestamp of daemon start
uptime: seconds since the above. Not sure if either of these is very useful (it's nice to see restarts, but the rest of the numbers will reset and make it obvious anyways)
purgers: fixed for life, just counts how many purger lines will follow
recvd: Counter of all purges received over multicast input
bad: Counter, subset of the above that were somehow faulty
filtered: Counter, subset of un-bad ones which were filtered out by a commandline regex filter

Purger lines: (0 is the backend, 1 is the frontend)
input: Counter, valid inputs to the queue
failed: Counter, failed to deliver via HTTP (unparseable response, or [45]xx response, etc)
q_size: Gauge, current amount of purges stuck in queue
q_mem: Gauge, current memory consumed by purges stuck in queue
q_max_size: Maximum value of q_size since daemon start - no need to consume this if prometheus has historical stats anyways!
q_max_mem: Maximum value of q_mem since daemon start - no need to consume this if prometheus has historical stats anyways!

Note that each PurgerN has its own separate queue. Purger0's "input" should track fairly close to recvd-bad-filtered from the top line. Purged items move through the multiple queues serially; Purger1 doesn't receive an item until Purger0 has successfully processed it. The purgers also have configurable queuing delays (I believe all are set to 1 second on the frontend, which is Purger1).

We're probably going to add multiple purger connections to fan out the per-thread load from T241232 to help with T249325 . I've poked at that already, and the likely impact here is that the overall format will stay the same, but the Purger0 labels at the start of the lines will probably change, so don't rely on the PurgerN formatting of that leading label. I'm thinking right now that I'll give the purgers names in the config and then number them for the configured fanout factor, so we're likely to see something more like:

start:1582152620 uptime:3775380 purgers:5 recvd:7919253941 bad:0 filtered:0
Backend0: input:7919253941 failed:0 q_size:86873061 q_mem:14865036237 q_max_size:157669187 q_max_mem:28148024297
Backend1: input:7919253941 failed:0 q_size:86873061 q_mem:14865036237 q_max_size:157669187 q_max_mem:28148024297
Backend2: input:7919253941 failed:0 q_size:86873061 q_mem:14865036237 q_max_size:157669187 q_max_mem:28148024297
Backend3: input:7919253941 failed:0 q_size:86873061 q_mem:14865036237 q_max_size:157669187 q_max_mem:28148024297
Frontend0: input:7832380879 failed:0 q_size:2331 q_mem:718320 q_max_size:129967 q_max_mem:18856476

And since all the BackendN will pull from the same queue, they'll each have ~1/4 the stats that the combined one does now, and would only compare to the receive-side stats in the aggregate.

Change 586111 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] update vhtcpd exporter to match reality

https://gerrit.wikimedia.org/r/586111

Updated version parsing your new output above:

1# HELP vhtcpd_start Start timestamp epoch seconds
2# TYPE vhtcpd_start gauge
3vhtcpd_start 1.58215262e+09
4# HELP vhtcpd_uptime Daemon uptime in seconds
5# TYPE vhtcpd_uptime gauge
6vhtcpd_uptime 3.77538e+06
7# HELP vhtcpd_input_packets_total Packets received by the shared listener
8# TYPE vhtcpd_input_packets_total gauge
9vhtcpd_input_packets_total 7.919253941e+09
10# HELP vhtcpd_input_packets_bad_total Packets received by the shared listener that were somehow faulty
11# TYPE vhtcpd_input_packets_bad_total gauge
12vhtcpd_input_packets_bad_total 0.0
13# HELP vhtcpd_input_packets_filtered_total Packets received by the listener but filtered by configuration
14# TYPE vhtcpd_input_packets_filtered_total gauge
15vhtcpd_input_packets_filtered_total 0.0
16# HELP vhtcpd_worker_input_packets_total Total packets enqueued on a given purger worker
17# TYPE vhtcpd_worker_input_packets_total gauge
18vhtcpd_worker_input_packets_total{worker="Backend0"} 7.919253941e+09
19vhtcpd_worker_input_packets_total{worker="Backend1"} 7.919253941e+09
20vhtcpd_worker_input_packets_total{worker="Backend2"} 7.919253941e+09
21vhtcpd_worker_input_packets_total{worker="Backend3"} 7.919253941e+09
22vhtcpd_worker_input_packets_total{worker="Frontend0"} 7.832380879e+09
23# HELP vhtcpd_worker_input_packets_failed_total Packets a given purger worker failed to deliver
24# TYPE vhtcpd_worker_input_packets_failed_total gauge
25vhtcpd_worker_input_packets_failed_total{worker="Backend0"} 0.0
26vhtcpd_worker_input_packets_failed_total{worker="Backend1"} 0.0
27vhtcpd_worker_input_packets_failed_total{worker="Backend2"} 0.0
28vhtcpd_worker_input_packets_failed_total{worker="Backend3"} 0.0
29vhtcpd_worker_input_packets_failed_total{worker="Frontend0"} 0.0
30# HELP vhtcpd_worker_queue_length Instantaneous queue size of each worker
31# TYPE vhtcpd_worker_queue_length gauge
32vhtcpd_worker_queue_length{worker="Backend0"} 8.6873061e+07
33vhtcpd_worker_queue_length{worker="Backend1"} 8.6873061e+07
34vhtcpd_worker_queue_length{worker="Backend2"} 8.6873061e+07
35vhtcpd_worker_queue_length{worker="Backend3"} 8.6873061e+07
36vhtcpd_worker_queue_length{worker="Frontend0"} 2331.0
37# HELP vhtcpd_worker_queue_size_bytes Number of memory bytes consumed by each worker's queue
38# TYPE vhtcpd_worker_queue_size_bytes gauge
39vhtcpd_worker_queue_size_bytes{worker="Backend0"} 1.4865036237e+010
40vhtcpd_worker_queue_size_bytes{worker="Backend1"} 1.4865036237e+010
41vhtcpd_worker_queue_size_bytes{worker="Backend2"} 1.4865036237e+010
42vhtcpd_worker_queue_size_bytes{worker="Backend3"} 1.4865036237e+010
43vhtcpd_worker_queue_size_bytes{worker="Frontend0"} 718320.0

And parsing current output from cp3052:

1# HELP vhtcpd_start Start timestamp epoch seconds
2# TYPE vhtcpd_start gauge
3vhtcpd_start 1.581504248e+09
4# HELP vhtcpd_uptime Daemon uptime in seconds
5# TYPE vhtcpd_uptime gauge
6vhtcpd_uptime 4.5957e+06
7# HELP vhtcpd_input_packets_total Packets received by the shared listener
8# TYPE vhtcpd_input_packets_total gauge
9vhtcpd_input_packets_total 9.564508919e+09
10# HELP vhtcpd_input_packets_bad_total Packets received by the shared listener that were somehow faulty
11# TYPE vhtcpd_input_packets_bad_total gauge
12vhtcpd_input_packets_bad_total 0.0
13# HELP vhtcpd_input_packets_filtered_total Packets received by the listener but filtered by configuration
14# TYPE vhtcpd_input_packets_filtered_total gauge
15vhtcpd_input_packets_filtered_total 0.0
16# HELP vhtcpd_worker_input_packets_total Total packets enqueued on a given purger worker
17# TYPE vhtcpd_worker_input_packets_total gauge
18vhtcpd_worker_input_packets_total{worker="Purger0"} 9.564508919e+09
19vhtcpd_worker_input_packets_total{worker="Purger1"} 9.456208903e+09
20# HELP vhtcpd_worker_input_packets_failed_total Packets a given purger worker failed to deliver
21# TYPE vhtcpd_worker_input_packets_failed_total gauge
22vhtcpd_worker_input_packets_failed_total{worker="Purger0"} 0.0
23vhtcpd_worker_input_packets_failed_total{worker="Purger1"} 0.0
24# HELP vhtcpd_worker_queue_length Instantaneous queue size of each worker
25# TYPE vhtcpd_worker_queue_length gauge
26vhtcpd_worker_queue_length{worker="Purger0"} 1.08300015e+08
27vhtcpd_worker_queue_length{worker="Purger1"} 1337.0
28# HELP vhtcpd_worker_queue_size_bytes Number of memory bytes consumed by each worker's queue
29# TYPE vhtcpd_worker_queue_size_bytes gauge
30vhtcpd_worker_queue_size_bytes{worker="Purger0"} 1.8318132764e+010
31vhtcpd_worker_queue_size_bytes{worker="Purger1"} 634376.0

Change 586111 merged by CDanis:
[operations/puppet@production] update vhtcpd exporter to match reality

https://gerrit.wikimedia.org/r/586111