Page MenuHomePhabricator

Consider adding per-shard metrics to the prometheus mcrouter exporter
Closed, ResolvedPublic

Description

This is essentially what being asked in https://github.com/Dev25/mcrouter_exporter/issues/9

mcrouter offers memcached shard metrics, for example:

[..]
STAT 10.64.0.81:11211:ascii:plain:notcompressed-1000 avg_latency_us:344.766 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:5; deleted:108046 touched:869669 found:1951422718 notfound:66660413 notstored:1825893 stored:51096164 exists:19914 timeout:183 remote_error:70
[..]

The info provided are more related to mcrouter itself rather than a high level breakdown of the memcached operations (get/set/cas/etc..) but it would be useful in my opinion to have per shard/server metrics rather than relying only on aggregates.

Event Timeline

elukey triaged this task as Medium priority.Jun 5 2019, 7:06 AM
elukey updated the task description. (Show Details)
elukey added a project: observability.
elukey added a subscriber: fgiunchedi.
Krinkle renamed this task from Consider adding per shard metrics to the prometheus mcrouter exporter to Consider adding per-shard metrics to the prometheus mcrouter exporter.Jun 5 2019, 1:07 PM

Created https://github.com/Dev25/mcrouter_exporter/pull/10

Example from deployment-prep:

# HELP mcrouter_server_duration_us Average time of processing a request per-server (i.e. receiving request and sending a reply).
# TYPE mcrouter_server_duration_us gauge
mcrouter_server_duration_us{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 968.161
mcrouter_server_duration_us{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 1219.458
# HELP mcrouter_server_memcached_connect_timeout_count Number of memcached connect timeouts (per-server metric).
# TYPE mcrouter_server_memcached_connect_timeout_count counter
mcrouter_server_memcached_connect_timeout_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 0
mcrouter_server_memcached_connect_timeout_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 0
# HELP mcrouter_server_memcached_deleted_count Number of memcached DELETED replies (per-server metric).
# TYPE mcrouter_server_memcached_deleted_count counter
mcrouter_server_memcached_deleted_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 47669
mcrouter_server_memcached_deleted_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 39955
# HELP mcrouter_server_memcached_exists_count Number of memcached EXISTS replies (per-server metric).
# TYPE mcrouter_server_memcached_exists_count counter
mcrouter_server_memcached_exists_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 141
mcrouter_server_memcached_exists_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 232
# HELP mcrouter_server_memcached_found_count Number of memcached FOUND replies (per-server metric).
# TYPE mcrouter_server_memcached_found_count counter
mcrouter_server_memcached_found_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 1.3666515e+07
mcrouter_server_memcached_found_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 1.3620475e+07
# HELP mcrouter_server_memcached_not_found_count Number of memcached NOT_FOUND replies (per-server metric).
# TYPE mcrouter_server_memcached_not_found_count counter
mcrouter_server_memcached_not_found_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 0
mcrouter_server_memcached_not_found_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 0
# HELP mcrouter_server_memcached_not_stored_count Number of memcached NOT_STORED replies (per-server metric).
# TYPE mcrouter_server_memcached_not_stored_count counter
mcrouter_server_memcached_not_stored_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 0
mcrouter_server_memcached_not_stored_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 0
# HELP mcrouter_server_memcached_remote_error_count Number of memcached remote errors (per-server metric).
# TYPE mcrouter_server_memcached_remote_error_count counter
mcrouter_server_memcached_remote_error_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 0
mcrouter_server_memcached_remote_error_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 15
# HELP mcrouter_server_memcached_stored_count Number of memcached STORED replies (per-server metric).
# TYPE mcrouter_server_memcached_stored_count counter
mcrouter_server_memcached_stored_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 1.736137e+06
mcrouter_server_memcached_stored_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 1.77941e+06
# HELP mcrouter_server_memcached_timeout_count Number of memcached timeouts (per-server metric).
# TYPE mcrouter_server_memcached_timeout_count counter
mcrouter_server_memcached_timeout_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 0
mcrouter_server_memcached_timeout_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 0
# HELP mcrouter_server_memcached_tko Number of times memcached has been marked as TKO (per-server metric).
# TYPE mcrouter_server_memcached_tko counter
mcrouter_server_memcached_tko{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 0
mcrouter_server_memcached_tko{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 0
# HELP mcrouter_server_memcached_touched_count Number of memcached TOUCHED replies (per-server metric).
# TYPE mcrouter_server_memcached_touched_count counter
mcrouter_server_memcached_touched_count{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 7750
mcrouter_server_memcached_touched_count{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 86956
# HELP mcrouter_server_proxy_reqs_processing Requests mcrouter started routing but didn't receive a reply yet (per-server metric)
# TYPE mcrouter_server_proxy_reqs_processing gauge
mcrouter_server_proxy_reqs_processing{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 0
mcrouter_server_proxy_reqs_processing{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 0
# HELP mcrouter_server_proxy_reqs_retrans_ratio Requests mcrouter started but that required retransmission.
# TYPE mcrouter_server_proxy_reqs_retrans_ratio gauge
mcrouter_server_proxy_reqs_retrans_ratio{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 0
mcrouter_server_proxy_reqs_retrans_ratio{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 0
# HELP mcrouter_server_proxy_reqs_waiting Requests queued up and not routed yet (per-server metric)
# TYPE mcrouter_server_proxy_reqs_waiting gauge
mcrouter_server_proxy_reqs_waiting{server="deployment-memc04:11211:ascii:plain:notcompressed-1000"} 0
mcrouter_server_proxy_reqs_waiting{server="deployment-memc05:11211:ascii:plain:notcompressed-1000"} 0

@fgiunchedi let me know if the above new metrics (and code if you have time - https://github.com/Dev25/mcrouter_exporter/issues/9) are ok for you. I think that the above would add 15 * #-of-mw-hosts * #-of-mc-hosts new metrics to Prometheus:

  • 15 * 130 * 18 in eqiad (35100)
  • 15 * 157 * 18 in codfw (42390)

Not sure if it is ok to add all the above metrics, I just came up with the number and it seems a lot. Let me know!

@fgiunchedi let me know if the above new metrics (and code if you have time - https://github.com/Dev25/mcrouter_exporter/issues/9) are ok for you. I think that the above would add 15 * #-of-mw-hosts * #-of-mc-hosts new metrics to Prometheus:

  • 15 * 130 * 18 in eqiad (35100)
  • 15 * 157 * 18 in codfw (42390)

Not sure if it is ok to add all the above metrics, I just came up with the number and it seems a lot. Let me know!

Seems ok to me, for sure that's not a few metrics but it seems we can get signal out of them so go for it!

The PR is still waiting for the second upstream review, since there is no real rush I'd prefer to wait for the code to be merged before rolling it out on our hosts.

Mentioned in SAL (#wikimedia-operations) [2019-07-09T07:26:08Z] <elukey> upload prometheus-mcrouter-exporter 0.0.0+git20190709-1 to stretch-wikimedia - T225059

Mentioned in SAL (#wikimedia-operations) [2019-07-09T08:36:04Z] <elukey> upgrade prometheus-mcrouter-exporter to 0.0.0+git20190709-1 on mw-codfw (cumin alias) via debdeploy - T225059

Mentioned in SAL (#wikimedia-operations) [2019-07-09T08:49:24Z] <elukey> upgrade prometheus-mcrouter-exporter to 0.0.0+git20190709-1 on mw-eqiad (cumin alias) via debdeploy - T225059

Change 521442 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::mcrouter_exporter: enable per-server metrics

https://gerrit.wikimedia.org/r/521442

Change 521442 merged by Elukey:
[operations/puppet@production] profile::prometheus::mcrouter_exporter: enable per-server metrics

https://gerrit.wikimedia.org/r/521442

Mentioned in SAL (#wikimedia-operations) [2019-07-09T09:13:05Z] <elukey> enable per-server metrics on all prometheus-mcrouter-exporter(s) via puppet - T225059

@aaron I added two new rows to https://grafana.wikimedia.org/dashboard/db/mcrouter with new per-shard metrics. Let me know what you think about it and if anything is missing.

The last row is interesting: it records the Memcached responses got by each shard. I am wondering if anything interesting is missing, if so we could upgrade the exporter further. Note: sadly get/set/cas/etc.. breakdowns are not provided by mcrouter's per-shard/server metrics :(