Page MenuHomePhabricator

restbase should send metrics in batches
Closed, ResolvedPublic

Description

looking at inbound statsd traffic it looks like restbase should be batching metric packets inside a single udp datagram

13:36:29.588068 IP restbase1009.eqiad.wmnet.43120 > graphite1001.eqiad.wmnet.8125: UDP, length 40
13:36:29.588071 IP restbase1009.eqiad.wmnet.43120 > graphite1001.eqiad.wmnet.8125: UDP, length 40
13:36:29.588319 IP restbase1001.eqiad.wmnet.40168 > graphite1001.eqiad.wmnet.8125: UDP, length 40
13:36:29.588328 IP restbase1001.eqiad.wmnet.40168 > graphite1001.eqiad.wmnet.8125: UDP, length 40
13:36:29.588560 IP restbase1007.eqiad.wmnet.36536 > graphite1001.eqiad.wmnet.8125: UDP, length 57
13:36:29.588564 IP restbase1007.eqiad.wmnet.36536 > graphite1001.eqiad.wmnet.8125: UDP, length 57
13:36:29.588565 IP restbase1007.eqiad.wmnet.36536 > graphite1001.eqiad.wmnet.8125: UDP, length 41
13:36:29.588580 IP restbase1007.eqiad.wmnet.36536 > graphite1001.eqiad.wmnet.8125: UDP, length 41
13:36:29.589080 IP restbase1001.eqiad.wmnet.40168 > graphite1001.eqiad.wmnet.8125: UDP, length 40
13:36:29.589087 IP restbase1001.eqiad.wmnet.40168 > graphite1001.eqiad.wmnet.8125: UDP, length 40
13:36:29.589436 IP restbase1003.eqiad.wmnet.35900 > graphite1001.eqiad.wmnet.8125: UDP, length 40
13:36:29.589440 IP restbase1003.eqiad.wmnet.35900 > graphite1001.eqiad.wmnet.8125: UDP, length 40
13:36:29.589465 IP restbase1003.eqiad.wmnet.35900 > graphite1001.eqiad.wmnet.8125: UDP, length 41
13:36:29.589469 IP restbase1003.eqiad.wmnet.35900 > graphite1001.eqiad.wmnet.8125: UDP, length 41
13:36:29.589588 IP restbase1009.eqiad.wmnet.43120 > graphite1001.eqiad.wmnet.8125: UDP, length 41
13:36:29.589590 IP restbase1009.eqiad.wmnet.43120 > graphite1001.eqiad.wmnet.8125: UDP, length 41
13:36:29.590016 IP restbase1005.eqiad.wmnet.49875 > graphite1001.eqiad.wmnet.8125: UDP, length 41
13:36:29.590021 IP restbase1005.eqiad.wmnet.49875 > graphite1001.eqiad.wmnet.8125: UDP, length 41

Event Timeline

fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: SRE, Grafana, RESTBase.
fgiunchedi added subscribers: ori, fgiunchedi.

Created a PR in for a statd client we're using https://github.com/sivy/node-statsd/pull/61 to support batching messages. If they don't answer in a reasonable amount of time we'll just switch to using our own fork

I've added this capability to service-runner experimentally, it can be enabled by providing batch: true to the metrics config. Next step is to try this in production.

Addition is commutative. So why not coalesce counter updates within a batch? If a counter is incremented twice in a single update interval, send foo:2|c instead of foo:1|c twice.

@ori Great idea, however I'm not quite sure how would that work in production.

The batch contains maximum 20 messages (the limit is set by an approximation of a message length and an average network MTU) and the messages are not delayed more than for 1 second. In production use it's not clear if we will see many possibilities for message merging across 20 messages/1 second. Anyway, I'll run an experiment tomorrow using RESTBase as an example and report the results here.

Counters are also rather rare. Most of our metrics are timers.

I've run some benchmarks with and without batching, and it actually makes a surprisingly huge difference: on my laptop, for RESTBase a /title/{title} endpoint with sqlite I have ~1135.91 req/s with batching and 1056.86 req/s without. It's almost 7.5% improvement in throughput. In real life that would probably not be as good, but the benchmarks are promising.

The patch for node-statsd has been reviewed upstream, and it's likely that would be accepted.

As for @ori idea, I've implemented a very naive version of it and tried to measure the effect of counter merging using RESTBase with some sample load. I couldn't see any effect, as there's way more timers firing between counters, so similar counters almost never end up in the same batch to be joined.

thanks @Pchelolo ! another strategy I've seen is letting the user pick MTU as opposed to number of metrics, I think this would be more predictable as the metric length is affected by factors like the prefix, table name, etc, what do you think?

@fgiunchedi Hm, indeed that would work better. I'll amend the PR.

a similar change has been merged today (and reverted) in https://gerrit.wikimedia.org/r/#/c/267917/ though the batching seems to truncate partial metrics:

17:31:35.942170 IP restbase1001.eqiad.wmnet.56046 > graphite1001.eqiad.wmnet.8125: UDP, length 1533
E....1 .?...
@..
@ .......@.restbase.internal-update.sys_table_-table-_.GET.200:9|ms
restbase.internal-update.sys_table_-table-_.GET.2xx:9|ms
restbase.internal-update.sys_table_-table-_.GET.ALL:9|ms
restbase.internal-update.sys_key-rev-value_-bucket-_-key--_revision--_tid-.GET.200:9|ms
restbase.internal-update.sys_key-rev-value_-bucket-_-key--_revision--_tid-.GET.2xx:9|ms
restbase.internal-update.sys_key-rev-value_-bucket-_-key--_revision--_tid-.GET.ALL:9|ms
restbase.internal-update.sys_table_-table-_.GET.200:13|ms
restbase.internal-update.sys_table_-table-_.GET.2xx:13|ms
restbase.internal-update.sys_table_-table-_.GET.ALL:13|ms
restbase.internal-update.sys_key-rev-value_-bucket-_-key--_revision--_tid-.GET.200:13|ms
restbase.internal-update.sys_key-rev-value_-bucket-_-key--_revision--_tid-.GET.2xx:13|ms
restbase.internal-update.sys_key-rev-value_-bucket-_-key--_revision--_tid-.GET.ALL:13|ms
restbase.internal-update.sys_table_-table-_.GET.200:9|ms
restbase.internal-update.sys_table_-table-_.GET.2xx:9|ms
restbase.internal-update.sys_table_-table-_.GET.ALL:9|ms
restbase.internal-update.sys_page-revisions_page_-title--_revision-.GET.200:9|ms
restbase.internal-update.sys_page-revisions_page_-title--_revision-.GET.2xx:9|ms
restbase.internal-update.sys_page-revisions_page_-title--_revision-.GET.ALL:9|ms
restbase.internal-update.sys_parsoid_html_-title--_revision--_tid-.GET.200:18|ms
restbase.internal-update.sys_parsoid_html_-title--_revision--_tid-.GET.2xx:18|ms
restbase.internal-up
17:31:35.955580 IP restbase1001.eqiad.wmnet.55474 > graphite1001.eqiad.wmnet.8125: UDP, length 1513
E....4 .?...
@..
@ ........9restbase.sys_parsoid_html_-title--_revision--_tid-.GET.200:19|ms
restbase.sys_parsoid_html_-title--_revision--_tid-.GET.2xx:19|ms
restbase.sys_parsoid_html_-title--_revision--_tid-.GET.ALL:19|ms
restbase.v1_page_html_-title-_-revision--_tid-.GET.200:20|ms
restbase.v1_page_html_-title-_-revision--_tid-.GET.2xx:20|ms
restbase.v1_page_html_-title-_-revision--_tid-.GET.ALL:20|ms
restbase.ALL.ALL:20|ms
restbase.GET.200:20|ms
restbase.sys_table_-table-_.GET.200:14|ms
restbase.sys_table_-table-_.GET.2xx:14|ms
restbase.sys_table_-table-_.GET.ALL:14|ms
restbase.sys_key-rev-value_-bucket-_-key--_revision--_tid-.GET.200:14|ms
restbase.sys_key-rev-value_-bucket-_-key--_revision--_tid-.GET.2xx:14|ms
restbase.sys_key-rev-value_-bucket-_-key--_revision--_tid-.GET.ALL:14|ms
restbase.sys_table_-table-_.GET.200:6|ms
restbase.sys_table_-table-_.GET.2xx:6|ms
restbase.sys_table_-table-_.GET.ALL:6|ms
restbase.sys_page-revisions_page_-title--_revision-.GET.200:6|ms
restbase.sys_page-revisions_page_-title--_revision-.GET.2xx:6|ms
restbase.sys_page-revisions_page_-title--_revision-.GET.ALL:6|ms
restbase.sys_parsoid_data-parsoid_-title--_revision--_tid-.GET.200:20|ms
restbase.sys_parsoid_data-parsoid_-title--_revision--_tid-.GET.2xx:20|ms
restbase.sys_parsoid_data-parsoid_-title--_revision--_tid-.GET.ALL:20|ms
restbase.v1_page_data-parsoid_-title-_-revision--_tid-.GET.200:22|ms
restbase.v1_page_data-parsoid_-title-_-revision--_tid-.GET.2xx:22|ms
restbase.v1_page_data-parsoi
17:31:36.008280 IP restbase1001.eqiad.wmnet.56046 > graphite1001.eqiad.wmnet.8125: UDP, length 1532
E....A .?...
@..
@ .........restbase.internal-update.sys_table_-table-_.GET.200:7|ms
restbase.internal-update.sys_table_-table-_.GET.2xx:7|ms
restbase.internal-update.sys_table_-table-_.GET.ALL:7|ms
restbase.internal-update.sys_page-revisions_page_-title--_revision-.GET.200:7|ms
restbase.internal-update.sys_page-revisions_page_-title--_revision-.GET.2xx:7|ms
restbase.internal-update.sys_page-revisions_page_-title--_revision-.GET.ALL:7|ms
restbase.internal-update.sys_parsoid_data-parsoid_-title--_revision--_tid-.GET.200:20|ms
restbase.internal-update.sys_parsoid_data-parsoid_-title--_revision--_tid-.GET.2xx:20|ms
restbase.internal-update.sys_parsoid_data-parsoid_-title--_revision--_tid-.GET.ALL:20|ms
restbase.internal-update.sys_parsoid_pagebundle_-title--_revision--_tid-.POST.200:471|ms
restbase.internal-update.sys_parsoid_pagebundle_-title--_revision--_tid-.POST.2xx:471|ms
restbase.internal-update.sys_parsoid_pagebundle_-title--_revision--_tid-.POST.ALL:471|ms
restbase.internal-update.sys_parsoid_generateAndSave.unchanged_rev_render:1|c
restbase.internal-update.sys_parsoid_html_-title--_revision--_tid-.GET.200:517|ms
restbase.internal-update.sys_parsoid_html_-title--_revision--_tid-.GET.2xx:517|ms
restbase.internal-update.sys_parsoid_html_-title--_revision--_tid-.GET.ALL:517|ms
restbase.internal-update.ALL.ALL:519|ms
restbase.internal-update.GET.200:519|ms
restbase.internal-update.sys_parsoid_pagebundle_-title--_revision--_tid-.POST.200:1406|ms
restbase.internal-update.sys_p

Instead of going with our own patched client, we decided to go with hot-shots, which has out-of-the-box support for batching. However, we have just discovered a bug in their code which we believe is responsible for message fragmentation. @Pchelolo is going to create a PR to fix it.

Created a PR for hot-shots lib https://github.com/brightcove/hot-shots/pull/14 , and the maintainer of that package is really fast: the PR is already merged!

Also, we need to decrease our default maxBufferSize option from 1500 to 1450 to account for UDP and IP header sizes. Created a PR for that in service-runner: https://github.com/wikimedia/service-runner/pull/81

Change 268611 had a related patch set uploaded (by Mobrovac):
Revert "Revert "RESTBase: enable metrics batching""

https://gerrit.wikimedia.org/r/268611

Change 268611 merged by Filippo Giunchedi:
Revert "Revert "RESTBase: enable metrics batching""

https://gerrit.wikimedia.org/r/268611

Batching has been enabled in production.

@Eevans, @fgiunchedi: Is everything still looking good / can this be considered done?

@GWicke AFAICT yes, there's ~8k udp packets/s less being received by graphite after the switch

As a bonus, the implementation is in the service-runner library used by all our node services, so batching can now be enabled for all of those.