Page MenuHomePhabricator

Better Benthos performances
Open, Needs TriagePublic

Assigned To
Authored By
Fabfur
Mar 19 2024, 5:09 PM
Referenced Files
F55264503: image.png
Thu, Jun 13, 9:16 AM
F55264501: image.png
Thu, Jun 13, 9:16 AM
F55264461: image.png
Thu, Jun 13, 9:16 AM
F55264342: image.png
Thu, Jun 13, 9:16 AM
F55264269: image.png
Thu, Jun 13, 9:16 AM
F55264258: image.png
Thu, Jun 13, 9:16 AM

Description

Currently for about 2 minutes of (test) host receiving live requests, Benthos takes ~15m to spool all messages to Kafka:

https://grafana.wikimedia.org/goto/2vGIah1Iz?orgId=1

Benthos performances should definitely be improved, first identifying bottlenecks (if any).

Event Timeline

Change 1012724 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos: add simple kafka_franz batching

https://gerrit.wikimedia.org/r/1012724

Change 1012724 merged by Fabfur:

[operations/puppet@production] benthos: add simple kafka_franz batching

https://gerrit.wikimedia.org/r/1012724

Change 1012756 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos: enabled batching policy for memory buffer too

https://gerrit.wikimedia.org/r/1012756

Change 1012756 merged by Fabfur:

[operations/puppet@production] benthos: enabled batching policy for memory buffer too

https://gerrit.wikimedia.org/r/1012756

Change 1012790 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos: switch to unix socket for performance testing

https://gerrit.wikimedia.org/r/1012790

Change 1012790 merged by Fabfur:

[operations/puppet@production] benthos: switch to unix socket for performance testing

https://gerrit.wikimedia.org/r/1012790

Change 1013040 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos: moved batching as close to the input as possible

https://gerrit.wikimedia.org/r/1013040

Change 1013040 merged by Fabfur:

[operations/puppet@production] benthos: moved batching as close to the input as possible

https://gerrit.wikimedia.org/r/1013040

Change #1029480 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache:benthos: move processors in the pipeline section

https://gerrit.wikimedia.org/r/1029480

Change #1029480 merged by Fabfur:

[operations/puppet@production] cache:benthos: move processors in the pipeline section

https://gerrit.wikimedia.org/r/1029480

Change #1034864 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos:cache: switch debug endpoints off for ulsfo

https://gerrit.wikimedia.org/r/1034864

Change #1034864 merged by Fabfur:

[operations/puppet@production] benthos:cache: switch debug endpoints off for ulsfo

https://gerrit.wikimedia.org/r/1034864

Mentioned in SAL (#wikimedia-operations) [2024-06-12T09:48:27Z] <fabfur> disabling puppet on cp4037 to test benthos configuration (T360454)

Update on Benthos performances.

To be able to compare Benthos (now RedPanda) to some tools we already use, I've collected some data from cp4037:

NOTE: the benthos metrics about cpu/memory utilization, as well as HAProxy ones, obviously varies with the host traffic. The example below are taken on different time ranges because are the result of "custom" configuration on hosts that are live in production and can't be held in that state too much. I'm aware it's not the best way to gather this data but I choose time windows that are significative for our goals.
  • Test #1 (current setup in ulsfo): HAProxy sends logs to Benthos socket using rfc3164 format, Benthos parses those using grok, publishes metrics with the collected data and sends formatted documents to Kafka cluster

image.png (421×1 px, 58 KB)

image.png (729×1 px, 100 KB)

In this case, while Benthos uses "just" on average ~143 MB of memory (compared to the 319 MB of VarnishKafka), mtail is much better at this with only ~40 MB of used memory. The CPU utilization graph is the real pain point here with Benthos being the worst one: 1.05 seconds of cpu usage on average, compared to 394 ms for HAProxy (inserted for comparison), 69.5 ms of mtail and 41.2 ms of VarnishKafka.

Link to the grafana dashboard

  • Test #2: HAProxy sends logs to Benthos socket using rfc3164 format, Benthos parses those using grok, publishes metrics but does not send logs to Kafka:

image.png (720×1 px, 83 KB)

image.png (703×1 px, 108 KB)

Also when dropping messages, the memory stats are comparable. The CPU stats are now much better (on average) for Benthos but again they are well above HAProxy and especially compared to VarnishKafka and mtail.

  • Test #3: HAProxy sends logs to Benthos socket using rfc5424 format so there's no need to parse the log lines with grok to gather all the fields. Messages are finally dropped.

image.png (737×1 px, 77 KB)

image.png (738×1 px, 84 KB)

While this has some issues (mainly due to some Benthos and HAProxy bugs), this shows a overall better performances, especially for CPU, for Benthos. Now it's consistenly lower than HAProxy by ~70 ms, although always worse than mtail or varnishkafka that are an order of magnitude lower than Benthos in this case.

Mentioned in SAL (#wikimedia-operations) [2024-06-13T10:09:59Z] <fabfur> cp4037 depooled && puppet disable to profile benthos configuration (T360454)

Mentioned in SAL (#wikimedia-operations) [2024-06-13T11:57:05Z] <fabfur> enabling puppet && repool cp4037 (T360454)