Better Benthos performances
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	Fabfur
	Mar 19 2024, 5:09 PM

Description

Currently for about 2 minutes of (test) host receiving live requests, Benthos takes ~15m to spool all messages to Kafka:

https://grafana.wikimedia.org/goto/2vGIah1Iz?orgId=1

Benthos performances should definitely be improved, first identifying bottlenecks (if any).

Details

Subject	Repo	Branch	Lines +/-
benthos:cache: switch debug endpoints off for ulsfo	operations/puppet	production	+25 -1
cache:benthos: move processors in the pipeline section	operations/puppet	production	+11 -11
benthos: moved batching as close to the input as possible	operations/puppet	production	+2 -5
benthos: switch to unix socket for performance testing	operations/puppet	production	+3 -3
benthos: enabled batching policy for memory buffer too	operations/puppet	production	+4 -5
benthos: add simple kafka_franz batching	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
In Progress	Fabfur	T351117 Move analytics log from Varnish to HAProxy
In Progress	Fabfur	T358109 Install new Benthos instance on cp hosts
Open	Fabfur	T360454 Better Benthos performances
Open	Fabfur	T364379 Benthos loses messages when under high load
Open	Fabfur	T365968 Install benthos on single esams host to check performances under higher load
Resolved	Fabfur	T365566 HAProxy should not log information we don't actually need
Open	Fabfur	T365718 Switch HAProxy/Benthos to rfc5424
Open	Fabfur	T366031 Upgrade Benthos package on cp hosts
Open	Fabfur	T367756 Upgrade hosts to haproxy 2.8.10
Resolved	Vgutierrez	T367963 Investigate increase in CD termination state after upgrading eqsin/ulsfo to HAProxy 2.8.10

Event Timeline

Fabfur created this task.Mar 19 2024, 5:09 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMar 19 2024, 5:09 PM

Change 1012724 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos: add simple kafka_franz batching

https://gerrit.wikimedia.org/r/1012724

gerritbot added a project: Patch-For-Review.Mar 19 2024, 5:10 PM

Change 1012724 merged by Fabfur:

[operations/puppet@production] benthos: add simple kafka_franz batching

https://gerrit.wikimedia.org/r/1012724

Change 1012756 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos: enabled batching policy for memory buffer too

https://gerrit.wikimedia.org/r/1012756

Change 1012756 merged by Fabfur:

[operations/puppet@production] benthos: enabled batching policy for memory buffer too

https://gerrit.wikimedia.org/r/1012756

Change 1012790 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos: switch to unix socket for performance testing

https://gerrit.wikimedia.org/r/1012790

Change 1012790 merged by Fabfur:

[operations/puppet@production] benthos: switch to unix socket for performance testing

https://gerrit.wikimedia.org/r/1012790

Change 1013040 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos: moved batching as close to the input as possible

https://gerrit.wikimedia.org/r/1013040

Change 1013040 merged by Fabfur:

[operations/puppet@production] benthos: moved batching as close to the input as possible

https://gerrit.wikimedia.org/r/1013040

lbowmaker moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Mar 22 2024, 4:19 PM

colewhite subscribed.Mar 27 2024, 6:46 PM

CDanis subscribed.May 8 2024, 4:10 PM

Change #1029480 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache:benthos: move processors in the pipeline section

https://gerrit.wikimedia.org/r/1029480

Change #1029480 merged by Fabfur:

[operations/puppet@production] cache:benthos: move processors in the pipeline section

https://gerrit.wikimedia.org/r/1029480

Change #1034864 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos:cache: switch debug endpoints off for ulsfo

https://gerrit.wikimedia.org/r/1034864

Change #1034864 merged by Fabfur:

[operations/puppet@production] benthos:cache: switch debug endpoints off for ulsfo

https://gerrit.wikimedia.org/r/1034864

Fabfur closed subtask T365566: HAProxy should not log information we don't actually need as Resolved.May 28 2024, 8:11 PM

Mentioned in SAL (#wikimedia-operations) [2024-06-12T09:48:27Z] <fabfur> disabling puppet on cp4037 to test benthos configuration (T360454)

Update on Benthos performances.

To be able to compare Benthos (now RedPanda) to some tools we already use, I've collected some data from cp4037:

NOTE: the benthos metrics about cpu/memory utilization, as well as HAProxy ones, obviously varies with the host traffic. The example below are taken on different time ranges because are the result of "custom" configuration on hosts that are live in production and can't be held in that state too much. I'm aware it's not the best way to gather this data but I choose time windows that are significative for our goals.

Test #1 (current setup in ulsfo): HAProxy sends logs to Benthos socket using rfc3164 format, Benthos parses those using grok, publishes metrics with the collected data and sends formatted documents to Kafka cluster

In this case, while Benthos uses "just" on average ~143 MB of memory (compared to the 319 MB of VarnishKafka), mtail is much better at this with only ~40 MB of used memory. The CPU utilization graph is the real pain point here with Benthos being the worst one: 1.05 seconds of cpu usage on average, compared to 394 ms for HAProxy (inserted for comparison), 69.5 ms of mtail and 41.2 ms of VarnishKafka.

Link to the grafana dashboard

Test #2: HAProxy sends logs to Benthos socket using rfc3164 format, Benthos parses those using grok, publishes metrics but does not send logs to Kafka:

Also when dropping messages, the memory stats are comparable. The CPU stats are now much better (on average) for Benthos but again they are well above HAProxy and especially compared to VarnishKafka and mtail.

Test #3: HAProxy sends logs to Benthos socket using rfc5424 format so there's no need to parse the log lines with grok to gather all the fields. Messages are finally dropped.

While this has some issues (mainly due to some Benthos and HAProxy bugs), this shows a overall better performances, especially for CPU, for Benthos. Now it's consistenly lower than HAProxy by ~70 ms, although always worse than mtail or varnishkafka that are an order of magnitude lower than Benthos in this case.

Mentioned in SAL (#wikimedia-operations) [2024-06-13T10:09:59Z] <fabfur> cp4037 depooled && puppet disable to profile benthos configuration (T360454)

Mentioned in SAL (#wikimedia-operations) [2024-06-13T11:57:05Z] <fabfur> enabling puppet && repool cp4037 (T360454)

	F55264503: image.png
	Thu, Jun 13, 9:16 AM

	F55264501: image.png
	Thu, Jun 13, 9:16 AM

	F55264461: image.png
	Thu, Jun 13, 9:16 AM

	F55264342: image.png
	Thu, Jun 13, 9:16 AM

Better Benthos performancesOpen, Needs TriagePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Better Benthos performances
Open, Needs TriagePublic
Actions

Related Objects
Search...