Page MenuHomePhabricator

Benthos loses messages when under high load
Open, Needs TriagePublic

Description

Umbrella task to track changes related to this issue:

We noticed some dropped UDP packets (from /proc/net/udp) for the Benthos socket used by HAProxy (cp hosts) and nginx (ncredir hosts). This is somehow expected when Benthos is not able to process with large volume of requests.

An ideal setup should be able to cope with 150k reqs/s (spikes experienced during large DDOS attacks), maybe allowing some spooling off time but without dropping any message.

Some notes (summary of last days conversation about this topic):

  • Dropped packets/messages aren't tracked in Benthos as they are discarded "silently" at OS level, so we need a way to track those and eventually trigger alerts
  • Switching to UDS for Benthos is a quick fix to avoid dropped messages but
    • Must be done at systemd level (passing the unixgram socket as stdin to Benthos) or patching Benthos input to support this.
    • When the UDS buffer is full, it blocks the writer, something we definitely want to avoid.
  • Interposing some kind of buffer (eg. fifo-log-demux) between the producer and Benthos could definitely help alleviating high requests spikes. Plus, the new release has some metrics about discarded messages.
  • We experience buffer overrun in VarnishKafka too (journalctl -u varnishkafka-webrequest.service --grep "Log overrun"). This means that we're currently loosing some messages from the current setup too?
  • A nice feature to have is the ability to "tap" into the current setup to have some insight of the logs before (or after) being processed by Benthos, in realtime. This is especially useful for debugging. With Benthos this is technically feasible but we noticed a significant performance drop with that. With fifo-log-demux should be as easy as connecting another client to the "read end".

Event Timeline

Change #1028791 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache: Use fifo-log-demux between haproxy and benthos

https://gerrit.wikimedia.org/r/1028791

@CDanis helped me a lot in this direction and he found a workaround|solution for this specific issue, optimizing Benthos configuration and using an higher UDP receive buffer to mitigate very high concurrent requests.

This can be summarized as:

  • Move all processing currently in the input section of the Benthos configuration to the pipeline (apparently, the buffer values defined in the configuration doesn't apply to these):
diff
--- local.yaml.orig	2024-05-08 23:29:53.201744352 +0200
+++ local.yaml	2024-05-08 23:31:12.845519835 +0200
@@ -20,17 +20,6 @@
   #     cert_file: ""
   #     key_file: ""
   #     self_signed: false
-  processors:
-    - label: "syslog_format"
-      parse_log:
-        format: "syslog_rfc3164"
-        codec: "json"
-    - label: "drop_unparsable_messages"
-      mapping: |
-        root = if this.message.contains("SSL handshake failure").catch(false) || this.message.contains("<BADREQ> <BADREQ> <BADREQ>").catch(false)
-        {
-          deleted()
-        }
 
 buffer:
  memory:
@@ -43,6 +32,16 @@
 pipeline:
   threads: -1
   processors:
+    - label: "syslog_format"
+      parse_log:
+        format: "syslog_rfc3164"
+        codec: "json"
+    - label: "drop_unparsable_messages"
+      mapping: |
+        root = if this.message.contains("SSL handshake failure").catch(false) || this.message.contains("<BADREQ> <BADREQ> <BADREQ>").catch(false)
+        {
+          deleted()
+        }
     - label: "drop_first_part"
       mapping: |
         root = this
  • Using an higher buffer memory in Benthos configuration (buffer:memory:limit: 268435456) (in bytes)
  • Using the systemd socket activation to define the Benthos socket (Benthos will read from stdin) and setting the ReceiveBuffer value to a very high value:
ReceiveBuffer=536870912
  • Setting net.core.rmem_max to the same value

This way even feeding ~150k messages/s into Benthos UDP socket, no datagrams are lost (confirmed by Benthos internal metrics benthos_input_received, benthos_output_sent and UDP socket stats)

I think this could be interesting for @fgiunchedi and @Vgutierrez, thanks again @CDanis

Little addendum: the latest Benthos release introduced the auto_replay_nacks configuration parameter in (all?) inputs: https://www.benthos.dev/docs/components/inputs/socket_server#auto_replay_nacks

Quoting:

Whether messages that are rejected (nacked) at the output level should be automatically replayed indefinitely, eventually resulting in back pressure if the cause of the rejections is persistent. If set to false these messages will instead be deleted. Disabling auto replays can greatly improve memory efficiency of high throughput streams as the original shape of the data can be discarded immediately upon consumption and mutation.

This could be useful in our scenarios but ATM I didn't see noticeable changes in setting it to false in my own tests

Change #1029480 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache:benthos: move processors in the pipeline section

https://gerrit.wikimedia.org/r/1029480

Change #1029508 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache:benthos: test for socket based activation in Benthos

https://gerrit.wikimedia.org/r/1029508

I was thinking about this while I was falling asleep last night, and I think it's actually desirable to drop some messages under persistent high incoming reqrate, as a way to shed some load.

I'm a bit worried that with how expensive the grok processing seems to be that we could create a backlog that's hard to work our way out of while the excessive load continues.

Change #1029480 merged by Fabfur:

[operations/puppet@production] cache:benthos: move processors in the pipeline section

https://gerrit.wikimedia.org/r/1029480

Change #1029615 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache:benthos: test for socket based activation in Benthos

https://gerrit.wikimedia.org/r/1029615

Change #1029508 abandoned by Fabfur:

[operations/puppet@production] cache:benthos: test for socket based activation in Benthos

Reason:

abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029615

https://gerrit.wikimedia.org/r/1029508

Change #1030010 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] ncredir,benthos: Move processors to the pipeline section

https://gerrit.wikimedia.org/r/1030010

Change #1029615 merged by Fabfur:

[operations/puppet@production] cache:benthos: test for socket based activation in Benthos

https://gerrit.wikimedia.org/r/1029615

Change #1034099 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: test Benthos socket activation on cp4037

https://gerrit.wikimedia.org/r/1034099