Given the issue with cp5017 being unable to process logs in the latest days (and no one noticed) we need to review how HAProxyKafka alerts are defined.
Apparently cp5017 haproxykafka process (and service) was up as reported by systemd but no messages has been sent to Kafka in the intervals from 2025-06-30@03:23UTC to 2025-7-07@12:40UTC and from 2025-07-15@15:59UTC to 2025-07-21@08:44UTC when the functionality has been (inadvertently) restarted while debugging.
Debugging during the issue has been hard due to the complete refusal of the process to serve pprof information (through the /debug/ endpoint) and even strace showed no activity with network, file or process queries.
While debugging threads backtraces with gdb for eventual deadlocks, the process restarted it's usual behavior, restarting processing and sending logs to the kafka cluster.
To avoid this in the future we should add a couple of alerts:
- Check if the prometheus exporter is up
- Check that we're sending a reasonable amount of messages, compared to the number of requests received by HAProxy (essentially replicating the current HaproxyKafka alert for DE.
Some screenshot of the issue for reference:
Availability of the prometheus exporter on cp5017:
Requests not showed on Turnilo compared to other cluster hosts:
Update: this happened to cp3071 too:


