Page MenuHomePhabricator

Haproxykafka silently stops sending request data to kafka
Closed, ResolvedPublic

Description

Given the issue with cp5017 being unable to process logs in the latest days (and no one noticed) we need to review how HAProxyKafka alerts are defined.

Apparently cp5017 haproxykafka process (and service) was up as reported by systemd but no messages has been sent to Kafka in the intervals from 2025-06-30@03:23UTC to 2025-7-07@12:40UTC and from 2025-07-15@15:59UTC to 2025-07-21@08:44UTC when the functionality has been (inadvertently) restarted while debugging.

Debugging during the issue has been hard due to the complete refusal of the process to serve pprof information (through the /debug/ endpoint) and even strace showed no activity with network, file or process queries.

While debugging threads backtraces with gdb for eventual deadlocks, the process restarted it's usual behavior, restarting processing and sending logs to the kafka cluster.

To avoid this in the future we should add a couple of alerts:

  • Check if the prometheus exporter is up
  • Check that we're sending a reasonable amount of messages, compared to the number of requests received by HAProxy (essentially replicating the current HaproxyKafka alert for DE.

Some screenshot of the issue for reference:

Availability of the prometheus exporter on cp5017:

hpk_cp5017_1.png (1,435×589 px, 57 KB)

Requests not showed on Turnilo compared to other cluster hosts:

hpk_turnilo.png (1,671×699 px, 46 KB)

Update: this happened to cp3071 too:

Screenshot from 2025-07-23 17-23-51.png (1,257×234 px, 15 KB)

Event Timeline

Fabfur renamed this task from Create alert for low haproxykafka message rate (traffic) to Create better alerts for HAProxyKafka.Jul 21 2025, 10:25 AM
Fabfur triaged this task as High priority.
Fabfur added a project: HaproxyKafka.
Fabfur updated the task description. (Show Details)
Vgutierrez renamed this task from Create better alerts for HAProxyKafka to Haproxykafka silently stops sending request data to kafka.Jul 21 2025, 10:34 AM

Change #1171176 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/alerts@master] traffic: new alerts for haproxykafka

https://gerrit.wikimedia.org/r/1171176

Change #1171176 merged by Fabfur:

[operations/alerts@master] traffic: new alerts for haproxykafka

https://gerrit.wikimedia.org/r/1171176

Adding DE team too, considering that two hosts didn't sent messages for a long time and this could be impacting on the analytics data.

We lost about 2 weeks of logs from cp5017 and about 10 weeks of logs from cp3071

Change #1172059 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/alerts@master] haproxykafka: fixed missing site in dashboard link

https://gerrit.wikimedia.org/r/1172059

Change #1172059 merged by Fabfur:

[operations/alerts@master] haproxykafka: fixed missing site in dashboard link

https://gerrit.wikimedia.org/r/1172059

Change #1172347 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/alerts@master] haproxykafka: adding alert for unexpected restarts

https://gerrit.wikimedia.org/r/1172347

Note that this happened again ~2025-07-24 14:43 on cp3071, same host

Change #1173427 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/alerts@master] traffic: Fix HaproxyKafkaNoMessages alerts

https://gerrit.wikimedia.org/r/1173427

Change #1173427 merged by Vgutierrez:

[operations/alerts@master] traffic: Fix HaproxyKafkaNoMessages alerts

https://gerrit.wikimedia.org/r/1173427

Change #1172347 merged by Fabfur:

[operations/alerts@master] haproxykafka: adding alert for unexpected restarts

https://gerrit.wikimedia.org/r/1172347

Change #1174421 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/alerts@master] haproxykafka: fixed alert HaproxykafkaNoMessages

https://gerrit.wikimedia.org/r/1174421

Closing as per T400199 and opening a new ticket dedicated only to watchdog feature

Change #1177319 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] haproxykafka: Reduce socket deadline to 500ms

https://gerrit.wikimedia.org/r/1177319

Change #1177319 merged by Vgutierrez:

[operations/puppet@production] haproxykafka: Reduce socket deadline to 500ms

https://gerrit.wikimedia.org/r/1177319

Mentioned in SAL (#wikimedia-operations) [2025-08-11T08:36:41Z] <vgutierrez> reducing haproxykafka socket batch deadline to 500ms - T400039

Change #1174421 merged by jenkins-bot:

[operations/alerts@master] team-data-engineering: fixed alert HaproxykafkaNoMessages

https://gerrit.wikimedia.org/r/1174421