Page MenuHomePhabricator

apache2 cpu-stuck on logstash hosts causes kafka logging lag
Closed, ResolvedPublic

Description

LogstashKafkaConsumerLag alert was firing since ~6h this EU morning, upon investigation from https://grafana.wikimedia.org/d/000000561/logstash I noticed logstash1032 wasn't consuming messages; turns out apache2 process was stuck on CPU (on futexes) and there was no CPU left for logstash/elasticsearch. I wanted to capture a backtrace with symbols, unfortunately installing apache2-bin-dbgsym also restarted apache2, so the problem went away.

Filing a task for tracking purposes, however I don't think we run into this bug very often (i.e. will be hard to reproduce)

Event Timeline

This occurred again today on logstash1030 at 21:44Z.

Mentioned in SAL (#wikimedia-operations) [2024-01-18T11:21:02Z] <godog> bounce apache2 on logstash1025 / logstash1031 - T337818

This was happening again on two logstash hosts in eqiad, interestingly enough I was able to capture the backtraces via gdb with debug symbols (in /root, I had to sanitize users and passwords so beware) and it didn't seem to cause any active impact since apache2 has been like that since multiple days AFAICS (!)

Mentioned in SAL (#wikimedia-operations) [2024-03-20T11:10:56Z] <godog> bounce apache2 on logstash1031 - T337818

fgiunchedi renamed this task from apache2 cpu-stuck on logstash1032 causes kafka logging lag to apache2 cpu-stuck on logstash hosts causes kafka logging lag.Mar 20 2024, 11:14 AM

IMO it'd be worth considering a split of the collector role into independent dashboard, logstash and potentially logs-api roles to provide additional isolation between services.

Mentioned in SAL (#wikimedia-operations) [2024-03-24T23:59:14Z] <denisse> restarting apache2 on logstash1023 - T337818

I think this issue triggered again today, I noticed that the affected host was logstash1023, I restarted the apache2 service.

Here are relevant graphs at the time of the incident:

Screenshot 2024-03-24 at 18-03-00 Logstash - Dashboards - Grafana.png (258×907 px, 39 KB)

I can see a drop in the Logstash input rate for logstash1023 that correlates with an increase in 'tripped in_flight_requests' errors on logstash1023.
https://grafana.wikimedia.org/goto/cHYbV9JIz

Screenshot 2024-03-24 at 18-07-05 Kafka Consumer Lag - Dashboards - Grafana.png (530×1 px, 123 KB)

https://grafana.wikimedia.org/goto/KD04S9JSz

After restarting apache2 on logstash1023 I notice that the consumer group lag graph shows that the number of events is decreasing, I think the issue may be contained but we still need a proper fix for this issue.

Screenshot 2024-03-24 at 18-31-03 Kafka Consumer Lag - Dashboards - Grafana.png (537×1 px, 110 KB)

https://grafana.wikimedia.org/goto/z9u1N9JIz

fgiunchedi raised the priority of this task from Low to Medium.Mar 27 2024, 2:28 PM

Change #1015045 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Use oauth2-proxy for opensearch dashboards

https://gerrit.wikimedia.org/r/1015045

Change #1016301 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: add logstash_oidc client

https://gerrit.wikimedia.org/r/1016301

Change #1016301 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: add logstash_oidc client

https://gerrit.wikimedia.org/r/1016301

Change #1015045 merged by Filippo Giunchedi:

[operations/puppet@production] Use oauth2-proxy for opensearch dashboards

https://gerrit.wikimedia.org/r/1015045

Change #1018647 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: test sso for opensearch-dashboards in cloud vps

https://gerrit.wikimedia.org/r/1018647

Change #1018647 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: test sso for opensearch-dashboards in cloud vps

https://gerrit.wikimedia.org/r/1018647

Change #1018654 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch: fix sso support

https://gerrit.wikimedia.org/r/1018654

Change #1018654 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch: fix sso support

https://gerrit.wikimedia.org/r/1018654

Change #1018657 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch: use Sensitive[String] for sso secrets

https://gerrit.wikimedia.org/r/1018657

Change #1018657 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch: use Sensitive[String] for sso secrets

https://gerrit.wikimedia.org/r/1018657

Change #1018659 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch: move apache-auth-sso.erb to the right location

https://gerrit.wikimedia.org/r/1018659

Change #1018659 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch: move apache-auth-sso.erb to the right location

https://gerrit.wikimedia.org/r/1018659

Change #1018667 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch: set vhost and issuer url for dashboards sso test

https://gerrit.wikimedia.org/r/1018667

Change #1018667 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch: set vhost and issuer url for dashboards sso test

https://gerrit.wikimedia.org/r/1018667

Change #1018872 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] opensearch: switch dashboards to sso auth

https://gerrit.wikimedia.org/r/1018872

Change #1018872 merged by Filippo Giunchedi:

[operations/puppet@production] opensearch: switch dashboards to sso auth

https://gerrit.wikimedia.org/r/1018872

fgiunchedi claimed this task.

I'm optimistically resolving this since we no longer use mod auth ldap