[10:20:57] <jynus> I belive cp hosts are overloading centrallog with:
[10:21:05] <jynus> Nov 22 09:19:55 cp3066 haproxykafka[1334378]: {"level":"error","TopicPartition":{"Topic":"webrequest_frontend_text","Partition":0,"Offset":588,"Metadata":null,"Error":{}},"error":"Broker: Topic authorization failed","time":"2024-11-22T09:19:55Z","message":"Failed to publish message"}Description
Details
- Other Assignee
- jcrespo
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| hieradata,haproxykafka: Disable haproxykafka globally | operations/puppet | production | +0 -3 |
Related Objects
- Mentioned In
- T380583: Avoid logging errors per produced message
- Mentioned Here
- T380583: Avoid logging errors per produced message
Event Timeline
Change #1094376 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] hieradata,haproxykafka: Disable haproxykafka globally
Change #1094376 merged by Vgutierrez:
[operations/puppet@production] hieradata,haproxykafka: Disable haproxykafka globally
Mentioned in SAL (#wikimedia-operations) [2024-11-22T10:22:13Z] <vgutierrez> manually stopping haproxykafka on A:cp-ulsfo and A:cp-eqsin - T380570
[11:37:05] <icinga-wm> RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [11:43:11] <icinga-wm> RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops
I've freed ~500GB of logs on both centrallog1002 and centrallog2002 by deleting 24+ hours of cp* hosts containing the errors on the header. cp* local logs have not been touched, so they can be used for debugging.
Leaving this ticket for traffic to decide if to resolve (if there is another ticket for haproxykafka) or repurpose for researching/fixing source of errors- the main issue I wanted to report (log clogging) is solved.
Thanks to @Vgutierrez for the quick takeover.
Mentioned in SAL (#wikimedia-operations) [2024-11-22T14:22:30Z] <vgutierrez> restoring haproxykafka on A:cp-ulsfo and A:cp-eqsin - T380570
