Page MenuHomePhabricator

centrallog1002, centrallog2002 running out of disk space
Closed, ResolvedPublic

Description

image.png (1×2 px, 89 KB)

[10:20:57] <jynus> I belive cp hosts are overloading centrallog with:
[10:21:05] <jynus> Nov 22 09:19:55 cp3066 haproxykafka[1334378]: {"level":"error","TopicPartition":{"Topic":"webrequest_frontend_text","Partition":0,"Offset":588,"Metadata":null,"Error":{}},"error":"Broker: Topic authorization failed","time":"2024-11-22T09:19:55Z","message":"Failed to publish message"}

Details

Event Timeline

jcrespo triaged this task as Unbreak Now! priority.Nov 22 2024, 10:06 AM
jcrespo updated the task description. (Show Details)

Change #1094376 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hieradata,haproxykafka: Disable haproxykafka globally

https://gerrit.wikimedia.org/r/1094376

Change #1094376 merged by Vgutierrez:

[operations/puppet@production] hieradata,haproxykafka: Disable haproxykafka globally

https://gerrit.wikimedia.org/r/1094376

Mentioned in SAL (#wikimedia-operations) [2024-11-22T10:22:13Z] <vgutierrez> manually stopping haproxykafka on A:cp-ulsfo and A:cp-eqsin - T380570

jcrespo lowered the priority of this task from Unbreak Now! to High.Nov 22 2024, 10:43 AM
[11:37:05] <icinga-wm> RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[11:43:11] <icinga-wm> RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops

I've freed ~500GB of logs on both centrallog1002 and centrallog2002 by deleting 24+ hours of cp* hosts containing the errors on the header. cp* local logs have not been touched, so they can be used for debugging.

Leaving this ticket for traffic to decide if to resolve (if there is another ticket for haproxykafka) or repurpose for researching/fixing source of errors- the main issue I wanted to report (log clogging) is solved.

Thanks to @Vgutierrez for the quick takeover.

jcrespo reassigned this task from Fabfur to Vgutierrez.
jcrespo updated Other Assignee, added: jcrespo.
jcrespo added a subscriber: Fabfur.

Resolved, it looks like it is being handled by Traffic at T380583.

Mentioned in SAL (#wikimedia-operations) [2024-11-22T14:22:30Z] <vgutierrez> restoring haproxykafka on A:cp-ulsfo and A:cp-eqsin - T380570