Page MenuHomePhabricator

(OoW) Heating alerts and broken RAM on kafka1014
Closed, DeclinedPublic

Description

kafka1014 has a lot of heating alerts logged in the kernel log (and also heating alerts logged to mcelog). Maybe there's a problem with one of the fans or similar?


kafka1014 reports RAM issues -> "Memory correctable errors -EDAC-"

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka1014&service=Memory+correctable+errors+-EDAC-

Event Timeline

We can stop the host and verify the status of the thermal paste if it is worth it :)

@elukey yes please stop the host and I will apply thermal paste

@Cmjohnson it would be better to stop the host only for the time needed, so I can stop it before you are ready to apply the paste. Lemme know 10 mins beforehand and I'll shut it down. Thanks!

Dzahn renamed this task from Heating alerts on kafka1014 to Heating alerts and broken RAM on kafka1014.Jan 9 2019, 11:54 PM
Dzahn updated the task description. (Show Details)
Dzahn added a subscriber: Dzahn.EditedJan 9 2019, 11:57 PM

as a new issue, kafka1014 reports broken RAM since recently

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka1014&service=Memory+correctable+errors+-EDAC-

I used the existing ticket to add it, but let me know if it should have been a new one.

It is fine in here Daniel, thanks! In theory kafka1012->23 should be decommissioned when Event Gate (part of Modern Event Platform) will be up and running, since Mediawiki Avro Monolog will be migrated to it and at that point nothing will be pushing data to the old Kafka Analytics cluster anymore. So I wouldn't spend much time energy on this if possible, worst case scenario we can shrink down the cluster to 5/4 hosts and decom the ones (like this) that are not healthy anymore.

elukey moved this task from Backlog to Stalled on the User-Elukey board.Jan 10 2019, 9:05 AM
elukey moved this task from Stalled to Waiting for others on the User-Elukey board.Jun 6 2019, 3:28 PM
wiki_willy renamed this task from Heating alerts and broken RAM on kafka1014 to (OoW) Heating alerts and broken RAM on kafka1014.Jul 2 2019, 10:05 PM
wiki_willy added a subscriber: wiki_willy.

@Dzahn - just wanted to circle back around on this, and see if kafka1014 can be decommissioned. Thanks, Willy

elukey closed this task as Declined.Jul 3 2019, 5:41 AM

The server will be decommed by https://phabricator.wikimedia.org/T226517, closing!