Page MenuHomePhabricator

(OoW) Heating alerts and broken RAM on kafka1014
Closed, DeclinedPublic

Description

kafka1014 has a lot of heating alerts logged in the kernel log (and also heating alerts logged to mcelog). Maybe there's a problem with one of the fans or similar?


kafka1014 reports RAM issues -> "Memory correctable errors -EDAC-"

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka1014&service=Memory+correctable+errors+-EDAC-

Event Timeline

We can stop the host and verify the status of the thermal paste if it is worth it :)

@elukey yes please stop the host and I will apply thermal paste

@Cmjohnson it would be better to stop the host only for the time needed, so I can stop it before you are ready to apply the paste. Lemme know 10 mins beforehand and I'll shut it down. Thanks!

Dzahn renamed this task from Heating alerts on kafka1014 to Heating alerts and broken RAM on kafka1014.Jan 9 2019, 11:54 PM
Dzahn updated the task description. (Show Details)

as a new issue, kafka1014 reports broken RAM since recently

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka1014&service=Memory+correctable+errors+-EDAC-

I used the existing ticket to add it, but let me know if it should have been a new one.

It is fine in here Daniel, thanks! In theory kafka1012->23 should be decommissioned when Event Gate (part of Modern Event Platform) will be up and running, since Mediawiki Avro Monolog will be migrated to it and at that point nothing will be pushing data to the old Kafka Analytics cluster anymore. So I wouldn't spend much time energy on this if possible, worst case scenario we can shrink down the cluster to 5/4 hosts and decom the ones (like this) that are not healthy anymore.

wiki_willy renamed this task from Heating alerts and broken RAM on kafka1014 to (OoW) Heating alerts and broken RAM on kafka1014.Jul 2 2019, 10:05 PM
wiki_willy subscribed.

@Dzahn - just wanted to circle back around on this, and see if kafka1014 can be decommissioned. Thanks, Willy