This one too crashed today together with cp3055. Nothing on the console.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T238305 Servers freezing across the caching cluster | |||
Duplicate | None | T241306 cp3051 crashed |
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2019-12-21T22:45:08Z] <volans> powercycle cp3051 - T241306
Nothing in racadm, checked both getsel and lclog view. Nothing in syslog & co.
FYI in dmesg during the end of the boot process it logged a bunch of kvm: disabled by bios.
Thanks @Volans for taking care of this.
Just like all other crashes tracked in T238305 :-/
Now, I know it sounds crazy, but: this is the 6th host crashing out of 8 cache_upload nodes in esams. So far none of the 8 cache_text nodes has crashed. I don't think there's too much to look at at the software configuration level, considering that in eqiad a text node has crashed (cp1077), but perhaps it's worth checking what's special about upload@esams that differentiates it from text? Something at the hardware level maybe, like parts batches, or anything special related to racking? You can tell upload@esams hosts from text because their hostname is odd: cp30(5[13579]|6[135]) vs cp30(5[02468]|6[024]).
FYI in dmesg during the end of the boot process it logged a bunch of kvm: disabled by bios.
Disabled on purpose, we don't use kvm on cache nodes.