Page MenuHomePhabricator

Alert in need of triage: SmartNotHealthy (instance sretest2006:9100)
Open, MediumPublic

Description

The alert SmartNotHealthy has started firing 1 month ago.

Labels
alertname=SmartNotHealthy
cluster=misc
device=sda
instance=sretest2006:9100
job=node
prometheus=ops
severity=warning
site=codfw
source=prometheus
team=sre
Annotations
NameContent
dashboardhttps://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=sretest2006
descriptionThe disk SMART status is *not* healthy, this could be an early warning before the disk fails.
runbookhttps://wikitech.wikimedia.org/wiki/SMART#Alerts
summaryDisk not healthy
Links

Triage metadata. Do not delete.
fingerprint=fe1655842b4b1212

Event Timeline

if you zoom out to half a year, this alert has been active since the end of July. Could this have been a mistake on initial setup? Would a reimage fix it? it is an OS drive, not a storage drive. I am not sure what it's being used for at this moment.

MoritzMuehlenhoff raised the priority of this task from Low to Medium.