Page MenuHomePhabricator

smartd not starting properly on gen9 + buster
Closed, ResolvedPublic

Description

I have upgraded db1078 from stretch to buster and I noticed smartd isn't starting.

root@db1078:/etc/systemd# /usr/sbin/smartd -n -d
smartd 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Opened configuration file /etc/smartd.conf
Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
glob(3) found no matches for pattern /dev/hd[a-t]
glob(3) found no matches for pattern /dev/sd[a-c][a-z]
DEVICESCAN failed: glob(3) aborted matching pattern /dev/discs/disc*
In the system's table of devices NO devices found to scan
Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...

root@db1078:/etc/systemd# ls -lh /dev/sda*
brw-rw---- 1 root disk 8, 0 Mar  5 15:23 /dev/sda
brw-rw---- 1 root disk 8, 1 Mar  5 15:23 /dev/sda1
brw-rw---- 1 root disk 8, 2 Mar  5 15:23 /dev/sda2
brw-rw---- 1 root disk 8, 3 Mar  5 15:23 /dev/sda3

On a gen9 running stretch this seems to be working fine:

root@db1075:~# /usr/sbin/smartd -n -d
smartd 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Opened configuration file /etc/smartd.conf
Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
glob(3) found no matches for pattern /dev/hd[a-t]
glob(3) found no matches for pattern /dev/sd[a-c][a-z]
Device: /dev/sda, opened
Device: /dev/sda, [HP       LOGICAL VOLUME   3.56], lu id: 0x600508b1001c7bc3f2c4db22732695b6, S/N: PDNNF0ARH9O0FN, 4.00 TB
Device: /dev/sda, does not support SMART Self-Test Log.
Device: /dev/sda, is SMART capable. Adding to "monitor" list.
Device: /dev/sda, state read from /var/lib/smartmontools/smartd.HP-LOGICAL_VOLUME-PDNNF0ARH9O0FN.scsi.state
Monitoring 0 ATA/SATA, 1 SCSI/SAS and 0 NVMe devices
Device: /dev/sda, opened SCSI device
Device: /dev/sda, SMART health: passed
Device: /dev/sda, failed to read Temperature
Device: /dev/sda, state written to /var/lib/smartmontools/smartd.HP-LOGICAL_VOLUME-PDNNF0ARH9O0FN.scsi.state

That /dev/discs/disc* doesn't seem to be existing on either stretch or buster.
Even manually creating it, it clears up the error but nothing else:

root@db1078:/etc/systemd# mkdir /dev/discs
root@db1078:/etc/systemd# /usr/sbin/smartd -n -d
smartd 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Opened configuration file /etc/smartd.conf
Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
glob(3) found no matches for pattern /dev/hd[a-t]
glob(3) found no matches for pattern /dev/sd[a-c][a-z]
glob(3) found no matches for pattern /dev/discs/disc*
In the system's table of devices NO devices found to scan
Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...

Event Timeline

Interesting find! Looks like db1078 is the first system that we run Buster on and has HP raid controller (so the disks are "masked" behind a single device). This looks like a "regression" in smartd (6.6-1 from buster, I tried 7.1-1~bpo10+1 from buster-backports and no joy either). Note that in this case smartd is running mostly for logging purposes to track attribute changes, however the smart-data-dump script we're using to export Prometheus metrics does support autodiscovery of hardware raid controllers and we alert on those SMART metrics via Prometheus.

One solution could be to not enable smartd at all since the alerting goes through Prometheus anyways. Alternatively we could explicitly list devices in smartd's configuration, either statically via puppet or via enumeration/autodiscovery like smart-data-dump does.

Thanks for taking a look.
From both options you suggest, I am more inclined on the first one so we can get rid of a component which is overruled by Prometheus anyways, no?
The idea of having to list devices is a bit scary to me, specially considering how those can change in the future with newer OS or kernel versions, and how we'd need to maintain or adapt that.

Let's report this upstream (or in the Debian BTS, not sure if there are possible local packaging changes which might make a difference)?

Thanks for taking a look.
From both options you suggest, I am more inclined on the first one so we can get rid of a component which is overruled by Prometheus anyways, no?
The idea of having to list devices is a bit scary to me, specially considering how those can change in the future with newer OS or kernel versions, and how we'd need to maintain or adapt that.

Yes I'm inclined to mask smartd too (smartmontools needs to be installed though), perhaps even across the board (i.e. not special-case only on hp + raid hosts).

Let's report this upstream (or in the Debian BTS, not sure if there are possible local packaging changes which might make a difference)?

My understanding is that previously it was working by chance/accident, and enumerating disks behind cciss drivers isn't supported without going through specific tools (like we do in smart-data-dump). I'm not sure what we'd be reporting here, although I'm very open to suggestions!

akosiaris triaged this task as Medium priority.Mar 6 2020, 11:31 AM

My understanding is that previously it was working by chance/accident

After digging a little further, I concur, no need for a bug report.

Change 581617 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: stop smartd on Buster + hpsa

https://gerrit.wikimedia.org/r/581617

Change 581617 merged by Filippo Giunchedi:
[operations/puppet@production] smart: stop smartd on Buster + hpsa

https://gerrit.wikimedia.org/r/581617

fgiunchedi claimed this task.

smartd.service will be masked on Buster + hpsa, resolving