Page MenuHomePhabricator

smart-data-dump --syslog producing errors and spamming root@
Closed, ResolvedPublic

Description

There are lots of mails sent to root@ coming from different hosts with the following:

/usr/local/sbin/smart-data-dump --syslog --outfile /var/lib/prometheus/node.d/device_smart.prom

Traceback (most recent call last):
  File "/usr/local/sbin/smart-data-dump", line 459, in <module>
    sys.exit(main())
  File "/usr/local/sbin/smart-data-dump", line 438, in main
    for pd in handler():
  File "/usr/local/sbin/smart-data-dump", line 182, in hpsa_list_pd
    return hpsa_parse(raw_output, lsscsi_list_dev())
  File "/usr/local/sbin/smart-data-dump", line 227, in lsscsi_list_dev
    return lsscsi_parse(_check_output('/usr/bin/lsscsi -t -g'))
  File "/usr/local/sbin/smart-data-dump", line 243, in lsscsi_parse
    output[m[1]] = m[2]
TypeError: '_sre.SRE_Match' object is not subscriptable

Event Timeline

Marostegui moved this task from Backlog to Acknowledged on the SRE board.
colewhite claimed this task.

Thanks for the report!

There was a bug in the updated hpsa parser on initial deployment that fired these emails. It was caught the same day and was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/594989.

There are reports from early today from dbprov2001 for instance.

I found the email you are referring to. Logs:

Traceback (most recent call last):
  File "/usr/local/sbin/smart-data-dump", line 459, in <module>
    sys.exit(main())
  File "/usr/local/sbin/smart-data-dump", line 429, in main
    raid_drivers = get_fact('raid')
  File "/usr/local/sbin/smart-data-dump", line 134, in get_fact
    facter_version = int(_check_output('/usr/bin/facter --version', stderr=subprocess.DEVNULL)
  File "/usr/local/sbin/smart-data-dump", line 123, in _check_output
    return subprocess.check_output(cmd, stderr=stderr) \
  File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/bin/timeout', '60', '/usr/bin/facter', '--version']' returned non-zero exit status 124

The output from dbprov2001 indicates a different issue manifested as a timeout fetching facter data. Although it's a bit surprising that facter timed out fetching its version.

There is an existing task for facter timeouts: T251293. I will copy these logs to that task as well.

Edit: At the time smart-data-dump ran, the disk was saturated (See 09:12).

jijiki added a project: DBA.
jijiki subscribed.

Reopened the wrong task, re-closing. Nothing to see here, move along.