Page MenuHomePhabricator

smart-data-dump fails occasionally due to facter timeouts
Closed, ResolvedPublic

Description

After rolling out T251293, we've noticed an increase in alerting from smart-data-dump runs: https://logstash.wikimedia.org/goto/d786c2b219de372785760d2ecda1b71f

Sep 19 08:11:00 logstash1010 systemd[1]: Started Collect SMART information from all physical disks and report as Prometheus metrics.
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]: Command '['/usr/bin/timeout', '60', '/usr/bin/facter', '--puppet', '--json', '-l', 'error', 'raid_mgmt_tools']' returned non-zero exit status 124.
                                                     Traceback (most recent call last):
                                                       File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
                                                         return subprocess.check_output(cmd, stderr=stderr) \
                                                       File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
                                                         **kwargs).stdout
                                                       File "/usr/lib/python3.7/subprocess.py", line 487, in run
                                                         output=stdout, stderr=stderr)
                                                     subprocess.CalledProcessError: Command '['/usr/bin/timeout', '60', '/usr/bin/facter', '--puppet', '--json', '-l', 'error', 'raid_mgmt_tools']' returned non-zero exit status 124.
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]: Traceback (most recent call last):
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/local/sbin/smart-data-dump", line 475, in <module>
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     sys.exit(main())
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/local/sbin/smart-data-dump", line 444, in main
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     raid_drivers = get_fact('raid_mgmt_tools')
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/local/sbin/smart-data-dump", line 137, in get_fact
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     raw_output = _check_output(command, stderr=subprocess.DEVNULL)
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     return subprocess.check_output(cmd, stderr=stderr) \
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     **kwargs).stdout
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/lib/python3.7/subprocess.py", line 487, in run
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     output=stdout, stderr=stderr)
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]: subprocess.CalledProcessError: Command '['/usr/bin/timeout', '60', '/usr/bin/facter', '--puppet', '--json', '-l', 'error', 'raid_mgmt_tools']' returned non-zero exit status 124.
Sep 19 08:12:35 logstash1010 systemd[1]: export_smart_data_dump.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 08:12:35 logstash1010 systemd[1]: export_smart_data_dump.service: Failed with result 'exit-code'.

Related Objects

Event Timeline

Change 842416 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] smart_data_dump: call raid fact directly

https://gerrit.wikimedia.org/r/842416

This is likely still the original ipmi issues. As its so sporadic im not sure its worth trying to track down or it is at least a low priority, as such i have created a cr to move back to calling raid.rb directly

Change 842471 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] Revert "smart: restore get_fact and deprecate get_raid_drivers"

https://gerrit.wikimedia.org/r/842471

Change 842471 merged by Cwhite:

[operations/puppet@production] Revert "smart: restore get_fact and deprecate get_raid_drivers"

https://gerrit.wikimedia.org/r/842471

Change 842416 merged by Cwhite:

[operations/puppet@production] raid: use raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/842416

Rolled back changes to use facter back to calling raid.rb. Updated raid.rb to look at the new metric.

We'll check back in a week to see if there are any more timeouts.

We have seen another recent uptick in these smart-data-dump timeouts, so I thought that I would let you know about it.
In November we have seen 20 cases of these alerts across the hosts managed by Data-Engineering and Data-Platform-SRE

image.png (551×1 px, 99 KB)

Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
                                                        Traceback (most recent call last):
                                                          File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
                                                            return subprocess.check_output(cmd, stderr=stderr) \
                                                          File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
                                                            return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
                                                          File "/usr/lib/python3.9/subprocess.py", line 528, in run
                                                            raise CalledProcessError(retcode, process.args,
                                                        subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status >
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]: Traceback (most recent call last):
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/local/sbin/smart-data-dump", line 460, in <module>
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     sys.exit(main())
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/local/sbin/smart-data-dump", line 429, in main
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     raid_drivers = get_raid_drivers()
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/local/sbin/smart-data-dump", line 136, in get_raid_drivers
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     raw_output = _check_output(command, timeout=120, stderr=subprocess.DEVNULL)
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     return subprocess.check_output(cmd, stderr=stderr) \
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/lib/python3.9/subprocess.py", line 528, in run
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     raise CalledProcessError(retcode, process.args,
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]: subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status >
Nov 22 12:07:00 an-worker1129 systemd[1]: export_smart_data_dump.service: Main process exited, code=exited, status=1/FAILURE
Nov 22 12:07:00 an-worker1129 systemd[1]: export_smart_data_dump.service: Failed with result 'exit-code'.
Nov 22 12:07:00 an-worker1129 systemd[1]: Failed to start Collect SMART information from all physical disks and report as Prometheus metrics.

Maybe it is related to the puppet 7 upgrade in some way? I can open a new ticket if you like, but I thought it probably best to add a comment here first, given that there is a lot of context here, even if it is over a year old.

Vgutierrez subscribed.

All the impacted hosts are indeed running puppet 7:

vgutierrez@cumin1001:~$ sudo -i cumin '*' 'journalctl -u export_smart_data_dump.service --since=-7d |grep -q timeout && echo impacted || echo healthy && puppet --version'
[...]
===== NODE GROUP =====                                                                    
(117) an-conf1003.eqiad.wmnet,an-presto[1001,1010-1013,1015].eqiad.wmnet,an-test-worker1001.eqiad.wmnet,an-worker[1081,1091,1100,1105-1106,1109,1113,1116,1120-1121,1128-1129,1131,1134-1135,1142,1156].eqiad.wmnet,analytics[1074-1075].eqiad.wmnet,backup[2006-2007].codfw.wmnet,backup1001.eqiad.wmnet,bast2003.wikimedia.org,cassandra-dev2001.codfw.wmnet,clouddb101
5.eqiad.wmnet,clouddumps1001.wikimedia.org,cloudelastic[1007,1010].wikimedia.org,cloudgw2003-dev.codfw.wmnet,cloudservices2004-dev.codfw.wmnet,cloudvirt[2002,2005]-dev.codfw.wmnet,cloudvirt[1033-1034,1036-1038,1043,1048,1053,1062,1064].eqiad.wmnet,cloudvirt-wdqs1003.eqiad.wmnet,cp4037.ulsfo.wmnet,dbprov2003.codfw.wmnet,dbstore1007.eqiad.wmnet,dumpsdata1007.eq
iad.wmnet,ganeti[2010,2012].codfw.wmnet,ganeti[1011,1021,1023,1025,1032].eqiad.wmnet,ganeti[5005,5007].eqsin.wmnet,ganeti4005.ulsfo.wmnet,ganeti-test2002.codfw.wmnet,ganeti-test[1001-1002].eqiad.wmnet,gitlab-runner[2002,2004].codfw.wmnet,graphite2004.codfw.wmnet,kafka-jumbo[1011-1012].eqiad.wmnet,kafka-main[2001,2005].codfw.wmnet,krb2002.codfw.wmnet,kubernete
s[2012-2013,2028,2032,2034-2035,2037,2046,2052-2053,2055].codfw.wmnet,kubernetes[1007,1017,1024,1028,1037,1039,1045,1051].eqiad.wmnet,kubestage2002.codfw.wmnet,kubestage1004.eqiad.wmnet,logstash2001.codfw.wmnet,lvs1013.eqiad.wmnet,ml-cache2001.codfw.wmnet,ml-cache1003.eqiad.wmnet,ml-serve[2005-2006].codfw.wmnet,ml-serve[1002,1006].eqiad.wmnet,ms-backup2001.co
dfw.wmnet,ms-be2051.codfw.wmnet,pki[1001-1002].eqiad.wmnet,puppetdb2003.codfw.wmnet,puppetdb1003.eqiad.wmnet,puppetserver1002.eqiad.wmnet,rdb1014.eqiad.wmnet,relforge1003.eqiad.wmnet,sessionstore2001.codfw.wmnet,titan1001.eqiad.wmnet,wdqs1009.eqiad.wmnet
----- OUTPUT of 'journalctl -u ex...puppet --version' -----                               
impacted                                                                                  
7.23.0

cumin also reports 763 hosts without timeouts in the last week running puppet 7.23.0

Maybe it is related to the puppet 7 upgrade in some way?

Although the timing could suggest this is caused by puppet 7 i would suggest caution for the following reasons:

  • ~50% of servers are no running puppet7
  • As the migration is active we can't be sure theses servers where running puppet7 when the the timeouts happened
  • most importantly: this script calls raid.rb directly as such puppet is not involved. the only library used is facter (and even then only very limited amount) which has not been upgraded as part of the puppet migration

My feeling is we have still to tacked down the original issue

From the logs, the symptoms have not reappeared since Nov. 23rd (excluding moss-be, it has other problems).

Thank you for the report! We will investigate again if the issue reappears.

From the logs, the symptoms have not reappeared since Nov. 23rd (excluding moss-be, it has other problems).

Thank you for the report! We will investigate again if the issue reappears.

Confirmed, this should have been resolved when we enabled server-side caching on the puppetserver (Puppet 7) hosts, which helped reduce load by quiet a bit.