Maniphest T320636

smart-data-dump fails occasionally due to facter timeouts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	colewhite
	Oct 12 2022, 1:38 PM

Description

After rolling out T251293, we've noticed an increase in alerting from smart-data-dump runs: https://logstash.wikimedia.org/goto/d786c2b219de372785760d2ecda1b71f

Sep 19 08:11:00 logstash1010 systemd[1]: Started Collect SMART information from all physical disks and report as Prometheus metrics.
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]: Command '['/usr/bin/timeout', '60', '/usr/bin/facter', '--puppet', '--json', '-l', 'error', 'raid_mgmt_tools']' returned non-zero exit status 124.
                                                     Traceback (most recent call last):
                                                       File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
                                                         return subprocess.check_output(cmd, stderr=stderr) \
                                                       File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
                                                         **kwargs).stdout
                                                       File "/usr/lib/python3.7/subprocess.py", line 487, in run
                                                         output=stdout, stderr=stderr)
                                                     subprocess.CalledProcessError: Command '['/usr/bin/timeout', '60', '/usr/bin/facter', '--puppet', '--json', '-l', 'error', 'raid_mgmt_tools']' returned non-zero exit status 124.
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]: Traceback (most recent call last):
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/local/sbin/smart-data-dump", line 475, in <module>
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     sys.exit(main())
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/local/sbin/smart-data-dump", line 444, in main
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     raid_drivers = get_fact('raid_mgmt_tools')
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/local/sbin/smart-data-dump", line 137, in get_fact
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     raw_output = _check_output(command, stderr=subprocess.DEVNULL)
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     return subprocess.check_output(cmd, stderr=stderr) \
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     **kwargs).stdout
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:   File "/usr/lib/python3.7/subprocess.py", line 487, in run
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]:     output=stdout, stderr=stderr)
Sep 19 08:12:35 logstash1010 smart-data-dump[31887]: subprocess.CalledProcessError: Command '['/usr/bin/timeout', '60', '/usr/bin/facter', '--puppet', '--json', '-l', 'error', 'raid_mgmt_tools']' returned non-zero exit status 124.
Sep 19 08:12:35 logstash1010 systemd[1]: export_smart_data_dump.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 08:12:35 logstash1010 systemd[1]: export_smart_data_dump.service: Failed with result 'exit-code'.

Details

	Subject	Repo	Branch	Lines +/-
	raid: use raid_mgmt_tools fact	operations/puppet	production	+2 -2
	Revert "smart: restore get_fact and deprecate get_raid_drivers"	operations/puppet	production	+1 -16

Customize query in gerrit

Related Objects

Mentioned Here: T251293: Facter is slow on a few hosts

Event Timeline

colewhite created this task.Oct 12 2022, 1:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 12 2022, 1:38 PM

Change 842416 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] smart_data_dump: call raid fact directly

https://gerrit.wikimedia.org/r/842416

gerritbot added a project: Patch-For-Review.Oct 13 2022, 11:00 AM

This is likely still the original ipmi issues. As its so sporadic im not sure its worth trying to track down or it is at least a low priority, as such i have created a cr to move back to calling raid.rb directly

Change 842471 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] Revert "smart: restore get_fact and deprecate get_raid_drivers"

https://gerrit.wikimedia.org/r/842471

Change 842471 merged by Cwhite:

[operations/puppet@production] Revert "smart: restore get_fact and deprecate get_raid_drivers"

https://gerrit.wikimedia.org/r/842471

Change 842416 merged by Cwhite:

[operations/puppet@production] raid: use raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/842416

Rolled back changes to use facter back to calling raid.rb. Updated raid.rb to look at the new metric.

We'll check back in a week to see if there are any more timeouts.

colewhite triaged this task as Medium priority.Oct 13 2022, 9:14 PM

colewhite added a project: SRE Observability (FY2022/2023-Q2).

colewhite moved this task from Inbox to In progress on the SRE Observability (FY2022/2023-Q2) board.

Maintenance_bot removed a project: Patch-For-Review.Oct 13 2022, 9:30 PM

colewhite moved this task from In progress to Done on the SRE Observability (FY2022/2023-Q2) board.Oct 19 2022, 2:36 PM

lmata closed this task as Resolved.Jan 11 2023, 10:25 PM

lmata moved this task from Inbox to Done on the Observability-Alerting board.Jan 16 2023, 5:58 PM

We have seen another recent uptick in these smart-data-dump timeouts, so I thought that I would let you know about it.
In November we have seen 20 cases of these alerts across the hosts managed by Data-Engineering and Data-Platform-SRE

Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
                                                        Traceback (most recent call last):
                                                          File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
                                                            return subprocess.check_output(cmd, stderr=stderr) \
                                                          File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
                                                            return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
                                                          File "/usr/lib/python3.9/subprocess.py", line 528, in run
                                                            raise CalledProcessError(retcode, process.args,
                                                        subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status >
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]: Traceback (most recent call last):
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/local/sbin/smart-data-dump", line 460, in <module>
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     sys.exit(main())
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/local/sbin/smart-data-dump", line 429, in main
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     raid_drivers = get_raid_drivers()
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/local/sbin/smart-data-dump", line 136, in get_raid_drivers
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     raw_output = _check_output(command, timeout=120, stderr=subprocess.DEVNULL)
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     return subprocess.check_output(cmd, stderr=stderr) \
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:   File "/usr/lib/python3.9/subprocess.py", line 528, in run
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]:     raise CalledProcessError(retcode, process.args,
Nov 22 12:07:00 an-worker1129 smart-data-dump[2478582]: subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status >
Nov 22 12:07:00 an-worker1129 systemd[1]: export_smart_data_dump.service: Main process exited, code=exited, status=1/FAILURE
Nov 22 12:07:00 an-worker1129 systemd[1]: export_smart_data_dump.service: Failed with result 'exit-code'.
Nov 22 12:07:00 an-worker1129 systemd[1]: Failed to start Collect SMART information from all physical disks and report as Prometheus metrics.

Maybe it is related to the puppet 7 upgrade in some way? I can open a new ticket if you like, but I thought it probably best to add a comment here first, given that there is a lot of context here, even if it is over a year old.

All the impacted hosts are indeed running puppet 7:

vgutierrez@cumin1001:~$ sudo -i cumin '*' 'journalctl -u export_smart_data_dump.service --since=-7d |grep -q timeout && echo impacted || echo healthy && puppet --version'
[...]
===== NODE GROUP =====                                                                    
(117) an-conf1003.eqiad.wmnet,an-presto[1001,1010-1013,1015].eqiad.wmnet,an-test-worker1001.eqiad.wmnet,an-worker[1081,1091,1100,1105-1106,1109,1113,1116,1120-1121,1128-1129,1131,1134-1135,1142,1156].eqiad.wmnet,analytics[1074-1075].eqiad.wmnet,backup[2006-2007].codfw.wmnet,backup1001.eqiad.wmnet,bast2003.wikimedia.org,cassandra-dev2001.codfw.wmnet,clouddb101
5.eqiad.wmnet,clouddumps1001.wikimedia.org,cloudelastic[1007,1010].wikimedia.org,cloudgw2003-dev.codfw.wmnet,cloudservices2004-dev.codfw.wmnet,cloudvirt[2002,2005]-dev.codfw.wmnet,cloudvirt[1033-1034,1036-1038,1043,1048,1053,1062,1064].eqiad.wmnet,cloudvirt-wdqs1003.eqiad.wmnet,cp4037.ulsfo.wmnet,dbprov2003.codfw.wmnet,dbstore1007.eqiad.wmnet,dumpsdata1007.eq
iad.wmnet,ganeti[2010,2012].codfw.wmnet,ganeti[1011,1021,1023,1025,1032].eqiad.wmnet,ganeti[5005,5007].eqsin.wmnet,ganeti4005.ulsfo.wmnet,ganeti-test2002.codfw.wmnet,ganeti-test[1001-1002].eqiad.wmnet,gitlab-runner[2002,2004].codfw.wmnet,graphite2004.codfw.wmnet,kafka-jumbo[1011-1012].eqiad.wmnet,kafka-main[2001,2005].codfw.wmnet,krb2002.codfw.wmnet,kubernete
s[2012-2013,2028,2032,2034-2035,2037,2046,2052-2053,2055].codfw.wmnet,kubernetes[1007,1017,1024,1028,1037,1039,1045,1051].eqiad.wmnet,kubestage2002.codfw.wmnet,kubestage1004.eqiad.wmnet,logstash2001.codfw.wmnet,lvs1013.eqiad.wmnet,ml-cache2001.codfw.wmnet,ml-cache1003.eqiad.wmnet,ml-serve[2005-2006].codfw.wmnet,ml-serve[1002,1006].eqiad.wmnet,ms-backup2001.co
dfw.wmnet,ms-be2051.codfw.wmnet,pki[1001-1002].eqiad.wmnet,puppetdb2003.codfw.wmnet,puppetdb1003.eqiad.wmnet,puppetserver1002.eqiad.wmnet,rdb1014.eqiad.wmnet,relforge1003.eqiad.wmnet,sessionstore2001.codfw.wmnet,titan1001.eqiad.wmnet,wdqs1009.eqiad.wmnet
----- OUTPUT of 'journalctl -u ex...puppet --version' -----                               
impacted                                                                                  
7.23.0

cumin also reports 763 hosts without timeouts in the last week running puppet 7.23.0

jcrespo added a project: Puppet (Puppet 7.0).Nov 23 2023, 1:53 PM

brouberol subscribed.Nov 23 2023, 2:00 PM

Maybe it is related to the puppet 7 upgrade in some way?

Although the timing could suggest this is caused by puppet 7 i would suggest caution for the following reasons:

~50% of servers are no running puppet7
As the migration is active we can't be sure theses servers where running puppet7 when the the timeouts happened
most importantly: this script calls raid.rb directly as such puppet is not involved. the only library used is facter (and even then only very limited amount) which has not been upgraded as part of the puppet migration

My feeling is we have still to tacked down the original issue

From the logs, the symptoms have not reappeared since Nov. 23rd (excluding moss-be, it has other problems).

Thank you for the report! We will investigate again if the issue reappears.

In T320636#9417362, @colewhite wrote:

From the logs, the symptoms have not reappeared since Nov. 23rd (excluding moss-be, it has other problems).

Thank you for the report! We will investigate again if the issue reappears.

Confirmed, this should have been resolved when we enabled server-side caching on the puppetserver (Puppet 7) hosts, which helped reduce load by quiet a bit.

	F41523886: image.png
	Nov 22 2023, 12:42 PM

smart-data-dump fails occasionally due to facter timeoutsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

smart-data-dump fails occasionally due to facter timeouts
Closed, ResolvedPublic
Actions