Page MenuHomePhabricator

Facter is slow on a few hosts
Open, MediumPublic

Description

It appears bmc-config takes a long time to return on a few hosts. This behavior is appears consistent and can be observed on ganeti1002 and mw2165.

time sudo /usr/bin/facter --puppet --json -l debug raid

2020-04-28 15:57:14.113322 INFO  puppetlabs.facter - executed with command line: --puppet --json -l debug raid.

<snip>

2020-04-28 15:57:15.508895 DEBUG leatherman.execution:93 - executing command: /bin/sh -c /usr/sbin/bmc-config -o -S Lan_Conf  <----------------- Hangs here a while
2020-04-28 15:59:00.925197 DEBUG | - #
# Section Lan_Conf Comments 
#
# In the Lan_Conf section, typical networking configuration is setup. Most users 
# will choose to set "Static" for the "IP_Address_Source" and set the 
# appropriate "IP_Address", "MAC_Address", "Subnet_Mask", etc. for the machine. 
#
Section Lan_Conf
	## Possible values: Unspecified/Static/Use_DHCP/Use_BIOS/Use_Others
	IP_Address_Source                             Static
	## Give valid IP address
	IP_Address                                    10.193.2.65
EndSection
2020-04-28 15:59:00.925497 DEBUG leatherman.execution:559 - process exited with status code 1.

<snip>

2020-04-28 15:59:01.009702 DEBUG puppetlabs.facter - no external facts were found.
{
  "raid": [
    "md"
  ]
}

real	1m46.947s
user	0m1.276s
sys	0m0.148s

Event Timeline

crusnov added a project: Puppet.

Interestingly, dbprov2001 hit the timeout fetching its facter version earlier today:

Traceback (most recent call last):
  File "/usr/local/sbin/smart-data-dump", line 459, in <module>
    sys.exit(main())
  File "/usr/local/sbin/smart-data-dump", line 429, in main
    raid_drivers = get_fact('raid')
  File "/usr/local/sbin/smart-data-dump", line 134, in get_fact
    facter_version = int(_check_output('/usr/bin/facter --version', stderr=subprocess.DEVNULL)
  File "/usr/local/sbin/smart-data-dump", line 123, in _check_output
    return subprocess.check_output(cmd, stderr=stderr) \
  File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/bin/timeout', '60', '/usr/bin/facter', '--version']' returned non-zero exit status 124

Grafana indicates the disk was saturated at the time. (09:12)

Change 626405 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] raid, smart: bypass facter timeout by calling facter script directly

https://gerrit.wikimedia.org/r/626405

Change 626405 merged by Cwhite:
[operations/puppet@production] raid, smart: bypass facter timeout by calling facter script directly

https://gerrit.wikimedia.org/r/626405

Hi Cole, I ran into this porting away things from the "raid" Puppet fact towards the new "raid_mgmt_tools" fact. All the slowness was originally caused by IPMI and the two hosts on which you ran into this were very old and are now decomissioned. And on dbprov2001 is only takes 2.3s in my test. Let's revert this/clean this up? @jbond offered to take a look if the IPMI slowness reappears.

Change 831114 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] smart: restore get_fact and deprecate get_raid_drivers

https://gerrit.wikimedia.org/r/831114

Change 831114 merged by Cwhite:

[operations/puppet@production] smart: restore get_fact and deprecate get_raid_drivers

https://gerrit.wikimedia.org/r/831114

raid_mgmt_tools does not detect raid on clouddb1021

cwhite@clouddb1021:~$ sudo /usr/bin/ruby /var/lib/puppet/lib/facter/raid.rb | jq .
{
  "raid": [
    "megaraid"
  ]
}
cwhite@clouddb1021:~$ sudo /usr/bin/facter --puppet --json -l error raid | jq .
{
  "raid": [
    "megaraid"
  ]
}
cwhite@clouddb1021:~$ sudo /usr/bin/facter --puppet --json -l error raid_mgmt_tools | jq .
{
  "raid_mgmt_tools": []
}
colewhite claimed this task.

Rolled back the changes and only one host experienced a regression. Created T317924 to handle that host.

Change 832570 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] smart: remove unused function get_raid_drivers

https://gerrit.wikimedia.org/r/832570

Change 842416 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] smart_data_dump: call raid fact directly

https://gerrit.wikimedia.org/r/842416

Change 842416 merged by Cwhite:

[operations/puppet@production] raid: use raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/842416

Change 832570 abandoned by Cwhite:

[operations/puppet@production] smart: remove unused function get_raid_drivers

Reason:

no longer needed

https://gerrit.wikimedia.org/r/832570

Vgutierrez subscribed.

It looks like we are having some issues with the raid fact:

Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
                                                Traceback (most recent call last):
                                                  File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
                                                    return subprocess.check_output(cmd, stderr=stderr) \
                                                  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
                                                    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
                                                  File "/usr/lib/python3.9/subprocess.py", line 528, in run
                                                    raise CalledProcessError(retcode, process.args,
                                                subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Traceback (most recent call last):
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 460, in <module>
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     sys.exit(main())
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 429, in main
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raid_drivers = get_raid_drivers()
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 136, in get_raid_drivers
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raw_output = _check_output(command, timeout=120, stderr=subprocess.DEVNULL)
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     return subprocess.check_output(cmd, stderr=stderr) \
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/lib/python3.9/subprocess.py", line 528, in run
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raise CalledProcessError(retcode, process.args,
Nov 22 23:47:01 cp4037 smart-data-dump[748598]: subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.

cumin shows at least 117 hosts impacted during the last week:

vgutierrez@cumin1001:~$ sudo -i cumin '*' 'journalctl -u export_smart_data_dump.service --since=-7d |grep -q timeout && echo impacted || echo healthy'
[...]
(117) an-conf1003.eqiad.wmnet,an-presto[1001,1010-1013,1015].eqiad.wmnet,an-test-worker1001.eqiad.wmnet,an-worker[1081,1091,1100,1105-1106,1109,1113,1116,1120-1121,1128-1129,1131,1134-1135,1142,1156].eqiad.wmnet,analytics[1074-1075].eqiad.wmnet,backup[2006-2007].codfw.wmnet,backup1001.eqiad.wmnet,bast2003.wikimedia.org,cassandra-dev2001.codfw.wmnet,clouddb1015.eqiad.wmnet,clouddumps1001.wikimedia.org,cloudelastic[1007,1010].wikimedia.org,cloudgw2003-dev.codfw.wmnet,cloudservices2004-dev.codfw.wmnet,cloudvirt[2002,2005]-dev.codfw.wmnet,cloudvirt[1033-1034,1036-1038,1043,1048,1053,1062,1064].eqiad.wmnet,cloudvirt-wdqs1003.eqiad.wmnet,cp4037.ulsfo.wmnet,dbprov2003.codfw.wmnet,dbstore1007.eqiad.wmnet,dumpsdata1007.eqiad.wmnet,ganeti[2010,2012].codfw.wmnet,ganeti[1011,1021,1023,1025,1032].eqiad.wmnet,ganeti[5005,5007].eqsin.wmnet,ganeti4005.ulsfo.wmnet,ganeti-test2002.codfw.wmnet,ganeti-test[1001-1002].eqiad.wmnet,gitlab-runner[2002,2004].codfw.wmnet,graphite2004.codfw.wmnet,kafka-jumbo[1011-1012].eqiad.wmnet,kafka-main[2001,2005].codfw.wmnet,krb2002.codfw.wmnet,kubernetes[2012-2013,2028,2032,2034-2035,2037,2046,2052-2053,2055].codfw.wmnet,kubernetes[1007,1017,1024,1028,1037,1039,1045,1051].eqiad.wmnet,kubestage2002.codfw.wmnet,kubestage1004.eqiad.wmnet,logstash2001.codfw.wmnet,lvs1013.eqiad.wmnet,ml-cache2001.codfw.wmnet,ml-cache1003.eqiad.wmnet,ml-serve[2005-2006].codfw.wmnet,ml-serve[1002,1006].eqiad.wmnet,ms-backup2001.codfw.wmnet,ms-be2051.codfw.wmnet,pki[1001-1002].eqiad.wmnet,puppetdb2003.codfw.wmnet,puppetdb1003.eqiad.wmnet,puppetserver1002.eqiad.wmnet,rdb1014.eqiad.wmnet,relforge1003.eqiad.wmnet,sessionstore2001.codfw.wmnet,titan1001.eqiad.wmnet,wdqs1009.eqiad.wmnet                                                                                                           ----- OUTPUT of 'journalctl -u ex... || echo healthy' -----                                                                                                                                                                                                                                                                                                              impacted