Facter is slow on a few hosts
Open, MediumPublic
Actions

Assigned To

Authored By

	colewhite
	Apr 28 2020, 4:10 PM

Description

It appears bmc-config takes a long time to return on a few hosts. This behavior is appears consistent and can be observed on ganeti1002 and mw2165.

time sudo /usr/bin/facter --puppet --json -l debug raid

2020-04-28 15:57:14.113322 INFO  puppetlabs.facter - executed with command line: --puppet --json -l debug raid.

<snip>

2020-04-28 15:57:15.508895 DEBUG leatherman.execution:93 - executing command: /bin/sh -c /usr/sbin/bmc-config -o -S Lan_Conf  <----------------- Hangs here a while
2020-04-28 15:59:00.925197 DEBUG | - #
# Section Lan_Conf Comments 
#
# In the Lan_Conf section, typical networking configuration is setup. Most users 
# will choose to set "Static" for the "IP_Address_Source" and set the 
# appropriate "IP_Address", "MAC_Address", "Subnet_Mask", etc. for the machine. 
#
Section Lan_Conf
	## Possible values: Unspecified/Static/Use_DHCP/Use_BIOS/Use_Others
	IP_Address_Source                             Static
	## Give valid IP address
	IP_Address                                    10.193.2.65
EndSection
2020-04-28 15:59:00.925497 DEBUG leatherman.execution:559 - process exited with status code 1.

<snip>

2020-04-28 15:59:01.009702 DEBUG puppetlabs.facter - no external facts were found.
{
  "raid": [
    "md"
  ]
}

real	1m46.947s
user	0m1.276s
sys	0m0.148s

Details

Subject	Repo	Branch	Lines +/-
smart: remove unused function get_raid_drivers	operations/puppet	production	+0 -15
raid: use raid_mgmt_tools fact	operations/puppet	production	+2 -2
smart: restore get_fact and deprecate get_raid_drivers	operations/puppet	production	+16 -1
raid, smart: bypass facter timeout by calling facter script directly	operations/puppet	production	+15 -16

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		colewhite	T251293 Facter is slow on a few hosts
		Resolved		jbond	T317924 raid_mgmt_tools cannot detect raid on clouddb1021

Event Timeline

colewhite created this task.Apr 28 2020, 4:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 28 2020, 4:10 PM

• crusnov triaged this task as Medium priority.Apr 28 2020, 4:53 PM

• crusnov added a project: Puppet.

colewhite mentioned this in T252500: smart-data-dump --syslog producing errors and spamming root@.May 12 2020, 5:12 PM

Interestingly, dbprov2001 hit the timeout fetching its facter version earlier today:

Traceback (most recent call last):
  File "/usr/local/sbin/smart-data-dump", line 459, in <module>
    sys.exit(main())
  File "/usr/local/sbin/smart-data-dump", line 429, in main
    raid_drivers = get_fact('raid')
  File "/usr/local/sbin/smart-data-dump", line 134, in get_fact
    facter_version = int(_check_output('/usr/bin/facter --version', stderr=subprocess.DEVNULL)
  File "/usr/local/sbin/smart-data-dump", line 123, in _check_output
    return subprocess.check_output(cmd, stderr=stderr) \
  File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/bin/timeout', '60', '/usr/bin/facter', '--version']' returned non-zero exit status 124

Grafana indicates the disk was saturated at the time. (09:12)

Change 626405 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] raid, smart: bypass facter timeout by calling facter script directly

https://gerrit.wikimedia.org/r/626405

gerritbot added a project: Patch-For-Review.Sep 10 2020, 3:48 PM

Change 626405 merged by Cwhite:
[operations/puppet@production] raid, smart: bypass facter timeout by calling facter script directly

https://gerrit.wikimedia.org/r/626405

Maintenance_bot removed a project: Patch-For-Review.Sep 14 2020, 8:11 PM

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 8:59 PM

Hi Cole, I ran into this porting away things from the "raid" Puppet fact towards the new "raid_mgmt_tools" fact. All the slowness was originally caused by IPMI and the two hosts on which you ran into this were very old and are now decomissioned. And on dbprov2001 is only takes 2.3s in my test. Let's revert this/clean this up? @jbond offered to take a look if the IPMI slowness reappears.

Change 831114 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] smart: restore get_fact and deprecate get_raid_drivers

https://gerrit.wikimedia.org/r/831114

gerritbot added a project: Patch-For-Review.Sep 9 2022, 3:36 PM

[Removed]

Change 831114 merged by Cwhite:

[operations/puppet@production] smart: restore get_fact and deprecate get_raid_drivers

https://gerrit.wikimedia.org/r/831114

Maintenance_bot removed a project: Patch-For-Review.Sep 14 2022, 9:31 PM

raid_mgmt_tools does not detect raid on clouddb1021

cwhite@clouddb1021:~$ sudo /usr/bin/ruby /var/lib/puppet/lib/facter/raid.rb | jq .
{
  "raid": [
    "megaraid"
  ]
}
cwhite@clouddb1021:~$ sudo /usr/bin/facter --puppet --json -l error raid | jq .
{
  "raid": [
    "megaraid"
  ]
}
cwhite@clouddb1021:~$ sudo /usr/bin/facter --puppet --json -l error raid_mgmt_tools | jq .
{
  "raid_mgmt_tools": []
}

Rolled back the changes and only one host experienced a regression. Created T317924 to handle that host.

Change 832570 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] smart: remove unused function get_raid_drivers

https://gerrit.wikimedia.org/r/832570

gerritbot added a project: Patch-For-Review.Sep 15 2022, 9:01 PM

jbond closed subtask T317924: raid_mgmt_tools cannot detect raid on clouddb1021 as Resolved.Sep 16 2022, 10:55 AM

colewhite mentioned this in T320636: smart-data-dump fails occasionally due to facter timeouts.Oct 12 2022, 1:38 PM

Change 842416 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] smart_data_dump: call raid fact directly

https://gerrit.wikimedia.org/r/842416

Change 842416 merged by Cwhite:

[operations/puppet@production] raid: use raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/842416

Change 832570 abandoned by Cwhite:

[operations/puppet@production] smart: remove unused function get_raid_drivers

Reason:

no longer needed

https://gerrit.wikimedia.org/r/832570

Maintenance_bot removed a project: Patch-For-Review.Feb 17 2023, 12:11 AM

It looks like we are having some issues with the raid fact:

Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
                                                Traceback (most recent call last):
                                                  File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
                                                    return subprocess.check_output(cmd, stderr=stderr) \
                                                  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
                                                    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
                                                  File "/usr/lib/python3.9/subprocess.py", line 528, in run
                                                    raise CalledProcessError(retcode, process.args,
                                                subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.
Nov 22 23:47:01 cp4037 smart-data-dump[748598]: Traceback (most recent call last):
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 460, in <module>
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     sys.exit(main())
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 429, in main
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raid_drivers = get_raid_drivers()
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 136, in get_raid_drivers
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raw_output = _check_output(command, timeout=120, stderr=subprocess.DEVNULL)
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/local/sbin/smart-data-dump", line 124, in _check_output
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     return subprocess.check_output(cmd, stderr=stderr) \
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:   File "/usr/lib/python3.9/subprocess.py", line 528, in run
Nov 22 23:47:01 cp4037 smart-data-dump[748598]:     raise CalledProcessError(retcode, process.args,
Nov 22 23:47:01 cp4037 smart-data-dump[748598]: subprocess.CalledProcessError: Command '['/usr/bin/timeout', '120', '/usr/bin/ruby', '/var/lib/puppet/lib/facter/raid.rb']' returned non-zero exit status 1.

cumin shows at least 117 hosts impacted during the last week:

vgutierrez@cumin1001:~$ sudo -i cumin '*' 'journalctl -u export_smart_data_dump.service --since=-7d |grep -q timeout && echo impacted || echo healthy'
[...]
(117) an-conf1003.eqiad.wmnet,an-presto[1001,1010-1013,1015].eqiad.wmnet,an-test-worker1001.eqiad.wmnet,an-worker[1081,1091,1100,1105-1106,1109,1113,1116,1120-1121,1128-1129,1131,1134-1135,1142,1156].eqiad.wmnet,analytics[1074-1075].eqiad.wmnet,backup[2006-2007].codfw.wmnet,backup1001.eqiad.wmnet,bast2003.wikimedia.org,cassandra-dev2001.codfw.wmnet,clouddb1015.eqiad.wmnet,clouddumps1001.wikimedia.org,cloudelastic[1007,1010].wikimedia.org,cloudgw2003-dev.codfw.wmnet,cloudservices2004-dev.codfw.wmnet,cloudvirt[2002,2005]-dev.codfw.wmnet,cloudvirt[1033-1034,1036-1038,1043,1048,1053,1062,1064].eqiad.wmnet,cloudvirt-wdqs1003.eqiad.wmnet,cp4037.ulsfo.wmnet,dbprov2003.codfw.wmnet,dbstore1007.eqiad.wmnet,dumpsdata1007.eqiad.wmnet,ganeti[2010,2012].codfw.wmnet,ganeti[1011,1021,1023,1025,1032].eqiad.wmnet,ganeti[5005,5007].eqsin.wmnet,ganeti4005.ulsfo.wmnet,ganeti-test2002.codfw.wmnet,ganeti-test[1001-1002].eqiad.wmnet,gitlab-runner[2002,2004].codfw.wmnet,graphite2004.codfw.wmnet,kafka-jumbo[1011-1012].eqiad.wmnet,kafka-main[2001,2005].codfw.wmnet,krb2002.codfw.wmnet,kubernetes[2012-2013,2028,2032,2034-2035,2037,2046,2052-2053,2055].codfw.wmnet,kubernetes[1007,1017,1024,1028,1037,1039,1045,1051].eqiad.wmnet,kubestage2002.codfw.wmnet,kubestage1004.eqiad.wmnet,logstash2001.codfw.wmnet,lvs1013.eqiad.wmnet,ml-cache2001.codfw.wmnet,ml-cache1003.eqiad.wmnet,ml-serve[2005-2006].codfw.wmnet,ml-serve[1002,1006].eqiad.wmnet,ms-backup2001.codfw.wmnet,ms-be2051.codfw.wmnet,pki[1001-1002].eqiad.wmnet,puppetdb2003.codfw.wmnet,puppetdb1003.eqiad.wmnet,puppetserver1002.eqiad.wmnet,rdb1014.eqiad.wmnet,relforge1003.eqiad.wmnet,sessionstore2001.codfw.wmnet,titan1001.eqiad.wmnet,wdqs1009.eqiad.wmnet                                                                                                           ----- OUTPUT of 'journalctl -u ex... || echo healthy' -----                                                                                                                                                                                                                                                                                                              impacted

Facter is slow on a few hostsOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Facter is slow on a few hosts
Open, MediumPublic
Actions

Related Objects
Search...