Tonight we got some cron spam from the k8s hosts with:
2019-11-03 03:34:05.451970 ERROR puppetlabs.facter - error while resolving custom facts in /var/lib/puppet/lib/facter/lvm_support.rb: command timed out after 60 seconds. Traceback (most recent call last): File "/usr/local/sbin/smart-data-dump", line 338, in <module> sys.exit(main()) File "/usr/local/sbin/smart-data-dump", line 308, in main raid_drivers = get_fact('raid', facter_version) File "/usr/local/sbin/smart-data-dump", line 94, in get_fact raw_output = subprocess.check_output(command) File "/usr/lib/python3.5/subprocess.py", line 316, in check_output **kwargs).stdout File "/usr/lib/python3.5/subprocess.py", line 398, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['/usr/bin/facter', '--puppet', '--json', 'raid', '-l', 'error']' returned non-zero exit status 1
Being the first sunday of the month I'm pretty sure this is related to mdadm data check as the time of the spam emails matches it:
Nov 3 00:57:01 kubernetes2003 kernel: [14389414.942606] md: data-check of RAID array md0 Nov 3 00:57:01 kubernetes2003 kernel: [14389414.942609] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Nov 3 00:57:01 kubernetes2003 kernel: [14389414.942611] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Nov 3 00:57:01 kubernetes2003 kernel: [14389414.942620] md: using 128k window, over a total of 29279232k. Nov 3 00:57:01 kubernetes2003 kernel: [14389415.104170] md: delaying data-check of md1 until md0 has finished (they share one or more physical units) Nov 3 01:00:43 kubernetes2003 kernel: [14389637.262236] md: md0: data-check done. Nov 3 01:00:43 kubernetes2003 kernel: [14389637.278157] md: data-check of RAID array md1 Nov 3 01:00:43 kubernetes2003 kernel: [14389637.278160] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Nov 3 01:00:43 kubernetes2003 kernel: [14389637.278162] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Nov 3 01:00:43 kubernetes2003 kernel: [14389637.278173] md: using 128k window, over a total of 947334144k. Nov 3 03:46:35 kubernetes2003 kernel: [14399587.554875] md: md1: data-check done.
You might want to consider having a similar approach of what we've done with the etcd cluster, see modules/profile/manifests/etcd.pp.