Page MenuHomePhabricator

Kubernetes hosts raid check make facter fail
Open, MediumPublic

Description

Tonight we got some cron spam from the k8s hosts with:

2019-11-03 03:34:05.451970 ERROR puppetlabs.facter - error while resolving custom facts in /var/lib/puppet/lib/facter/lvm_support.rb: command timed out after 60 seconds.

Traceback (most recent call last):
  File "/usr/local/sbin/smart-data-dump", line 338, in <module>
    sys.exit(main())
  File "/usr/local/sbin/smart-data-dump", line 308, in main
    raid_drivers = get_fact('raid', facter_version)
  File "/usr/local/sbin/smart-data-dump", line 94, in get_fact
    raw_output = subprocess.check_output(command)
  File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/bin/facter', '--puppet', '--json', 'raid', '-l', 'error']' returned non-zero exit status 1

Being the first sunday of the month I'm pretty sure this is related to mdadm data check as the time of the spam emails matches it:

Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942606] md: data-check of RAID array md0
Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942609] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942611] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942620] md: using 128k window, over a total of 29279232k.
Nov  3 00:57:01 kubernetes2003 kernel: [14389415.104170] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.262236] md: md0: data-check done.
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278157] md: data-check of RAID array md1
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278160] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278162] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278173] md: using 128k window, over a total of 947334144k.
Nov  3 03:46:35 kubernetes2003 kernel: [14399587.554875] md: md1: data-check done.

You might want to consider having a similar approach of what we've done with the etcd cluster, see modules/profile/manifests/etcd.pp.

Related Objects

Event Timeline

Volans created this task.Nov 3 2019, 3:26 PM
MoritzMuehlenhoff triaged this task as Medium priority.Nov 4 2019, 9:14 AM

Change 549847 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] md: Globally set lower sync limits

https://gerrit.wikimedia.org/r/549847

Change 549847 merged by Alexandros Kosiaris:
[operations/puppet@production] md: Globally set lower sync limits

https://gerrit.wikimedia.org/r/549847

jijiki added a subscriber: jijiki.Dec 9 2019, 4:38 PM
jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:46 PM