Page MenuHomePhabricator

Kubernetes hosts raid check make facter fail
Closed, ResolvedPublic

Description

Tonight we got some cron spam from the k8s hosts with:

2019-11-03 03:34:05.451970 ERROR puppetlabs.facter - error while resolving custom facts in /var/lib/puppet/lib/facter/lvm_support.rb: command timed out after 60 seconds.

Traceback (most recent call last):
  File "/usr/local/sbin/smart-data-dump", line 338, in <module>
    sys.exit(main())
  File "/usr/local/sbin/smart-data-dump", line 308, in main
    raid_drivers = get_fact('raid', facter_version)
  File "/usr/local/sbin/smart-data-dump", line 94, in get_fact
    raw_output = subprocess.check_output(command)
  File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/bin/facter', '--puppet', '--json', 'raid', '-l', 'error']' returned non-zero exit status 1

Being the first sunday of the month I'm pretty sure this is related to mdadm data check as the time of the spam emails matches it:

Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942606] md: data-check of RAID array md0
Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942609] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942611] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942620] md: using 128k window, over a total of 29279232k.
Nov  3 00:57:01 kubernetes2003 kernel: [14389415.104170] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.262236] md: md0: data-check done.
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278157] md: data-check of RAID array md1
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278160] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278162] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278173] md: using 128k window, over a total of 947334144k.
Nov  3 03:46:35 kubernetes2003 kernel: [14399587.554875] md: md1: data-check done.

You might want to consider having a similar approach of what we've done with the etcd cluster, see modules/profile/manifests/etcd.pp.

Event Timeline

Change 549847 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] md: Globally set lower sync limits

https://gerrit.wikimedia.org/r/549847

Change 549847 merged by Alexandros Kosiaris:
[operations/puppet@production] md: Globally set lower sync limits

https://gerrit.wikimedia.org/r/549847

JMeybohm claimed this task.
JMeybohm subscribed.

I'm going to close this as we don't have this particular problem anymore (AFAICT) with mdadm checks being spread out across the first week of each month.