Kubernetes hosts raid check make facter fail
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Volans
	Nov 3 2019, 3:26 PM

Description

Tonight we got some cron spam from the k8s hosts with:

2019-11-03 03:34:05.451970 ERROR puppetlabs.facter - error while resolving custom facts in /var/lib/puppet/lib/facter/lvm_support.rb: command timed out after 60 seconds.

Traceback (most recent call last):
  File "/usr/local/sbin/smart-data-dump", line 338, in <module>
    sys.exit(main())
  File "/usr/local/sbin/smart-data-dump", line 308, in main
    raid_drivers = get_fact('raid', facter_version)
  File "/usr/local/sbin/smart-data-dump", line 94, in get_fact
    raw_output = subprocess.check_output(command)
  File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/bin/facter', '--puppet', '--json', 'raid', '-l', 'error']' returned non-zero exit status 1

Being the first sunday of the month I'm pretty sure this is related to mdadm data check as the time of the spam emails matches it:

Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942606] md: data-check of RAID array md0
Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942609] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942611] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Nov  3 00:57:01 kubernetes2003 kernel: [14389414.942620] md: using 128k window, over a total of 29279232k.
Nov  3 00:57:01 kubernetes2003 kernel: [14389415.104170] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.262236] md: md0: data-check done.
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278157] md: data-check of RAID array md1
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278160] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278162] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Nov  3 01:00:43 kubernetes2003 kernel: [14389637.278173] md: using 128k window, over a total of 947334144k.
Nov  3 03:46:35 kubernetes2003 kernel: [14399587.554875] md: md1: data-check done.

You might want to consider having a similar approach of what we've done with the etcd cluster, see modules/profile/manifests/etcd.pp.

Details

	Subject	Repo	Branch	Lines +/-
	md: Globally set lower sync limits	operations/puppet	production	+16 -17

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T294906 Puppet Improvements
Duplicate	jbond	T265138 Work required to prepare for puppet 7
Resolved	SLyngshede-WMF	T273673 replace all puppet crons with systemd timers
Open	None	T132324 Tracking and Reducing cron-spam to root@
Resolved	JMeybohm	T237197 Kubernetes hosts raid check make facter fail

Event Timeline

Volans created this task.Nov 3 2019, 3:26 PM

MoritzMuehlenhoff triaged this task as Medium priority.Nov 4 2019, 9:14 AM

Change 549847 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] md: Globally set lower sync limits

https://gerrit.wikimedia.org/r/549847

gerritbot added a project: Patch-For-Review.Nov 8 2019, 1:23 PM

Change 549847 merged by Alexandros Kosiaris:
[operations/puppet@production] md: Globally set lower sync limits

https://gerrit.wikimedia.org/r/549847

Maintenance_bot removed a project: Patch-For-Review.Dec 3 2019, 3:10 PM

jijiki subscribed.Dec 9 2019, 4:38 PM

jijiki moved this task from Incoming 🐫 to 🔦Unused2 on the serviceops board.Aug 17 2020, 11:46 PM

JMeybohm mentioned this in T273953: Stagger software raid checks even more.Feb 5 2021, 8:38 AM

I'm going to close this as we don't have this particular problem anymore (AFAICT) with mdadm checks being spread out across the first week of each month.

Kubernetes hosts raid check make facter failClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Kubernetes hosts raid check make facter fail
Closed, ResolvedPublic
Actions

Related Objects
Search...