Puppet runs on labservices1002 are VERY slow, often more than 5 minutes. I noticed this when investigating a different issue with pdns and designate-mdns where communication was timing out. Possibly these things are related; regardless something is amiss on labservices1002.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Andrew | T164675 labservices1002 slow puppet runs and IO issues | |||
Resolved | Andrew | T164807 labservices1002 mostly won't boot |
Event Timeline
labservices1002
Notice: Finished catalog run in 346.38 seconds Debug: Executing '/etc/puppet/etckeeper-commit-post' Debug: Closing connection for https://puppet:8140 Debug: Creating new connection for https://puppet:8140 Debug: Starting connection for https://puppet:8140 Debug: Caching connection for https://puppet:8140 Debug: Closing connection for https://puppet:8140 real 9m3.078s user 7m5.693s sys 0m39.886s
vs
labservices1001
Notice: Finished catalog run in 78.52 seconds Debug: Executing '/etc/puppet/etckeeper-commit-post' Debug: Using cached connection for https://puppet:8140 Debug: Caching connection for https://puppet:8140 Debug: Closing connection for https://puppet:8140 real 1m33.141s user 0m16.704s sys 0m2.394s
/dev/md0 ext3 9.1G 3.0G 5.7G 34% /
tune2fs -l /dev/md0
tune2fs 1.42.9 (4-Feb-2014) Filesystem volume name: <none> Last mounted on: / Filesystem UUID: f44de5c8-0b60-4963-9cd6-2330c3e8a00d Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 610800 Block count: 2439136 Reserved block count: 121956 Free blocks: 1596362 Free inodes: 481743 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 595 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8144 Inode blocks per group: 509 Filesystem created: Mon May 23 22:38:22 2016 Last mount time: Sun May 7 17:14:50 2017 Last write time: Sun May 7 17:14:50 2017 Mount count: 8 Maximum mount count: -1 Last checked: Mon May 23 22:38:22 2016 Check interval: 0 (<none>) Lifetime writes: 3388 MB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 57012 Default directory hash: half_md4 Directory Hash Seed: f26fe8f9-cfe8-430c-9018-91adc3f7c056 Journal backup: inode blocks
grep -i fix /etc/default/rcS
# automatically repair filesystems with inconsistencies during boot #FSCKFIX=no
/var/log/kern.log
May 7 17:14:58 labservices1002 kernel: [ 15.242796] md/raid1:md0: active with 2 out of 2 mirrors May 7 17:14:58 labservices1002 kernel: [ 15.249241] md0: detected capacity change from 0 to 9990701056 May 7 17:14:58 labservices1002 kernel: [ 15.279924] md0: unknown partition table May 7 17:14:58 labservices1002 kernel: [ 15.530367] EXT4-fs (md0): mounting ext3 file system using the ext4 subsystem May 7 17:14:58 labservices1002 kernel: [ 15.540377] EXT4-fs (md0): INFO: recovery required on readonly filesystem May 7 17:14:58 labservices1002 kernel: [ 15.548104] EXT4-fs (md0): write access will be enabled during recovery May 7 17:14:58 labservices1002 kernel: [ 15.557669] random: nonblocking pool is initialized May 7 17:14:58 labservices1002 kernel: [ 16.014382] EXT4-fs (md0): orphan cleanup on readonly fs May 7 17:14:58 labservices1002 kernel: [ 16.128187] EXT4-fs (md0): 16 orphan inodes deleted May 7 17:14:58 labservices1002 kernel: [ 16.133908] EXT4-fs (md0): recovery complete May 7 17:14:58 labservices1002 kernel: [ 16.178146] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null) May 7 17:14:58 labservices1002 kernel: [ 19.562754] ERST: NVRAM ERST Log Address Range not implemented yet. May 7 17:14:58 labservices1002 kernel: [ 19.598243] Adding 976316k swap on /dev/md1. Priority:-1 extents:1 across:976316k FS May 7 17:14:58 labservices1002 kernel: [ 20.590636] EXT4-fs (md0): re-mounted. Opts: errors=remount-ro
fsck seems to run fine but this is later in the boot process than this which shows up every boot
[ 15.165191] md0: detected capacity change from 0 to 9990701056 [ 15.204433] md0: unknown partition table Begin: Running /scripts/local-premount ... done. [ 15.465476] EXT4-fs (md0): mounting ext3 file system using the ext4 subsystem [ 15.475645] EXT4-fs (md0): INFO: recovery required on readonly filesystem [ 15.483373] EXT4-fs (md0): write access will be enabled during recovery [ 15.516742] random: nonblocking pool is initialized [ 15.768993] EXT4-fs (md0): recovery complete [ 15.776569] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
Is there a chicken/egg thing with the root fs I'm wondering. Maybe need to boot into some recovery shell.
time aptitude update
Ign http://mirrors.wikimedia.org trusty InRelease Hit http://apt.wikimedia.org trusty-wikimedia InRelease Hit http://mirrors.wikimedia.org trusty-updates InRelease Hit http://mirrors.wikimedia.org trusty-backports InRelease Hit http://security.ubuntu.com trusty-security InRelease Hit http://mirrors.wikimedia.org trusty Release.gpg ... Ign http://ubuntu-cloud.archive.canonical.com trusty-updates/liberty/main Translation-en [ ERR] Reading package lists 3,963 kB/s 0s E: Problem renaming the file /var/cache/apt/pkgcache.bin.FopEfs to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory) W: You may want to run apt-get update to correct these problems E: Couldn't rebuild package cache
I tried dropping into LifeCycle controller to run diagnostics via F-10
I see
Unified Server Configurator does not support console redirection.
I updated /etc/default/rcS:
# be more verbose during the boot process VERBOSE=yes # automatically repair filesystems with inconsistencies during boot FSCKFIX=yes
and
touch /forcefsck && /sbin/reboot
then I set it back to default and rebooted again for fun
Well. Now the issue has gone dormant or fsck addressed it.
rush@labservices1002:~# time sudo puppet agent --test Info: Retrieving plugin Info: Loading facts Info: Caching catalog for labservices1002.wikimedia.org Info: Applying configuration version '1494259943' Notice: Finished catalog run in 83.58 seconds real 1m38.460s user 0m21.260s sys 0m2.836s rush@labservices1002:~# time sudo puppet agent --test Info: Retrieving plugin Info: Loading facts Info: Caching catalog for labservices1002.wikimedia.org Info: Applying configuration version '1494259943' Notice: Finished catalog run in 81.28 seconds real 1m34.917s user 0m20.751s sys 0m2.721s rush@labservices1002:~# time sudo puppet agent --test Info: Retrieving plugin Info: Loading facts Info: Caching catalog for labservices1002.wikimedia.org Info: Applying configuration version '1494260417' Notice: Finished catalog run in 81.57 seconds real 1m35.166s user 0m20.695s sys 0m2.906s rush@labservices1002:~#
I am going to stall this and @Andrew has agreed to monitor over the coming week for reoccurrence. I spoke with @Cmjohnson briefly who said he did have drives to put in this if we decide replacement is the right step..
a few reference links to remember:
https://wiki.debian.org/InitramfsDebug
https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN20_Gen8#Reboot_and_boot_into_BIOS_then_console
https://askubuntu.com/questions/14740/force-fsck-ext4-on-reboot-but-really-forceful
https://linuxconfig.org/how-to-force-fsck-to-check-filesystem-after-system-reboot-on-linux
I can't imagine how an fsck would fix a slow puppet run. labservices1002 seems to be an R320, are you sure you weren't affected by T162850? There is a puppet fix committed now that would only take effect after a reboot, so it's possible it won't reoccur. There also seems to be a monitoring check for this, but from a quick glance it looks broken to me, so it wouldn't catch this.
I'm not sure what the story is other than during debugging it disappeared for now. Pattern as I understand it is: puppet slow seemingly from degraded IO sunday, failed away from the host as primary and rebooted, symptoms still persisted, I did the above poking looking for some IO issue indicator and went back to demonstrate and the issue would no longer reproduce.
for posterity atm:
labservices1002:~# lsmod|grep -c acpi_pad
0
I audited w/ @Dzahn the Icinga check that we have. It's not broken, at least with recent kernels, but it wouldn't fire off because the cpufreq module isn't present/loaded on the box at all. This is probably because it's configured in its BIOS to some other setting than OS-controlled.
We've seen weird CPU usage/performance issues with our R320 for years now (T82885, T82642, T162850 etc.). I'd be a hell of a coincidence this being a R320 server and having unrelated CPU performance issues :)
That system is safe to reboot right now... any pointers about where I should look in the bios for that setting? (I can dig if not)
Should be System Profile Settings > System Profile -> Performance Per Watt (OS), according to our DC Ops docs (which even say "the default dell setting causes power_saving/watchdog kernel threads to spawn and inordinately consume CPU cycles, this setting fixes it", heh :)
I've changed the bios setting in question, so /maybe/ this is resolved. I'm going to keep an eye out for a few days though.
Great :) I've verified that cpufreq is indeed loaded now (and acpi_pad isn't) and that MHz reported in /proc/cpuinfo varies depending on the system load. If this happens again and the cause is what I was suspecting it to be, grep MHz /proc/cpuinfo will list a very low and our alert about this will trigger.