labservices1002 slow puppet runs and IO issues
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	May 7 2017, 5:31 PM

Description

Puppet runs on labservices1002 are VERY slow, often more than 5 minutes. I noticed this when investigating a different issue with pdns and designate-mdns where communication was timing out. Possibly these things are related; regardless something is amiss on labservices1002.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Andrew	T164675 labservices1002 slow puppet runs and IO issues
		Resolved		Andrew	T164807 labservices1002 mostly won't boot

Event Timeline

Andrew created this task.May 7 2017, 5:31 PM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptMay 7 2017, 5:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Andrew added a subscriber: • MoritzMuehlenhoff.May 7 2017, 5:31 PM

Paladox subscribed.May 7 2017, 7:15 PM

labservices1002

Notice: Finished catalog run in 346.38 seconds
Debug: Executing '/etc/puppet/etckeeper-commit-post'
Debug: Closing connection for https://puppet:8140
Debug: Creating new connection for https://puppet:8140
Debug: Starting connection for https://puppet:8140
Debug: Caching connection for https://puppet:8140
Debug: Closing connection for https://puppet:8140

real	9m3.078s
user	7m5.693s
sys	0m39.886s

labservices1001

Notice: Finished catalog run in 78.52 seconds
Debug: Executing '/etc/puppet/etckeeper-commit-post'
Debug: Using cached connection for https://puppet:8140
Debug: Caching connection for https://puppet:8140
Debug: Closing connection for https://puppet:8140

real	1m33.141s
user	0m16.704s
sys	0m2.394s

/dev/md0 ext3 9.1G 3.0G 5.7G 34% /

tune2fs -l /dev/md0

tune2fs 1.42.9 (4-Feb-2014)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          f44de5c8-0b60-4963-9cd6-2330c3e8a00d
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              610800
Block count:              2439136
Reserved block count:     121956
Free blocks:              1596362
Free inodes:              481743
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      595
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8144
Inode blocks per group:   509
Filesystem created:       Mon May 23 22:38:22 2016
Last mount time:          Sun May  7 17:14:50 2017
Last write time:          Sun May  7 17:14:50 2017
Mount count:              8
Maximum mount count:      -1
Last checked:             Mon May 23 22:38:22 2016
Check interval:           0 (<none>)
Lifetime writes:          3388 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       57012
Default directory hash:   half_md4
Directory Hash Seed:      f26fe8f9-cfe8-430c-9018-91adc3f7c056
Journal backup:           inode blocks

grep -i fix /etc/default/rcS

# automatically repair filesystems with inconsistencies during boot
#FSCKFIX=no

!log labservices1002 'touch /forcefsck && sudo reboot'

/var/log/kern.log

May  7 17:14:58 labservices1002 kernel: [   15.242796] md/raid1:md0: active with 2 out of 2 mirrors
May  7 17:14:58 labservices1002 kernel: [   15.249241] md0: detected capacity change from 0 to 9990701056
May  7 17:14:58 labservices1002 kernel: [   15.279924]  md0: unknown partition table
May  7 17:14:58 labservices1002 kernel: [   15.530367] EXT4-fs (md0): mounting ext3 file system using the ext4 subsystem
May  7 17:14:58 labservices1002 kernel: [   15.540377] EXT4-fs (md0): INFO: recovery required on readonly filesystem
May  7 17:14:58 labservices1002 kernel: [   15.548104] EXT4-fs (md0): write access will be enabled during recovery
May  7 17:14:58 labservices1002 kernel: [   15.557669] random: nonblocking pool is initialized
May  7 17:14:58 labservices1002 kernel: [   16.014382] EXT4-fs (md0): orphan cleanup on readonly fs
May  7 17:14:58 labservices1002 kernel: [   16.128187] EXT4-fs (md0): 16 orphan inodes deleted
May  7 17:14:58 labservices1002 kernel: [   16.133908] EXT4-fs (md0): recovery complete
May  7 17:14:58 labservices1002 kernel: [   16.178146] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
May  7 17:14:58 labservices1002 kernel: [   19.562754] ERST: NVRAM ERST Log Address Range not implemented yet.
May  7 17:14:58 labservices1002 kernel: [   19.598243] Adding 976316k swap on /dev/md1.  Priority:-1 extents:1 across:976316k FS
May  7 17:14:58 labservices1002 kernel: [   20.590636] EXT4-fs (md0): re-mounted. Opts: errors=remount-ro

tune2fs -l /dev/md0 | grep 'Last checked'

Last checked:             Mon May  8 13:50:42 2017

fsck seems to run fine but this is later in the boot process than this which shows up every boot

[   15.165191] md0: detected capacity change from 0 to 9990701056
[   15.204433]  md0: unknown partition table
Begin: Running /scripts/local-premount ... done.
[   15.465476] EXT4-fs (md0): mounting ext3 file system using the ext4 subsystem
[   15.475645] EXT4-fs (md0): INFO: recovery required on readonly filesystem
[   15.483373] EXT4-fs (md0): write access will be enabled during recovery
[   15.516742] random: nonblocking pool is initialized
[   15.768993] EXT4-fs (md0): recovery complete
[   15.776569] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)

Is there a chicken/egg thing with the root fs I'm wondering. Maybe need to boot into some recovery shell.

time aptitude update

Ign http://mirrors.wikimedia.org trusty InRelease
Hit http://apt.wikimedia.org trusty-wikimedia InRelease
Hit http://mirrors.wikimedia.org trusty-updates InRelease
Hit http://mirrors.wikimedia.org trusty-backports InRelease
Hit http://security.ubuntu.com trusty-security InRelease
Hit http://mirrors.wikimedia.org trusty Release.gpg
...
Ign http://ubuntu-cloud.archive.canonical.com trusty-updates/liberty/main Translation-en
[ ERR] Reading package lists                                                                    3,963 kB/s 0s
E: Problem renaming the file /var/cache/apt/pkgcache.bin.FopEfs to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory)
W: You may want to run apt-get update to correct these problems
E: Couldn't rebuild package cache

• chasemp renamed this task from labservices1002 slow puppet runs to labservices1002 slow puppet runs and IO issues.May 8 2017, 3:08 PM

I tried dropping into LifeCycle controller to run diagnostics via F-10

https://www.dell.com/support/article/us/en/19/SLN283546/how-to-run-hardware-diagnostics-on-your-poweredge-server?lang=EN

I see

Unified Server Configurator does not support console redirection.

I updated /etc/default/rcS:

# be more verbose during the boot process
VERBOSE=yes

# automatically repair filesystems with inconsistencies during boot
FSCKFIX=yes

and

touch /forcefsck && /sbin/reboot

then I set it back to default and rebooted again for fun

Well. Now the issue has gone dormant or fsck addressed it.

rush@labservices1002:~# time sudo puppet agent --test
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for labservices1002.wikimedia.org
Info: Applying configuration version '1494259943'
Notice: Finished catalog run in 83.58 seconds

real    1m38.460s
user    0m21.260s
sys    0m2.836s
rush@labservices1002:~# time sudo puppet agent --test
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for labservices1002.wikimedia.org
Info: Applying configuration version '1494259943'
Notice: Finished catalog run in 81.28 seconds

real    1m34.917s
user    0m20.751s
sys    0m2.721s
rush@labservices1002:~# time sudo puppet agent --test
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for labservices1002.wikimedia.org
Info: Applying configuration version '1494260417'
Notice: Finished catalog run in 81.57 seconds

real    1m35.166s
user    0m20.695s
sys    0m2.906s
rush@labservices1002:~#

I am going to stall this and @Andrew has agreed to monitor over the coming week for reoccurrence. I spoke with @Cmjohnson briefly who said he did have drives to put in this if we decide replacement is the right step..

a few reference links to remember:

https://wiki.debian.org/InitramfsDebug
https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN20_Gen8#Reboot_and_boot_into_BIOS_then_console
https://askubuntu.com/questions/14740/force-fsck-ext4-on-reboot-but-really-forceful
https://linuxconfig.org/how-to-force-fsck-to-check-filesystem-after-system-reboot-on-linux

• chasemp triaged this task as Medium priority.May 8 2017, 4:30 PM

Andrew claimed this task.May 8 2017, 4:30 PM

I can't imagine how an fsck would fix a slow puppet run. labservices1002 seems to be an R320, are you sure you weren't affected by T162850? There is a puppet fix committed now that would only take effect after a reboot, so it's possible it won't reoccur. There also seems to be a monitoring check for this, but from a quick glance it looks broken to me, so it wouldn't catch this.

I'm not sure what the story is other than during debugging it disappeared for now. Pattern as I understand it is: puppet slow seemingly from degraded IO sunday, failed away from the host as primary and rebooted, symptoms still persisted, I did the above poking looking for some IO issue indicator and went back to demonstrate and the issue would no longer reproduce.

for posterity atm:

labservices1002:~# lsmod|grep -c acpi_pad
0

I audited w/ @Dzahn the Icinga check that we have. It's not broken, at least with recent kernels, but it wouldn't fire off because the cpufreq module isn't present/loaded on the box at all. This is probably because it's configured in its BIOS to some other setting than OS-controlled.

We've seen weird CPU usage/performance issues with our R320 for years now (T82885, T82642, T162850 etc.). I'd be a hell of a coincidence this being a R320 server and having unrelated CPU performance issues :)

That system is safe to reboot right now... any pointers about where I should look in the bios for that setting? (I can dig if not)

Should be System Profile Settings > System Profile -> Performance Per Watt (OS), according to our DC Ops docs (which even say "the default dell setting causes power_saving/watchdog kernel threads to spawn and inordinately consume CPU cycles, this setting fixes it", heh :)

Andrew created subtask T164807: labservices1002 mostly won't boot.May 8 2017, 10:46 PM

I've changed the bios setting in question, so /maybe/ this is resolved. I'm going to keep an eye out for a few days though.

In T164675#3246074, @Andrew wrote:

I've changed the bios setting in question, so /maybe/ this is resolved. I'm going to keep an eye out for a few days though.

Great :) I've verified that cpufreq is indeed loaded now (and acpi_pad isn't) and that MHz reported in /proc/cpuinfo varies depending on the system load. If this happens again and the cause is what I was suspecting it to be, grep MHz /proc/cpuinfo will list a very low and our alert about this will trigger.

Andrew closed this task as Resolved.May 26 2017, 8:07 PM

labservices1002 slow puppet runs and IO issuesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

labservices1002 slow puppet runs and IO issues
Closed, ResolvedPublic
Actions

Related Objects
Search...