Page MenuHomePhabricator

labservices1002 slow puppet runs and IO issues
Closed, ResolvedPublic

Description

Puppet runs on labservices1002 are VERY slow, often more than 5 minutes. I noticed this when investigating a different issue with pdns and designate-mdns where communication was timing out. Possibly these things are related; regardless something is amiss on labservices1002.

Event Timeline

Andrew created this task.May 7 2017, 5:31 PM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptMay 7 2017, 5:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Paladox added a subscriber: Paladox.May 7 2017, 7:15 PM
chasemp added a subscriber: chasemp.May 8 2017, 1:43 PM

labservices1002

Notice: Finished catalog run in 346.38 seconds
Debug: Executing '/etc/puppet/etckeeper-commit-post'
Debug: Closing connection for https://puppet:8140
Debug: Creating new connection for https://puppet:8140
Debug: Starting connection for https://puppet:8140
Debug: Caching connection for https://puppet:8140
Debug: Closing connection for https://puppet:8140

real	9m3.078s
user	7m5.693s
sys	0m39.886s

vs

labservices1001

Notice: Finished catalog run in 78.52 seconds
Debug: Executing '/etc/puppet/etckeeper-commit-post'
Debug: Using cached connection for https://puppet:8140
Debug: Caching connection for https://puppet:8140
Debug: Closing connection for https://puppet:8140

real	1m33.141s
user	0m16.704s
sys	0m2.394s

/dev/md0 ext3 9.1G 3.0G 5.7G 34% /

tune2fs -l /dev/md0

tune2fs 1.42.9 (4-Feb-2014)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          f44de5c8-0b60-4963-9cd6-2330c3e8a00d
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              610800
Block count:              2439136
Reserved block count:     121956
Free blocks:              1596362
Free inodes:              481743
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      595
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8144
Inode blocks per group:   509
Filesystem created:       Mon May 23 22:38:22 2016
Last mount time:          Sun May  7 17:14:50 2017
Last write time:          Sun May  7 17:14:50 2017
Mount count:              8
Maximum mount count:      -1
Last checked:             Mon May 23 22:38:22 2016
Check interval:           0 (<none>)
Lifetime writes:          3388 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       57012
Default directory hash:   half_md4
Directory Hash Seed:      f26fe8f9-cfe8-430c-9018-91adc3f7c056
Journal backup:           inode blocks

grep -i fix /etc/default/rcS

# automatically repair filesystems with inconsistencies during boot
#FSCKFIX=no

!log labservices1002 'touch /forcefsck && sudo reboot'

chasemp added a comment.EditedMay 8 2017, 1:57 PM

/var/log/kern.log

May  7 17:14:58 labservices1002 kernel: [   15.242796] md/raid1:md0: active with 2 out of 2 mirrors
May  7 17:14:58 labservices1002 kernel: [   15.249241] md0: detected capacity change from 0 to 9990701056
May  7 17:14:58 labservices1002 kernel: [   15.279924]  md0: unknown partition table
May  7 17:14:58 labservices1002 kernel: [   15.530367] EXT4-fs (md0): mounting ext3 file system using the ext4 subsystem
May  7 17:14:58 labservices1002 kernel: [   15.540377] EXT4-fs (md0): INFO: recovery required on readonly filesystem
May  7 17:14:58 labservices1002 kernel: [   15.548104] EXT4-fs (md0): write access will be enabled during recovery
May  7 17:14:58 labservices1002 kernel: [   15.557669] random: nonblocking pool is initialized
May  7 17:14:58 labservices1002 kernel: [   16.014382] EXT4-fs (md0): orphan cleanup on readonly fs
May  7 17:14:58 labservices1002 kernel: [   16.128187] EXT4-fs (md0): 16 orphan inodes deleted
May  7 17:14:58 labservices1002 kernel: [   16.133908] EXT4-fs (md0): recovery complete
May  7 17:14:58 labservices1002 kernel: [   16.178146] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
May  7 17:14:58 labservices1002 kernel: [   19.562754] ERST: NVRAM ERST Log Address Range not implemented yet.
May  7 17:14:58 labservices1002 kernel: [   19.598243] Adding 976316k swap on /dev/md1.  Priority:-1 extents:1 across:976316k FS
May  7 17:14:58 labservices1002 kernel: [   20.590636] EXT4-fs (md0): re-mounted. Opts: errors=remount-ro

tune2fs -l /dev/md0 | grep 'Last checked'

Last checked:             Mon May  8 13:50:42 2017
chasemp added a comment.EditedMay 8 2017, 2:28 PM

fsck seems to run fine but this is later in the boot process than this which shows up every boot

[   15.165191] md0: detected capacity change from 0 to 9990701056
[   15.204433]  md0: unknown partition table
Begin: Running /scripts/local-premount ... done.
[   15.465476] EXT4-fs (md0): mounting ext3 file system using the ext4 subsystem
[   15.475645] EXT4-fs (md0): INFO: recovery required on readonly filesystem
[   15.483373] EXT4-fs (md0): write access will be enabled during recovery
[   15.516742] random: nonblocking pool is initialized
[   15.768993] EXT4-fs (md0): recovery complete
[   15.776569] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)

Is there a chicken/egg thing with the root fs I'm wondering. Maybe need to boot into some recovery shell.

time aptitude update

Ign http://mirrors.wikimedia.org trusty InRelease
Hit http://apt.wikimedia.org trusty-wikimedia InRelease
Hit http://mirrors.wikimedia.org trusty-updates InRelease
Hit http://mirrors.wikimedia.org trusty-backports InRelease
Hit http://security.ubuntu.com trusty-security InRelease
Hit http://mirrors.wikimedia.org trusty Release.gpg
...
Ign http://ubuntu-cloud.archive.canonical.com trusty-updates/liberty/main Translation-en
[ ERR] Reading package lists                                                                    3,963 kB/s 0s
E: Problem renaming the file /var/cache/apt/pkgcache.bin.FopEfs to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory)
W: You may want to run apt-get update to correct these problems
E: Couldn't rebuild package cache
chasemp renamed this task from labservices1002 slow puppet runs to labservices1002 slow puppet runs and IO issues.May 8 2017, 3:08 PM
chasemp added a comment.EditedMay 8 2017, 3:15 PM

I tried dropping into LifeCycle controller to run diagnostics via F-10

https://www.dell.com/support/article/us/en/19/SLN283546/how-to-run-hardware-diagnostics-on-your-poweredge-server?lang=EN

I see

Unified Server Configurator does not support console redirection.

I updated /etc/default/rcS:

# be more verbose during the boot process
VERBOSE=yes

# automatically repair filesystems with inconsistencies during boot
FSCKFIX=yes

and

touch /forcefsck && /sbin/reboot

then I set it back to default and rebooted again for fun

chasemp changed the task status from Open to Stalled.May 8 2017, 4:29 PM
chasemp added a subscriber: Cmjohnson.

Well. Now the issue has gone dormant or fsck addressed it.

rush@labservices1002:~# time sudo puppet agent --test
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for labservices1002.wikimedia.org
Info: Applying configuration version '1494259943'
Notice: Finished catalog run in 83.58 seconds

real    1m38.460s
user    0m21.260s
sys    0m2.836s
rush@labservices1002:~# time sudo puppet agent --test
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for labservices1002.wikimedia.org
Info: Applying configuration version '1494259943'
Notice: Finished catalog run in 81.28 seconds

real    1m34.917s
user    0m20.751s
sys    0m2.721s
rush@labservices1002:~# time sudo puppet agent --test
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for labservices1002.wikimedia.org
Info: Applying configuration version '1494260417'
Notice: Finished catalog run in 81.57 seconds

real    1m35.166s
user    0m20.695s
sys    0m2.906s
rush@labservices1002:~#

I am going to stall this and @Andrew has agreed to monitor over the coming week for reoccurrence. I spoke with @Cmjohnson briefly who said he did have drives to put in this if we decide replacement is the right step..

a few reference links to remember:

https://wiki.debian.org/InitramfsDebug
https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN20_Gen8#Reboot_and_boot_into_BIOS_then_console
https://askubuntu.com/questions/14740/force-fsck-ext4-on-reboot-but-really-forceful
https://linuxconfig.org/how-to-force-fsck-to-check-filesystem-after-system-reboot-on-linux

chasemp triaged this task as Medium priority.May 8 2017, 4:30 PM
Andrew claimed this task.May 8 2017, 4:30 PM
faidon added a subscriber: faidon.May 8 2017, 5:07 PM

I can't imagine how an fsck would fix a slow puppet run. labservices1002 seems to be an R320, are you sure you weren't affected by T162850? There is a puppet fix committed now that would only take effect after a reboot, so it's possible it won't reoccur. There also seems to be a monitoring check for this, but from a quick glance it looks broken to me, so it wouldn't catch this.

chasemp added a comment.EditedMay 8 2017, 5:48 PM

I'm not sure what the story is other than during debugging it disappeared for now. Pattern as I understand it is: puppet slow seemingly from degraded IO sunday, failed away from the host as primary and rebooted, symptoms still persisted, I did the above poking looking for some IO issue indicator and went back to demonstrate and the issue would no longer reproduce.

for posterity atm:

labservices1002:~# lsmod|grep -c acpi_pad
0

faidon added a subscriber: Dzahn.May 8 2017, 7:17 PM

I audited w/ @Dzahn the Icinga check that we have. It's not broken, at least with recent kernels, but it wouldn't fire off because the cpufreq module isn't present/loaded on the box at all. This is probably because it's configured in its BIOS to some other setting than OS-controlled.

We've seen weird CPU usage/performance issues with our R320 for years now (T82885, T82642, T162850 etc.). I'd be a hell of a coincidence this being a R320 server and having unrelated CPU performance issues :)

Andrew added a comment.May 8 2017, 7:21 PM

That system is safe to reboot right now... any pointers about where I should look in the bios for that setting? (I can dig if not)

faidon added a comment.May 8 2017, 7:32 PM

Should be System Profile Settings > System Profile -> Performance Per Watt (OS), according to our DC Ops docs (which even say "the default dell setting causes power_saving/watchdog kernel threads to spawn and inordinately consume CPU cycles, this setting fixes it", heh :)

I've changed the bios setting in question, so /maybe/ this is resolved. I'm going to keep an eye out for a few days though.

I've changed the bios setting in question, so /maybe/ this is resolved. I'm going to keep an eye out for a few days though.

Great :) I've verified that cpufreq is indeed loaded now (and acpi_pad isn't) and that MHz reported in /proc/cpuinfo varies depending on the system load. If this happens again and the cause is what I was suspecting it to be, grep MHz /proc/cpuinfo will list a very low and our alert about this will trigger.

Andrew closed this task as Resolved.May 26 2017, 8:07 PM