Roughly what occurred (needs times and check for accuracy/order):
- Rebooted labstore1004 activate upgraded kernel
- Promoted labstore1004 to primary and failed clients over to it
- Load spiked on labstore1004
- (LDAP outage from unrelated causes)
- Rebooted labstore1005 to activate upgraded kernel
- Promoted labstore1005 to primary and failed clients over to it
- Load spiked on labstore1005
- Load spiked on several NFS clients
- Reboots of various NFS clients in Tools project to ensure that stale NFS handles are not to blame for load spikes
- Kubernetes nodes not able to communicate with etcd
- Reboot flannel and flannel etcd
- Reboot kubernetes etcd
- Load continues to be very very high on labstore1005 NFS primary
- Halt new pod scheduling on Kubernetes
- Rolling reboot of Kubernetes nodes
- Rolling reboot of grid engine nodes
- Tune kernel parameters on labstore1005
- Let things sit to see if load will settle down
- Rollback labstore1004 kernel to 4.4.2-3+wmf8
- Change i/o scheduler on labstore1005 from deadline to cfq
- Let things sit to see if load will settle down
- Re-enable new pod scheduling on Kubernetes to restore service to clients
- Let things sit to see if load will settle down
- 2017-06-30T00:28:26 Load spikes hit new high of 165.14 1m avg on labstore1005
- Promote labstore1004 to NFS primary and fail clients over
- Load on labstore1004 stays within pre-update expected values
- Rollback labstore1005 kernel to 4.4.2-3+wmf8
- Let things sit