Things did not go as planned. We need to write up an incident report and track followup issues.
Roughly what occurred (needs times and check for accuracy/order):
* Upgraded kernel on labstore1004
* Promoted labstore1004 to primary and failed clients over to it
* Load spiked on labstore1004
* (LDAP outage from unrelated causes)
* Upgraded kernel on labstore1005
* Promoted labstore1005 to primary and failed clients over to it
* Load spiked on labstore1005
* Load spiked on several NFS clients
* Reboots of various NFS clients in Tools project to ensure that stale NFS handles are not to blame for load spikes
* Kubernetes nodes not able to communicate with etcd
* Reboot flannel and flannel etcd
* Reboot kubernetes etcd
* Load continues to be very very high on labstore1005 NFS primary
* Halt new pod scheduling on Kubernetes
* Rolling reboot of Kubernetes nodes
* Rolling reboot of grid engine nodes
* Tune kernel parameters on labstore1005
* Change i/o scheduler on labstore1005 from deadline to cfq
* Let things sit to see if load will settle down
* Re-enable new pod scheduling on Kubernetes
* Let things sit to see if load will settle down
* Load spikes have hit new high of **165.14** 1m avg on labstore1005
* Rollback labstore1004 kernel to 4.4.2-3+wmf8
* Promote labstore1004 to NFS primary and fail clients over
* Load on labstore1004 stays within pre-update expected values
* Rollback labstore1005 kernel to 4.4.2-3+wmf8
* Let things sit