On 7/19/17 load on labstore1004 began increasing and icinga alerted:
[20:41:40] <icinga-wm> PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0]
[20:43:40] <icinga-wm> PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0]
@madhuvishy and I investigated looking for the usual signs of overzealous clients, or IO wait. My initial suspicion was that the previously days migration of non-Tools users to the new NFS cluster had come with a client who was not throttled and/or was creating contention. iftop and the like were not able to target any clients who were beyond normal usage. giftbot did appear to be a heavy user but that is pretty standard. I marked all non-Tools exports as ro briefly to see if it had an affect and it had none noticeably. (changed export.d files and exportfs -ra)
I solicited for help at 21:01:56 from available operations folks and @Volans, @ema and @Joe were able to take a look (at this point we were at 40 or so load wise and climbing where normal load is <=5). The diagnoses was that several NFSd threads were in D wait state but no other indications of disk contention or abusive clients correlated. DRBD state in /proc/drbd reported as fine and in sync. Clients (at least Tools and the few I spot checked) seemed to be operating normally at this point still.
[Jan19 20:13] INFO: task nfsd:95840 blocked for more than 120 seconds
I restarted nfs-kernel-server to see if that would effect load and it did not. This should not have been noticeable other than transient D wait state on clients. Load was climbing past 55 now and I knew soon it was going to effect normal operations. We didn't know how to mitigate so I decided to make a controlled failover to the secondary node in the pair (labstore1005) using nfs-manage while labstore1004 was still responsive and otherwise 'normal'. I figured it was either internal factors that were not diagnosable in real-time or external factors we would see surface on the new primary.
21:33 nfs-manage down on labstore1004 (load at 60+)
21:35 rebooted labstore1004 (notes on why below)
21:39 labstore1005 failover complete and clients recovering
21:40 @yuvipanda emailed labs-l to let them know we were experiencing issues
21:41<chasemp> status is, afaik labstore1005 is up and available to handle things, I'm looking through to see if labstore1005 is actually handling things in reality
21:50 no evidence of remaining issues in Tools of NFS issues
This failover had two unexpected components:
- labstore1004 even with nfs-kernel-server completely stopped would not release the devices for umount cleanly, and I saw NFSd threads persist as part of nfs-manage down DRBD kept primary state for both Tools and Misc on labstore1004. After stopping nfs-kernel-server to make sure no clients were accessing resources I rebooted labstore1004 (at 21:35) to guarantee labstore1005 would see it as unavailable.
State change failed: (-12) Device is held open by someone
Command 'drbdsetup-84 secondary 3' terminated with exit code 11
- labstore1005 did not want to take control via nfs-manage up because of the preexisting Primary status of labstore1004. That's more or less expected and good for integrity reasons, and the remedy was using drbdadm primary all --force (instead of drbdadm primary all) within the management script.
The failover delay and unavailability of NFS while I figured out to use --force caused some portion of clients to have their mounts "stuck" with Stale file handle and that affected a known subset of Tools exec hosts for sure. This did resolve without intervention once the stack was functional and available.
Icinga began complaining about labstore1004 being unavailable after all failover actions were completed and it should have been back up and stable without workload:
[22:15:57] <icinga-wm> PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100%
[22:16:27] <icinga-wm> RECOVERY - Host labstore1004 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
Looking at the icinga history this has continued since this time intermittently https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=labstore1004. @Volans reported seeing around 2% packet loss during a controlled ping test.
- @madhuvishy to confirm reasoning and timeline here and convert to an incident report
- We are now running on labstore1005 and it has been stable so far (https://gerrit.wikimedia.org/r/#/c/333131/)
- Need to investigate if --force should be default for nfs-manage up or if it should be an option
- @paravoid reported that the eth0 and eth1 NICs on labstore1004/1005 will cause undeterministic behavior that may be the cause of the periodic unavailability. Addressing this should mostly likely be moving the replication link for DRBD to a directly connected crossover cable.