Notification Type: PROBLEM
Service: Persistent high iowait
Host: labstore1006
Address: 208.80.154.7
State: CRITICALDate/Time: Sat Sept 19 06:46:50 UTC 2020
Notes URLs: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring
and it recovered at Date/Time: Sat Sept 19 07:19:32 UTC 2020
Upon investigation, I saw that the issue had started around 300 UTC, with high iowait and general usage. I was not able to correlate any specific processes at the time to the iowait, but the big network user was stat1005.eqiad.wmnet.
This task is to figure out what triggered it and how to mitigate or even, if it is the right approach, change the alert thresholds.