Page MenuHomePhabricator

labstore1006 persistent high iowait
Closed, ResolvedPublic

Description

Notification Type: PROBLEM

Service: Persistent high iowait
Host: labstore1006
Address: 208.80.154.7
State: CRITICAL

Date/Time: Sat Sept 19 06:46:50 UTC 2020

Notes URLs: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring

and it recovered at Date/Time: Sat Sept 19 07:19:32 UTC 2020

Upon investigation, I saw that the issue had started around 300 UTC, with high iowait and general usage. I was not able to correlate any specific processes at the time to the iowait, but the big network user was stat1005.eqiad.wmnet.

This task is to figure out what triggered it and how to mitigate or even, if it is the right approach, change the alert thresholds.

Event Timeline

Bstorm triaged this task as Medium priority.Sep 19 2020, 5:01 PM
Bstorm created this task.

The dashboard links are out of date for this. Can you link a current dashboard for labstore1006? I added it to https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006 quickly; but I don't know what you referenced.

I'm guessing thought there should be a dumps dashboard and I should probably revert my edits :-)

The dashboard links are out of date for this. Can you link a current dashboard for labstore1006? I added it to https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006 quickly; but I don't know what you referenced.

The labstore1004/5 dash was created a long time ago when there were concerns over the load averages on those specific servers (which is an issue that we've largely given up on as cosmetic anyway). Security patches to the kernel caused it, and it may or may not go away when we get them on Buster. I don't personally suspect it will for a lot of reasons, but it might. There is a need to review the large number of "labstore" and NFS dashboards and collapse them into the ones that are most useful and current. There's more history in many of those dashboards than useful information.

This looks like legitimate use, as people sometimes pull huge files over NFS at /mnt/public on the stat boxes. Assuming this kind of usage will happen again, what can we do? I remember some guidance a while back to use a rate-limiter when using NFS mounts?

aborrero claimed this task.
aborrero subscribed.

This happens from time to time. These aren't high performant boxes anyway. We don't have a more specific fix today.