labstore1006 persistent high iowait
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Sep 19 2020, 5:01 PM

Description

Notification Type: PROBLEM

Service: Persistent high iowait
Host: labstore1006
Address: 208.80.154.7
State: CRITICAL

Date/Time: Sat Sept 19 06:46:50 UTC 2020

Notes URLs: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring

and it recovered at Date/Time: Sat Sept 19 07:19:32 UTC 2020

Upon investigation, I saw that the issue had started around 300 UTC, with high iowait and general usage. I was not able to correlate any specific processes at the time to the iowait, but the big network user was stat1005.eqiad.wmnet.

This task is to figure out what triggered it and how to mitigate or even, if it is the right approach, change the alert thresholds.

Event Timeline

• Bstorm triaged this task as Medium priority.Sep 19 2020, 5:01 PM

• Bstorm created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2020, 5:01 PM

• Bstorm moved this task from Backlog to Dumps on the Data-Services board.Sep 19 2020, 5:02 PM

• Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

The dashboard links are out of date for this. Can you link a current dashboard for labstore1006? I added it to https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006 quickly; but I don't know what you referenced.

I'm guessing thought there should be a dumps dashboard and I should probably revert my edits :-)

There's no special dashboard for this. It's just the host dashboard: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=labstore1006&var-datasource=thanos&var-cluster=wmcs

In T263329#6485593, @nskaggs wrote:

The dashboard links are out of date for this. Can you link a current dashboard for labstore1006? I added it to https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006 quickly; but I don't know what you referenced.

The labstore1004/5 dash was created a long time ago when there were concerns over the load averages on those specific servers (which is an issue that we've largely given up on as cosmetic anyway). Security patches to the kernel caused it, and it may or may not go away when we get them on Buster. I don't personally suspect it will for a lot of reasons, but it might. There is a need to review the large number of "labstore" and NFS dashboards and collapse them into the ones that are most useful and current. There's more history in many of those dashboards than useful information.

• fdans edited projects, added Analytics-Radar; removed Analytics.Oct 5 2020, 4:33 PM

• fdans subscribed.

This looks like legitimate use, as people sometimes pull huge files over NFS at /mnt/public on the stat boxes. Assuming this kind of usage will happen again, what can we do? I remember some guidance a while back to use a rate-limiter when using NFS mounts?

This happens from time to time. These aren't high performant boxes anyway. We don't have a more specific fix today.

labstore1006 persistent high iowaitClosed, ResolvedPublicActions

Description

Event Timeline

labstore1006 persistent high iowait
Closed, ResolvedPublic
Actions