Page MenuHomePhabricator

virt1002/labstore1001 network exhaustion
Closed, ResolvedPublic

Description

We have beeen getting port saturation threshold alerts every few minutes for virt1002 & labstore1001. Although the latter is being (slowly) being worked on with #7282, I'm not aware of any work happening regarding virt1002.
Even if we upgrade both, there's nothing suggesting that the capacity will be enough, as I'm not aware of any data pinpointing where this capacity is being used.
This is likely a larger Labs issue that needs to be taken care of. It's being going on for many weeks now and is probably largely degrading Labs' performance to the point that makes me wonder why we're not treating it with a very high priority, possibly even as a Labs outage.

Details

Reference
rt7657

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:56 AM
rtimport added a project: ops-core.
rtimport set Reference to rt7657.

On Tue Jun 10 04:38:13 2014, faidon wrote:

We have beeen getting port saturation threshold alerts every few
minutes for virt1002 & labstore1001. Although the latter is being
(slowly) being worked on with #7282, I'm not aware of any work
happening regarding virt1002.

Even if we upgrade both, there's nothing suggesting that the capacity
will be enough, as I'm not aware of any data pinpointing where this
capacity is being used.

This is likely a larger Labs issue that needs to be taken care of.
It's being going on for many weeks now and is probably largely
degrading Labs' performance to the point that makes me wonder why
we're not treating it with a very high priority, possibly even as a
Labs outage.

Right now, virt1002 has most (almost all) of the labs VMs on it, so it ends up
being a sore hotspot. I'm going to look into spreading the load between more
servers in the short term, and applying tc rules to cap the amount of bandwidth
used by Labs.

While you were away, I tracked this to certain CVN VMs. I prodded Krinkle (their maintainer) a couple of times and he promptly moved the file accesses to local storage, which very quickly dropped the bandwidth to saner levels again.
So, this might also be helped by user education -- Krinkle didn't know there was a penalty incurred for access to /data :)

faidon changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
faidon changed the edit policy from "WMF-NDA (Project)" to "All Users".
faidon set Security to None.

This simply requires hunt-and-seek of outliers as they occur.