Page MenuHomePhabricator

cloudvirt-wdqs1001 getting out of space due to huge VM
Open, HighPublic

Description

We seem to have given a flavor for a VM that is bigger than the host it's pinned to.

According to https://openstack-browser.toolforge.org/server/wcqs-beta-01.wikidata-query.eqiad1.wikimedia.cloud the VM was requested with Storage 3400G. I wonder if that's a typo with an extra 0 character.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-02-02T07:29:00Z] <dcaro> large VM wcqs-beta-01 is exhausting the hosts disk space (cloudvirt-wdqs1001) (T273579)

aborrero triaged this task as High priority.
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.
aborrero updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2021-02-02T11:00:04Z] <arturo> icinga-downtime cloudvirt-wdqs1001 for 1 week (T273579)

3400G should be the correct size for this instance. Per node_filesystem_size_bytes{instance=~"cloudvirt-wdqs1001:.*"} the disk mounted to /var/lib/nova/instances is 3.6TB. In this use case it is expected that a single VM takes all the resources of the VM host.

On the other hand, we didn't expect the data inside to have grown quite this fast. We will be looking over to see how the disk usage has exploded so quickly, but overall we don't want to change the size of the VM.

We did free up space inside of that VM (see T273636). At the moment, there is 430G used (14% of /srv), so we have some headroom. The underlying issue of the blazegraph journal growing out of control isn't fixed, so this is going to happen again in a few weeks / months. Hopefully, we'll have a better update process before that and we'll have moved this service to a production context (see T260568).

Let us (@Gehel and @RKemper) if there is something more we should be doing short term.

From the virt layer's perspective this is working as designed. We might want to figure out how to suppress the icinga disk space warnings.