cloudvirt-wdqs1001 getting out of space due to huge VM
Open, HighPublic
Actions

Assigned To

None

Authored By

	dcaro
	Feb 2 2021, 7:27 AM

Description

We seem to have given a flavor for a VM that is bigger than the host it's pinned to.

According to https://openstack-browser.toolforge.org/server/wcqs-beta-01.wikidata-query.eqiad1.wikimedia.cloud the VM was requested with Storage 3400G. I wonder if that's a typo with an extra 0 character.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Gehel	T206636 Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service
		Open		None	T273579 cloudvirt-wdqs1001 getting out of space due to huge VM

Event Timeline

dcaro created this task.Feb 2 2021, 7:27 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 2 2021, 7:27 AM

dcaro claimed this task.Feb 2 2021, 7:27 AM

Mentioned in SAL (#wikimedia-cloud) [2021-02-02T07:29:00Z] <dcaro> large VM wcqs-beta-01 is exhausting the hosts disk space (cloudvirt-wdqs1001) (T273579)

RhinosF1 subscribed.Feb 2 2021, 8:41 AM

aborrero reassigned this task from dcaro to EBernhardson.Feb 2 2021, 10:53 AM

aborrero triaged this task as High priority.

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

aborrero updated the task description. (Show Details)Feb 2 2021, 10:55 AM

aborrero updated the task description. (Show Details)

aborrero added a subscriber: dcausse.Feb 2 2021, 10:58 AM

Mentioned in SAL (#wikimedia-cloud) [2021-02-02T11:00:04Z] <arturo> icinga-downtime cloudvirt-wdqs1001 for 1 week (T273579)

aborrero added a parent task: T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service.Feb 2 2021, 11:00 AM

3400G should be the correct size for this instance. Per node_filesystem_size_bytes{instance=~"cloudvirt-wdqs1001:.*"} the disk mounted to /var/lib/nova/instances is 3.6TB. In this use case it is expected that a single VM takes all the resources of the VM host.

On the other hand, we didn't expect the data inside to have grown quite this fast. We will be looking over to see how the disk usage has exploded so quickly, but overall we don't want to change the size of the VM.

bd808 added a project: VPS-Projects.Feb 2 2021, 9:12 PM

We did free up space inside of that VM (see T273636). At the moment, there is 430G used (14% of /srv), so we have some headroom. The underlying issue of the blazegraph journal growing out of control isn't fixed, so this is going to happen again in a few weeks / months. Hopefully, we'll have a better update process before that and we'll have moved this service to a production context (see T260568).

Let us (@Gehel and @RKemper) if there is something more we should be doing short term.

From the virt layer's perspective this is working as designed. We might want to figure out how to suppress the icinga disk space warnings.

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:15 PM

fnegri moved this task from Kanban to Soon! on the cloud-services-team board.

@EBernhardson: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

cloudvirt-wdqs1001 getting out of space due to huge VMOpen, HighPublicActions

Description

Related ObjectsSearch...

Event Timeline

cloudvirt-wdqs1001 getting out of space due to huge VM
Open, HighPublic
Actions

Related Objects
Search...