Move castor instance to 4xiops disk flavor
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Sep 8 2023, 3:27 PM

Description

16:52:01 <Lucas_WMDE> I have a feeling that CI builds are “Waiting for the completion of castor-save-workspace-cache” a bit longer than usual lately… would it be possible to give castor more resources? (no idea if that even makes sense tbh)

17:13:30 <•hashar> Lucas_WMDE: the job doesn't run concurrently (I can't remember why)
17:13:39 <•hashar> and it transfers the whole cache of the job, so maybe those have grown 
17:13:58 <•hashar> and indeed there are a lot of them :
17:13:59 <•hashar> )
17:15:20 <•hashar> the caches are stored on integration-castor05.integration.eqiad1.wikimedia.cloud
17:15:33 <•hashar> and `iotop -o` shows rsync using 99% of the available disk io
17:18:02 <•hashar> and if I remember well the data is written to an attached volume
17:19:42 <•hashar> ah found it g3.cores8.ram36.disk20
17:19:57 <•hashar> it lacks the 4 x increase of disk io that other instancess have
17:20:10 <•hashar> so openstack throttles the disk io made to the shared volume / Ceph

The reason is integration-castor05.integration.eqiad1.wikimedia.cloud uses the flavor g3.cores8.ram36.disk20 which has disk io rate limited (which is the default).

The fix is to migrate the instance to a flavor with a raised disk io throttling which have a 4xiops in their name. The available flavors are:

g3.cores8.ram24.disk20.ephemeral90.4xiops

g3.cores8.ram24.disk20.ephemeral60.4xiops

g3.cores8.ram24.disk20.ephemeral40.4xiops

However they have:

24G ram when the current one has 36G (I have picked that to benefit the Linux disk cache).
an ephemeral disk which makes it impossible to change the flavor later on T340825: OpenStack silently fail to resize an Ephemeral volume

So I guess the easiest is to create a duplicate of the currently used flavor with the 4xiops: g3.cores8.ram36.disk20.4xiops. Without an ephemeral disk we can then change the flavor of the instance (which restarts it) and get the new IO throttle.

Details

Subject	Repo	Branch	Lines +/-
ci: manage cinder volume on Castor instance	operations/puppet	production	+8 -1
Revert "jjb: disable castor-save-workspace-cache"	integration/config	master	+5 -3
jjb: disable castor-save-workspace-cache	integration/config	master	+3 -5

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		hashar	T345924 Move castor instance to 4xiops disk flavor
		Resolved		Andrew	T345925 Request a flavor with elevated iops for integration cache storage

Event Timeline

hashar created this task.Sep 8 2023, 3:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2023, 3:27 PM

So I guess the easiest is to create a duplicate of the currently used flavor with the 4xiops: g3.cores8.ram36.disk20.4xiops.

Filed as T345925: Request a flavor with elevated iops for integration cache storage

Andrew closed subtask T345925: Request a flavor with elevated iops for integration cache storage as Resolved.Sep 16 2023, 5:36 PM

hashar reopened subtask T345925: Request a flavor with elevated iops for integration cache storage as Open.Sep 18 2023, 8:38 AM

The g3.cores8.ram36.disk20.4xiops flavor has been created (T345925). Given the current flavor and the new one do not have ephemeral disk, I should be able to resize the instance without having to recreate it.

NOTE: resizing implies an automatic restart of the instance.

There is a single partition, the Castor data are in an attached volume which hopefully will hopefully remained attached to the instance even after its flavor changed.

I will look at it when time allow.

Andrew closed subtask T345925: Request a flavor with elevated iops for integration cache storage as Resolved.Sep 20 2023, 10:36 PM

thcipriani triaged this task as Low priority.Sep 27 2023, 4:41 PM

thcipriani edited projects, added Release-Engineering-Team (Priority Backlog 📥); removed Release-Engineering-Team.

Also have to update the image:

jjb/castor-load-sync.bash:19:CASTOR_HOST="${CASTOR_HOST:-integration-castor03.integration.eqiad.wmflabs}"

CASTOR_HOST is set from https://integration.wikimedia.org/ci/configure and is currently set to:

CASTOR_HOST integration-castor05.integration.eqiad1.wikimedia.cloud

Mentioned in SAL (#wikimedia-releng) [2023-09-28T14:25:11Z] <hashar> integration: resizing integration-castor05 from g3.cores8.ram36.disk20 to g3.cores8.ram36.disk20.4xiops ( +4xiops) # T345924

Change 961824 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: disable castor-save-workspace-cache

https://gerrit.wikimedia.org/r/961824

gerritbot added a project: Patch-For-Review.Sep 28 2023, 2:34 PM

Change 961824 merged by jenkins-bot:

[integration/config@master] jjb: disable castor-save-workspace-cache

https://gerrit.wikimedia.org/r/961824

16:29:49 <•hashar>     Invalid input received: Invalid volume: Volume 3f90c3f2-158d-4e45-a919-0f048f47c3b6 status must be available or downloading to reserve, but the current status is attaching. (HTTP 400) (Request-ID: req-ddd07558-b6b7-4ec6-8258-c4e5efb83a07)

I ended up deleting the instance.

Change 961831 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Revert "jjb: disable castor-save-workspace-cache"

https://gerrit.wikimedia.org/r/961831

Mentioned in SAL (#wikimedia-releng) [2023-09-28T15:26:24Z] <hashar> Rettaching integration-castor05 to Jenkins after its ssh host fingerprint got changed when I recreated the instance # T345924

Mentioned in SAL (#wikimedia-releng) [2023-09-28T15:26:59Z] <hashar> Reenabling castor-save-workspace-cache job # T345924

Change 961831 merged by jenkins-bot:

[integration/config@master] Revert "jjb: disable castor-save-workspace-cache"

https://gerrit.wikimedia.org/r/961831

Change 961844 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] ci: manage cinder volume on Castor instance

https://gerrit.wikimedia.org/r/961844

I have recreated the instance with the same hostname integration-castor05. Since the ssh host key changed I had to manually verify the ssh key in Jenkins.

The new instance lacked a /srv mount for the attached storage which is handled by profile::labs::cindermount::srv. https://gerrit.wikimedia.org/r/961844 adds it as the role level so that next time we rebuild the instance Puppet will mount the volume (assuming it got attached, else Puppet errors out).

As a side effect, in Horizon the Puppet configuration solely has role::ci::castor::server which is a slight improvement.

The class can be marked resolved once that Puppet patch got reviewed/merged.

And to complete I have added panels to the cloud grafana instance to show the number of disk IO in progress as well as the read/write throughput.

https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=integration&var-instance=integration-castor05&from=now-3h&to=now

Change 961844 merged by David Caro: