Page MenuHomePhabricator

Move castor instance to 4xiops disk flavor
Closed, ResolvedPublic

Description

16:52:01 <Lucas_WMDE> I have a feeling that CI builds are “Waiting for the completion of castor-save-workspace-cache” a bit longer than usual lately… would it be possible to give castor more resources? (no idea if that even makes sense tbh)

17:13:30 <•hashar> Lucas_WMDE: the job doesn't run concurrently (I can't remember why)
17:13:39 <•hashar> and it transfers the whole cache of the job, so maybe those have grown 
17:13:58 <•hashar> and indeed there are a lot of them :
17:13:59 <•hashar> )
17:15:20 <•hashar> the caches are stored on integration-castor05.integration.eqiad1.wikimedia.cloud
17:15:33 <•hashar> and `iotop -o` shows rsync using 99% of the available disk io
17:18:02 <•hashar> and if I remember well the data is written to an attached volume
17:19:42 <•hashar> ah found it g3.cores8.ram36.disk20
17:19:57 <•hashar> it lacks the 4 x increase of disk io that other instancess have
17:20:10 <•hashar> so openstack throttles the disk io made to the shared volume / Ceph

The reason is integration-castor05.integration.eqiad1.wikimedia.cloud uses the flavor g3.cores8.ram36.disk20 which has disk io rate limited (which is the default).

The fix is to migrate the instance to a flavor with a raised disk io throttling which have a 4xiops in their name. The available flavors are:

g3.cores8.ram24.disk20.ephemeral90.4xiops
g3.cores8.ram24.disk20.ephemeral60.4xiops
g3.cores8.ram24.disk20.ephemeral40.4xiops

However they have:

So I guess the easiest is to create a duplicate of the currently used flavor with the 4xiops: g3.cores8.ram36.disk20.4xiops. Without an ephemeral disk we can then change the flavor of the instance (which restarts it) and get the new IO throttle.

Event Timeline

The g3.cores8.ram36.disk20.4xiops flavor has been created (T345925). Given the current flavor and the new one do not have ephemeral disk, I should be able to resize the instance without having to recreate it.

NOTE: resizing implies an automatic restart of the instance.

There is a single partition, the Castor data are in an attached volume which hopefully will hopefully remained attached to the instance even after its flavor changed.

I will look at it when time allow.

Also have to update the image:

jjb/castor-load-sync.bash:19:CASTOR_HOST="${CASTOR_HOST:-integration-castor03.integration.eqiad.wmflabs}"

CASTOR_HOST is set from https://integration.wikimedia.org/ci/configure and is currently set to:

CASTOR_HOSTintegration-castor05.integration.eqiad1.wikimedia.cloud

Mentioned in SAL (#wikimedia-releng) [2023-09-28T14:25:11Z] <hashar> integration: resizing integration-castor05 from g3.cores8.ram36.disk20 to g3.cores8.ram36.disk20.4xiops ( +4xiops) # T345924

Change 961824 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: disable castor-save-workspace-cache

https://gerrit.wikimedia.org/r/961824

Change 961824 merged by jenkins-bot:

[integration/config@master] jjb: disable castor-save-workspace-cache

https://gerrit.wikimedia.org/r/961824

16:29:49 <•hashar>     Invalid input received: Invalid volume: Volume 3f90c3f2-158d-4e45-a919-0f048f47c3b6 status must be available or downloading to reserve, but the current status is attaching. (HTTP 400) (Request-ID: req-ddd07558-b6b7-4ec6-8258-c4e5efb83a07)

I ended up deleting the instance.

Change 961831 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Revert "jjb: disable castor-save-workspace-cache"

https://gerrit.wikimedia.org/r/961831

Mentioned in SAL (#wikimedia-releng) [2023-09-28T15:26:24Z] <hashar> Rettaching integration-castor05 to Jenkins after its ssh host fingerprint got changed when I recreated the instance # T345924

Mentioned in SAL (#wikimedia-releng) [2023-09-28T15:26:59Z] <hashar> Reenabling castor-save-workspace-cache job # T345924

Change 961831 merged by jenkins-bot:

[integration/config@master] Revert "jjb: disable castor-save-workspace-cache"

https://gerrit.wikimedia.org/r/961831

Change 961844 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] ci: manage cinder volume on Castor instance

https://gerrit.wikimedia.org/r/961844

I have recreated the instance with the same hostname integration-castor05. Since the ssh host key changed I had to manually verify the ssh key in Jenkins.

The new instance lacked a /srv mount for the attached storage which is handled by profile::labs::cindermount::srv. https://gerrit.wikimedia.org/r/961844 adds it as the role level so that next time we rebuild the instance Puppet will mount the volume (assuming it got attached, else Puppet errors out).

As a side effect, in Horizon the Puppet configuration solely has role::ci::castor::server which is a slight improvement.

The class can be marked resolved once that Puppet patch got reviewed/merged.

And to complete I have added panels to the cloud grafana instance to show the number of disk IO in progress as well as the read/write throughput.

https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=integration&var-instance=integration-castor05&from=now-3h&to=now

Change 961844 merged by David Caro:

[operations/puppet@production] ci: manage cinder volume on Castor instance

https://gerrit.wikimedia.org/r/961844

Read/writes should be faster now that the instance has the 4xiops flavor (which quadruples the number of IO that can be done).