Bump memory for registry[12]00[34] VMs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Mar 21 2024, 2:02 PM

Description

In T359067 ServiceOps and Machine Learning agreed on a short/medium fix to allow bigger image layers to be pushed to the Docker registry's nodes.

The idea is to:

Bump VM memory from 4GB to 6GB on all registry* nodes
Increase the nginx's tmpfs mountpoint on their OS to 4GB as well. That would be profile::nginx::tmpfs_size in hieradata/role/common/docker_registry_ha/registry.yaml

The docker-registry's discovery record shows eqiad depooled and codfw pooled, so I could probably start from there and then move to codfw.

After reading https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook I boldly make those assumptions:

Shutting down the VMs on the depooled DC shouldn't take any extra step, since the only important thing seems to be Swift replication and the VMs run stateless daemons.
Upgrading codfw may be done one VM at the time, without the need of a failover, but to be safe we can do it anyway to guarantee capacity (say if Ganeti fails to bring up the new VM etc..).

Does the above make sense? Any more complete/sound procedures to follow?

Details

	Subject	Repo	Branch	Lines +/-
	role::docker_registry_ha::registry: set nginx's tmpfs size in codfw	operations/puppet	production	+5 -2
	role::docker_registry_ha::registry: increase tmpfs size in eqiad	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		elukey	T359067 Find an efficient strategy to add Pytorch and ROCm packages to our Docker images
		Resolved		elukey	T360637 Bump memory for registry[12]00[34] VMs

Event Timeline

elukey created this task.Mar 21 2024, 2:02 PM

elukey added a subscriber: klausman.

Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM increase. No need for extra steps

In T360637#9649735, @JMeybohm wrote:

Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM increase. No need for extra steps

You need to reboot on the KVM level, though, so via the sre.ganeti.reboot-vm cookbook, an OS reboot isn't enough.

akosiaris updated the task description. (Show Details)Mar 21 2024, 2:49 PM

elukey updated the task description. (Show Details)Mar 21 2024, 4:16 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-21T16:25:58Z] <elukey> expand vram for registry100[3,4] from 4G to 6G - T360637

VM registry1003.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM

Mentioned in SAL (#wikimedia-operations) [2024-03-21T16:35:52Z] <elukey> edit /etc/network/interfaces on registry1003 (ens5 => ens13) - T360637

VM registry1004.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM

Mentioned in SAL (#wikimedia-operations) [2024-03-21T16:44:28Z] <elukey> edit /etc/network/interfaces on registry1004 (ens5 => ens13) - T360637

elukey@ganeti1027:~$ sudo gnt-instance list | grep registry
registry1003.eqiad.wmnet            kvm        debootstrap+default ganeti1026.eqiad.wmnet running   6.0G
registry1004.eqiad.wmnet            kvm        debootstrap+default ganeti1011.eqiad.wmnet running   6.0G

Eqiad done, will take care of codfw as next step :)

Change #1013541 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::docker_registry_ha::registry: increase tmpfs size in eqiad

https://gerrit.wikimedia.org/r/1013541

gerritbot added a project: Patch-For-Review.Mar 22 2024, 1:33 PM

Icinga downtime and Alertmanager silence (ID=78701a88-bd13-4896-9ad1-88076e82347e) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Increase tmpfs for nginx

registry1003.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=9cabb1e2-3230-40ba-8e89-bce14ddf9042) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Increase tmpfs for nginx

registry1004.eqiad.wmnet

Change #1013541 merged by Elukey:

[operations/puppet@production] role::docker_registry_ha::registry: increase tmpfs size in eqiad

https://gerrit.wikimedia.org/r/1013541

Mentioned in SAL (#wikimedia-operations) [2024-03-25T14:57:54Z] <elukey> increase tmpfs for /var/lib/nginx on registry100[3,4] and restart nginx - T360637

calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.Mar 26 2024, 2:27 PM

Change #1014534 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::docker_registry_ha::registry: set nginx's tmpfs size in codfw

https://gerrit.wikimedia.org/r/1014534

High level plan for codfw:

Book a mw infrastructure maintenance in the deployments wikitech page.
When the time comes, disable puppet on registry2* and downtime them.
Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014534
Depool registry2003, increase the vram via Ganeti's commands and reboot the vm (via cookbook). Once the node is up, run puppet and restart nginx. Check that everything is running as expected.
Repool registry2003 and wait 10/15 mins before proceeding, paying attention to health check probes etc..
Depool 2004 and do the same.

Icinga downtime and Alertmanager silence (ID=110ad5f3-e41f-4f7d-a5d0-3343dc9fca15) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Increase tmpfs for nginx

registry2003.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=1d34de6a-7fb2-4477-984a-7dcc642d43b2) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Increase tmpfs for nginx

registry2004.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-03-27T11:15:39Z] <elukey> expand vram for registry200[3,4] from 4G to 6G - T360637

VM registry2003.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM

Change #1014534 merged by Elukey:

[operations/puppet@production] role::docker_registry_ha::registry: set nginx's tmpfs size in codfw

https://gerrit.wikimedia.org/r/1014534

VM registry2004.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM

Everything done!

elukey moved this task from Ready To Go to 2023-2024 Q4 Done on the Machine-Learning-Team board.Thu, Apr 4, 3:57 PM

Bump memory for registry[12]00[34] VMsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Bump memory for registry[12]00[34] VMs
Closed, ResolvedPublic
Actions

Related Objects
Search...