Page MenuHomePhabricator

Bump memory for registry[12]00[34] VMs
Closed, ResolvedPublic

Description

In T359067 ServiceOps and Machine Learning agreed on a short/medium fix to allow bigger image layers to be pushed to the Docker registry's nodes.

The idea is to:

  • Bump VM memory from 4GB to 6GB on all registry* nodes
  • Increase the nginx's tmpfs mountpoint on their OS to 4GB as well. That would be profile::nginx::tmpfs_size in hieradata/role/common/docker_registry_ha/registry.yaml

The docker-registry's discovery record shows eqiad depooled and codfw pooled, so I could probably start from there and then move to codfw.

After reading https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook I boldly make those assumptions:

  1. Shutting down the VMs on the depooled DC shouldn't take any extra step, since the only important thing seems to be Swift replication and the VMs run stateless daemons.
  2. Upgrading codfw may be done one VM at the time, without the need of a failover, but to be safe we can do it anyway to guarantee capacity (say if Ganeti fails to bring up the new VM etc..).

Does the above make sense? Any more complete/sound procedures to follow?

Event Timeline

Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM increase. No need for extra steps

Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM increase. No need for extra steps

You need to reboot on the KVM level, though, so via the sre.ganeti.reboot-vm cookbook, an OS reboot isn't enough.

Mentioned in SAL (#wikimedia-operations) [2024-03-21T16:25:58Z] <elukey> expand vram for registry100[3,4] from 4G to 6G - T360637

VM registry1003.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM

Mentioned in SAL (#wikimedia-operations) [2024-03-21T16:35:52Z] <elukey> edit /etc/network/interfaces on registry1003 (ens5 => ens13) - T360637

VM registry1004.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM

Mentioned in SAL (#wikimedia-operations) [2024-03-21T16:44:28Z] <elukey> edit /etc/network/interfaces on registry1004 (ens5 => ens13) - T360637

elukey@ganeti1027:~$ sudo gnt-instance list | grep registry
registry1003.eqiad.wmnet            kvm        debootstrap+default ganeti1026.eqiad.wmnet running   6.0G
registry1004.eqiad.wmnet            kvm        debootstrap+default ganeti1011.eqiad.wmnet running   6.0G

Eqiad done, will take care of codfw as next step :)

Change #1013541 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::docker_registry_ha::registry: increase tmpfs size in eqiad

https://gerrit.wikimedia.org/r/1013541

Icinga downtime and Alertmanager silence (ID=78701a88-bd13-4896-9ad1-88076e82347e) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Increase tmpfs for nginx

registry1003.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=9cabb1e2-3230-40ba-8e89-bce14ddf9042) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Increase tmpfs for nginx

registry1004.eqiad.wmnet

Change #1013541 merged by Elukey:

[operations/puppet@production] role::docker_registry_ha::registry: increase tmpfs size in eqiad

https://gerrit.wikimedia.org/r/1013541

Mentioned in SAL (#wikimedia-operations) [2024-03-25T14:57:54Z] <elukey> increase tmpfs for /var/lib/nginx on registry100[3,4] and restart nginx - T360637

Change #1014534 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::docker_registry_ha::registry: set nginx's tmpfs size in codfw

https://gerrit.wikimedia.org/r/1014534

High level plan for codfw:

  • Book a mw infrastructure maintenance in the deployments wikitech page.
  • When the time comes, disable puppet on registry2* and downtime them.
  • Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014534
  • Depool registry2003, increase the vram via Ganeti's commands and reboot the vm (via cookbook). Once the node is up, run puppet and restart nginx. Check that everything is running as expected.
  • Repool registry2003 and wait 10/15 mins before proceeding, paying attention to health check probes etc..
  • Depool 2004 and do the same.

Icinga downtime and Alertmanager silence (ID=110ad5f3-e41f-4f7d-a5d0-3343dc9fca15) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Increase tmpfs for nginx

registry2003.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=1d34de6a-7fb2-4477-984a-7dcc642d43b2) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Increase tmpfs for nginx

registry2004.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-03-27T11:15:39Z] <elukey> expand vram for registry200[3,4] from 4G to 6G - T360637

VM registry2003.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM

Change #1014534 merged by Elukey:

[operations/puppet@production] role::docker_registry_ha::registry: set nginx's tmpfs size in codfw

https://gerrit.wikimedia.org/r/1014534

VM registry2004.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM

Everything done!