Page MenuHomePhabricator

decom old people VMs / finish people host upgrade
Closed, ResolvedPublic

Description

Just realized we (still) have 4 people* VMs and we have not finished the upgrade yet.

people[2003-2004].codfw.wmnet,people[1004-1005].eqiad.wmnet

At least the old hosts are not decom'ed.

Double check that is all that is left and do the proper decoms.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2025-10-29T16:18:36Z] <mutante> shutting down people1004.eqiad.wmnet, people2003.codfw.wmnet - T408713 T402596

hosts shut down after sending a warning message via wall - but VMs not destroyed just yet

Dzahn changed the task status from Open to In Progress.Oct 29 2025, 4:20 PM

someone or something booted the old VMs again that I had shut down.. about 8 days ago. unclear why

Mentioned in SAL (#wikimedia-operations) [2025-11-06T17:38:37Z] <mutante> shutting down people1004 and people2003 - had already shut them down on Oct 29 but someone or something booted them again T408713

The decom cookbook warned me that IPs of these machines appear in:

deployment-charts/helmfile.d/services/machinetranslation/values.yaml.

There was another long-term ticket (where was it?) about this and people hosts being used to dump data for a production service.

We really need to stop that. Especially with hardcoded IPs in other repos.

Aborted the decom cookbook and started to ask in #wikimedia-ml

Dzahn changed the task status from In Progress to Stalled.Nov 6 2025, 6:00 PM

cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: people1004.eqiad.wmnet

  • people1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Dzahn changed the task status from Stalled to In Progress.Nov 10 2025, 6:39 PM

Change #1203502 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove decom'ed people bookworm machines

https://gerrit.wikimedia.org/r/1203502

cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: people2003.codfw.wmnet

  • people2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Change #1203502 merged by Dzahn:

[operations/puppet@production] site: remove decom'ed people bookworm machines

https://gerrit.wikimedia.org/r/1203502