Page MenuHomePhabricator

Ensure unique machine-id across Cloud VPS VMs
Closed, ResolvedPublic

Description

As a followup to T351507: VMs in Cloud VPS share the same machine-id and T400223: Investigate daily disconnections of IRC bots hosted in Toolforge we must make sure /etc/machine-id is actually unique across all Cloud VPS VMs.

The process is simple: rm /etc/machine-id and then either systemd-machine-id-setup + restart affected demons, or reboot. The problematic case we have seen so far was with systemd-networkd using the same dhcp client id from multiple VMs, and causing problems with leases unable to be renewed.

Action plan:

  • Audit all VMs machine-id, identify which need to be fixed.
  • Decide what to do with un-auditable VMs, either shut down or up but inaccessible to cumin. T402185: Audit and potentially fix VMs not reachable by cloudcumin root key
  • Proceed to fix the problem in batches, perhaps previous announcement to users. We have observed a brief bounce of network connections when systemd-networkd is restarted, and no ill side effects other than that.

Event Timeline

The initial audit for VMs with the same machine id is at P81346, it was generated with:

root@cloudcumin1001:~# cumin 'O{*}' 'cat /etc/machine-id' --output json --force | tee filippo-T401880-all
root@cloudcumin1001:~# cat filippo-T401880-all |  sort -k2 | uniq -f1 -D

On the Toolforge NFS side, I checked the dbus ids (/var/lib/dbus/machine-id) and are all different, and the nfs-client ids are empty, so it should be using the default ("Linux NFS " + hostname), if we need to change that, here it explains a bit: https://docs.kernel.org/filesystems/nfs/client-identifier.html#selecting-an-appropriate-client-identifier

taavi triaged this task as High priority.

Mentioned in SAL (#wikimedia-cloud) [2025-08-18T07:18:34Z] <godog> ensure unique machine-id for hosts listed in task - T401880

Mentioned in SAL (#wikimedia-cloud) [2025-08-18T07:36:34Z] <godog> roll-restart systemd-networkd on hosts with new machine-id - T401880

Current status:

  • All 414 reachable hosts with duplicate /etc/machine-id have been fixed.
  • Of those hosts, 190 are running systemd-networkd which has been restarted: P81424
  • The remaining 224 were not running systemd-networkd and thus have been left alone: P81425
  • I could not reach these hosts via cloudcumin for auditing: P81423

Also these hosts share /var/lib/dbus/machine-id:

(5) deployment-kafka-jumbo-[5,8-9].deployment-prep.eqiad1.wikimedia.cloud,deployment-kafka-main-[5-6].deployment-prep.eqiad1.wikimedia.cloud                                                                                                    
----- OUTPUT of 'cat /var/lib/dbus/machine-id' -----                                                                    
fb56aa8772094c6c8bf3009e24231e2d  
(3) deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud,pki-test01.pki.eqiad1.wikimedia.cloud,tools-sgebastion-10.tools.eqiad1.wikimedia.cloud                                                                                    
----- OUTPUT of 'cat /var/lib/dbus/machine-id' -----                                                                    
db8553bf01c848ec8114c2edd72a50d0

Also these hosts share /var/lib/dbus/machine-id:

(5) deployment-kafka-jumbo-[5,8-9].deployment-prep.eqiad1.wikimedia.cloud,deployment-kafka-main-[5-6].deployment-prep.eqiad1.wikimedia.cloud                                                                                                    
----- OUTPUT of 'cat /var/lib/dbus/machine-id' -----                                                                    
fb56aa8772094c6c8bf3009e24231e2d  
(3) deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud,pki-test01.pki.eqiad1.wikimedia.cloud,tools-sgebastion-10.tools.eqiad1.wikimedia.cloud                                                                                    
----- OUTPUT of 'cat /var/lib/dbus/machine-id' -----                                                                    
db8553bf01c848ec8114c2edd72a50d0

I have fixed those with rm /var/lib/dbus/machine-id /etc/machine-id ; dbus-uuidgen --ensure ; systemd-machine-id-setup ; systemctl restart dbus. None are running systemd-networkd thus no action there

I'm optimistically calling this done. There are a minority of unreachable / unauditable hosts in P81423, some expected (trove, magnum VMs) and some unexpected. Will track those separatedly