Page MenuHomePhabricator

wmcs-backups: handle VMs that are deleted between backups start and that VM backup
Open, MediumPublic

Description

As we are backing up a bunch of VMs, there's some chance that one of them might be removed between the time we list all
the VMs and the time we do a backup of that one, ex:

Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]: INFO:[2021-03-09 02:54:33,369] Creating backup for vm k8splay.gitlab(f807b6cc-621d-4a17-8b00-58e93b2329d7)
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]: INFO:[2021-03-09 02:54:33,369] Forcing a full backup as there's no previous one with a snapshot.
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]: INFO:[2021-03-09 02:54:33,369] Creating full backup of pool:eqiad1-compute, image_name:f807b6cc-621d-4a17-8b00-58e93b2329d7_disk, snapshot_name:2021-03-09T02:54:33_cloudvirt1024
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]: rbd: error opening image f807b6cc-621d-4a17-8b00-58e93b2329d7_disk: (2) No such file or directory
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]: Traceback (most recent call last):
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/local/sbin/wmcs-backup", line 2012, in <module>
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     args.func()
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/local/sbin/wmcs-backup", line 1847, in <lambda>
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     ).backup_assigned_vms(from_cache=args.from_cache, noop=args.noop)
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/local/sbin/wmcs-backup", line 1312, in backup_assigned_vms
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     noop=noop,
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/local/sbin/wmcs-backup", line 1121, in create_vm_backup
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     vm_info=vm_info, noop=noop
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/local/sbin/wmcs-backup", line 1030, in create_vm_backup
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     new_backup = self.vms_backups[vm_id].create_next_backup(noop=noop)
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/local/sbin/wmcs-backup", line 709, in create_next_backup
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     noop=noop,
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/local/sbin/wmcs-backup", line 591, in create_full_backup
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     noop=noop,
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 190, in create_full_backup
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     noop=noop,
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 91, in create
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     run_command([RBD, "snap", "create", str(new_snapshot)], noop=noop)
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 31, in run_command
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     output = subprocess.check_output(args).decode("utf8")
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     **kwargs).stdout
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:   File "/usr/lib/python3.7/subprocess.py", line 487, in run
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]:     output=stdout, stderr=stderr)
Mar 09 02:54:33 cloudvirt1024 wmcs-backup[55819]: subprocess.CalledProcessError: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/f807b6cc-621d-4a17-8b00-58e93b2329d7_disk@2021-03-09T02:54:33_cloudvirt1024']' returned non-zero exit status 2.

That makes the service crash and stop, while it would be better to just ignore and continue (in the case the VM is
deleted).

This task is to come up with a good solution and implement it.

Event Timeline

dcaro triaged this task as Medium priority.Mar 9 2021, 9:10 AM
dcaro created this task.

Mentioned in SAL (#wikimedia-cloud) [2021-03-09T09:12:03Z] <dcaro> restarted the wmcs-backup service on cloudvirt1024 to retry the backups (failed because a VM was removed in-between, T276892)

It turns out that the VM is not deleted, but it's in "BUILD" status:

dcaro@cloudvirt1024:~$ jq '.[] | select(.tenant_id == "k8splay" and .name == "gitlab")' <backups.instances.cache
{
  "id": "f807b6cc-621d-4a17-8b00-58e93b2329d7",
  "name": "gitlab",
  "status": "BUILD",
  "tenant_id": "k8splay",
  ...
  "image": {
    "id": "e971dc0f-3b5c-4cd2-ab8b-02faf403c136",
    "links": [
      {
        "rel": "bookmark",
        "href": "http://openstack.eqiad1.wikimediacloud.org:8774/images/e971dc0f-3b5c-4cd2-ab8b-02faf403c136"
      }
    ]
  },
  ...
}

I'll open another task for skipping the 'BUILD' status with more priority (as that's blocking the backups).

This seemed to be another instance of that happenning:

Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: INFO:[2021-03-23 20:53:14,145] Creating backup for vm wikidata-query.wcqs-beta-01(87acbb5a-ddac-457a-9fab-19b4f8af7916)
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: INFO:[2021-03-23 20:53:14,145] Forcing a full backup as there's no previous one with a snapshot.
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: INFO:[2021-03-23 20:53:14,145] Creating full backup of pool:eqiad1-compute, image_name:87acbb5a-ddac-457a-9fab-19b4f8af7916_disk, snapshot_name:2021-03-23T20:53:14_cloudvirt1021
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: rbd: error opening image 87acbb5a-ddac-457a-9fab-19b4f8af7916_disk: (2) No such file or directory
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: WARNING:[2021-03-23 20:53:14,216] Got an error trying to backup wcqs-beta-01, try n#0 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/87acbb5a-ddac-457a-9fab-19b4f8af7916_disk@2021-03-23T20:53:14_cloudvirt1021']' returned non-zero exit status 2.
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: INFO:[2021-03-23 20:53:14,229] Creating backup for vm wikidata-query.wcqs-beta-01(87acbb5a-ddac-457a-9fab-19b4f8af7916)
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: INFO:[2021-03-23 20:53:14,229] Forcing a full backup as there's no previous one with a snapshot.
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: INFO:[2021-03-23 20:53:14,229] Creating full backup of pool:eqiad1-compute, image_name:87acbb5a-ddac-457a-9fab-19b4f8af7916_disk, snapshot_name:2021-03-23T20:53:14_cloudvirt1021
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: rbd: error opening image 87acbb5a-ddac-457a-9fab-19b4f8af7916_disk: (2) No such file or directory
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: WARNING:[2021-03-23 20:53:14,297] Got an error trying to backup wcqs-beta-01, try n#1 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/87acbb5a-ddac-457a-9fab-19b4f8af7916_disk@2021-03-23T20:53:14_cloudvirt1021']' returned non-zero exit status 2.
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: INFO:[2021-03-23 20:53:14,297] Creating backup for vm wikidata-query.wcqs-beta-01(87acbb5a-ddac-457a-9fab-19b4f8af7916)
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: INFO:[2021-03-23 20:53:14,298] Forcing a full backup as there's no previous one with a snapshot.
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: INFO:[2021-03-23 20:53:14,298] Creating full backup of pool:eqiad1-compute, image_name:87acbb5a-ddac-457a-9fab-19b4f8af7916_disk, snapshot_name:2021-03-23T20:53:14_cloudvirt1021
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: rbd: error opening image 87acbb5a-ddac-457a-9fab-19b4f8af7916_disk: (2) No such file or directory
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: WARNING:[2021-03-23 20:53:14,362] Got an error trying to backup wcqs-beta-01, try n#2 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/87acbb5a-ddac-457a-9fab-19b4f8af7916_disk@2021-03-23T20:53:14_cloudvirt1021']' returned non-zero exit status 2.
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: Traceback (most recent call last):
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/local/sbin/wmcs-backup", line 2028, in <module>
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     args.func()
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/local/sbin/wmcs-backup", line 1863, in <lambda>
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     ).backup_assigned_vms(from_cache=args.from_cache, noop=args.noop)
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/local/sbin/wmcs-backup", line 1318, in backup_assigned_vms
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     noop=noop,
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/local/sbin/wmcs-backup", line 1119, in create_vm_backup
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     vm_info=vm_info, noop=noop
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/local/sbin/wmcs-backup", line 1028, in create_vm_backup
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     new_backup = self.vms_backups[vm_id].create_next_backup(noop=noop)
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/local/sbin/wmcs-backup", line 707, in create_next_backup
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     noop=noop,
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/local/sbin/wmcs-backup", line 589, in create_full_backup
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     noop=noop,
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 190, in create_full_backup
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     noop=noop,
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 91, in create
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     run_command([RBD, "snap", "create", str(new_snapshot)], noop=noop)
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 31, in run_command
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     output = subprocess.check_output(args).decode("utf8")
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     **kwargs).stdout
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:   File "/usr/lib/python3.7/subprocess.py", line 487, in run
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]:     output=stdout, stderr=stderr)
Mar 23 20:53:14 cloudvirt1021 wmcs-backup[46911]: subprocess.CalledProcessError: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/87acbb5a-ddac-457a-9fab-19b4f8af7916_disk@2021-03-23T20:53:14_cloudvirt1021']' returned non-zero exit status 2.

Mentioned in SAL (#wikimedia-cloud) [2021-03-24T09:19:29Z] <dcaro> restarted wmcs-backup on cloudvirt1024 as it failed due to an image being removed while running (T276892)

dcaro removed dcaro as the assignee of this task.Aug 10 2021, 5:05 PM
dcaro raised the priority of this task from Medium to Needs Triage.
nskaggs triaged this task as Medium priority.Aug 10 2021, 10:08 PM