Page MenuHomePhabricator

wmcs-backups: failure on cloudvirt1024
Closed, ResolvedPublic

Description

We got some wmcs-backups failures on cloudvirt1024 today, log:

Feb 04 06:44:21 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:21,322] Cleaned up 1 expired backups for VM traffic.traffic-ncredir(0478d505-62dd-4dad-9fd3-5ba84b9ba52b)
Feb 04 06:44:21 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:21,323] Creating backup for vm traffic.diffscan(54e36627-6f0c-46e7-a035-994a2526a0ab)
Feb 04 06:44:21 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:21,508] Creating differential backup of pool:eqiad1-compute image_name:54e36627-6f0c-46e7-a035-994a2526a0ab_disk from rbd snapshot eqiad1-compute/54e36627-6f0c-46e7-a035-994a2526a0ab_disk@2022-02-03T06:53:00_cloudvirt1024 and backy2 version ee635e2e-84bd-11ec-95fb-b02628295df0
Feb 04 06:44:41 cloudvirt1024 python3.7[39042]: 2022-02-04T06:44:41.689+0000 7f367ebc8700 -1 asok(0x7f3678001400) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/guests/ceph-client.eqiad1-compute.39042.139871918246072.asok': (13) Permission denied
Feb 04 06:44:43 cloudvirt1024 python3.7[39066]: 2022-02-04T06:44:43.373+0000 7fe5e3fff700 -1 asok(0x7fe5e4001400) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/guests/ceph-client.eqiad1-compute.39066.140625349462200.asok': (13) Permission denied
Feb 04 06:44:47 cloudvirt1024 wmcs-backup[4882]: [37B blob data]
Feb 04 06:44:47 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:47,943] Cleaning up expired backups for VM traffic.diffscan(54e36627-6f0c-46e7-a035-994a2526a0ab)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,497] Cleaned up 1 expired backups for VM traffic.diffscan(54e36627-6f0c-46e7-a035-994a2526a0ab)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,534] Creating backup for vm trove.maps-postgres(6f43781a-bdf9-4f72-8c65-1bee29f7ef6c)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,535] Forcing a full backup as there's no previous one with a snapshot.
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,535] Creating full backup of pool:eqiad1-compute, image_name:6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk, snapshot_name:2022-02-04T06:44:49_cloudvirt1024
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: rbd: error opening image 6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk: (2) No such file or directory
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: WARNING:[2022-02-04 06:44:49,615] Got an error trying to backup maps-postgres, try n#0 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk@2022-02-04T06:44:49_cloudvirt1024']' returned non-zero exit status 2.
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,632] Creating backup for vm trove.maps-postgres(6f43781a-bdf9-4f72-8c65-1bee29f7ef6c)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,632] Forcing a full backup as there's no previous one with a snapshot.
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,633] Creating full backup of pool:eqiad1-compute, image_name:6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk, snapshot_name:2022-02-04T06:44:49_cloudvirt1024
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: rbd: error opening image 6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk: (2) No such file or directory
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: WARNING:[2022-02-04 06:44:49,705] Got an error trying to backup maps-postgres, try n#1 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk@2022-02-04T06:44:49_cloudvirt1024']' returned non-zero exit status 2.
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,706] Creating backup for vm trove.maps-postgres(6f43781a-bdf9-4f72-8c65-1bee29f7ef6c)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,706] Forcing a full backup as there's no previous one with a snapshot.
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,706] Creating full backup of pool:eqiad1-compute, image_name:6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk, snapshot_name:2022-02-04T06:44:49_cloudvirt1024
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: rbd: error opening image 6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk: (2) No such file or directory
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: WARNING:[2022-02-04 06:44:49,783] Got an error trying to backup maps-postgres, try n#2 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk@2022-02-04T06:44:49_cloudvirt1024']' returned non-zero exit status 2.
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: Traceback (most recent call last):
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/local/sbin/wmcs-backup", line 2028, in <module>
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     args.func()
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/local/sbin/wmcs-backup", line 1863, in <lambda>
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     ).backup_assigned_vms(from_cache=args.from_cache, noop=args.noop)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/local/sbin/wmcs-backup", line 1318, in backup_assigned_vms
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     noop=noop,
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/local/sbin/wmcs-backup", line 1119, in create_vm_backup
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     vm_info=vm_info, noop=noop
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/local/sbin/wmcs-backup", line 1028, in create_vm_backup
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     new_backup = self.vms_backups[vm_id].create_next_backup(noop=noop)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/local/sbin/wmcs-backup", line 707, in create_next_backup
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     noop=noop,
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/local/sbin/wmcs-backup", line 589, in create_full_backup
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     noop=noop,
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 190, in create_full_backup
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     noop=noop,
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 91, in create
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     run_command([RBD, "snap", "create", str(new_snapshot)], noop=noop)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 31, in run_command
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     output = subprocess.check_output(args).decode("utf8")
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     **kwargs).stdout
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:   File "/usr/lib/python3.7/subprocess.py", line 487, in run
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]:     output=stdout, stderr=stderr)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: subprocess.CalledProcessError: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk@2022-02-04T06:44:49_cloudvirt1024']' returned non-zero exit status 2.
Feb 04 06:44:50 cloudvirt1024 systemd[1]: backup_vms.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 06:44:50 cloudvirt1024 systemd[1]: backup_vms.service: Failed with result 'exit-code'.

I have the feeling this is just a transient error, but I'm creating a task anyway so we can track it.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2022-02-04T10:12:35Z] <arturo> restart backup_vms service in cloudvirt1024 (T300956)

Worth noting that the failed backup is for a trove-managed virtual machine, in case that matters.

Moreover, that DB should be pretty big, in the TB range. Does Trove use a cinder volume to store DB data? or is our script trying to backup a multi TB VM disk?

Backups can be safely retried. When I encountered this, I believe it was do to file system state lag between gathering files to backup and when they are processed. Try running the job again and see if it completes.

In theory that giant DB should be on a cinder volume and hence ignored by the backup system. I suspect that the backup issue is just a race with new VM creation but I'll double check that I didn't accidentally create a multi-TB VM

@dcaro can we account for this kind of failure as part of the alert or job? Assuming we're ok with retrying once or twice for what we perceive to be racy conditions causing failures like this, it would be useful to not be alerted.

From the logs it seems that the VM was recreated, and then it worked (it's id changes):

root@cloudvirt1024:~# journalctl -u backup_vms.service | grep maps-postgres
...
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: INFO:[2022-02-04 06:44:49,706] Creating backup for vm trove.maps-postgres(6f43781a-bdf9-4f72-8c65-1bee29f7ef6c)
Feb 04 06:44:49 cloudvirt1024 wmcs-backup[4882]: WARNING:[2022-02-04 06:44:49,783] Got an error trying to backup maps-postgres, try n#2 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/6f43781a-bdf9-4f72-8c65-1bee29f7ef6c_disk@2022-02-04T06:44:49_cloudvirt1024']' returned non-zero exit status 2.
Feb 04 14:52:29 cloudvirt1024 wmcs-backup[54385]: INFO:[2022-02-04 14:52:29,660] Creating backup for vm trove.maps-postgres(685cf06d-de70-489b-9a9f-b88744bdbf61)
...

My guess is that it was deleted between the backup starting and getting to try to back it up, and rebuilt.

There's no easy way to distinguish that from an actual failure, retrying in that case would not help (it actually does
retry three times before giving up). Some possible improvements to this:

  • Try to refresh the list of VMs before every backup
    • Note that listing all VMs is kind of heavy, so this adds to the backup time.
  • Keep track of the failed ones and retry on the next run

Both actually end up being some case of "Refresh the list of VMs and reconcile with the list we have", the last being
that through different runs.

This is already tracked here: T276892

Feel free to add further comments or proposals there, I'll close this task.

dcaro claimed this task.