Page MenuHomePhabricator

cloudvirt1024: CRITICAL: Status of the systemd unit backup_vms
Closed, ResolvedPublic

Description

From an alert:

Icinga/Check unit status of backup_vms
summary: CRITICAL: Status of the systemd unit backup_vms
1 hour ago
instance: cloudvirt1024
team: wmcs
@cluster: wikimedia.org
runbook

Looking at the logs it seems it was unable to find a specific image:

Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: INFO:[2021-10-27 06:54:39,714] Creating backup for vm wikisp.ceres-01(5b62f768-ba2a-4c0d-8ee2-0bffba99222f)
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: INFO:[2021-10-27 06:54:39,714] Forcing a full backup as there's no previous one with a snapshot.
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: INFO:[2021-10-27 06:54:39,714] Creating full backup of pool:eqiad1-compute, image_name:5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk, snapshot_name:2021-10-27T06:54:39_cloudvirt1024
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: rbd: error opening image 5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk: (2) No such file or directory
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: WARNING:[2021-10-27 06:54:39,810] Got an error trying to backup ceres-01, try n#0 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk@2021-10-27T06:54:39_cloudvirt1024']' returned non-zero exit status 2.
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: INFO:[2021-10-27 06:54:39,827] Creating backup for vm wikisp.ceres-01(5b62f768-ba2a-4c0d-8ee2-0bffba99222f)
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: INFO:[2021-10-27 06:54:39,827] Forcing a full backup as there's no previous one with a snapshot.
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: INFO:[2021-10-27 06:54:39,828] Creating full backup of pool:eqiad1-compute, image_name:5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk, snapshot_name:2021-10-27T06:54:39_cloudvirt1024
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: rbd: error opening image 5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk: (2) No such file or directory
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: WARNING:[2021-10-27 06:54:39,911] Got an error trying to backup ceres-01, try n#1 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk@2021-10-27T06:54:39_cloudvirt1024']' returned non-zero exit status 2.
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: INFO:[2021-10-27 06:54:39,912] Creating backup for vm wikisp.ceres-01(5b62f768-ba2a-4c0d-8ee2-0bffba99222f)
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: INFO:[2021-10-27 06:54:39,912] Forcing a full backup as there's no previous one with a snapshot.
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: INFO:[2021-10-27 06:54:39,912] Creating full backup of pool:eqiad1-compute, image_name:5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk, snapshot_name:2021-10-27T06:54:39_cloudvirt1024
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: rbd: error opening image 5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk: (2) No such file or directory
Oct 27 06:54:39 cloudvirt1024 wmcs-backup[51353]: WARNING:[2021-10-27 06:54:39,996] Got an error trying to backup ceres-01, try n#2 of 3: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk@2021-10-27T06:54:39_cloudvirt1024']' returned non-zero exit status 2.
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]: Traceback (most recent call last):
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/local/sbin/wmcs-backup", line 2028, in <module>
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     args.func()
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/local/sbin/wmcs-backup", line 1863, in <lambda>
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     ).backup_assigned_vms(from_cache=args.from_cache, noop=args.noop)
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/local/sbin/wmcs-backup", line 1318, in backup_assigned_vms
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     noop=noop,
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/local/sbin/wmcs-backup", line 1119, in create_vm_backup
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     vm_info=vm_info, noop=noop
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/local/sbin/wmcs-backup", line 1028, in create_vm_backup
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     new_backup = self.vms_backups[vm_id].create_next_backup(noop=noop)
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/local/sbin/wmcs-backup", line 707, in create_next_backup
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     noop=noop,
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/local/sbin/wmcs-backup", line 589, in create_full_backup
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     noop=noop,
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 190, in create_full_backup
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     noop=noop,
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 91, in create
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     run_command([RBD, "snap", "create", str(new_snapshot)], noop=noop)
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 31, in run_command
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     output = subprocess.check_output(args).decode("utf8")
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     **kwargs).stdout
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:   File "/usr/lib/python3.7/subprocess.py", line 487, in run
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]:     output=stdout, stderr=stderr)
Oct 27 06:54:40 cloudvirt1024 wmcs-backup[51353]: subprocess.CalledProcessError: Command '['/usr/bin/rbd', 'snap', 'create', 'eqiad1-compute/5b62f768-ba2a-4c0d-8ee2-0bffba99222f_disk@2021-10-27T06:54:39_cloudvirt1024']' returned non-zero exit status 2.
Oct 27 06:54:40 cloudvirt1024 systemd[1]: backup_vms.service: Main process exited, code=exited, status=1/FAILURE

Event Timeline

dcaro triaged this task as High priority.

It seems that the server was removed while the script was running:

root@cloudvirt1024:~# openstack server show 5b62f768-ba2a-4c0d-8ee2-0bffba99222f
No server with a name or ID of '5b62f768-ba2a-4c0d-8ee2-0bffba99222f' exists.

root@cloudvirt1024:~# openstack server list --os-project-id wikisp
+--------------------------------------+----------+--------+------------------------------------------------------+----------------------+-----------------------+
| ID                                   | Name     | Status | Networks                                             | Image                | Flavor                |
+--------------------------------------+----------+--------+------------------------------------------------------+----------------------+-----------------------+
| f424e08c-aef4-40e4-87e6-772cfb04179c | ceres-01 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.47                | debian-11.0-bullseye | g3.cores1.ram2.disk20 |
| 74f56e80-7f91-4a08-a8d4-d5a1b4ff48a7 | mars     | ACTIVE | lan-flat-cloudinstances2b=172.16.4.126, 185.15.56.7  | debian-11.0-bullseye | g3.cores2.ram4.disk20 |
| b1c79b17-0839-438f-9713-3942bb43ebf9 | apollo   | ACTIVE | lan-flat-cloudinstances2b=172.16.4.20, 185.15.56.126 | debian-11.0-bullseye | g3.cores4.ram8.disk20 |
+--------------------------------------+----------+--------+------------------------------------------------------+----------------------+-----------------------+

root@cloudvirt1024:~# openstack server show f424e08c-aef4-40e4-87e6-772cfb04179c | grep launched_at
| OS-SRV-USG:launched_at              | 2021-10-27T04:43:44.000000

# the backups started earlier
root@cloudvirt1024:~# journalctl -u backup_vms.service  | grep 'Started Backup' | tail -n 1
Oct 27 02:00:02 cloudvirt1024 systemd[1]: Started Backup vms assigned to this host.

This seems to be a consequence of T276892, will close this.