Page MenuHomePhabricator

cloudcontrol1005 - Check unit status of backup_cinder_volumes
Closed, ResolvedPublic

Description

From https://alerts.wikimedia.org/?q=team%3Dwmcs:

Check unit status of backup_cinder_volumes
summary: CRITICAL: Status of the systemd unit backup_cinder_volumes
33 minutes ago
instance: cloudcontrol1005
source: icinga
team: wmcs

Event Timeline

dcaro triaged this task as High priority.Mar 2 2022, 8:57 AM
dcaro created this task.

Change 767467 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs: add runbook url to the backup_cinder_volumes alert

https://gerrit.wikimedia.org/r/767467

dcaro changed the task status from Open to In Progress.Mar 2 2022, 9:23 AM
dcaro moved this task from To refine to Doing on the User-dcaro board.

Change 767474 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs-cinder-backup-manager: increase individual timeout to 30h

https://gerrit.wikimedia.org/r/767474

Change 767467 merged by David Caro:

[operations/puppet@production] wmcs: add runbook url to the backup_cinder_volumes alert

https://gerrit.wikimedia.org/r/767467

Change 767474 merged by David Caro:

[operations/puppet@production] wmcs-cinder-backup-manager: increase individual timeout to 30h

https://gerrit.wikimedia.org/r/767474

So the backup failed on friday too, it started at 16:18 UTC

dcaro@cloudbackup2002:~$ sudo journalctl -u cinder-backup
...
Mar 04 16:18:21 cloudbackup2002 cinder-backup[18548]: 2022-03-04 16:18:21.856 18548 INFO cinder.backup.manager [req-a83c5c2b-9c2a-4b0c-99a4-4bf6b74d7054 novaadmin admin - - -] Create backup started, backup: 0897f17e-57ed-4af1-a514-bbc71c3f7550 volume: 8d68

But lots of errors happened, some look not very worrysome (reconnections), like:

Mar 04 16:20:14 cloudbackup2002 cinder-backup[18548]: 2022-03-04 16:20:14.724 18548 ERROR cinder.service [req-879d9482-17c9-4140-84ef-ea8ea271e3ce - - - - -] model server went away: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError)
...

Some are a bit more worrysome:

Mar 05 01:33:51 cloudbackup2002 cinder-backup[18548]: 2022-03-05 01:33:51.214 18548 ERROR oslo.messaging._drivers.impl_rabbit [-] [625cb6ef-d7b0-4a5f-b742-3c45470bec2a] AMQP server on cloudcontrol1005.wikimedia.org:5672 is unreachable: Server unexpectedly

But this looks more impactful:

Mar 05 08:36:23 cloudbackup2002 cinder-backup[18548]: 2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server [req-a83c5c2b-9c2a-4b0c-99a4-4bf6b74d7054 novaadmin admin - - -] Exception during message handling: MemoryError

There's nothing suspicious on the journal logs around that time:

dcaro@cloudbackup2002:~$ sudo journalctl --since "2022-03-05 08:00:00"
...

Or in dmesg:

dcaro@cloudbackup2002:~$ sudo dmesg -T

Looking...

According to the stack trace, it seems though that it was finishing some backup (probably maps), see the _finalize_backup in the stack:

Mar 05 08:36:23 cloudbackup2002 cinder-backup[18548]: 2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server [req-a83c5c2b-9c2a-4b0c-99a4-4bf6b74d7054 novaadmin admin - - -] Exception during message handling: MemoryError
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/cinder/utils.py", line 1130, in wrapper
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     return func(self, *args, **kwargs)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/cinder/backup/manager.py", line 411, in create_backup
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     volume_utils.update_backup_error(backup, six.text_type(err))
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     self.force_reraise()
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     raise value
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/cinder/backup/manager.py", line 400, in create_backup
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     updates = self._run_backup(context, backup, volume)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/cinder/backup/manager.py", line 488, in _run_backup
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     tpool.Proxy(device_path))
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/cinder/backup/chunkeddriver.py", line 665, in backup
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     self._finalize_backup(backup, container, object_meta, object_sha256)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/cinder/backup/chunkeddriver.py", line 428, in _finalize_backup
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     sha256_list)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/cinder/backup/chunkeddriver.py", line 289, in _write_sha256file
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     sha256file_json = json.dumps(sha256file, sort_keys=True, indent=2)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.7/json/__init__.py", line 238, in dumps
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     **kw).encode(obj)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.7/json/encoder.py", line 202, in encode
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server     return ''.join(chunks)
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server MemoryError
                                                      2022-03-05 08:36:23.108 18548 ERROR oslo_messaging.rpc.server

Might be that calculating the sha256 of the file ended up using too much memory?

Looking at the host details grafana dashboard for cloudbackup2002 it seems it got out of memory yep:

https://grafana.wikimedia.org/goto/yMHF_fL7k?orgId=1

That's too bad, will see if it happens again on the next run, but might be a limitation of the current backup system for big volumes.

Yep, same error, same place:

Mar 08 10:42:00 cloudbackup2002 cinder-backup[18536]: 2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server [req-311182ee-4a3f-4a10-943f-7ab51d5a7832 novaadmin admin - - -] Exception during message handling: MemoryError
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
                                                      ...
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/cinder/backup/chunkeddriver.py", line 428, in _finalize_backup
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server     sha256_list)
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/cinder/backup/chunkeddriver.py", line 289, in _write_sha256file
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server     sha256file_json = json.dumps(sha256file, sort_keys=True, indent=2)
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.7/json/__init__.py", line 238, in dumps
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server     **kw).encode(obj)
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.7/json/encoder.py", line 202, in encode
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server     return ''.join(chunks)
                                                      2022-03-08 10:42:00.692 18536 ERROR oslo_messaging.rpc.server MemoryError

Will dig in the code a bit.

I think I found one way to improve the memory usage considerably, using json.dump instead of json.dumps, the latter, did a small test, without the imporvement:

memory_json_dumps.png (869×1 px, 148 KB)

With the improvement:

memory_json_dump_and_encoder.png (858×1 px, 209 KB)

So hopeful for now.

Icinga downtime and Alertmanager silence (ID=cb3d723c-50f3-4cc8-b784-f1899cff64b2) set by dcaro@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: dcaro testing backups

cloudcontrol1005.wikimedia.org

The backup timed out this time :facepalm:, so rerunning again with a longer timeout...

Backup finished this time \o/, took a while though, but no memory issues:

root@cloudcontrol1005:~# wmcs-cinder-volume-backup --full 8d687b46-03b8-4308-9b71-13704a664290 --timeout $((60*60*24*3))
wmcs-cinder-volume-backup: 2022-03-25 09:14:30,672: INFO: Backup up volume 8d687b46-03b8-4308-9b71-13704a664290
wmcs-cinder-volume-backup: 2022-03-25 09:14:30,672: INFO: User requested full backup
wmcs-cinder-volume-backup: 2022-03-25 09:14:30,673: INFO: Backup up from snapshot
wmcs-cinder-volume-backup: 2022-03-25 09:14:31,029: INFO: current status is creating; waiting for it to change to ['available']
wmcs-cinder-volume-backup: 2022-03-25 09:14:33,168: INFO: Just made snapshot 90a702f3-43a0-4eee-a1f1-26e232a23542
wmcs-cinder-volume-backup: 2022-03-25 09:14:33,396: INFO: Generating backup maps-2022-03-25T09:14:30.672942 (ab071465-11f4-4526-8bd7-bf6f98d05b6d)
wmcs-cinder-volume-backup: 2022-03-25 09:14:33,467: INFO: current status is creating; waiting for it to change to ['available']
wmcs-cinder-volume-backup: 2022-03-26 07:52:07,151: INFO: Cleaning up snapshot 90a702f3-43a0-4eee-a1f1-26e232a23542

Will do a couple restore tests and we are ready to go.

The memory usage was less than half what is available, so we can still make bigger backups without problem
(approximately twice the size):
{

cloudbackup2002_mem_usage_backup_with_fix (299×948 px, 38 KB)
}

Change 774454 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] openstack.cinder: Add patch to the backups chunkeddriver

https://gerrit.wikimedia.org/r/774454

Change 774454 merged by David Caro:

[operations/puppet@production] openstack.cinder: Add patch to the backups chunkeddriver

https://gerrit.wikimedia.org/r/774454

Created a patch for victoria openstack version, wallaby will need to be redone (the code might have changed), closing this for now.