Page MenuHomePhabricator

CRITICAL: Status of the systemd unit backup_cinder_volumes
Closed, ResolvedPublic

Description

Cinder backups are failing a lot lately. I think two things are happening.

First, a given backup is dying because of mysql issues:

| fail_reason           | (pymysql.err.OperationalError) (1040, 'Too many connections') |

Subsequently, cinder-backup tries to base an incremental backup on that previous failed backup, which produces another failed backup... and so on.

Event Timeline

Started manually cleaning up things:

  • Deleted old maps backups (the latest was in error status)
  • Removed a snapshot stuck in 'deleting' for wikidumpparse, and re-triggered a full backup (in progress)
  • Done wikidumpparse
  • working on f8a19a5f-a872-494e-ba22-839192c2bffd (wmde-templates-alpha), maximum number of snapshots allowed.

Went the other way around, got the list of snapshots pending in 'deleting':

root@cloudcontrol1005:~# openstack volume snapshot list | grep deleting
| 0278bc9b-ce7a-4faa-9fd6-26d9b25878c6 | utrs-nfs-2022-05-11T10:32:29.187774          | None        | deleting  |   10 |
| 6c7fba9b-1c02-461e-99a9-d5900a491a97 | twl-nfs-2022-05-11T10:32:16.117514           | None        | deleting  |  100 |
| af2e06e6-9b95-4a5e-ba8a-f636a831432d | testlabs-nfs-2022-05-11T10:32:02.196544      | None        | deleting  |   40 |
| 4965d908-c202-4682-83f4-5dd03c6a06d2 | quarry-nfs-2022-05-11T10:31:48.733668        | None        | deleting  |  400 |
| 0bbd1f6c-ec80-4f65-b4f1-0f5431f07e30 | puppet-diffs-nfs-2022-05-11T10:31:35.263687  | None        | deleting  |    4 |
| 4af98564-eecb-4a71-bc65-e9ef4a7de3c8 | proxy-03-backup-2022-05-11T10:31:21.833497   | None        | deleting  |   10 |
| 388c40c6-58f2-495c-b015-27e391f32c79 | proxy-04-backup-2022-05-11T10:31:09.023390   | None        | deleting  |   10 |
| 6d4c2578-a5e1-4b81-ba02-6b6480b89a79 | huggle-nfs-2022-05-11T10:30:55.306257        | None        | deleting  |   40 |
| eac601a9-6727-4951-a319-873dcb55d417 | fastcci-nfs-2022-05-11T10:30:42.045102       | None        | deleting  |   10 |
| a934f988-597c-4517-964a-ef4f886dc6ab | dumps-nfs-2022-05-11T10:30:28.793944         | None        | deleting  |   80 |
| c81882a1-9f7d-4c50-babd-2a038e28f42b | cvn-nfs-2022-05-11T10:30:15.431877           | None        | deleting  |    8 |
| 7203ac4f-ae16-449b-a4c1-8dbb30c03a6f | scratch-2022-05-11T10:30:02.295162           | None        | deleting  | 3072 |

All are from 2022-05-11 ~10:330UTC, something happned there I guess, will clean those up.

Cleaned them all up:

root@cloudcontrol1005:~# #for stuck_snapshot in $(openstack volume snapshot list | grep deleting | awk '{print $2}'); do echo "Deleting $stuck_snapshot"; rbd list -l eqiad1-cinder | grep -q "$stuck_snapshot" && echo "... There's some rbd leftovers, check manually" || { cinder snapshot-reset-state --state error "$stuck_snapshot" && cinder snapshot-delete "$stuck_snapshot"; echo ".... removed"; }; done;
root@cloudcontrol1005:~# for stuck_snapshot in $(openstack volume snapshot list | grep deleting | awk '{print $2}'); do echo "Deleting $stuck_snapshot"; rbd list -l eqiad1-cinder | grep -q "$stuck_snapshot" && echo "... There's some rbd leftovers, check manually" || { cinder snapshot-reset-state --state error "$stuck_snapshot" && cinder snapshot-delete "$stuck_snapshot"; echo ".... removed"; }; done;
Deleting 0278bc9b-ce7a-4faa-9fd6-26d9b25878c6
.... removed
Deleting 6c7fba9b-1c02-461e-99a9-d5900a491a97
.... removed
Deleting af2e06e6-9b95-4a5e-ba8a-f636a831432d
.... removed
Deleting 4965d908-c202-4682-83f4-5dd03c6a06d2
.... removed
Deleting 0bbd1f6c-ec80-4f65-b4f1-0f5431f07e30
.... removed
Deleting 4af98564-eecb-4a71-bc65-e9ef4a7de3c8
.... removed
Deleting 388c40c6-58f2-495c-b015-27e391f32c79
.... removed
Deleting 6d4c2578-a5e1-4b81-ba02-6b6480b89a79
.... removed
Deleting eac601a9-6727-4951-a319-873dcb55d417
.... removed
Deleting a934f988-597c-4517-964a-ef4f886dc6ab
.... removed
Deleting c81882a1-9f7d-4c50-babd-2a038e28f42b
.... removed
Deleting 7203ac4f-ae16-449b-a4c1-8dbb30c03a6f
.... removed

Will let it run again and see what happens.

I just ran an incremental backup of the maps cinder volume on the CLI so I could see how long it's really taking... it timed out after 48 hours. Our current timeout for scripted backups is 30 hours, so probably part of what's been happening is that the initial full backup of that volume works but then any subsequent (incremental) backup fails.

So one stopgap would be to allow config of certain volumes to do full backups only (and, presumably, fewer of them)

but also we probably need to do some serious performance analysis of what the heck cinder-backup is doing :)

Change 858659 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-cinder-backup-manager: allow for less frequent backups

https://gerrit.wikimedia.org/r/858659

Change 858659 merged by Andrew Bogott:

[operations/puppet@production] wmcs-cinder-backup-manager: allow for less frequent backups

https://gerrit.wikimedia.org/r/858659

cinder-backup doesn't have much of a recovery plan for backups in progress, so there will always be some mess left anytime we restart the service, but as long as I let it run without interference it seems pretty stable these days.