Page MenuHomePhabricator

cinder-backup getting OOM-killed for large volumes
Closed, ResolvedPublic

Description

The tools NFS volume almost never finishes a complete backup run. It appears to run for about 24 hours and then get oom killed:

[Sat Jun  3 10:37:08 2023] Out of memory: Killed process 1381260 (cinder-backup) total-vm:132445968kB, anon-rss:127617160kB, file-rss:0kB, shmem-rss:0kB, UID:64061 pgtables:250800kB oom_score_adj:0
[Sun Jun  4 10:57:54 2023] Out of memory: Killed process 3217627 (cinder-backup) total-vm:134186300kB, anon-rss:129442948kB, file-rss:0kB, shmem-rss:0kB, UID:64061 pgtables:254376kB oom_score_adj:0
[Fri Jun 16 10:14:34 2023] Out of memory: Killed process 3110495 (cinder-backup) total-vm:133269216kB, anon-rss:129318968kB, file-rss:772kB, shmem-rss:0kB, UID:64061 pgtables:253968kB oom_score_adj:0
[Sat Jun 17 11:10:50 2023] Out of memory: Killed process 2750021 (cinder-backup) total-vm:133841328kB, anon-rss:129260376kB, file-rss:0kB, shmem-rss:0kB, UID:64061 pgtables:253776kB oom_score_adj:0
[Sun Jun 18 18:42:47 2023] Out of memory: Killed process 1673656 (cinder-backup) total-vm:133871584kB, anon-rss:129318816kB, file-rss:0kB, shmem-rss:0kB, UID:64061 pgtables:253860kB oom_score_adj:0

In theory it should be doing full backups rather than incremental backups but I need to double-check that.

The backup agent must be doing something silly.

Event Timeline

Change 930947 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cinder-backup: decrease block size and # of concurrent operations

https://gerrit.wikimedia.org/r/930947

Change 930947 merged by Andrew Bogott:

[operations/puppet@production] cinder-backup: decrease block size and # of concurrent operations

https://gerrit.wikimedia.org/r/930947

right now:

814791 cinder    20   0   66.2g  61.3g  18904 S 100.7  48.9 668:09.73 cinder-backup

Now

814791 cinder    20   0   68.2g  63.4g  18904 S 100.0  50.5 723:40.06 cinder-backup

And now

814791 cinder    20   0   71.1g  66.0g  18440 S 104.0  52.7 793:51.81 cinder-backup
814791 cinder    20   0   82.2g  77.0g  18516 S  38.6  61.4   1137:55 cinder-backup

Change 932022 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-cinder-backup-manager.py: don't make incremental backups giant volumes

https://gerrit.wikimedia.org/r/932022

Change 932022 merged by Andrew Bogott:

[operations/puppet@production] wmcs-cinder-backup-manager.py: only full backups of giant volumes

https://gerrit.wikimedia.org/r/932022