Page MenuHomePhabricator

xtrabackup --prepare hits open_files_limit on buster
Closed, ResolvedPublic

Description

S3 and x1 backups get stuck on prepare phase due to the amount of files that are attempted to be open at the same time:

2021-04-26  8:50:16 0 [ERROR] InnoDB: Operating system error number 24 in a file operation.
2021-04-26  8:50:16 0 [ERROR] InnoDB: Error number 24 means 'Too many open files'

Adding "--open-files-limit=100000" fixes the issue.

This didn't happen on Stretch, so it must be a new behavior of innodb for 10.4 (or a bug). While s3 contains a lot of files, I would expect xtrabackup to run almost serially (unlike on normal database execution). Maybe a new optimization leads to file descriptor exahustion.

Event Timeline

Change 682536 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/wmfbackups@master] Increase default memory usage of xtrabackup --prepare to 40GB

https://gerrit.wikimedia.org/r/682536

Change 682537 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/wmfbackups@master] Xtrabackup: Increase default open-files-limit to match production

https://gerrit.wikimedia.org/r/682537

Change 682536 merged by Jcrespo:

[operations/software/wmfbackups@master] Increase default memory usage of xtrabackup --prepare to 40GB

https://gerrit.wikimedia.org/r/682536

Change 682537 merged by Jcrespo:

[operations/software/wmfbackups@master] Xtrabackup: Increase default open-files-limit to match production

https://gerrit.wikimedia.org/r/682537

New package version has been locally installed on dbprov2003. Run looks fine so far:

[08:43:32]: DEBUG - ['xtrabackup', '--prepare', '--target-dir', '/srv/backups/snapshots/ongoing/snapshot.s3.2021-04-24--16-00-21', '--use-memory', '40G', '--open-files-limit', '200000']

But waiting to see it completes succesfully.

Change 682916 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/wmfbackups@master] Release new v0.5 version

https://gerrit.wikimedia.org/r/682916

Change 682916 merged by Jcrespo:

[operations/software/wmfbackups@master] Release new v0.5 version

https://gerrit.wikimedia.org/r/682916

This is now fixed, I have uploaded v0.5 packages fixing the issue, but I will only have updated for now buster dbprov hosts, as the others have 0 changes on the relevant package.

2 successful backups since deploy, everything looking good.

Not generating a 0.5 package for stretch as they are scheduled to disappear soon and not affected by the bug.

Screenshot from 2021-04-28 08-09-48.png (824×2 px, 120 KB)