dumps are done to es2001 and in the future, somewhere in eqiad, redundantly.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | jcrespo | T151579 dbstore1001 backup jobs failed between 2016-10-19 and 2016-11-23 | |||
Open | None | T138562 Improve regular production database backups handling | |||
Resolved | jcrespo | T151999 Create script to monitor db dumps for backups are successful |
Event Timeline
So, the main issue is that we can have problems like this:
07-Dec 02:05 helium.eqiad.wmnet JobId 42930: Start Backup JobId 42930, Job=dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2016-12-07_02.05.00_25 07-Dec 02:05 helium.eqiad.wmnet JobId 42930: Using Device "FileStorage1" 07-Dec 02:05 dbstore1001.eqiad.wmnet-fd JobId 42930: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/predump" 07-Dec 02:05 dbstore1001.eqiad.wmnet-fd JobId 42930: ClientRunBeforeJob: ERROR 2026 (HY000): SSL connection error: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed 07-Dec 02:05 helium.eqiad.wmnet-fd JobId 42930: Volume "production0026" previously written, moving to end of data. 07-Dec 02:05 helium.eqiad.wmnet-fd JobId 42930: Ready to append to end of Volume "production0026" size=5903390593 07-Dec 08:01 helium.eqiad.wmnet-fd JobId 42930: Job write elapsed time = 05:56:31, Transfer rate = 55.68 M Bytes/second 07-Dec 08:01 helium.eqiad.wmnet JobId 42930: Bacula helium.eqiad.wmnet 5.2.5 (26Jan12): Build OS: x86_64-pc-linux-gnu ubuntu 12.04 JobId: 42930 Job: dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2016-12-07_02.05.00_25 Backup Level: Full Client: "dbstore1001.eqiad.wmnet-fd" 5.2.6 (21Feb12) x86_64-pc-linux-gnu,ubuntu,14.04 FileSet: "mysql-srv-backups" 2015-01-28 02:05:00 Pool: "production" (From Job resource) Catalog: "production" (From Client resource) Storage: "helium-FileStorage1" (From Pool resource) Scheduled time: 07-Dec-2016 02:05:00 Start time: 07-Dec-2016 02:05:06 End time: 07-Dec-2016 08:01:38 Elapsed time: 5 hours 56 mins 32 secs Priority: 10 FD Files Written: 1,825 SD Files Written: 1,825 FD Bytes Written: 1,191,203,712,720 (1.191 TB) SD Bytes Written: 1,191,204,297,679 (1.191 TB) Rate: 55684.5 KB/s Software Compression: None VSS: no Encryption: yes Accurate: no Volume name(s): production0026|production0027|production0029 Volume Session Id: 914 Volume Session Time: 1479834296 Last Volume Bytes: 479,290,981,269 (479.2 GB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK
Returning Backup OK, but the predump requirement failing silently. Is it a case of a return value not being set properly?
So, Bacula does indeed honor an error return from a ClientRunBeforeJob script.
This directive is not required, but if it is defined, and if the exit code of the program run is non-zero, the current Bacula job will be canceled.
The predump scripts has a set -e right at the beginning for this exact thing https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/backup/templates/mysql-predump.erb;e46f56e88de3191731440c162b622c40bfaa650a$5
So, somehow we missed the error there. My take would be that one of the child processes returned that error, others did not and the wait in the last line somehow mangled all that. I 'll conduct some tests in case it's that.
Change 327187 had a related patch set uploaded (by Jcrespo):
mariadb: Fix quering phabricator databases by querying m3 slave
Change 327187 merged by Jcrespo:
mariadb: Fix quering phabricator databases by querying m3 slave
Change 327245 had a related patch set uploaded (by Jcrespo):
mariadb: even more small fixes for the misc backup scripts
Change 327245 merged by Jcrespo:
mariadb: even more small fixes for the misc backup scripts
Change 327568 had a related patch set uploaded (by Jcrespo):
mariadb-backups: Fix x1 backups
FWIW while debugging something else I ran across a long-running predump script, can confirm this is the case. Namely wait will just return 0 once all jobs are completed. Replacing bash job management with e.g. parallel (from moreutils or GNU parallel) should do the right thing though.
This is partially implemented on T184696, if backups fail, they are not rotated.
The second part, monitoring and alerting of failures are still needed.
I am going to close this as resolved as per T151999#4038297 and going to create a task for the "second part" which is also going to be a goal. So creating a task for that.