Page MenuHomePhabricator

Create script to monitor db dumps for backups are successful
Closed, ResolvedPublic

Description

dumps are done to es2001 and in the future, somewhere in eqiad, redundantly.

Event Timeline

So, the main issue is that we can have problems like this:

07-Dec 02:05 helium.eqiad.wmnet JobId 42930: Start Backup JobId 42930, Job=dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2016-12-07_02.05.00_25
07-Dec 02:05 helium.eqiad.wmnet JobId 42930: Using Device "FileStorage1"
07-Dec 02:05 dbstore1001.eqiad.wmnet-fd JobId 42930: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/predump"
07-Dec 02:05 dbstore1001.eqiad.wmnet-fd JobId 42930: ClientRunBeforeJob: ERROR 2026 (HY000): SSL connection error: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
07-Dec 02:05 helium.eqiad.wmnet-fd JobId 42930: Volume "production0026" previously written, moving to end of data.
07-Dec 02:05 helium.eqiad.wmnet-fd JobId 42930: Ready to append to end of Volume "production0026" size=5903390593
07-Dec 08:01 helium.eqiad.wmnet-fd JobId 42930: Job write elapsed time = 05:56:31, Transfer rate = 55.68 M Bytes/second
07-Dec 08:01 helium.eqiad.wmnet JobId 42930: Bacula helium.eqiad.wmnet 5.2.5 (26Jan12):
  Build OS:               x86_64-pc-linux-gnu ubuntu 12.04
  JobId:                  42930
  Job:                    dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2016-12-07_02.05.00_25
  Backup Level:           Full
  Client:                 "dbstore1001.eqiad.wmnet-fd" 5.2.6 (21Feb12) x86_64-pc-linux-gnu,ubuntu,14.04
  FileSet:                "mysql-srv-backups" 2015-01-28 02:05:00
  Pool:                   "production" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "helium-FileStorage1" (From Pool resource)
  Scheduled time:         07-Dec-2016 02:05:00
  Start time:             07-Dec-2016 02:05:06
  End time:               07-Dec-2016 08:01:38
  Elapsed time:           5 hours 56 mins 32 secs
  Priority:               10
  FD Files Written:       1,825
  SD Files Written:       1,825
  FD Bytes Written:       1,191,203,712,720 (1.191 TB)
  SD Bytes Written:       1,191,204,297,679 (1.191 TB)
  Rate:                   55684.5 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             yes
  Accurate:               no
  Volume name(s):         production0026|production0027|production0029
  Volume Session Id:      914
  Volume Session Time:    1479834296
  Last Volume Bytes:      479,290,981,269 (479.2 GB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

Returning Backup OK, but the predump requirement failing silently. Is it a case of a return value not being set properly?

So, Bacula does indeed honor an error return from a ClientRunBeforeJob script.

http://www.bacula.org/5.2.x-manuals/en/main/main/Configuring_Director.html#SECTION001440000000000000000 says:

This directive is not required, but if it is defined, and if the exit code of the program run is non-zero, the current Bacula job will be canceled.

The predump scripts has a set -e right at the beginning for this exact thing https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/backup/templates/mysql-predump.erb;e46f56e88de3191731440c162b622c40bfaa650a$5

So, somehow we missed the error there. My take would be that one of the child processes returned that error, others did not and the wait in the last line somehow mangled all that. I 'll conduct some tests in case it's that.

Change 327187 had a related patch set uploaded (by Jcrespo):
mariadb: Fix quering phabricator databases by querying m3 slave

https://gerrit.wikimedia.org/r/327187

Change 327187 merged by Jcrespo:
mariadb: Fix quering phabricator databases by querying m3 slave

https://gerrit.wikimedia.org/r/327187

Change 327245 had a related patch set uploaded (by Jcrespo):
mariadb: even more small fixes for the misc backup scripts

https://gerrit.wikimedia.org/r/327245

Change 327245 merged by Jcrespo:
mariadb: even more small fixes for the misc backup scripts

https://gerrit.wikimedia.org/r/327245

Change 327568 had a related patch set uploaded (by Jcrespo):
mariadb-backups: Fix x1 backups

https://gerrit.wikimedia.org/r/327568

Change 327568 merged by Jcrespo:
mariadb-backups: Fix x1 backups

https://gerrit.wikimedia.org/r/327568

So, somehow we missed the error there. My take would be that one of the child processes returned that error, others did not and the wait in the last line somehow mangled all that. I 'll conduct some tests in case it's that.

FWIW while debugging something else I ran across a long-running predump script, can confirm this is the case. Namely wait will just return 0 once all jobs are completed. Replacing bash job management with e.g. parallel (from moreutils or GNU parallel) should do the right thing though.

1978Gage2001 moved this task from Triage to In progress on the DBA board.

This is partially implemented on T184696, if backups fail, they are not rotated.

The second part, monitoring and alerting of failures are still needed.

jcrespo renamed this task from Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) to Create script to monitor db dumps for backups are successful.Mar 9 2018, 3:21 PM
jcrespo updated the task description. (Show Details)
Marostegui assigned this task to jcrespo.

I am going to close this as resolved as per T151999#4038297 and going to create a task for the "second part" which is also going to be a goal. So creating a task for that.