Page MenuHomePhabricator

Bacula recovery of sql files from silver/wikitech fails
Closed, ResolvedPublic

Description

I've tried almost all files from 28-feb, and it fails to recover those, e.g.:

/a/backup/labswiki-20160301.sql.gz

both in silver and in other hosts. I thought I was doing something wrong, but recovering files from other host "just works" without problem.

The error I get every time is:

29-Mar 20:39 dbstore1001.eqiad.wmnet-fd JobId 29292: Error: restore.c:780 Failed to initialize decryption context for /srv/T130892/a/backup/labswiki-20160228.sql.gz

(this is trying to recover to a different host, but same error is shown on the original host).
The files are recognized, if there were no files, the recovery is not even started.

I think there were previous errors, but at least not anymore since I upgraded ssl/restarted the bacula-fd service.

Event Timeline

jcrespo created this task.Mar 29 2016, 8:53 PM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptMar 29 2016, 8:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I did a last try trying to recover all possible files within a month, and I got some extra errors:

29-Mar 20:39 helium.eqiad.wmnet JobId 29292: Start Restore Job RestoreFiles.2016-03-29_20.02.56_19
29-Mar 20:39 helium.eqiad.wmnet JobId 29292: Using Device "FileStorage1"
29-Mar 20:39 helium.eqiad.wmnet-fd JobId 29292: Ready to read from volume "production0011" on device "FileStorage1" (/srv/baculasd1).
29-Mar 20:39 helium.eqiad.wmnet-fd JobId 29292: Forward spacing Volume "production0011" to file:block 69:628759543.
29-Mar 20:39 dbstore1001.eqiad.wmnet-fd JobId 29292: Error: restore.c:780 Failed to initialize decryption context for /srv/T130892/a/backup/labswiki-20160228.sql.gz
29-Mar 20:41 dbstore1001.eqiad.wmnet-fd JobId 29292: Error: restore.c:780 Failed to initialize decryption context for /srv/T130892/a/backup/labswiki-20160229.sql.gz
29-Mar 20:42 helium.eqiad.wmnet-fd JobId 29292: End of Volume at file 78 on device "FileStorage1" (/srv/baculasd1), Volume "production0011"
29-Mar 20:42 helium.eqiad.wmnet-fd JobId 29292: Ready to read from volume "production0019" on device "FileStorage1" (/srv/baculasd1).
29-Mar 20:42 helium.eqiad.wmnet-fd JobId 29292: Forward spacing Volume "production0019" to file:block 33:3257075293.
29-Mar 20:42 dbstore1001.eqiad.wmnet-fd JobId 29292: Error: openssl.c:86 TLS read/write failure.: ERR=error:0406506C:rsa routines:RSA_EAY_PRIVATE_DECRYPT:data greater than mod len
29-Mar 20:42 dbstore1001.eqiad.wmnet-fd JobId 29292: Fatal error: restore.c:495 Data record error. ERR=Resource temporarily unavailable
29-Mar 20:42 helium.eqiad.wmnet-fd JobId 29292: Error: bsock.c:394 Wrote 65536 bytes to client:10.64.48.17:36643, but only 49152 accepted.
29-Mar 20:42 helium.eqiad.wmnet-fd JobId 29292: Fatal error: read.c:137 Error sending to File daemon. ERR=Connection reset by peer
29-Mar 20:42 helium.eqiad.wmnet-fd JobId 29292: Error: bsock.c:335 Socket has errors=1 on call to client:10.64.48.17:36643
29-Mar 20:42 helium.eqiad.wmnet JobId 29292: Error: Bacula helium.eqiad.wmnet 5.2.5 (26Jan12):
  Build OS:               x86_64-pc-linux-gnu ubuntu 12.04
  JobId:                  29292
  Job:                    RestoreFiles.2016-03-29_20.02.56_19
  Restore Client:         dbstore1001.eqiad.wmnet-fd
  Start time:             29-Mar-2016 20:39:41
  End time:               29-Mar-2016 20:42:41
  Files Expected:         10
  Files Restored:         3
  Bytes Restored:         0
  Rate:                   0.0 KB/s
  FD Errors:              3
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Restore Error ***

The 3 files created were empty.

Luke081515 added a subscriber: Luke081515.
Dzahn added a subscriber: Dzahn.Mar 29 2016, 10:05 PM

I also tried this multiple times and to different restore clients. Either i got an empty file with that same error about decryption or there would just be no job started at all.

So restoring on different clients in not really possible without some mambo jumbo first. Namely http://www.bacula.org/5.2.x-manuals/en/main/main/Data_Encryption.html#SECTION003730000000000000000

The reason is that we have encryption keys per host (with a master - the puppet ca fallback).

On the topic in hand, I 've failed as well to restore that specific file. I can restore multiple others but not that specific one.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 9:41 PM

alex- I think I was able to recover to a different host, but not from that file. But I may be wrong. In any case, the root problem is not that- recovery failed first on the same server, only others were tried after that failed.

I wonder if I "fixed it" by upgrading and restarting the agent on the client server- it seems the backups had "stuck", but subsequent backups run fine (but that means olders were lost). If that is the case, older backups were lost, and we may want to check other datasets (I already checked main databases one on dbstore, they are fine).

Other than that we should close this.

akosiaris closed this task as Resolved.Apr 20 2016, 11:41 AM
akosiaris claimed this task.

alex- I think I was able to recover to a different host, but not from that file.

I find that doubtful.

In any case, the root problem is not that- recovery failed first on the same server, only others were tried after that failed.

True

Other than that we should close this.

OK will do but I 'll file another one since this might have actually unearthed something.