Page MenuHomePhabricator

Backups on buster hosts fail to run
Closed, ResolvedPublic

Description

After starting to implement backup monitoring, we discovered several hosts that had failed since recently its backups. Most seem to have been recently upgraded to buster.

Error is as following:

helium:/var/lib/bacula/log
11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Start Backup JobId 156612, Job=puppetmaster1001.eqiad.wmnet-Monthly-1st-Sat-production-var-lib-puppet-volatile.2019-10-11_04.05.02_37
11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Using Device "FileStorage1" to write.
11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156612: Fatal error: bsock.c:569 Packet size=1073742965 too big from "client:10.64.16.73:9103. Terminating connection.
11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156612: Elapsed time=00:00:01, Transfer rate=127  Bytes/second
11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Fatal error: bsock.c:569 Packet size=1073742002 too big from "Client: puppetmaster1001.eqiad.wmnet-fd:puppetmaster1001.eqiad.wmnet:9102. Terminating connection.
11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Fatal error: No Job status returned from FD.
11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Error: Bacula helium.eqiad.wmnet 7.4.3 (18Jun16):
  Build OS:               x86_64-pc-linux-gnu debian 8.5
  JobId:                  156612
  Job:                    puppetmaster1001.eqiad.wmnet-Monthly-1st-Sat-production-var-lib-puppet-volatile.2019-10-11_04.05.02_37
  Backup Level:           Incremental, since=2019-10-10 04:33:01
  Client:                 "puppetmaster1001.eqiad.wmnet-fd" 9.4.2 (04Feb19) x86_64-pc-linux-gnu,debian,buster/sid
  FileSet:                "var-lib-puppet-volatile" 2014-01-23 04:05:00
  Pool:                   "production" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "helium-FileStorage1" (From Pool resource)
  Scheduled time:         11-Oct-2019 04:05:02
  Start time:             11-Oct-2019 04:41:22
  End time:               11-Oct-2019 04:41:22
  Elapsed time:           0 secs
  Priority:               10
  FD Files Written:       0
  SD Files Written:       1
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       127 (127 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               no
  Volume name(s):         
  Volume Session Id:      3683
  Volume Session Time:    1568135667
  Last Volume Bytes:      173,614,484,663 (173.6 GB)
  Non-fatal FD errors:    1
  SD Errors:              1
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Backup Error ***

11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Start Backup JobId 156613, Job=puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-ssl.2019-10-11_04.05.02_38
11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Using Device "FileStorage1" to write.
11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156613: Fatal error: bsock.c:569 Packet size=1073742965 too big from "client:10.192.0.27:9103. Terminating connection.
11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156613: Elapsed time=00:00:01, Transfer rate=131  Bytes/second
11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:388 Wrote 10 bytes to Storage daemon:helium.eqiad.wmnet:9103, but only 0 accepted.

11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103

11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103

11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103

11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103

11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=3 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Fatal error: backup.c:929 Network send error to SD. Data=?^<EC><CF>ZZ      <98>3=<FB>^Rg/^WN^A ERR=Connection reset by peer

11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103

11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Fatal error: bsock.c:569 Packet size=1073741943 too big from "Client: puppetmaster2001.codfw.wmnet-fd:puppetmaster2001.codfw.wmnet:9102. Terminating connection.
11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Fatal error: No Job status returned from FD.

11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: Bacula helium.eqiad.wmnet 7.4.3 (18Jun16):
  Build OS:               x86_64-pc-linux-gnu debian 8.5
  JobId:                  156613
  Job:                    puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-ssl.2019-10-11_04.05.02_38
  Backup Level:           Incremental, since=2019-10-09 07:39:36
  Client:                 "puppetmaster2001.codfw.wmnet-fd" 9.4.2 (04Feb19) x86_64-pc-linux-gnu,debian,buster/sid
  FileSet:                "var-lib-puppet-ssl" 2018-05-18 03:05:00
  Pool:                   "production" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "helium-FileStorage1" (From Pool resource)
  Scheduled time:         11-Oct-2019 04:05:02
  Start time:             11-Oct-2019 04:41:24
  End time:               11-Oct-2019 04:41:24
  Elapsed time:           0 secs
  Priority:               10
  FD Files Written:       0
  SD Files Written:       1
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       131 (131 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               no
  Volume name(s):         
  Volume Session Id:      3684
  Volume Session Time:    1568135667
  Last Volume Bytes:      173,812,149,427 (173.8 GB)
  Non-fatal FD errors:    7
  SD Errors:              1
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Backup Error ***

11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Start Backup JobId 156614, Job=puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-volatile.2019-10-11_04.05.02_39
11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Using Device "FileStorage1" to write.
11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156614: Fatal error: bsock.c:569 Packet size=1073742965 too big from "client:10.192.0.27:9103. Terminating connection.
11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Error: getmsg.c:185 Malformed message: Jmsg JobId=156614 type=4 level=1570768887 puppetmaster2001.codfw.wmnet-fd JobId 156614: Error: bsock.c:388 Wrote 65540 bytes to Storage daemon:helium.eqiad.wmnet:9103, but only 16384 accepted.

11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156614: Elapsed time=00:00:01, Transfer rate=110  Bytes/second
11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Error: getmsg.c:185 Malformed message: Jmsg JobId=156614 type=3 level=1570768887 puppetmaster2001.codfw.wmnet-fd JobId 156614: Fatal error: backup.c:843 Network send error to SD. ERR=Connection reset by peer

11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Error: getmsg.c:185 Malformed message: Jmsg JobId=156614 type=4 level=1570768887 puppetmaster2001.codfw.wmnet-fd JobId 156614: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103

11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Fatal error: bsock.c:569 Packet size=1073741943 too big from "Client: puppetmaster2001.codfw.wmnet-fd:puppetmaster2001.codfw.wmnet:9102. Terminating connection.
11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Fatal error: No Job status returned from FD.
11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Error: Bacula helium.eqiad.wmnet 7.4.3 (18Jun16):

Note the Fatal error: bsock.c:569 Packet size=1073741943 too big from "Client: puppetmaster2001.codfw.wmnet-fd:puppetmaster2001.codfw.wmnet:9102. Terminating connection.

Working theory is that the buster upgrade made the client incompatible either on configuration, or on protocol. Happily, we are just about to upgrade the director and sd daemons to buster too.

Event Timeline

jcrespo created this task.Oct 18 2019, 10:08 AM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptOct 18 2019, 10:08 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jcrespo triaged this task as High priority.Oct 18 2019, 10:08 AM
jcrespo updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.Oct 18 2019, 10:12 AM

Bacula advice on the issue:

<<In connecting to my Client, I get ”ERR:Connection Refused. Packet Size too big from File daemon:192.>>

This is typically a communications error resulting from one of the following:
• Old versions of Bacula, usually a Win32 client, where two threads were using the same I/O packet.
Fixed in more recent versions. Please upgrade.
• Some other program such as an HP Printer using the same port (9102 in this case).
If it is neither of the above, please submit a bug report at bugs.bacula.org.
Another solution might be to run the daemon with the debug option by:
Start a DOS shell Window.
cd c:\bacula\bin
bacula-fd -d100 -c c:\bacula\bin\bacula-fd.conf
This will cause the FD to write a file bacula.trace in the current directory, which you can examine
to determine the problem.

It is not the second case:

# netstat -tlpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:9102            0.0.0.0:*               LISTEN      834/bacula-fd 

#iptables -L
Chain INPUT (policy DROP)
target     prot opt source               destination
ACCEPT     tcp  --  helium.eqiad.wmnet   anywhere             tcp dpt:bacula-fd

Most likely an incompatibility problem.

There is mix experiences on compatibility of clients and storage daemons between 5.X and higher: https://serverfault.com/questions/837241/bacula-versions-compatibility

On mailing list, with the same error:

Version mismatch?

Yes.

The rule, which is documented, but I don't recall where, is:

(DIR == SD) >= FD

I see 2 paths forward:

  • Push forward with the migration in order to fix this
  • Backport bacula 9 from the stretch backport to jessie and do the upgrade in place on the old hosts in order to restore functionality.

I did try backporting bacula9 to jessie just to see if it's possible and I do already have debs, albeit not tested.

Change 544665 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup: Migrate bacula director from helium to backup1001

https://gerrit.wikimedia.org/r/544665

Change 544665 merged by Jcrespo:
[operations/puppet@production] backup: Migrate bacula director from helium to backup1001

https://gerrit.wikimedia.org/r/544665

Ottomata assigned this task to jcrespo.Oct 29 2019, 3:35 PM
Ottomata added a subscriber: Ottomata.

Jaime, assigning to you, feel free to undo or reassign if this is not correct.

jcrespo lowered the priority of this task from High to Low.Oct 29 2019, 3:46 PM

We believe this is fixed after T236406. Keeping it open until all hosts run at least once.

jcrespo closed this task as Resolved.Nov 5 2019, 10:39 AM