After starting to implement backup monitoring, we discovered several hosts that had failed since recently its backups. Most seem to have been recently upgraded to buster.
Error is as following:
11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Start Backup JobId 156612, Job=puppetmaster1001.eqiad.wmnet-Monthly-1st-Sat-production-var-lib-puppet-volatile.2019-10-11_04.05.02_37 11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Using Device "FileStorage1" to write. 11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156612: Fatal error: bsock.c:569 Packet size=1073742965 too big from "client:10.64.16.73:9103. Terminating connection. 11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156612: Elapsed time=00:00:01, Transfer rate=127 Bytes/second 11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Fatal error: bsock.c:569 Packet size=1073742002 too big from "Client: puppetmaster1001.eqiad.wmnet-fd:puppetmaster1001.eqiad.wmnet:9102. Terminating connection. 11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Fatal error: No Job status returned from FD. 11-Oct 04:41 helium.eqiad.wmnet JobId 156612: Error: Bacula helium.eqiad.wmnet 7.4.3 (18Jun16): Build OS: x86_64-pc-linux-gnu debian 8.5 JobId: 156612 Job: puppetmaster1001.eqiad.wmnet-Monthly-1st-Sat-production-var-lib-puppet-volatile.2019-10-11_04.05.02_37 Backup Level: Incremental, since=2019-10-10 04:33:01 Client: "puppetmaster1001.eqiad.wmnet-fd" 9.4.2 (04Feb19) x86_64-pc-linux-gnu,debian,buster/sid FileSet: "var-lib-puppet-volatile" 2014-01-23 04:05:00 Pool: "production" (From Job resource) Catalog: "production" (From Client resource) Storage: "helium-FileStorage1" (From Pool resource) Scheduled time: 11-Oct-2019 04:05:02 Start time: 11-Oct-2019 04:41:22 End time: 11-Oct-2019 04:41:22 Elapsed time: 0 secs Priority: 10 FD Files Written: 0 SD Files Written: 1 FD Bytes Written: 0 (0 B) SD Bytes Written: 127 (127 B) Rate: 0.0 KB/s Software Compression: None Snapshot/VSS: no Encryption: no Accurate: no Volume name(s): Volume Session Id: 3683 Volume Session Time: 1568135667 Last Volume Bytes: 173,614,484,663 (173.6 GB) Non-fatal FD errors: 1 SD Errors: 1 FD termination status: Error SD termination status: Error Termination: *** Backup Error *** 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Start Backup JobId 156613, Job=puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-ssl.2019-10-11_04.05.02_38 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Using Device "FileStorage1" to write. 11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156613: Fatal error: bsock.c:569 Packet size=1073742965 too big from "client:10.192.0.27:9103. Terminating connection. 11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156613: Elapsed time=00:00:01, Transfer rate=131 Bytes/second 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:388 Wrote 10 bytes to Storage daemon:helium.eqiad.wmnet:9103, but only 0 accepted. 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=3 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Fatal error: backup.c:929 Network send error to SD. Data=?^<EC><CF>ZZ <98>3=<FB>^Rg/^WN^A ERR=Connection reset by peer 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: getmsg.c:185 Malformed message: Jmsg JobId=156613 type=4 level=1570768884 puppetmaster2001.codfw.wmnet-fd JobId 156613: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Fatal error: bsock.c:569 Packet size=1073741943 too big from "Client: puppetmaster2001.codfw.wmnet-fd:puppetmaster2001.codfw.wmnet:9102. Terminating connection. 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Fatal error: No Job status returned from FD. 11-Oct 04:41 helium.eqiad.wmnet JobId 156613: Error: Bacula helium.eqiad.wmnet 7.4.3 (18Jun16): Build OS: x86_64-pc-linux-gnu debian 8.5 JobId: 156613 Job: puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-ssl.2019-10-11_04.05.02_38 Backup Level: Incremental, since=2019-10-09 07:39:36 Client: "puppetmaster2001.codfw.wmnet-fd" 9.4.2 (04Feb19) x86_64-pc-linux-gnu,debian,buster/sid FileSet: "var-lib-puppet-ssl" 2018-05-18 03:05:00 Pool: "production" (From Job resource) Catalog: "production" (From Client resource) Storage: "helium-FileStorage1" (From Pool resource) Scheduled time: 11-Oct-2019 04:05:02 Start time: 11-Oct-2019 04:41:24 End time: 11-Oct-2019 04:41:24 Elapsed time: 0 secs Priority: 10 FD Files Written: 0 SD Files Written: 1 FD Bytes Written: 0 (0 B) SD Bytes Written: 131 (131 B) Rate: 0.0 KB/s Software Compression: None Snapshot/VSS: no Encryption: no Accurate: no Volume name(s): Volume Session Id: 3684 Volume Session Time: 1568135667 Last Volume Bytes: 173,812,149,427 (173.8 GB) Non-fatal FD errors: 7 SD Errors: 1 FD termination status: Error SD termination status: Error Termination: *** Backup Error *** 11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Start Backup JobId 156614, Job=puppetmaster2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-puppet-volatile.2019-10-11_04.05.02_39 11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Using Device "FileStorage1" to write. 11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156614: Fatal error: bsock.c:569 Packet size=1073742965 too big from "client:10.192.0.27:9103. Terminating connection. 11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Error: getmsg.c:185 Malformed message: Jmsg JobId=156614 type=4 level=1570768887 puppetmaster2001.codfw.wmnet-fd JobId 156614: Error: bsock.c:388 Wrote 65540 bytes to Storage daemon:helium.eqiad.wmnet:9103, but only 16384 accepted. 11-Oct 04:41 helium.eqiad.wmnet-fd JobId 156614: Elapsed time=00:00:01, Transfer rate=110 Bytes/second 11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Error: getmsg.c:185 Malformed message: Jmsg JobId=156614 type=3 level=1570768887 puppetmaster2001.codfw.wmnet-fd JobId 156614: Fatal error: backup.c:843 Network send error to SD. ERR=Connection reset by peer 11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Error: getmsg.c:185 Malformed message: Jmsg JobId=156614 type=4 level=1570768887 puppetmaster2001.codfw.wmnet-fd JobId 156614: Error: bsock.c:271 Socket has errors=1 on call to Storage daemon:helium.eqiad.wmnet:9103 11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Fatal error: bsock.c:569 Packet size=1073741943 too big from "Client: puppetmaster2001.codfw.wmnet-fd:puppetmaster2001.codfw.wmnet:9102. Terminating connection. 11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Fatal error: No Job status returned from FD. 11-Oct 04:41 helium.eqiad.wmnet JobId 156614: Error: Bacula helium.eqiad.wmnet 7.4.3 (18Jun16):
Note the Fatal error: bsock.c:569 Packet size=1073741943 too big from "Client: puppetmaster2001.codfw.wmnet-fd:puppetmaster2001.codfw.wmnet:9102. Terminating connection.
Working theory is that the buster upgrade made the client incompatible either on configuration, or on protocol. Happily, we are just about to upgrade the director and sd daemons to buster too.