Page MenuHomePhabricator

labtestpuppetmaster2001 is failing to backup
Closed, ResolvedPublic

Description

The following jobs are failing to run:

labtestpuppetmaster2001.wikimedia.org-Monthly-1st-Sat-production-var-lib-puppet-ssl
labtestpuppetmaster2001.wikimedia.org-Monthly-1st-Sat-production-var-lib-puppet-volatile

There is a connection (TLS?) failure:

01-Jul 05:11 backup1001.eqiad.wmnet JobId 240917: Start Backup JobId 240917, Job=labtestpuppetmaster2001.wikimedia.org-M
onthly-1st-Sat-production-var-lib-puppet-ssl.2020-07-01_04.05.01_42
01-Jul 05:11 backup1001.eqiad.wmnet JobId 240917: Using Device "FileStorageProduction" to write.
01-Jul 05:11 backup1001.eqiad.wmnet JobId 240917: Error: openssl.c:68 Connect failure: ERR=error:14094415:SSL routines:s
sl3_read_bytes:sslv3 alert certificate expired
01-Jul 05:11 backup1001.eqiad.wmnet JobId 240917: Fatal error: TLS negotiation failed with FD at "labtestpuppetmaster200
1.wikimedia.org:9102".
01-Jul 05:11 backup1001.eqiad.wmnet JobId 240917: Error: getmsg.c:209 Malformed message: authenticate.c:113 TLS negotiat
ion failed.

01-Jul 05:11 backup1001.eqiad.wmnet JobId 240917: Warning: Unexpected Client Job message: 2999 Authentication failed.

01-Jul 05:11 backup1001.eqiad.wmnet JobId 240917: Fatal error: No Job status returned from FD.
01-Jul 05:11 backup1001.eqiad.wmnet JobId 240917: Error: Bacula backup1001.eqiad.wmnet 9.4.2 (04Feb19):
...
  Catalog:                "production" (From Client resource)
  Storage:                "backup1001-FileStorageProduction" (From Pool resource)
  Scheduled time:         01-Jul-2020 04:05:01
  Start time:             01-Jul-2020 05:11:52
  End time:               01-Jul-2020 05:11:53
  Elapsed time:           1 sec
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  Comm Line Compression:  None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               no
  Non-fatal FD errors:    2
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Waiting on FD
  Termination:            *** Backup Error ***

IMPORTANT: Do not close until the 2 relavant lines in the ignorelist https://gerrit.wikimedia.org/r/c/operations/puppet/+/612167/6/modules/profile/files/backup/job_monitoring_ignorelist are removed

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui triaged this task as Medium priority.Jul 6 2020, 4:55 AM
Marostegui moved this task from Triage to Backlog on the DBA board.

Change 612167 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Add ignorelist for long-term broken backups

https://gerrit.wikimedia.org/r/612167

Change 612167 merged by Jcrespo:
[operations/puppet@production] bacula: Add ignorelist for long-term broken backups

https://gerrit.wikimedia.org/r/612167

I have added a rule to ignore labtestpuppetmaster2001 backup monitoring:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/1cd5aee3ff46cda2a1a5396266c24c51dc0ec2b5/modules/profile/files/backup/job_monitoring_ignorelist

The rule should be deleted before closing this ticket (after backups have been removed from configuration or they work again).

Hey, cloud people! :-D

Could I get an ack of this ticket from someone at cloud (or a redirection to the right server owners)? I don't particularly need this fixed (it is not blocking me in any way, so it could be low priority if needed on your side), but I want to make sure you are explicitly aware backups are not running for (what I think it is) one of your servers/services, before it is too late to cause permanent data loss.

Sorry about the slow response! Since you first opened this ticket I've moved all VMs off this puppetmaster; it's now set to role::spare::system. Anything you can do to take it out of the backup loop is great; there's definitely nothing on there that needs backing up.

Thank you very much @Andrew! Indeed, backups jobs have been automatically removed, so no need for any further action, except revert from the alert ignore list. I will do so and resolve the ticket.

Change 621537 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Revert labtestpuppetmaster2001 addition to backup check ignorelist

https://gerrit.wikimedia.org/r/621537

Change 621537 merged by Jcrespo:
[operations/puppet@production] Revert labtestpuppetmaster2001 addition to backup check ignorelist

https://gerrit.wikimedia.org/r/621537

jcrespo claimed this task.