cloudservices: codfw1dev: fix backups
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Jun 20 2023, 8:58 AM

Description

The hosts don't have backups for the couple of stateful datasets they have:

LDAP database. Done with https://gerrit.wikimedia.org/r/c/931558
pdns database https://gerrit.wikimedia.org/r/c/operations/puppet/+/931880

Details

Subject	Repo	Branch	Lines +/-
Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring"	operations/puppet	production	+0 -1
bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring	operations/puppet	production	+1 -0
dbbackups: Make backups statistics optional	operations/puppet	production	+22 -8
pdns_server: db_backups: avoid <<< redirection	operations/puppet	production	+2 -2
openstack: pdns: add grants for DB backups	operations/puppet	production	+43 -1
dbbackups: Fix small logical backup for no-stats-file case	operations/puppet	production	+5 -5
openstack: pdns: add backup for the database	operations/puppet	production	+26 -3
dbbackups: Make mydumper backup template path configurable	operations/puppet	production	+12 -3
cr-labs: Permit bacula backup traffic	operations/homer/public	master	+15 -0
cloudservices: codfw1dev: enable LDAP backups	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T340446 Cloud VPS Designate setup improvements
		Resolved		aborrero	T339894 cloudservices: codfw1dev: fix backups

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

aborrero mentioned this in T339869: codfw1dev: LDAP database content seems to have lost years of content.Jun 20 2023, 9:00 AM

Change 931558 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices: codfw1dev: enable LDAP backups

https://gerrit.wikimedia.org/r/931558

gerritbot added a project: Patch-For-Review.Jun 20 2023, 9:08 AM

Let's try a recovery to ensure the process would work for you.

Change 931558 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices: codfw1dev: enable LDAP backups

https://gerrit.wikimedia.org/r/931558

Maintenance_bot removed a project: Patch-For-Review.Jun 20 2023, 10:10 AM

cloudservices2005-dev seems to have run correctly (3 files backed up, 7MB)

Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
515892  Full           3    6.901 M  OK       20-Jun-23 10:31 cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap

but cloudservices2004 seems it got stuck, my guess is network connectivity or no client running (?), given the logs (waiting on client/unable to connect).

515891  Full           0         0   Error    20-Jun-23 10:34 cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap

20-Jun 10:31 backup1001.eqiad.wmnet JobId 515891: Start Backup JobId 515891, Job=cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap.2023-06-20_10.31.07_23
20-Jun 10:31 backup1001.eqiad.wmnet JobId 515891: Using Device "FileStorageProductionEqiad" to write.
20-Jun 10:34 backup1001.eqiad.wmnet JobId 515891: Fatal error: bsockcore.c:208 Unable to connect to Client: cloudservices2004-dev.codfw.wmnet-fd on cloudservices2004-dev.codfw.wmnet:9102. ERR=Interrupted system call
20-Jun 10:34 backup1001.eqiad.wmnet JobId 515891: Fatal error: No Job status returned from FD.
20-Jun 10:34 backup1001.eqiad.wmnet JobId 515891: Error: Bacula backup1001.eqiad.wmnet 9.6.7 (10Dec20):
  Build OS:               x86_64-pc-linux-gnu debian bullseye/sid
  JobId:                  515891
  Job:                    cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap.2023-06-20_10.31.07_23
  Backup Level:           Full
  Client:                 "cloudservices2004-dev.codfw.wmnet-fd" 
  FileSet:                "openldap" 2022-09-13 04:05:02
  Pool:                   "productionEqiad" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "backup1009-FileStorageProductionEqiad" (From Pool resource)
  Scheduled time:         20-Jun-2023 10:30:46
  Start time:             20-Jun-2023 10:31:09
  End time:               20-Jun-2023 10:34:19
  Elapsed time:           3 mins 10 secs
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  Comm Line Compression:  None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               no
  Volume name(s):         
  Volume Session Id:      609
  Volume Session Time:    1686902090
  Last Volume Bytes:      282,811,304,024 (282.8 GB)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Waiting on FD
  Termination:            *** Backup Error ***

backula-fd seems to be running on both hosts:

taavi@cloudservices2004-dev ~ $ sudo ss -tulpn|grep bacula
tcp   LISTEN 0      50                 0.0.0.0:9102       0.0.0.0:*    users:(("bacula-fd",pid=1943340,fd=3))

taavi@cloudservices2005-dev ~ $ sudo ss -tulpn|grep bacula
tcp   LISTEN 0      50                 0.0.0.0:9102       0.0.0.0:*    users:(("bacula-fd",pid=2141385,fd=3))

So I'm confused why we're seeing connection failures on hosts on the new setup only?

There might be indeed firewalling going on at CR.

Oh, I think I know what's going on.. operations/homer/public.git:policies/cr-labs.pol is missing rules for the return traffic. Does the backup process use any other ports than 9102/tcp?

Change 931576 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] cr-labs: Permit bacula backup traffic

https://gerrit.wikimedia.org/r/931576

gerritbot added a project: Patch-For-Review.Jun 20 2023, 10:44 AM

bacula-fd requires listening from the director and sending data to the bacula storage disk (obviously the return is needed too to establish tcp, just mentioning the direction of connections). This may be happening also for https://gerrit.wikimedia.org/r/c/operations/puppet/+/927119/3/modules/profile/files/backup/job_monitoring_ignorelist

In T339894#8948387, @jcrespo wrote:

bacula-fd requires listening from the director and sending data to the bacula storage disk (obviously the return is needed too to establish tcp, just mentioning the direction of connections). This may be happening also for https://gerrit.wikimedia.org/r/c/operations/puppet/+/927119/3/modules/profile/files/backup/job_monitoring_ignorelist

Most likely yes! this might be the fix for T338132: cloudcontrol: review connectivity with backup system too.

aborrero updated the task description. (Show Details)Jun 20 2023, 12:24 PM

Change 931576 merged by Arturo Borrero Gonzalez:

[operations/homer/public@master] cr-labs: Permit bacula backup traffic

https://gerrit.wikimedia.org/r/931576

Maintenance_bot removed a project: Patch-For-Review.Jun 20 2023, 2:29 PM

Mentioned in SAL (#wikimedia-operations) [2023-06-20T14:36:19Z] <arturo> homer run for CR eqiad/codfw to allow bacula traffic in from cloud-hosts (T338132, T339894)

the connectivity problem should be fixed now!

It is however, not obvious to me how to do the mysql backup.

aborrero updated the task description. (Show Details)Jun 20 2023, 2:38 PM

It worked now:

515898  Full          16    54.00 M  OK       20-Jun-23 15:34 cloudcontrol2001-dev.codfw.wmnet-Monthly-1st-Wed-productionEqiad-mysql-srv-backups
515899  Full           3    6.316 M  OK       20-Jun-23 15:35 cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap

Don't worry, if you are interested in knowing it, I can explain it, but you don't need to know how to run it. What you MUST know is how to do a recovery. I can do 1 for you if you tell me a filesystem path (EDIT: not patch) to recover to, separate from production to test the recovery. I require to do a recovery test when setting up backups for the first time because it is easy to end up backing up the wrong location or some other silly mistake. Once that is tested, my monitoring usually keeps the workflow safe.

But you should be able to do a recovery on your own afterawards.

I believe I've done it in the past at least once, following https://wikitech.wikimedia.org/wiki/Bacula#Restore_(aka_Panic_mode) I'd be happy to try again to make sure the procedure is clear (and that we are backing up the expected data),

The cloudservices systems have another mysql database in them that is currently not being backed up. Could you advice on how to puppetize the backup of that
I saw other examples that needs setting up a particular user/grant for the mysql dump, but I'm not sure if this is what I need in this scenario.

aborrero moved this task from Backlog to Doing on the User-aborrero board.Jun 20 2023, 4:28 PM

have another mysql database in them that is currently not being backed up. Could you advice on how to puppetize the backup of that

I may not be the best person to answer that question- because the method we use for MediaWiki databases may be too over-engineered for your needs.

You can see how it is setup at:

role dbbackups:content (a different role will be needed)
and a template and yaml config controls its parameters. Documentation is at: https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Software_and_deployment_architecture and https://wikitech.wikimedia.org/wiki/Backup-mariadb It indeed uses a separate user to generate the backup for security reasons (it is not *that* hard to setup).

The advantages is that it is a robust method, production-proven and maintained by myself. It is intended to handle all backups preprocesing before bacula (not only for databases, also for things like gitlab). The disadvantage is that it is thought for large backups at scale and may have some complexities.

Let me know what you think- if you want to set it up as a (possibly) first step towards uniformizing all db backups in cloud to a common, improved architecture (T284157), I would go for this (and I will support you, no problem). If it is a 1 host (or 1 role) super small backup system, I'd recommend asking how others are doing it (as we do all we know about "the right way").

If think that's ok, start by creating a new role inside dbbackups (e.g. dbbackups::cloud), and we can go from there discussing on a patch.

Change 931880 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: pdns: add backup for the database

https://gerrit.wikimedia.org/r/931880

gerritbot added a project: Patch-For-Review.Jun 21 2023, 9:23 AM

In T339894#8952054, @gerritbot wrote:

Change 931880 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: pdns: add backup for the database

https://gerrit.wikimedia.org/r/931880

In an IRC conversation @jcrespo proposed a refactor previous to introducing the backup for the PDNS db.

Meanwhile, I've stored a copy of the PDNS database in the openldap backup directory and asked for a force-run of the backup procedure, so we have a copy somewhere today.

aborrero removed aborrero as the assignee of this task.Jun 21 2023, 11:33 AM

Change 931940 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Make mydumper backup template path configurable

https://gerrit.wikimedia.org/r/931940

Change 931940 merged by Jcrespo:

[operations/puppet@production] dbbackups: Make mydumper backup template path configurable

https://gerrit.wikimedia.org/r/931940

I don't know what are the hosts to run puppet compiler on, but something like this should work: https://gerrit.wikimedia.org/r/c/operations/puppet/+/931880 You will need to fix the things I don't know, like host names or backup schedule.

To make things explicit- backups are not yet working- that will require you to merge the above and test it. I am no longer working on this, as you seem to have abandoned work on it/no longer requiring my help.

Change 931880 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: pdns: add backup for the database

https://gerrit.wikimedia.org/r/931880

aborrero updated the task description. (Show Details)Jun 23 2023, 9:49 AM

Maintenance_bot removed a project: Patch-For-Review.Jun 23 2023, 10:10 AM

Change 932388 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] openstack: pdns: Change backup user for dump and make statistics configurable

https://gerrit.wikimedia.org/r/932388

gerritbot added a project: Patch-For-Review.Jun 23 2023, 11:06 AM

Change 932388 merged by Jcrespo:

[operations/puppet@production] dbbackups: Make backups statistics optional

https://gerrit.wikimedia.org/r/932388

Change 932403 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Fix small logical backup for no-stats-file case

https://gerrit.wikimedia.org/r/932403

Change 932406 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: pdns: add grants for DB backups

https://gerrit.wikimedia.org/r/932406

Change 932403 merged by Jcrespo:

[operations/puppet@production] dbbackups: Fix small logical backup for no-stats-file case

https://gerrit.wikimedia.org/r/932403

Change 932406 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: pdns: add grants for DB backups

https://gerrit.wikimedia.org/r/932406

Change 932430 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] pdns_server: db_backups: avoid <<< redirection

https://gerrit.wikimedia.org/r/932430

Change 932430 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] pdns_server: db_backups: avoid <<< redirection

https://gerrit.wikimedia.org/r/932430

Maintenance_bot removed a project: Patch-For-Review.Jun 23 2023, 4:10 PM

The MySQL backups are working, but the LDAP one on cloudservices2005-dev.wikimedia.org started failing on the 22nd (also looks like network issue):

23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Start Backup JobId 516361, Job=cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap.2023-06-23_16.22.32_09
23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Using Device "FileStorageProductionEqiad" to write.
23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Error: bsockcore.c:284 gethostbyname() for host "cloudservices2005-dev.wikimedia.org" failed: ERR=Name or service not known
23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Fatal error: No Job status returned from FD.
23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Error: Bacula backup1001.eqiad.wmnet 9.6.7 (10Dec20):
  Build OS:               x86_64-pc-linux-gnu debian bullseye/sid
  JobId:                  516361
  Job:                    cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap.2023-06-23_16.22.32_09
  Backup Level:           Full
  Client:                 "cloudservices2005-dev.wikimedia.org-fd" 9.6.7 (10Dec20) x86_64-pc-linux-gnu,debian,bullseye/sid
  FileSet:                "openldap" 2022-09-13 04:05:02
  Pool:                   "productionEqiad" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "backup1009-FileStorageProductionEqiad" (From Pool resource)
  Scheduled time:         23-Jun-2023 16:22:26
  Start time:             23-Jun-2023 16:22:34
  End time:               23-Jun-2023 16:22:34
  Elapsed time:           1 sec
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  Comm Line Compression:  None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               no
  Volume name(s):         
  Volume Session Id:      1069
  Volume Session Time:    1686902090
  Last Volume Bytes:      27,569,946,053 (27.56 GB)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Waiting on FD
  Termination:            *** Backup Error ***

Compare to cloudservices2004-dev.codfw.wmnet:

root@backup1001:~$ check_bacula.py cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap
id: 515892, ts: 2023-06-20 10:31:35, type: F, status: T, bytes: 6901136
id: 515975, ts: 2023-06-21 04:11:33, type: I, status: T, bytes: 2907456
id: 516065, ts: 2023-06-21 10:43:23, type: F, status: T, bytes: 7037776
id: 516125, ts: 2023-06-22 04:11:50, type: I, status: E, bytes: 0
id: 516264, ts: 2023-06-23 04:12:12, type: I, status: E, bytes: 0
id: 516361, ts: 2023-06-23 16:22:34, type: F, status: E, bytes: 0

root@backup1001:~$ check_bacula.py cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap
id: 515891, ts: 2023-06-20 10:31:09, type: F, status: f, bytes: 0
id: 515899, ts: 2023-06-20 15:35:04, type: F, status: T, bytes: 6316016
id: 515974, ts: 2023-06-21 04:11:30, type: I, status: T, bytes: 2907456
id: 516064, ts: 2023-06-21 10:43:15, type: F, status: T, bytes: 6448160
id: 516124, ts: 2023-06-22 04:11:49, type: I, status: T, bytes: 22400016
id: 516263, ts: 2023-06-23 04:12:11, type: I, status: T, bytes: 29138272
✔️

aborrero mentioned this in T340446: Cloud VPS Designate setup improvements.Jun 26 2023, 2:48 PM

aborrero edited parent tasks, added: T340446: Cloud VPS Designate setup improvements; removed: T339869: codfw1dev: LDAP database content seems to have lost years of content.

As part of T338779: cloudservices2005-dev: reimage into new network setup the server is now up and running.

When we can confirm all 4 servers (2 in eqiad 2 in codfw) have backups -- that can be restored -- we can declare the work here completed.

Still alerting:

Stale: 1 (cloudservices2005-dev), No backups: 2 (cloudservices2005-dev, ...),

cloudservices2005-dev was reimaged / put into service yesterday after a week of being offline. Could you please check again?

Checking.

Still failing:

root@backup1001:~$ check_bacula.py cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap
id: 515892, ts: 2023-06-20 10:31:35, type: F, status: T, bytes: 6901136
id: 515975, ts: 2023-06-21 04:11:33, type: I, status: T, bytes: 2907456
id: 516065, ts: 2023-06-21 10:43:23, type: F, status: T, bytes: 7037776
id: 516125, ts: 2023-06-22 04:11:50, type: I, status: E, bytes: 0
id: 516264, ts: 2023-06-23 04:12:12, type: I, status: E, bytes: 0
id: 516361, ts: 2023-06-23 16:22:34, type: F, status: E, bytes: 0
id: 516409, ts: 2023-06-24 04:12:06, type: I, status: E, bytes: 0
id: 516550, ts: 2023-06-25 04:11:21, type: I, status: E, bytes: 0
id: 516693, ts: 2023-06-26 04:11:51, type: I, status: E, bytes: 0
id: 516838, ts: 2023-06-27 04:12:23, type: I, status: E, bytes: 0
✔

The mysql one works:

$ check_bacula.py cloudservices1005.wikimedia.org-Monthly-1st-Wed-productionEqiad-mysql-srv-backups-dumps-latest
id: 516359, ts: 2023-06-23 16:20:14, type: F, status: T, bytes: 0
id: 516406, ts: 2023-06-24 04:12:05, type: I, status: T, bytes: 122592
id: 516547, ts: 2023-06-25 04:11:20, type: I, status: T, bytes: 122544
id: 516690, ts: 2023-06-26 04:11:50, type: I, status: T, bytes: 122576
id: 516833, ts: 2023-06-27 04:12:16, type: I, status: T, bytes: 123360

the ldap one says this (the script is failing):

27-Jun 04:05 backup1001.eqiad.wmnet JobId 516837: No prior Full backup Job record found.
27-Jun 04:05 backup1001.eqiad.wmnet JobId 516837: No prior or suitable Full backup found in catalog. Doing FULL backup.
27-Jun 04:12 backup1001.eqiad.wmnet JobId 516837: Start Backup JobId 516837, Job=cloudservices2005-dev.codfw.wmnet-Monthly-1st-Sun-productionEqiad-openl
dap.2023-06-27_04.05.01_15
27-Jun 04:12 backup1001.eqiad.wmnet JobId 516837: Using Device "FileStorageProductionEqiad" to write.
27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/openldap-pre"
27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 /etc/ldap/slapd.conf: line 78: rootdn is always granted unl
imited privileges.
27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 /etc/ldap/acls.conf: line 7: rootdn is always granted unlim
ited privileges.
27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 /etc/ldap/acls.conf: line 30: rootdn is always granted unli
mited privileges.
27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 /etc/ldap/acls.conf: line 41: rootdn is always granted unli
mited privileges.
27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 The first database does not allow slapcat; using the first available one (2)
27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837:      Could not stat "/var/lib/ldap/slapd-audit.log": ERR=No such file or directory
27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: shell command: run ClientAfterJob "/etc/bacula/scripts/openldap-post"
27-Jun 04:12 backup1009.eqiad.wmnet-fd JobId 516837: Elapsed time=00:00:01, Transfer rate=1.931 K Bytes/second
27-Jun 04:12 backup1009.eqiad.wmnet-fd JobId 516837: Sending spooled attrs to the Director. Despooling 378 bytes ...
27-Jun 04:12 backup1001.eqiad.wmnet JobId 516837: Bacula backup1001.eqiad.wmnet 9.6.7 (10Dec20):

I see some mentions to cloudservices2005-dev.wikimedia.org but it should be cloudservices2005-dev.codfw.wmnet instead. The wikimedia.org is the old setup that no longer exists.

In T339894#8966880, @aborrero wrote:

I see some mentions to cloudservices2005-dev.wikimedia.org but it should be cloudservices2005-dev.codfw.wmnet instead. The wikimedia.org is the old setup that no longer exists.

Were the old puppet facts purged? if no, the resources will still be on puppet and will still try to backup the old host.

Checking the log it says Error: bsockcore.c:284 gethostbyname() for host "cloudservices2005-dev.wikimedia.org" which may indicate a DNS issue (aside from the above):

27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Start Backup JobId 516838, Job=cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-ope
nldap.2023-06-27_04.05.01_16
27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Using Device "FileStorageProductionEqiad" to write.
27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Error: bsockcore.c:284 gethostbyname() for host "cloudservices2005-dev.wikimedia.org" failed: ERR=Name or service not known
27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Fatal error: No Job status returned from FD.
27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Error: Bacula backup1001.eqiad.wmnet 9.6.7 (10Dec20):
  Build OS:               x86_64-pc-linux-gnu debian bullseye/sid
  JobId:                  516838
  Job:                    cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap.2023-06-27_04.05.01_16
  Backup Level:           Incremental, since=2023-06-21 10:43:23
  Client:                 "cloudservices2005-dev.wikimedia.org-fd" 9.6.7 (10Dec20) x86_64-pc-linux-gnu,debian,bullseye/sid
  FileSet:                "openldap" 2022-09-13 04:05:02
  Pool:                   "productionEqiad" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "backup1009-FileStorageProductionEqiad" (From Pool resource)
  Scheduled time:         27-Jun-2023 04:05:01
  Start time:             27-Jun-2023 04:12:23
  End time:               27-Jun-2023 04:12:24
  Elapsed time:           1 sec
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  Comm Line Compression:  None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               no
  Volume name(s):         
  Volume Session Id:      1544
  Volume Session Time:    1686902090
  Last Volume Bytes:      207,049,984,827 (207.0 GB)

In T339894#8966911, @jcrespo wrote:

Checking the log it says Error: bsockcore.c:284 gethostbyname() for host "cloudservices2005-dev.wikimedia.org" which may indicate a DNS issue (aside from the above):

The decomission cookbook may have failed to cleanup the bacula configuration for cloudservices2005-dev.wikimedia.org. That FQDN no longer exists.

Please, clean it up by hand. Anything (config, etc) that mentions it is an error and should removed. The puppet tree (including hiera) should be up-to-date with the new name.

I just ran this:

aborrero@puppetmaster1001:~ $ sudo puppet node deactivate cloudservices2005-dev.wikimedia.org
Submitted 'deactivate node' for cloudservices2005-dev.wikimedia.org with UUID 011b4166-2242-4e30-997a-c9a676cdcc05
aborrero@puppetmaster1001:~ $ sudo puppet node deactivate cloudservices2004-dev.wikimedia.org
Submitted 'deactivate node' for cloudservices2004-dev.wikimedia.org with UUID 111318d9-5849-4b96-b6ce-876cfdd09d66

hopefully it'll fix things

I think that worked, all jobs at the moment are working and do not have stale backups (old backups of old hosts are still available, but those are not checked).

[10:48] <icinga-wm> RECOVERY - Backup freshness on backup1001 is OK: Fresh: 132 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring

ok, thanks!

Could you try restoring one file in each host?

While all 4 restores worked ok (bacula level), the last one returned a 0 byte file:

✔ root@backup1001:/etc/bacula/jobs.d$ check_bacula.py cloudservices2005-dev.codfw.wmnet-Monthly-1st-Sun-productionEqiad-openldap
id: 516837, ts: 2023-06-27 04:12:23, type: F, status: T, bytes: 1152
id: 516989, ts: 2023-06-28 04:12:30, type: I, status: T, bytes: 1152

The other also is suspiciously the same size, which may mean data has not changed, or it has another issue:

✔ root@backup1001:/etc/bacula/jobs.d$ check_bacula.py cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap
id: 515891, ts: 2023-06-20 10:31:09, type: F, status: f, bytes: 0
id: 515899, ts: 2023-06-20 15:35:04, type: F, status: T, bytes: 6316016
id: 515974, ts: 2023-06-21 04:11:30, type: I, status: T, bytes: 2907456
id: 516064, ts: 2023-06-21 10:43:15, type: F, status: T, bytes: 6448160
id: 516124, ts: 2023-06-22 04:11:49, type: I, status: T, bytes: 22400016
id: 516263, ts: 2023-06-23 04:12:11, type: I, status: T, bytes: 29138272
id: 516408, ts: 2023-06-24 04:12:12, type: I, status: T, bytes: 2907456
id: 516549, ts: 2023-06-25 04:11:26, type: I, status: T, bytes: 2907456
id: 516692, ts: 2023-06-26 04:11:57, type: I, status: T, bytes: 2907456
id: 516835, ts: 2023-06-27 04:12:22, type: I, status: T, bytes: 2907456
id: 516987, ts: 2023-06-28 04:12:30, type: I, status: T, bytes: 2907456

Please check you can recover from the files on /tmp and/or request assistance for the mysql recovery process (it is documented on wikitech).

I see the following in cloudservices2004-dev which seems fine to me:

aborrero@cloudservices2004-dev:~ $ sudo ls -lh /tmp/srv/backups/dumps/latest/dump.pdns.2023-06-28--00-00-04
total 52K
-rw-r--r-- 1 dump dump  173 Jun 28 00:00 metadata
-rw-r--r-- 1 dump dump  333 Jun 28 00:00 pdns.comments-schema.sql.gz
-rw-r--r-- 1 dump dump  289 Jun 28 00:00 pdns.cryptokeys-schema.sql.gz
-rw-r--r-- 1 dump dump  284 Jun 28 00:00 pdns.domainmetadata-schema.sql.gz
-rw-r--r-- 1 dump dump  337 Jun 28 00:00 pdns.domains-schema.sql.gz
-rw-r--r-- 1 dump dump 1.7K Jun 28 00:00 pdns.domains.sql.gz
-rw-r--r-- 1 dump dump  397 Jun 28 00:00 pdns.records-schema.sql.gz
-rw-r--r-- 1 dump dump 9.2K Jun 28 00:00 pdns.records.sql.gz
-rw-r--r-- 1 dump dump  102 Jun 28 00:00 pdns-schema-create.sql.gz
-rw-r--r-- 1 dump dump  254 Jun 28 00:00 pdns.supermasters-schema.sql.gz
-rw-r--r-- 1 dump dump  295 Jun 28 00:00 pdns.tsigkeys-schema.sql.gz
aborrero@cloudservices2004-dev:~ $ sudo ls -lh /tmp/var/run/openldap-backup/backup.ldif
-rw------- 1 root root 2.8M Jun 28 04:12 /tmp/var/run/openldap-backup/backup.ldif

In cloudservices2005-dev however the LDAP backup is indeed empty:

aborrero@cloudservices2005-dev:~ $ sudo ls -lh /tmp/srv/backups/dumps/latest/dump.pdns.2023-06-28--00-00-03
total 52K
-rw-r--r-- 1 dump dump  172 Jun 28 00:00 metadata
-rw-r--r-- 1 dump dump  333 Jun 28 00:00 pdns.comments-schema.sql.gz
-rw-r--r-- 1 dump dump  289 Jun 28 00:00 pdns.cryptokeys-schema.sql.gz
-rw-r--r-- 1 dump dump  284 Jun 28 00:00 pdns.domainmetadata-schema.sql.gz
-rw-r--r-- 1 dump dump  337 Jun 28 00:00 pdns.domains-schema.sql.gz
-rw-r--r-- 1 dump dump 1.7K Jun 28 00:00 pdns.domains.sql.gz
-rw-r--r-- 1 dump dump  398 Jun 28 00:00 pdns.records-schema.sql.gz
-rw-r--r-- 1 dump dump 9.2K Jun 28 00:00 pdns.records.sql.gz
-rw-r--r-- 1 dump dump  102 Jun 28 00:00 pdns-schema-create.sql.gz
-rw-r--r-- 1 dump dump  254 Jun 28 00:00 pdns.supermasters-schema.sql.gz
-rw-r--r-- 1 dump dump  295 Jun 28 00:00 pdns.tsigkeys-schema.sql.gz
aborrero@cloudservices2005-dev:~ $ sudo ls -lh /tmp/var/run/openldap-backup/
total 0
-rw------- 1 root root 0 Jun 28 04:12 backup.ldif

I just checked and the LDAP directory is empty there. A different problem. The backup seems to be performing just fine.

This can be considered resolved, thanks!

root@backup1001:~$ check_bacula.py cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap
...
id: 529431, ts: 2023-09-18 04:08:44, type: I, status: T, bytes: 2908304
id: 529581, ts: 2023-09-19 04:39:30, type: I, status: f, bytes: 0
id: 529749, ts: 2023-09-20 04:13:49, type: I, status: f, bytes: 0
✔

Since 2 days ago backups have started failing FYI @aborrero .

Change 959149 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring

https://gerrit.wikimedia.org/r/959149

gerritbot added a project: Patch-For-Review.Sep 20 2023, 7:47 AM

Change 959149 merged by Jcrespo:

[operations/puppet@production] bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring

https://gerrit.wikimedia.org/r/959149

This server is under maintenance by @fnegri, see T345810: [openstack] Upgrade codfw hosts to bookworm.

Maintenance_bot removed a project: Patch-For-Review.Sep 20 2023, 9:10 AM

jcrespo mentioned this in T345810: [openstack] Upgrade codfw hosts to bookworm.Sep 20 2023, 9:35 AM

@jcrespo backups in cloudservices2004-dev are looking good now:

fnegri@backup1001:~$ sudo check_bacula.py cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap

[...]

id: 531036, ts: 2023-09-29 04:11:42, type: I, status: T, bytes: 2908032
id: 531181, ts: 2023-09-30 04:11:47, type: I, status: T, bytes: 2908032
id: 531344, ts: 2023-10-01 09:47:56, type: I, status: T, bytes: 2908032
id: 531503, ts: 2023-10-02 04:38:07, type: I, status: T, bytes: 2908032

Can we revert the patch https://gerrit.wikimedia.org/r/959149?

Change 962207 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring"

https://gerrit.wikimedia.org/r/962207

gerritbot added a project: Patch-For-Review.Oct 3 2023, 12:00 PM

Change 962207 merged by FNegri:

[operations/puppet@production] Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring"

https://gerrit.wikimedia.org/r/962207

Maintenance_bot removed a project: Patch-For-Review.Oct 16 2023, 10:30 AM