The hosts don't have backups for the couple of stateful datasets they have:
- LDAP database. Done with https://gerrit.wikimedia.org/r/c/931558
- pdns database https://gerrit.wikimedia.org/r/c/operations/puppet/+/931880
The hosts don't have backups for the couple of stateful datasets they have:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T340446 Cloud VPS Designate setup improvements | |||
Resolved | aborrero | T339894 cloudservices: codfw1dev: fix backups |
Change 931558 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudservices: codfw1dev: enable LDAP backups
Change 931558 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudservices: codfw1dev: enable LDAP backups
cloudservices2005-dev seems to have run correctly (3 files backed up, 7MB)
Terminated Jobs: JobId Level Files Bytes Status Finished Name ==================================================================== 515892 Full 3 6.901 M OK 20-Jun-23 10:31 cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap
but cloudservices2004 seems it got stuck, my guess is network connectivity or no client running (?), given the logs (waiting on client/unable to connect).
515891 Full 0 0 Error 20-Jun-23 10:34 cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap
20-Jun 10:31 backup1001.eqiad.wmnet JobId 515891: Start Backup JobId 515891, Job=cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap.2023-06-20_10.31.07_23 20-Jun 10:31 backup1001.eqiad.wmnet JobId 515891: Using Device "FileStorageProductionEqiad" to write. 20-Jun 10:34 backup1001.eqiad.wmnet JobId 515891: Fatal error: bsockcore.c:208 Unable to connect to Client: cloudservices2004-dev.codfw.wmnet-fd on cloudservices2004-dev.codfw.wmnet:9102. ERR=Interrupted system call 20-Jun 10:34 backup1001.eqiad.wmnet JobId 515891: Fatal error: No Job status returned from FD. 20-Jun 10:34 backup1001.eqiad.wmnet JobId 515891: Error: Bacula backup1001.eqiad.wmnet 9.6.7 (10Dec20): Build OS: x86_64-pc-linux-gnu debian bullseye/sid JobId: 515891 Job: cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap.2023-06-20_10.31.07_23 Backup Level: Full Client: "cloudservices2004-dev.codfw.wmnet-fd" FileSet: "openldap" 2022-09-13 04:05:02 Pool: "productionEqiad" (From Job resource) Catalog: "production" (From Client resource) Storage: "backup1009-FileStorageProductionEqiad" (From Pool resource) Scheduled time: 20-Jun-2023 10:30:46 Start time: 20-Jun-2023 10:31:09 End time: 20-Jun-2023 10:34:19 Elapsed time: 3 mins 10 secs Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None Comm Line Compression: None Snapshot/VSS: no Encryption: no Accurate: no Volume name(s): Volume Session Id: 609 Volume Session Time: 1686902090 Last Volume Bytes: 282,811,304,024 (282.8 GB) Non-fatal FD errors: 1 SD Errors: 0 FD termination status: Error SD termination status: Waiting on FD Termination: *** Backup Error ***
backula-fd seems to be running on both hosts:
taavi@cloudservices2004-dev ~ $ sudo ss -tulpn|grep bacula tcp LISTEN 0 50 0.0.0.0:9102 0.0.0.0:* users:(("bacula-fd",pid=1943340,fd=3)) taavi@cloudservices2005-dev ~ $ sudo ss -tulpn|grep bacula tcp LISTEN 0 50 0.0.0.0:9102 0.0.0.0:* users:(("bacula-fd",pid=2141385,fd=3))
So I'm confused why we're seeing connection failures on hosts on the new setup only?
Oh, I think I know what's going on.. operations/homer/public.git:policies/cr-labs.pol is missing rules for the return traffic. Does the backup process use any other ports than 9102/tcp?
Change 931576 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/homer/public@master] cr-labs: Permit bacula backup traffic
bacula-fd requires listening from the director and sending data to the bacula storage disk (obviously the return is needed too to establish tcp, just mentioning the direction of connections). This may be happening also for https://gerrit.wikimedia.org/r/c/operations/puppet/+/927119/3/modules/profile/files/backup/job_monitoring_ignorelist
Most likely yes! this might be the fix for T338132: cloudcontrol: review connectivity with backup system too.
Change 931576 merged by Arturo Borrero Gonzalez:
[operations/homer/public@master] cr-labs: Permit bacula backup traffic
Mentioned in SAL (#wikimedia-operations) [2023-06-20T14:36:19Z] <arturo> homer run for CR eqiad/codfw to allow bacula traffic in from cloud-hosts (T338132, T339894)
the connectivity problem should be fixed now!
It is however, not obvious to me how to do the mysql backup.
It worked now:
515898 Full 16 54.00 M OK 20-Jun-23 15:34 cloudcontrol2001-dev.codfw.wmnet-Monthly-1st-Wed-productionEqiad-mysql-srv-backups 515899 Full 3 6.316 M OK 20-Jun-23 15:35 cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap
Don't worry, if you are interested in knowing it, I can explain it, but you don't need to know how to run it. What you MUST know is how to do a recovery. I can do 1 for you if you tell me a filesystem path (EDIT: not patch) to recover to, separate from production to test the recovery. I require to do a recovery test when setting up backups for the first time because it is easy to end up backing up the wrong location or some other silly mistake. Once that is tested, my monitoring usually keeps the workflow safe.
But you should be able to do a recovery on your own afterawards.
I believe I've done it in the past at least once, following https://wikitech.wikimedia.org/wiki/Bacula#Restore_(aka_Panic_mode) I'd be happy to try again to make sure the procedure is clear (and that we are backing up the expected data),
The cloudservices systems have another mysql database in them that is currently not being backed up. Could you advice on how to puppetize the backup of that
I saw other examples that needs setting up a particular user/grant for the mysql dump, but I'm not sure if this is what I need in this scenario.
have another mysql database in them that is currently not being backed up. Could you advice on how to puppetize the backup of that
I may not be the best person to answer that question- because the method we use for MediaWiki databases may be too over-engineered for your needs.
You can see how it is setup at:
role dbbackups:content (a different role will be needed)
and a template and yaml config controls its parameters. Documentation is at: https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Software_and_deployment_architecture and https://wikitech.wikimedia.org/wiki/Backup-mariadb It indeed uses a separate user to generate the backup for security reasons (it is not *that* hard to setup).
The advantages is that it is a robust method, production-proven and maintained by myself. It is intended to handle all backups preprocesing before bacula (not only for databases, also for things like gitlab). The disadvantage is that it is thought for large backups at scale and may have some complexities.
Let me know what you think- if you want to set it up as a (possibly) first step towards uniformizing all db backups in cloud to a common, improved architecture (T284157), I would go for this (and I will support you, no problem). If it is a 1 host (or 1 role) super small backup system, I'd recommend asking how others are doing it (as we do all we know about "the right way").
If think that's ok, start by creating a new role inside dbbackups (e.g. dbbackups::cloud), and we can go from there discussing on a patch.
Change 931880 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: pdns: add backup for the database
In an IRC conversation @jcrespo proposed a refactor previous to introducing the backup for the PDNS db.
Meanwhile, I've stored a copy of the PDNS database in the openldap backup directory and asked for a force-run of the backup procedure, so we have a copy somewhere today.
Change 931940 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] dbbackups: Make mydumper backup template path configurable
Change 931940 merged by Jcrespo:
[operations/puppet@production] dbbackups: Make mydumper backup template path configurable
I don't know what are the hosts to run puppet compiler on, but something like this should work: https://gerrit.wikimedia.org/r/c/operations/puppet/+/931880 You will need to fix the things I don't know, like host names or backup schedule.
To make things explicit- backups are not yet working- that will require you to merge the above and test it. I am no longer working on this, as you seem to have abandoned work on it/no longer requiring my help.
Change 931880 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: pdns: add backup for the database
Change 932388 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] openstack: pdns: Change backup user for dump and make statistics configurable
Change 932388 merged by Jcrespo:
[operations/puppet@production] dbbackups: Make backups statistics optional
Change 932403 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] dbbackups: Fix small logical backup for no-stats-file case
Change 932406 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: pdns: add grants for DB backups
Change 932403 merged by Jcrespo:
[operations/puppet@production] dbbackups: Fix small logical backup for no-stats-file case
Change 932406 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: pdns: add grants for DB backups
Change 932430 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] pdns_server: db_backups: avoid <<< redirection
Change 932430 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] pdns_server: db_backups: avoid <<< redirection
The MySQL backups are working, but the LDAP one on cloudservices2005-dev.wikimedia.org started failing on the 22nd (also looks like network issue):
23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Start Backup JobId 516361, Job=cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap.2023-06-23_16.22.32_09 23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Using Device "FileStorageProductionEqiad" to write. 23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Error: bsockcore.c:284 gethostbyname() for host "cloudservices2005-dev.wikimedia.org" failed: ERR=Name or service not known 23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Fatal error: No Job status returned from FD. 23-Jun 16:22 backup1001.eqiad.wmnet JobId 516361: Error: Bacula backup1001.eqiad.wmnet 9.6.7 (10Dec20): Build OS: x86_64-pc-linux-gnu debian bullseye/sid JobId: 516361 Job: cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap.2023-06-23_16.22.32_09 Backup Level: Full Client: "cloudservices2005-dev.wikimedia.org-fd" 9.6.7 (10Dec20) x86_64-pc-linux-gnu,debian,bullseye/sid FileSet: "openldap" 2022-09-13 04:05:02 Pool: "productionEqiad" (From Job resource) Catalog: "production" (From Client resource) Storage: "backup1009-FileStorageProductionEqiad" (From Pool resource) Scheduled time: 23-Jun-2023 16:22:26 Start time: 23-Jun-2023 16:22:34 End time: 23-Jun-2023 16:22:34 Elapsed time: 1 sec Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None Comm Line Compression: None Snapshot/VSS: no Encryption: no Accurate: no Volume name(s): Volume Session Id: 1069 Volume Session Time: 1686902090 Last Volume Bytes: 27,569,946,053 (27.56 GB) Non-fatal FD errors: 1 SD Errors: 0 FD termination status: Error SD termination status: Waiting on FD Termination: *** Backup Error ***
Compare to cloudservices2004-dev.codfw.wmnet:
root@backup1001:~$ check_bacula.py cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap id: 515892, ts: 2023-06-20 10:31:35, type: F, status: T, bytes: 6901136 id: 515975, ts: 2023-06-21 04:11:33, type: I, status: T, bytes: 2907456 id: 516065, ts: 2023-06-21 10:43:23, type: F, status: T, bytes: 7037776 id: 516125, ts: 2023-06-22 04:11:50, type: I, status: E, bytes: 0 id: 516264, ts: 2023-06-23 04:12:12, type: I, status: E, bytes: 0 id: 516361, ts: 2023-06-23 16:22:34, type: F, status: E, bytes: 0
root@backup1001:~$ check_bacula.py cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap id: 515891, ts: 2023-06-20 10:31:09, type: F, status: f, bytes: 0 id: 515899, ts: 2023-06-20 15:35:04, type: F, status: T, bytes: 6316016 id: 515974, ts: 2023-06-21 04:11:30, type: I, status: T, bytes: 2907456 id: 516064, ts: 2023-06-21 10:43:15, type: F, status: T, bytes: 6448160 id: 516124, ts: 2023-06-22 04:11:49, type: I, status: T, bytes: 22400016 id: 516263, ts: 2023-06-23 04:12:11, type: I, status: T, bytes: 29138272 ✔️
As part of T338779: cloudservices2005-dev: reimage into new network setup the server is now up and running.
When we can confirm all 4 servers (2 in eqiad 2 in codfw) have backups -- that can be restored -- we can declare the work here completed.
Still alerting:
Stale: 1 (cloudservices2005-dev), No backups: 2 (cloudservices2005-dev, ...),
cloudservices2005-dev was reimaged / put into service yesterday after a week of being offline. Could you please check again?
Still failing:
root@backup1001:~$ check_bacula.py cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap id: 515892, ts: 2023-06-20 10:31:35, type: F, status: T, bytes: 6901136 id: 515975, ts: 2023-06-21 04:11:33, type: I, status: T, bytes: 2907456 id: 516065, ts: 2023-06-21 10:43:23, type: F, status: T, bytes: 7037776 id: 516125, ts: 2023-06-22 04:11:50, type: I, status: E, bytes: 0 id: 516264, ts: 2023-06-23 04:12:12, type: I, status: E, bytes: 0 id: 516361, ts: 2023-06-23 16:22:34, type: F, status: E, bytes: 0 id: 516409, ts: 2023-06-24 04:12:06, type: I, status: E, bytes: 0 id: 516550, ts: 2023-06-25 04:11:21, type: I, status: E, bytes: 0 id: 516693, ts: 2023-06-26 04:11:51, type: I, status: E, bytes: 0 id: 516838, ts: 2023-06-27 04:12:23, type: I, status: E, bytes: 0 ✔
The mysql one works:
$ check_bacula.py cloudservices1005.wikimedia.org-Monthly-1st-Wed-productionEqiad-mysql-srv-backups-dumps-latest id: 516359, ts: 2023-06-23 16:20:14, type: F, status: T, bytes: 0 id: 516406, ts: 2023-06-24 04:12:05, type: I, status: T, bytes: 122592 id: 516547, ts: 2023-06-25 04:11:20, type: I, status: T, bytes: 122544 id: 516690, ts: 2023-06-26 04:11:50, type: I, status: T, bytes: 122576 id: 516833, ts: 2023-06-27 04:12:16, type: I, status: T, bytes: 123360
the ldap one says this (the script is failing):
27-Jun 04:05 backup1001.eqiad.wmnet JobId 516837: No prior Full backup Job record found. 27-Jun 04:05 backup1001.eqiad.wmnet JobId 516837: No prior or suitable Full backup found in catalog. Doing FULL backup. 27-Jun 04:12 backup1001.eqiad.wmnet JobId 516837: Start Backup JobId 516837, Job=cloudservices2005-dev.codfw.wmnet-Monthly-1st-Sun-productionEqiad-openl dap.2023-06-27_04.05.01_15 27-Jun 04:12 backup1001.eqiad.wmnet JobId 516837: Using Device "FileStorageProductionEqiad" to write. 27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/openldap-pre" 27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 /etc/ldap/slapd.conf: line 78: rootdn is always granted unl imited privileges. 27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 /etc/ldap/acls.conf: line 7: rootdn is always granted unlim ited privileges. 27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 /etc/ldap/acls.conf: line 30: rootdn is always granted unli mited privileges. 27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 /etc/ldap/acls.conf: line 41: rootdn is always granted unli mited privileges. 27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: ClientRunBeforeJob: 649a61a6 The first database does not allow slapcat; using the first available one (2) 27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: Could not stat "/var/lib/ldap/slapd-audit.log": ERR=No such file or directory 27-Jun 04:12 cloudservices2005-dev.codfw.wmnet-fd JobId 516837: shell command: run ClientAfterJob "/etc/bacula/scripts/openldap-post" 27-Jun 04:12 backup1009.eqiad.wmnet-fd JobId 516837: Elapsed time=00:00:01, Transfer rate=1.931 K Bytes/second 27-Jun 04:12 backup1009.eqiad.wmnet-fd JobId 516837: Sending spooled attrs to the Director. Despooling 378 bytes ... 27-Jun 04:12 backup1001.eqiad.wmnet JobId 516837: Bacula backup1001.eqiad.wmnet 9.6.7 (10Dec20):
I see some mentions to cloudservices2005-dev.wikimedia.org but it should be cloudservices2005-dev.codfw.wmnet instead. The wikimedia.org is the old setup that no longer exists.
Were the old puppet facts purged? if no, the resources will still be on puppet and will still try to backup the old host.
Checking the log it says Error: bsockcore.c:284 gethostbyname() for host "cloudservices2005-dev.wikimedia.org" which may indicate a DNS issue (aside from the above):
27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Start Backup JobId 516838, Job=cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-ope nldap.2023-06-27_04.05.01_16 27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Using Device "FileStorageProductionEqiad" to write. 27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Error: bsockcore.c:284 gethostbyname() for host "cloudservices2005-dev.wikimedia.org" failed: ERR=Name or service not known 27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Fatal error: No Job status returned from FD. 27-Jun 04:12 backup1001.eqiad.wmnet JobId 516838: Error: Bacula backup1001.eqiad.wmnet 9.6.7 (10Dec20): Build OS: x86_64-pc-linux-gnu debian bullseye/sid JobId: 516838 Job: cloudservices2005-dev.wikimedia.org-Monthly-1st-Thu-productionEqiad-openldap.2023-06-27_04.05.01_16 Backup Level: Incremental, since=2023-06-21 10:43:23 Client: "cloudservices2005-dev.wikimedia.org-fd" 9.6.7 (10Dec20) x86_64-pc-linux-gnu,debian,bullseye/sid FileSet: "openldap" 2022-09-13 04:05:02 Pool: "productionEqiad" (From Job resource) Catalog: "production" (From Client resource) Storage: "backup1009-FileStorageProductionEqiad" (From Pool resource) Scheduled time: 27-Jun-2023 04:05:01 Start time: 27-Jun-2023 04:12:23 End time: 27-Jun-2023 04:12:24 Elapsed time: 1 sec Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None Comm Line Compression: None Snapshot/VSS: no Encryption: no Accurate: no Volume name(s): Volume Session Id: 1544 Volume Session Time: 1686902090 Last Volume Bytes: 207,049,984,827 (207.0 GB)
The decomission cookbook may have failed to cleanup the bacula configuration for cloudservices2005-dev.wikimedia.org. That FQDN no longer exists.
Please, clean it up by hand. Anything (config, etc) that mentions it is an error and should removed. The puppet tree (including hiera) should be up-to-date with the new name.
I just ran this:
aborrero@puppetmaster1001:~ $ sudo puppet node deactivate cloudservices2005-dev.wikimedia.org Submitted 'deactivate node' for cloudservices2005-dev.wikimedia.org with UUID 011b4166-2242-4e30-997a-c9a676cdcc05 aborrero@puppetmaster1001:~ $ sudo puppet node deactivate cloudservices2004-dev.wikimedia.org Submitted 'deactivate node' for cloudservices2004-dev.wikimedia.org with UUID 111318d9-5849-4b96-b6ce-876cfdd09d66
hopefully it'll fix things
I think that worked, all jobs at the moment are working and do not have stale backups (old backups of old hosts are still available, but those are not checked).
[10:48] <icinga-wm> RECOVERY - Backup freshness on backup1001 is OK: Fresh: 132 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
While all 4 restores worked ok (bacula level), the last one returned a 0 byte file:
✔ root@backup1001:/etc/bacula/jobs.d$ check_bacula.py cloudservices2005-dev.codfw.wmnet-Monthly-1st-Sun-productionEqiad-openldap id: 516837, ts: 2023-06-27 04:12:23, type: F, status: T, bytes: 1152 id: 516989, ts: 2023-06-28 04:12:30, type: I, status: T, bytes: 1152
The other also is suspiciously the same size, which may mean data has not changed, or it has another issue:
✔ root@backup1001:/etc/bacula/jobs.d$ check_bacula.py cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap id: 515891, ts: 2023-06-20 10:31:09, type: F, status: f, bytes: 0 id: 515899, ts: 2023-06-20 15:35:04, type: F, status: T, bytes: 6316016 id: 515974, ts: 2023-06-21 04:11:30, type: I, status: T, bytes: 2907456 id: 516064, ts: 2023-06-21 10:43:15, type: F, status: T, bytes: 6448160 id: 516124, ts: 2023-06-22 04:11:49, type: I, status: T, bytes: 22400016 id: 516263, ts: 2023-06-23 04:12:11, type: I, status: T, bytes: 29138272 id: 516408, ts: 2023-06-24 04:12:12, type: I, status: T, bytes: 2907456 id: 516549, ts: 2023-06-25 04:11:26, type: I, status: T, bytes: 2907456 id: 516692, ts: 2023-06-26 04:11:57, type: I, status: T, bytes: 2907456 id: 516835, ts: 2023-06-27 04:12:22, type: I, status: T, bytes: 2907456 id: 516987, ts: 2023-06-28 04:12:30, type: I, status: T, bytes: 2907456
Please check you can recover from the files on /tmp and/or request assistance for the mysql recovery process (it is documented on wikitech).
I see the following in cloudservices2004-dev which seems fine to me:
aborrero@cloudservices2004-dev:~ $ sudo ls -lh /tmp/srv/backups/dumps/latest/dump.pdns.2023-06-28--00-00-04 total 52K -rw-r--r-- 1 dump dump 173 Jun 28 00:00 metadata -rw-r--r-- 1 dump dump 333 Jun 28 00:00 pdns.comments-schema.sql.gz -rw-r--r-- 1 dump dump 289 Jun 28 00:00 pdns.cryptokeys-schema.sql.gz -rw-r--r-- 1 dump dump 284 Jun 28 00:00 pdns.domainmetadata-schema.sql.gz -rw-r--r-- 1 dump dump 337 Jun 28 00:00 pdns.domains-schema.sql.gz -rw-r--r-- 1 dump dump 1.7K Jun 28 00:00 pdns.domains.sql.gz -rw-r--r-- 1 dump dump 397 Jun 28 00:00 pdns.records-schema.sql.gz -rw-r--r-- 1 dump dump 9.2K Jun 28 00:00 pdns.records.sql.gz -rw-r--r-- 1 dump dump 102 Jun 28 00:00 pdns-schema-create.sql.gz -rw-r--r-- 1 dump dump 254 Jun 28 00:00 pdns.supermasters-schema.sql.gz -rw-r--r-- 1 dump dump 295 Jun 28 00:00 pdns.tsigkeys-schema.sql.gz aborrero@cloudservices2004-dev:~ $ sudo ls -lh /tmp/var/run/openldap-backup/backup.ldif -rw------- 1 root root 2.8M Jun 28 04:12 /tmp/var/run/openldap-backup/backup.ldif
In cloudservices2005-dev however the LDAP backup is indeed empty:
aborrero@cloudservices2005-dev:~ $ sudo ls -lh /tmp/srv/backups/dumps/latest/dump.pdns.2023-06-28--00-00-03 total 52K -rw-r--r-- 1 dump dump 172 Jun 28 00:00 metadata -rw-r--r-- 1 dump dump 333 Jun 28 00:00 pdns.comments-schema.sql.gz -rw-r--r-- 1 dump dump 289 Jun 28 00:00 pdns.cryptokeys-schema.sql.gz -rw-r--r-- 1 dump dump 284 Jun 28 00:00 pdns.domainmetadata-schema.sql.gz -rw-r--r-- 1 dump dump 337 Jun 28 00:00 pdns.domains-schema.sql.gz -rw-r--r-- 1 dump dump 1.7K Jun 28 00:00 pdns.domains.sql.gz -rw-r--r-- 1 dump dump 398 Jun 28 00:00 pdns.records-schema.sql.gz -rw-r--r-- 1 dump dump 9.2K Jun 28 00:00 pdns.records.sql.gz -rw-r--r-- 1 dump dump 102 Jun 28 00:00 pdns-schema-create.sql.gz -rw-r--r-- 1 dump dump 254 Jun 28 00:00 pdns.supermasters-schema.sql.gz -rw-r--r-- 1 dump dump 295 Jun 28 00:00 pdns.tsigkeys-schema.sql.gz aborrero@cloudservices2005-dev:~ $ sudo ls -lh /tmp/var/run/openldap-backup/ total 0 -rw------- 1 root root 0 Jun 28 04:12 backup.ldif
I just checked and the LDAP directory is empty there. A different problem. The backup seems to be performing just fine.
root@backup1001:~$ check_bacula.py cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap ... id: 529431, ts: 2023-09-18 04:08:44, type: I, status: T, bytes: 2908304 id: 529581, ts: 2023-09-19 04:39:30, type: I, status: f, bytes: 0 id: 529749, ts: 2023-09-20 04:13:49, type: I, status: f, bytes: 0 ✔
Since 2 days ago backups have started failing FYI @aborrero .
Change 959149 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring
Change 959149 merged by Jcrespo:
[operations/puppet@production] bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring
This server is under maintenance by @fnegri, see T345810: [openstack] Upgrade codfw hosts to bookworm.
@jcrespo backups in cloudservices2004-dev are looking good now:
fnegri@backup1001:~$ sudo check_bacula.py cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap [...] id: 531036, ts: 2023-09-29 04:11:42, type: I, status: T, bytes: 2908032 id: 531181, ts: 2023-09-30 04:11:47, type: I, status: T, bytes: 2908032 id: 531344, ts: 2023-10-01 09:47:56, type: I, status: T, bytes: 2908032 id: 531503, ts: 2023-10-02 04:38:07, type: I, status: T, bytes: 2908032
Can we revert the patch https://gerrit.wikimedia.org/r/959149?
Change 962207 had a related patch set uploaded (by FNegri; author: FNegri):
[operations/puppet@production] Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring"
Change 962207 merged by FNegri:
[operations/puppet@production] Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring"
I belive the work on those tickets was done here CC @fnegri But please double check if there are additonal dbs that were not part of the migration.