We've just added automatic backups for LFS data in T254155. A first manual run woud help us cover the time until the first automatic run on Sunday.
Description
Event Timeline
Running:
232157 Back Full 0 0 gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data is running
I have a few things I would like to test/ask about this.
- First, I would like to test a recovery, if possible to make sure the backup worked as intended.
- I also have a few questions regarding recovery of gerrit- in terms of consistency and disaster recovery model, but not sure if this task would be the right one.
*llist jobid=232157
JobId: 232,157
Job: gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data.2020-06-01_16.14.26_23
Name: gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data
PurgedFiles: 0
Type: B
Level: F
ClientId: 151
ClientName: gerrit1001.wikimedia.org-fd
JobStatus: T
SchedTime: 2020-06-01 16:14:07
StartTime: 2020-06-01 16:14:26
EndTime: 2020-06-01 16:29:21
RealEndTime: 2020-06-01 16:29:21
JobTDate: 1,591,028,961
VolSessionId: 10,153
VolSessionTime: 1,586,342,727
JobFiles: 147,467
JobBytes: 50,803,269,008
ReadBytes: 50,700,053,821
JobErrors: 0
JobMissingFiles: 0
PoolId: 2
PoolName: production
PriorJobId: 0
FileSetId: 71
FileSet: gerrit-repo-data
HasBase: 0
HasCache: 0
Comment:01-Jun 16:29 backup1001.eqiad.wmnet JobId 232157: Bacula backup1001.eqiad.wmnet 9.4.2 (04Feb19): Build OS: x86_64-pc-linux-gnu debian buster/sid JobId: 232157 Job: gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data.2020-06-01_16.14.26_23 Backup Level: Full Client: "gerrit1001.wikimedia.org-fd" 9.4.2 (04Feb19) x86_64-pc-linux-gnu,debian,buster/sid FileSet: "gerrit-repo-data" 2020-06-01 14:00:00 Pool: "production" (From Job resource) Catalog: "production" (From Client resource) Storage: "backup1001-FileStorageProduction" (From Pool resource) Scheduled time: 01-Jun-2020 16:14:07 Start time: 01-Jun-2020 16:14:26 End time: 01-Jun-2020 16:29:21 Elapsed time: 14 mins 55 secs Priority: 10 FD Files Written: 147,467 SD Files Written: 147,467 FD Bytes Written: 50,803,269,008 (50.80 GB) SD Bytes Written: 50,876,346,668 (50.87 GB) Rate: 56763.4 KB/s Software Compression: None Comm Line Compression: None Snapshot/VSS: no Encryption: yes Accurate: no Volume name(s): production0089 Volume Session Id: 10153 Volume Session Time: 1586342727 Last Volume Bytes: 136,812,896,732 (136.8 GB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: OK SD termination status: OK Termination: Backup OK
job ran, job scheduled to run an incremental automatically every hour: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data
We should try restoring it, give me a path you have access to to do so.
Mentioned in SAL (#wikimedia-operations) [2020-06-01T17:31:25Z] <mutante> backup1001 - queued job 42 - gerrit backup after renaming of the file set and addition of LFS data (T254155, T254162) it is incremental, the full one already ran
@jcrespo Thanks for the initial run!
We should try restoring it, give me a path you have access to to do so.
gerrit1002 (gerrit-test) is a bit scarce on disk space, so what about gerrit2001 (gerrit-replica)? On gerrit2001, /srv has >150GB free. That should easily do.
I have to temporarily block gerrit-root members access to gerrit2001 as I need to use the global key to decrypt backups to a different host than they were taken AND these members have local root to the server, otherwise this would be security concern.
I have already tried restoring a single random file from the git-lfs directory on gerrit1001 itself and the file is here now:
root@gerrit1001:/var/tmp/bacula-restores/srv/gerrit/plugins/lfs/ff/00# file ff002a0e6e173a695d6e51f8d2941eebd14123aa17d673dbd2124746ee182e20 ff002a0e6e173a695d6e51f8d2941eebd14123aa17d673dbd2124746ee182e20: data
Would this be sufficient?
Mentioned in SAL (#wikimedia-operations) [2020-06-02T10:12:08Z] <jynus> disable non-global root login to gerrit2001 T254162
Also not a problem. I don't think anyone needs shell access right now..especially if that's just a temporary thing.
I have scheduled the restore. If this was an emergency, I would kill ongoing backups jobs and the restore would run immediately, but because this is a test, I would let the large ongoing backup processes (it is currently doing a full backup of Phabricator) to finish before the restore starts, which will delay the restore. It shouldn't take more than a few extra minutes.
I will ping when the restore ends it execution.
Normally yes, but I would like to do a full restore to a separate server (simulating a total loss of the primary server), as I have additional questions for service owner about the feasibility of the recovery and its consistency.
A file (or all files) being recoverable doesn't mean that the service is recoverable.
Bacula has its own consistency model that may not be compatible with git consistency model, I want to put that to the test/enquire if lvm snapshoting would be needed.
Recovery took 20 minutes, I chose the full one we took for testing yesterday, but remember we have hourly backups: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&from=1591022207000&to=1591173396847&var-dc=eqiad%20prometheus%2Fops&var-job=gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data
02-Jun 18:03 backup1001.eqiad.wmnet JobId 233274: Bacula backup1001.eqiad.wmnet 9.4.2 (04Feb19): Build OS: x86_64-pc-linux-gnu debian buster/sid JobId: 233274 Job: RestoreFiles.2020-06-02_10.27.00_11 Restore Client: gerrit2001.wikimedia.org-fd Where: /srv/T254162_restore Replace: Never Start time: 02-Jun-2020 17:39:27 End time: 02-Jun-2020 18:03:03 Elapsed time: 23 mins 36 secs Files Expected: 147,467 Files Restored: 147,467 Bytes Restored: 50,700,053,821 (50.70 GB) Rate: 35805.1 KB/s FD Errors: 0 FD termination status: OK SD termination status: OK Termination: Restore OK
Content of everything backed up on gerrit1001 was restored into gerrit2001:/srv/T254162_restore.
Please try the right operations to verify the backup works for an emergency restore, successfully.
I have reenabled access to gerrit-roots & puppet to gerrit2001 after master key shredding.
Nice!
Content of everything backed up on gerrit1001 was restored into gerrit2001:/srv/T254162_restore.
I went over the data by comparing the lfs files and fsck-ing each of the repos, and it looks good. We could get gerrit up from that data. (Of course assuming the database content is there)
There were a few inconsistencies between the currently live data on gerrit1001 and the restored data, but those are expected.
@jcrespo: Can I just remove gerrit2001:/srv/T254162_restore again or do you still want to check something with the data?
There were a few inconsistencies between the currently live data on gerrit1001 and the restored data, but those are expected.
Of course, there is a 24 hours gap. My question is if there would be issues with seting up the service from that dump, if they would be internally consistent, even if not up to date.
Can I just remove
Not only you can, you should remove any side-channel copy of the data after your checks are finished- I don't think there is any private data there so it is not subjected to the data retention policy, but always better to be sure by removing it. :-D.
I assigned this task to you to note that everything was done on my side.
Yes, as said above: The data is good to set the service up with (assuming the database content is available)
Can I just remove
[ Yes ]
Done.