Page MenuHomePhabricator

Initial backup run for Gerrit LFS data
Closed, ResolvedPublic

Description

We've just added automatic backups for LFS data in T254155. A first manual run woud help us cover the time until the first automatic run on Sunday.

Related Objects

Event Timeline

+1, let's kick off an initial backup.

jcrespo triaged this task as Medium priority.
jcrespo added a project: SRE.

Running:

232157  Back Full          0         0  gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data is running

I have a few things I would like to test/ask about this.

  • First, I would like to test a recovery, if possible to make sure the backup worked as intended.
  • I also have a few questions regarding recovery of gerrit- in terms of consistency and disaster recovery model, but not sure if this task would be the right one.
  • I also have a few questions regarding recovery of gerrit- in terms of consistency and disaster recovery model, but not sure if this task would be the right one.

Probably a secondary task, yeah. But good to discuss.

*llist jobid=232157    
           JobId: 232,157
             Job: gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data.2020-06-01_16.14.26_23
            Name: gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data
     PurgedFiles: 0
            Type: B
           Level: F
        ClientId: 151
      ClientName: gerrit1001.wikimedia.org-fd
       JobStatus: T
       SchedTime: 2020-06-01 16:14:07
       StartTime: 2020-06-01 16:14:26
         EndTime: 2020-06-01 16:29:21
     RealEndTime: 2020-06-01 16:29:21
        JobTDate: 1,591,028,961
    VolSessionId: 10,153
  VolSessionTime: 1,586,342,727
        JobFiles: 147,467
        JobBytes: 50,803,269,008
       ReadBytes: 50,700,053,821
       JobErrors: 0
 JobMissingFiles: 0
          PoolId: 2
        PoolName: production
      PriorJobId: 0
       FileSetId: 71
         FileSet: gerrit-repo-data
         HasBase: 0
        HasCache: 0
         Comment:
01-Jun 16:29 backup1001.eqiad.wmnet JobId 232157: Bacula backup1001.eqiad.wmnet 9.4.2 (04Feb19):
  Build OS:               x86_64-pc-linux-gnu debian buster/sid
  JobId:                  232157
  Job:                    gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data.2020-06-01_16.14.26_23
  Backup Level:           Full
  Client:                 "gerrit1001.wikimedia.org-fd" 9.4.2 (04Feb19) x86_64-pc-linux-gnu,debian,buster/sid
  FileSet:                "gerrit-repo-data" 2020-06-01 14:00:00
  Pool:                   "production" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "backup1001-FileStorageProduction" (From Pool resource)
  Scheduled time:         01-Jun-2020 16:14:07
  Start time:             01-Jun-2020 16:14:26
  End time:               01-Jun-2020 16:29:21
  Elapsed time:           14 mins 55 secs
  Priority:               10
  FD Files Written:       147,467
  SD Files Written:       147,467
  FD Bytes Written:       50,803,269,008 (50.80 GB)
  SD Bytes Written:       50,876,346,668 (50.87 GB)
  Rate:                   56763.4 KB/s
  Software Compression:   None
  Comm Line Compression:  None
  Snapshot/VSS:           no
  Encryption:             yes
  Accurate:               no
  Volume name(s):         production0089
  Volume Session Id:      10153
  Volume Session Time:    1586342727
  Last Volume Bytes:      136,812,896,732 (136.8 GB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

job ran, job scheduled to run an incremental automatically every hour: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data

We should try restoring it, give me a path you have access to to do so.

Mentioned in SAL (#wikimedia-operations) [2020-06-01T17:31:25Z] <mutante> backup1001 - queued job 42 - gerrit backup after renaming of the file set and addition of LFS data (T254155, T254162) it is incremental, the full one already ran

@jcrespo Thanks for the initial run!

We should try restoring it, give me a path you have access to to do so.

gerrit1002 (gerrit-test) is a bit scarce on disk space, so what about gerrit2001 (gerrit-replica)? On gerrit2001, /srv has >150GB free. That should easily do.

I have to temporarily block gerrit-root members access to gerrit2001 as I need to use the global key to decrypt backups to a different host than they were taken AND these members have local root to the server, otherwise this would be security concern.

I have already tried restoring a single random file from the git-lfs directory on gerrit1001 itself and the file is here now:

root@gerrit1001:/var/tmp/bacula-restores/srv/gerrit/plugins/lfs/ff/00# file ff002a0e6e173a695d6e51f8d2941eebd14123aa17d673dbd2124746ee182e20 
ff002a0e6e173a695d6e51f8d2941eebd14123aa17d673dbd2124746ee182e20: data

Would this be sufficient?

Mentioned in SAL (#wikimedia-operations) [2020-06-02T10:12:08Z] <jynus> disable non-global root login to gerrit2001 T254162

I have to temporarily block gerrit-root members access to gerrit2001 as I need to use the global key to decrypt backups to a different host than they were taken AND these members have local root to the server, otherwise this would be security concern.

Also not a problem. I don't think anyone needs shell access right now..especially if that's just a temporary thing.

I have scheduled the restore. If this was an emergency, I would kill ongoing backups jobs and the restore would run immediately, but because this is a test, I would let the large ongoing backup processes (it is currently doing a full backup of Phabricator) to finish before the restore starts, which will delay the restore. It shouldn't take more than a few extra minutes.

I will ping when the restore ends it execution.

Would this be sufficient?

Normally yes, but I would like to do a full restore to a separate server (simulating a total loss of the primary server), as I have additional questions for service owner about the feasibility of the recovery and its consistency.

A file (or all files) being recoverable doesn't mean that the service is recoverable.

Bacula has its own consistency model that may not be compatible with git consistency model, I want to put that to the test/enquire if lvm snapshoting would be needed.

jcrespo added a subscriber: jcrespo.

Recovery took 20 minutes, I chose the full one we took for testing yesterday, but remember we have hourly backups: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&from=1591022207000&to=1591173396847&var-dc=eqiad%20prometheus%2Fops&var-job=gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data

02-Jun 18:03 backup1001.eqiad.wmnet JobId 233274: Bacula backup1001.eqiad.wmnet 9.4.2 (04Feb19):
  Build OS:               x86_64-pc-linux-gnu debian buster/sid
  JobId:                  233274
  Job:                    RestoreFiles.2020-06-02_10.27.00_11
  Restore Client:         gerrit2001.wikimedia.org-fd
  Where:                  /srv/T254162_restore
  Replace:                Never
  Start time:             02-Jun-2020 17:39:27
  End time:               02-Jun-2020 18:03:03
  Elapsed time:           23 mins 36 secs
  Files Expected:         147,467
  Files Restored:         147,467
  Bytes Restored:         50,700,053,821 (50.70 GB)
  Rate:                   35805.1 KB/s
  FD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Restore OK

Content of everything backed up on gerrit1001 was restored into gerrit2001:/srv/T254162_restore.

Please try the right operations to verify the backup works for an emergency restore, successfully.

I have reenabled access to gerrit-roots & puppet to gerrit2001 after master key shredding.

Bacula has its own consistency model that may not be compatible with git consistency model, I want to put that to the test/enquire if lvm snapshoting would be needed.

Oh! Thanks for pointing that out and getting into the details.

Recovery took 20 minutes, [...]

Nice!

Content of everything backed up on gerrit1001 was restored into gerrit2001:/srv/T254162_restore.

I went over the data by comparing the lfs files and fsck-ing each of the repos, and it looks good. We could get gerrit up from that data. (Of course assuming the database content is there)
There were a few inconsistencies between the currently live data on gerrit1001 and the restored data, but those are expected.

@jcrespo: Can I just remove gerrit2001:/srv/T254162_restore again or do you still want to check something with the data?

There were a few inconsistencies between the currently live data on gerrit1001 and the restored data, but those are expected.

Of course, there is a 24 hours gap. My question is if there would be issues with seting up the service from that dump, if they would be internally consistent, even if not up to date.

Can I just remove

Not only you can, you should remove any side-channel copy of the data after your checks are finished- I don't think there is any private data there so it is not subjected to the data retention policy, but always better to be sure by removing it. :-D.

I assigned this task to you to note that everything was done on my side.

My question is if there would be issues with seting up the service from that dump, if they would be internally consistent, even if not up to date.

Yes, as said above: The data is good to set the service up with (assuming the database content is available)

Can I just remove

[ Yes ]

Done.