Page MenuHomePhabricator

Make sure we backup Gerrit's LFS data
Closed, ResolvedPublic

Description

Judging from our puppet repo, we only have backups for the git repos themselves, but not additionally needed LFS data.

So if the gerrit1001 server dies, LFS data would be lost.

There is a copy of most LFS data on gerrit2001, but it's missing recent files.

Event Timeline

QChris created this task.Jun 1 2020, 1:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 1 2020, 1:10 PM

Change 601341 had a related patch set uploaded (by QChris; owner: Christian Aistleitner):
[operations/puppet@production] profile,gerrit: Add backup of Gerrit's LFS files

https://gerrit.wikimedia.org/r/601341

Change 601341 merged by Dzahn:
[operations/puppet@production] profile,gerrit: Add backup of Gerrit's LFS files

https://gerrit.wikimedia.org/r/601341

QChris added a subscriber: Dzahn.Jun 1 2020, 1:49 PM

Code to run backups is in place. Thanks @Dzahn!

We now need to wait a week for the backup to run.

jcrespo added a subscriber: jcrespo.Jun 1 2020, 1:56 PM

We now need to wait a week for the backup to run

We don't need to wait, I can (or any other SRE can) run it now, but I have not idea or context what is the end goal here. That is why I asked for fill me in on a request ticket.

QChris added a comment.Jun 1 2020, 2:34 PM

The end goal is to be able to avoid data loss if Gerrit hardware dies.

Gerrit currently stores data in three places:

  1. The main git repositories,
  2. the MariaDB database, and
  3. (since git in not the best place to store huge binary files) there is a dedicated storage for large files (LFS).

The git repos themselves had a backup configuration in puppet.

Gerrit's MariaDB database is seemingly backed up with the other databases. (My email from a few days ago was only about that MariaDB database)

The LFS part seemingly lacked backup.

So if for example gerrit1001 dies catastrophically, we've lost LFS data and could not fully recover from backups.

This ticket tries to ensure LFS data is backed up and we can recover from Gerrit hardware issues.

It's not a huge deal if the backup runs today or on Sunday, as we haven't had any backup at all since LFS got turned on some time back.
If it's tricky or hard to manually kick off a backup run, I'd wait till Sunday.
But if it's easy to manually kick off a backup run, a first run would be great. Let me file a ticket for that.

[...] Let me file a ticket for [initial backup run].

The ticket is at T254162

greg added a subscriber: greg.Jun 1 2020, 4:10 PM

We now need to wait a week for the backup to run

We don't need to wait, I can (or any other SRE can) run it now, but I have not idea or context what is the end goal here. That is why I asked for fill me in on a request ticket.

Thank you for the offer, @jcrespo . After reading @QChris's follow-up, I think we might as well kick off an initial backup to seed the backups and get us on good footing. No need to wait :) Will comment similarly on T254162.

This comment was removed by jcrespo.

Mentioned in SAL (#wikimedia-operations) [2020-06-01T17:31:25Z] <mutante> backup1001 - queued job 42 - gerrit backup after renaming of the file set and addition of LFS data (T254155, T254162) it is incremental, the full one already ran

QChris closed this task as Resolved.Jun 3 2020, 8:08 PM
QChris claimed this task.