Page MenuHomePhabricator

Gerrit backups are growing
Closed, ResolvedPublic

Description

Following up on T406762: gerrit2003 is trying to backup incrementally 3.5 million files every hour, clogging backus and filling in available disk space

The backup size increase is probably a side effect of the local-backup cookbook which duplicates critical git data locally to ensure a local stable state snapshot is quickly available. The most obvious fix for this could be to exclude the backup directory from bacula.

Event Timeline

Question: I don't know the details of how are things are organized for resilience, but wouldn't the other host should have those files (and thus, backup would frequently have the same size) if reliability was the goal? I don't have a problem with backup requiring more space, but if losing those file makes failover more difficult, those shouldn't be [only] on the local host, but elsewhere, otherwise the most common need for a failover (host failure, maybe the second most common, after service maintenance) will not work.

Maybe those files should be sent elsewhere AND backed up? Again, I don't know anything about the internals, I am just making an assumption based on what I see on backup size.

ABran-WMF triaged this task as Medium priority.Dec 8 2025, 4:43 PM
ABran-WMF moved this task from Incoming to Work in Progress on the collaboration-services board.

I think my initial assumption was misguided, I was mistaken on the fileset Bacula targets.

Given:

I think we can assume that this is not an unusual pattern. And given:

These are the top files by size:

cumin2024@db1213.eqiad.wmnet[bacula9]> select Name, lstat_size(LStat) FROM File JOIN Filename USING(FilenameId) where JobId=666670 ORDER BY lstat_size(LStat) DESC LIMIT 15;
+-----------------------------+-------------------+
| Name                        | lstat_size(LStat) |
+-----------------------------+-------------------+
| git_file_diff.h2.db         |        2925682688 |
| comment_context.h2.db       |        1925754880 |
| account_patch_reviews.h2.db |        1427044352 |
| gerrit_file_diff.h2.db      |        1319897088 |
| mergeability.h2.db          |        1156052992 |
| diff_summary.h2.db          |        1086640128 |
| diff_intraline.h2.db        |         797939712 |
| conflicts.h2.db             |         787589120 |
| web_sessions.h2.db          |         705048576 |
| change_kind.h2.db           |         637464576 |
| git_modified_files.h2.db    |         349911040 |
| modified_files.h2.db        |         238133248 |
| master                      |          69714340 |
| accounts.h2.db              |          62945280 |
| persisted_projects.h2.db    |          30328832 |
+-----------------------------+-------------------+
15 rows in set (0.071 sec)

We can see that these files are cache dbs. According to Gerrit's documentation on caches, some are expensive to compute and their computation is triggered on demand by the UI. We're running replicas with no UI enabled, which prevents the computation of these caches. @Dzahn, @hashar, are you OK with my interpretation?

Question: I don't know the details of how are things are organized for resilience, but wouldn't the other host should have those files (and thus, backup would frequently have the same size) if reliability was the goal? I don't have a problem with backup requiring more space, but if losing those file makes failover more difficult, those shouldn't be [only] on the local host, but elsewhere, otherwise the most common need for a failover (host failure, maybe the second most common, after service maintenance) will not work.

Maybe those files should be sent elsewhere AND backed up? Again, I don't know anything about the internals, I am just making an assumption based on what I see on backup size.

Those files are not critical for a failover. With the current failover process, they are transferred to the new primary host. Given that Gerrit's compaction is not idempotent, I'm not sure we'd be able to use them if we just copied them without the rest of the data directory structure.

The next iteration of the process is aiming to avoid using rsync altogether, given those files sizes, we would probably benefit from triggering a cache creation on the promoted instance before setting it to read-write mode.

Since the backup size is not dramatically increasing over time, I'm marking this task as resolved. Feel free to reopen if needed!

Those files are not critical for a failover. With the current failover process, they are transferred to the new primary host. Given that Gerrit's compaction is not idempotent, I'm not sure we'd be able to use them if we just copied them without the rest of the data directory structure.

Could you tell me how you do a failover if the primary host is down and unavailable? That is my main concern here. If they have to be transferred to work, but are not backed up (do not exist?) on the secondary host, that is a flaw on the backups. If they are not important for the secondary host to exist and work, then they should not be backed up (because they are not important). Either one or the other, or I am not understanding your logic.

With the current failover process

That's a switchover process, not a failover process, hence my point/question.

Alternatively, if they are "good to backup just in case", but not "critical", they should be backed up with the normal schedule (daily instead of hourly).

With the current failover process

That's a switchover process, not a failover process, hence my point/question.

I see where I was unclear, sorry about this!

The failover process is derivative from the one I've linked, but is not really documented yet.

Those files are not critical for a failover. With the current failover process, they are transferred to the new primary host. Given that Gerrit's compaction is not idempotent, I'm not sure we'd be able to use them if we just copied them without the rest of the data directory structure.

Could you tell me how you do a failover if the primary host is down and unavailable? That is my main concern here. If they have to be transferred to work, but are not backed up (do not exist?) on the secondary host, that is a flaw on the backups. If they are not important for the secondary host to exist and work, then they should not be backed up (because they are not important). Either one or the other, or I am not understanding your logic.

Good question! In the context of a failover, where the primary instance is unavailable for some reason, we would not have to transfer or restore anything as long as one of the replicas is still available.
The main focus here would be to mend Puppet and DNS so they point to the failover primary instance.

If we needed to restore from a backup, that would mean either we have no more Gerrit instance, or the data has somehow been corrupted.
In both context, I think it's best to have the cache DBs on par with the dataset that they are caching.

Alternatively, if they are "good to backup just in case", but not "critical", they should be backed up with the normal schedule (daily instead of hourly).

Given what I mention above, I'd be curious to know what @Dzahn @hashar think

@jcrespo please let me know if there is anything unclear!

So my takeaway is (simplifying):

  1. the data is critical and backups will be used for recovery in case of a failover, so it needs a full backup. Backup recovery cannot be ensured to be recovered properly.
  2. complete backups cannot be taken from replicas because they are not "complete backups"
  3. in the future, to relay less on backups, those files will be synced to the replicas

Ok- I don't think this is an ideal situation -it seems if primary gets corrupted of bad we will lose a lot of data-, but this ticket can resolved now.

marking this as resolved, feel free to reopen if needed

My thought here is just that the O'Reilly SRE book reminded me of "data recovery plans should not rely on replication".