Page MenuHomePhabricator

gerrit2003 is trying to backup incrementally 3.5 million files every hour, clogging backus and filling in available disk space
Closed, ResolvedPublic

Description

This is only happening for gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data and not for other hosts, which backup only a few thousand files every hour.

status dir
Running Jobs:
Console connected using TLS at 07-Oct-25 12:34
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
656139  Back Incr  3,515,749    44.41 G gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is running
656142  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656145  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656148  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656151  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656154  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656157  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656160  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656163  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656166  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656169  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656172  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656186  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656189  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656251  Back Incr          0         0  gerrit2003.wikimedia.org-Monthly-1st-Mon-productionEqiad-home is waiting on max Client jobs
656318  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656321  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656324  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656327  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656330  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656333  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656336  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656339  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656342  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656345  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656348  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
656351  Back Incr          0         0  gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data is waiting on max Client jobs
====

Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
656338  Incr       1,857    113.9 M  OK       08-Oct-25 12:01 gerrit2002.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data
656340  Incr       2,089    10.21 G  OK       08-Oct-25 13:02 gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data
656341  Incr       2,064    123.0 M  OK       08-Oct-25 13:02 gerrit2002.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data
656136  Incr    3,745,079    54.98 G  OK       08-Oct-25 13:09 gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data
656343  Incr       2,965    10.23 G  OK       08-Oct-25 14:01 gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data
656344  Incr       2,781    120.1 M  OK       08-Oct-25 14:02 gerrit2002.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data
656346  Incr       2,416    10.43 G  OK       08-Oct-25 15:02 gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data
656347  Incr       2,217    387.0 M  OK       08-Oct-25 15:02 gerrit2002.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data
656349  Incr       2,594    10.25 G  OK       08-Oct-25 16:03 gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data
656350  Incr       2,597    162.3 M  OK       08-Oct-25 16:03 gerrit2002.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data

Started happening around UTC midday today.

Event Timeline

jcrespo renamed this task from gerrit2003 is trying to backup incrementally 2 million files every hour, clogging backus and filling in available disk space to gerrit2003 is trying to backup incrementally 3.5 million files every hour, clogging backus and filling in available disk space.Oct 8 2025, 4:43 PM
jcrespo triaged this task as Unbreak Now! priority.

I confirm right now host gerrit2003 is still defined as a spare_host. Based on that, please just disable backup first. for now.

Change #1194697 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] gerrit: Disable gerrit2003 backups

https://gerrit.wikimedia.org/r/1194697

Change #1194697 merged by Jcrespo:

[operations/puppet@production] gerrit: Disable gerrit2003 backups

https://gerrit.wikimedia.org/r/1194697

jcrespo lowered the priority of this task from Unbreak Now! to High.EditedOct 8 2025, 5:11 PM
jcrespo added a subscriber: ABran-WMF.

No longer UBN once the job has been disabled, please have a look when you have the time.

jcrespo updated the task description. (Show Details)

Today we had scheduled a switch over of Gerrit from gerrit1003 to gerrit2003. One of the step involves doing a rsync of /srv/gerrit/git to keep a local backup. That step ran for more than 45 minutes which prompted me to investigate.

Something I have noticed in the rsync verbose output is a lot of files such as refs/changes/YY/XXXYY... and indeed:

gerrit 2003$ find /srv/gerrit/git |wc -l
5037253

That is 5 millions files.

The reason is that on Monday we had to interrupt a rsync between gerrit1003 and gerrit2003. That has left the file ownership in a mixed state due to T338470. To fix that, on gerrit2003 I went to nuke /srv/gerrit/git and started a Gerrit replication. The Gerrit replication would push every single refs to the replicas which write them down to individual files. They are only collected in a single packed-refs when Gerrit triggers a garbage collection on all files. That happens on Sunday iirc and meanwhile there are millions of files on the disk which explains the 3,745,079 files in:

 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
656136  Incr    3,745,079    54.98 G  OK       08-Oct-25 13:09 gerrit2003.wikimedia.org-Hourly-Mon-productionEqiad-gerrit-repo-data

Given T387833#11267438, that should not be an issue anymore, I'll send a patch to re-enable backups

Change #1195432 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: re-enable backups on gerrit2003

https://gerrit.wikimedia.org/r/1195432

Change #1195432 merged by Dzahn:

[operations/puppet@production] gerrit: re-enable backups on gerrit2003

https://gerrit.wikimedia.org/r/1195432

I merged the change above to re-enable backups. Then noticed that currently puppet is disabled on gerrit2003 with a message that it's about debugging by @ABran-WMF . So this will be applied when puppet gets re-enabled there.

Change #1196629 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: disable gerrit service to enable backups

https://gerrit.wikimedia.org/r/1196629

LSobanski lowered the priority of this task from High to Medium.Oct 16 2025, 11:18 AM
LSobanski moved this task from Incoming to Work in Progress on the collaboration-services board.

Change #1196629 merged by Dzahn:

[operations/puppet@production] gerrit: disable gerrit service to enable backups

https://gerrit.wikimedia.org/r/1196629

Change #1196792 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: unmask service & disable backup temporarily

https://gerrit.wikimedia.org/r/1196792

I think that can be closed now, feel free to reopen if needed!

Change #1211551 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: re-enable backups on gerrit2003

https://gerrit.wikimedia.org/r/1211551

gerrit1003.wikimedia.org is now backing up 13GB every hour. Is that normal?

This task was for gerrit2003 which had a 55GB backup because, as part of a migration, the git repositories have been entirely rewritten. That specific cause has been resolved for now (but will trigger again next time the repos are wiped and regenerated via parent T387833).

@jcrespo for gerrit1003, that is the primary server and that is certainly a different issue. Can you please file a new task for that? If you have examples of files being touched that would help. Thanks!

These are the top files by size:

cumin2024@db1213.eqiad.wmnet[bacula9]> select Name, lstat_size(LStat) FROM File JOIN Filename USING(FilenameId) where JobId=666670 ORDER BY lstat_size(LStat) DESC LIMIT 15;
+-----------------------------+-------------------+
| Name                        | lstat_size(LStat) |
+-----------------------------+-------------------+
| git_file_diff.h2.db         |        2925682688 |
| comment_context.h2.db       |        1925754880 |
| account_patch_reviews.h2.db |        1427044352 |
| gerrit_file_diff.h2.db      |        1319897088 |
| mergeability.h2.db          |        1156052992 |
| diff_summary.h2.db          |        1086640128 |
| diff_intraline.h2.db        |         797939712 |
| conflicts.h2.db             |         787589120 |
| web_sessions.h2.db          |         705048576 |
| change_kind.h2.db           |         637464576 |
| git_modified_files.h2.db    |         349911040 |
| modified_files.h2.db        |         238133248 |
| master                      |          69714340 |
| accounts.h2.db              |          62945280 |
| persisted_projects.h2.db    |          30328832 |
+-----------------------------+-------------------+
15 rows in set (0.071 sec)
Dzahn removed hashar as the assignee of this task.

This task was for gerrit2003 which had a 55GB backup because, as part of a migration, the git repositories have been entirely rewritten. That specific cause has been resolved for now (but will trigger again next time the repos are wiped and regenerated via parent T387833).

@jcrespo for gerrit1003, that is the primary server and that is certainly a different issue. Can you please file a new task for that? If you have examples of files being touched that would help. Thanks!

I have limited insights of gerrit backups. But from looking at the Gerrit metrics a hourly backup of around ~10GB seems normal.

On the ReposEqiad bacula backend there are mostly ~10GB backups with a weekly bigger backup: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&from=2025-10-09T20:46:12.816Z&to=2025-12-02T13:50:27.684Z&timezone=utc&var-site=eqiad&var-job=gerrit1003.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data&viewPanel=panel-10

On the previous productionEqiad bacula backend the size was similar: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&from=2025-06-05T14:07:21.628Z&to=2025-10-09T12:41:20.566Z&timezone=utc&var-site=eqiad&var-job=gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data&viewPanel=panel-10

The reason I ask is because the other hosts' backups are very small in comparison:

667888  Incr       1,828    164.1 M  OK       02-Dec-25 14:00 gerrit2002.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data
667889  Incr       1,787    55.12 M  OK       02-Dec-25 14:00 gerrit2003.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data
667887  Incr       1,889    13.93 G  OK       02-Dec-25 14:02 gerrit1003.wikimedia.org-Hourly-Tue-ReposEqiad-gerrit-repo-data

Which would led me to think they are not being effectively being backed up, given the difference.

hashar claimed this task.

These are the top files by size:

cumin2024@db1213.eqiad.wmnet[bacula9]> select Name, lstat_size(LStat) FROM File JOIN Filename USING(FilenameId) where JobId=666670 ORDER BY lstat_size(LStat) DESC LIMIT 15;
+-----------------------------+-------------------+
| Name                        | lstat_size(LStat) |
+-----------------------------+-------------------+
| git_file_diff.h2.db         |        2925682688 |
| comment_context.h2.db       |        1925754880 |
| account_patch_reviews.h2.db |        1427044352 |
| gerrit_file_diff.h2.db      |        1319897088 |
...

Thanks for the list of files. That is indeed a different issue and unrelated to the git repositories references that filed gerrit2003. It is almost certainly preexisting and most certainly have been around since we had setup the bakcup.

Can you please file a different task. It is very challenging for me to deal with multiple different issues on a single task, specially when the original one got closed more than a once ago and, although affecting backup, has an unrelated root cause. Thank you for your understanding ;)

The task was to open a new task.

Change #1196792 abandoned by Arnaudb:

[operations/puppet@production] gerrit: unmask service

Reason:

1217133

https://gerrit.wikimedia.org/r/1196792

Change #1211551 abandoned by Arnaudb:

[operations/puppet@production] gerrit: re-enable backups on gerrit2003

Reason:

1217133

https://gerrit.wikimedia.org/r/1211551

Change #1217134 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: re-enable backups and monitoring on gerrit2003

https://gerrit.wikimedia.org/r/1217134