Page MenuHomePhabricator

Daily backup job not running for gerrit1003
Closed, ResolvedPublic

Description

There were some disk space issues on gerrit1003 on Nov 20 (documented in {T351658}). The host was rebooted at 2023-11-20 17:18 at which point disk usage dropped to 83%

According to check_bacula.py there was a gap in executing the hourly backup job between 2023-11-20 11:00 and 2023-11-21 04:15 and then another one from 2023-11-21 05:14:07 until now.

lsobanski@backup1001:/var/log/bacula$ sudo /usr/bin/check_bacula.py gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data
...
id: 539407, ts: 2023-11-20 11:00:02, type: I, status: T, bytes: 25990405136
id: 539408, ts: 2023-11-21 04:15:16, type: I, status: T, bytes: 19458946160
...
id: 539436, ts: 2023-11-21 05:14:07, type: I, status: T, bytes: 19465905776

For the first gap I see the following in the Bacula log: "Job ... is waiting. Cannot find any appendable volumes", which as far as I was able to find would be related to the lack of disk space. There's nothing in the Bacula log for the second gap.

Primary question: Does the daily job need to be restarted?
Secondary question (if it can be established): Why did the job only run successfully at 4:15 when disk usage dropped at 18:00 the previous day?

Event Timeline

We looked at the grafana dashboard today and could confirm backup job was completed succesfully 11 hours ago.

There was a spike yesterday and it seems it was caused by T351658 which is already linked.

I think we just need to keep an eye on it and if it works fine the next few days as before we can close this.

"Job ... is waiting. Cannot find any appendable volumes"

That would mean on the bacula storage daemon (somewhere on backup1*), not on gerrit/client hosts. Are you sure this is related to T351658 (it can be, if made the client fail), but the error points to backups storage instead?

Change 976286 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Increase the amount of maximum volumes for regular backups to 140

https://gerrit.wikimedia.org/r/976286

Despite https://gerrit.wikimedia.org/r/c/operations/puppet/+/976286 , I wonder if backups are well executed- the full backup is around 50GB, but each hour right now 20GB are copied. I wonder if that makes sense, if 20GB of new files are generated each hour, or there some avoidable overhead- specially given that a few months ago it was 8GB, and I would expect gerrit usage to go stale as people start using gitlab more.

"Job ... is waiting. Cannot find any appendable volumes"

That would mean on the bacula storage daemon (somewhere on backup1*), not on gerrit/client hosts. Are you sure this is related to T351658 (it can be, if made the client fail), but the error points to backups storage instead?

I'm not sure to be honest but it seems to mostly align with the gap in the jobs running, here's the relevant part of the log.

20-Nov 11:04 backup1009.eqiad.wmnet-fd JobId 539407: Job gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data.2023-11-20_11.00.00_37 is waiting. Cannot find any appendable volumes.
20-Nov 12:09 backup1009.eqiad.wmnet-fd JobId 539407: Job gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data.2023-11-20_11.00.00_37 is waiting. Cannot find any appendable volumes.
20-Nov 14:20 backup1009.eqiad.wmnet-fd JobId 539407: Job gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data.2023-11-20_11.00.00_37 is waiting. Cannot find any appendable volumes.
20-Nov 18:40 backup1009.eqiad.wmnet-fd JobId 539407: Job gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data.2023-11-20_11.00.00_37 is waiting. Cannot find any appendable volumes.
21-Nov 02:05 backup1001.eqiad.wmnet JobId 539423: Copying using JobId=539406 Job=gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data.2023-11-20_10.00.00_19
  Prev Backup Job:        gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data.2023-11-20_10.00.00_19
21-Nov 03:23 backup1009.eqiad.wmnet-fd JobId 539407: Job gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data.2023-11-20_11.00.00_37 is waiting. Cannot find any appendable volumes.

Change 976286 merged by Jcrespo:

[operations/puppet@production] bacula: Increase the amount of maximum volumes for regular backups to 140

https://gerrit.wikimedia.org/r/976286

Mentioned in SAL (#wikimedia-operations) [2023-11-21T18:15:19Z] <jynus> restart of bacula-sd on backup1009 T351725

jcrespo moved this task from Triage to Done on the Data-Persistence-Backup board.

I have unclogged things from my side (bacula) and backups are now flowing normally- usually I would close this are resolved, but leaving it to you in case you want to check something else from client side (gerrit1003). Please reassign to who corresponds. Specially regarding my questions at https://phabricator.wikimedia.org/T351725#9350316

Backups look fine again since 18:30 UTC yesterday: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data&from=1700471853896&to=1700642496593. No elevated backup freshness or gaps in backups.

Despite https://gerrit.wikimedia.org/r/c/operations/puppet/+/976286 , I wonder if backups are well executed- the full backup is around 50GB, but each hour right now 20GB are copied. I wonder if that makes sense, if 20GB of new files are generated each hour, or there some avoidable overhead- specially given that a few months ago it was 8GB, and I would expect gerrit usage to go stale as people start using gitlab more.

I'm not 100% how gerrit backups work. But I went over the bacula fileset

bacula::director::fileset { 'gerrit-repo-data':
    includes => [ '/srv/gerrit', '/var/lib/gerrit2' ]
}

And in the latter there are quite a lot of big .db files, which I guess are hard to diff because the complete file changes. So all of the db files (for example in /var/lib/gerrit2/review_site/cache) are transferred every time. Total size of this dbs is over 20GB.

/var/lib/gerrit2/review_site/cache that sounds like cache. Is that really necessary on a site recovery or would cache get regenerated? Not opposed to backup that anyway just in case, but maybe not hourly? Obviously, I also speak from a place of ignorance-any change should be tested- but wanting to know if the hourly backups are really being effective, given its size (and how they used to be much smaller).

Edit: In particular, this is the part that looks weird and not consistent with its history: https://grafana.wikimedia.org/goto/piRtZ3SSz?orgId=1

{F41523804}

And in the latter there are quite a lot of big .db files,

See T351658#9352274.

"Gerrit hasn't been restarted in a while and did not have an opportunity to compact the H2 databases"

and the following comments after that.

Edit: In particular, this is the part that looks weird and not consistent with its history: https://grafana.wikimedia.org/goto/piRtZ3SSz?orgId=1

I think this is also related to T351658#9352667 and the comments around it.

And in the latter there are quite a lot of big .db files,

See T351658#9352274.

"Gerrit hasn't been restarted in a while and did not have an opportunity to compact the H2 databases"

and the following comments after that.

Ah, that certainly could explain it! I am then personally satisfied, and this can be resolved IMHO.

LSobanski triaged this task as Medium priority.Nov 23 2023, 2:59 PM
LSobanski lowered the priority of this task from Medium to Low.Nov 23 2023, 3:03 PM

I'll leave this open for another week or two to see if the backup size changes after the reboots.

As of midday 2023-11-22 the backup size is down to around 8GB, in line with what it was before the recent increase. Resolving.