Page MenuHomePhabricator

bacula restore job waiting on higher jobs
Closed, ResolvedPublic

Description

I was trying a restore on graphite1001 of some data re: T95596 but the job seems stuck behind higher priority jobs

while creating the job

Select parameter to modify (1-13): 9
Please enter path prefix for restore (/ for none): /var/lib/carbon/bacula-restores
Run Restore job
JobName:         RestoreFiles
Bootstrap:       /var/lib/bacula/helium.eqiad.wmnet.restore.2.bsr
Where:           /var/lib/carbon/bacula-restores
Replace:         always
FileSet:         root
Backup Client:   graphite1001.eqiad.wmnet-fd
Restore Client:  graphite1001.eqiad.wmnet-fd
Storage:         helium-FileStorage1
When:            2015-04-10 15:12:37
Catalog:         production
Priority:        1
Plugin Options:  *None*
OK to run? (yes/mod/no): yes
Job queued. JobId=17582

then the director status

*status dir
helium.eqiad.wmnet Version: 5.2.5 (26 January 2012) x86_64-pc-linux-gnu ubuntu 12.04
Daemon started 27-Feb-15 11:27. Jobs: run=1050, running=264 mode=0,0
 Heap: heap=4,747,264 smbytes=2,615,508 max_bytes=150,860,886 bufs=7,958 max_bufs=10,427

Scheduled Jobs:
Level          Type     Pri  Scheduled          Name               Volume
===================================================================================
Incremental    Backup    10  11-Apr-15 04:05    antimony.wikimedia.org-Monthly-1st-Mon-production-svnroot production0001
...
Incremental    Backup    10  11-Apr-15 04:05    ytterbium.wikimedia.org-Monthly-1st-Sat-production-var-lib-gerrit2-review_site-git production0001
====

Running Jobs:
Console connected at 10-Apr-15 14:33
Console connected at 10-Apr-15 14:42
Console connected at 10-Apr-15 14:45
Console connected at 10-Apr-15 14:54
 JobId Level   Name                       Status
======================================================================
 17040 Full    dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2015-03-25_02.05.00_03 is running
 17072 Increme  caesium.eqiad.wmnet-Monthly-1st-Fri-production-srv-org-wikimedia.2015-03-26_04.05.00_35 is waiting on Storage "helium-FileStorage1"
 17073 Increme  carbon.wikimedia.org-Monthly-1st-Sat-production-srv-autoinstall.2015-03-26_04.05.00_36 is waiting on max Storage jobs
...
 17578 Full    virt1000.wikimedia.org-Monthly-1st-Tue-production-var-lib-puppet-ssl.2015-04-10_04.05.00_40 is waiting on max Storage jobs
 17579 Full    virt1000.wikimedia.org-Monthly-1st-Tue-production-var-lib-puppet-volatile.2015-04-10_04.05.00_41 is waiting on max Storage jobs
 17580 Full    ytterbium.wikimedia.org-Monthly-1st-Sat-production-var-lib-gerrit2-review_site-git.2015-04-10_04.05.00_42 is waiting on max Storage jobs
 17582         RestoreFiles.2015-04-10_15.13.25_49 is waiting for higher priority jobs to finish
====

Terminated Jobs:
 JobId  Level    Files      Bytes   Status   Finished        Name 
====================================================================
 17506  Incr          1         0   OK       08-Apr-15 14:13 stat1002.eqiad.wmnet-Monthly-1st-Thu-production-a-eventlogging
 17507  Incr      5,379    1.813 G  OK       08-Apr-15 14:14 stat1002.eqiad.wmnet-Monthly-1st-Thu-production-home
 17508  Full         54    410.7 M  OK       08-Apr-15 14:14 stat1003.eqiad.wmnet-Monthly-1st-Tue-production-a-geowiki-data-private-bare
 17528  Incr          0         0   Cancel   09-Apr-15 10:09 graphite1001.eqiad.wmnet-Monthly-1st-Thu-production-var-lib-carbon-whisper
 17545  Incr          0         0   Cancel   09-Apr-15 10:09 tungsten.eqiad.wmnet-Monthly-1st-Sat-production-var-lib-carbon-whisper
 17529  Incr          0         0   Cancel   09-Apr-15 10:09 graphite1002.eqiad.wmnet-Monthly-1st-Fri-production-var-lib-carbon-whisper
 17530  Incr          0         0   Cancel   09-Apr-15 10:09 graphite2001.codfw.wmnet-Monthly-1st-Fri-production-var-lib-carbon-whisper
 17515  Incr          0         0   Cancel   09-Apr-15 10:09 tungsten.eqiad.wmnet-Monthly-1st-Sat-production-var-lib-carbon-whisper
 17499  Incr          0         0   Cancel   09-Apr-15 10:09 graphite1002.eqiad.wmnet-Monthly-1st-Fri-production-var-lib-carbon-whisper
 17581                0         0   Cancel   10-Apr-15 15:10 RestoreFiles

====
*list jobid=17582
+--------+--------------+---------------------+------+-------+----------+----------+-----------+
| JobId  | Name         | StartTime           | Type | Level | JobFiles | JobBytes | JobStatus |
+--------+--------------+---------------------+------+-------+----------+----------+-----------+
| 17,582 | RestoreFiles | 0000-00-00 00:00:00 | R    | F     |        0 |        0 | C         |
+--------+--------------+---------------------+------+-------+----------+----------+-----------+
*

Event Timeline

fgiunchedi assigned this task to akosiaris.
fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a project: acl*sre-team.
fgiunchedi subscribed.

Seems like the problem was bacula-sd on palladium locking up. bacula-sd has had a lot of problems in the past with threads locking up quite often. Most have been fixed but maybe we stumbled across one. Unfortunately I restarted the sd before thinking about getting a stack trace. Restarting also meant that the running jobs got canceled. I think the race here may be manifested due to backups running and the restore job trying to reserve a drive. I 'll try to reproduce it.

Other than that, restore jobs have the highest priority. However, bacula's scheduler is not preemptive. That is, it will not stop a running job to execute a new one with a higher priority. That AFAIK is a precaution against operations that would torture both tape drives as well as tapes.

sounds good, I've kicked off the restore again

sounds good, I've kicked off the restore again

@fgiunchedi: Any news?

let's resolve this for now, I've inquired about status in T95596

jcrespo subscribed.

I'd like to reopen this, and enable "Allow Mixed Priority" to allow jobs with different priorities to run at the same time.

If I have running backups (which is true 90% of the time now) and I request for a restore with higher priority, it is anti-intuitive that
a) the higher priority job waits for the lower priority jobs, even if it has enough concurrency to run
b) because a higher priority job has been scheduled, all new lower scheduled jobs are blocked, even if normally they would just run as concurrency permits

I believe enabling "Allow Mixed Priority" would do a less surprising scheduling setup. Thoughts?

Aklapper added a subscriber: akosiaris.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

@akosiaris Any thoughts on applying T95705#7118417, even if it were "do as you see fit, not managing this anymore" 0:-)

@akosiaris Any thoughts on applying T95705#7118417, even if it were "do as you see fit, not managing this anymore" 0:-)

I 've never tested "Allow Mixed Priority" so I have no experience with it, but from the docs it looks like it would solve a) in your comment. b) would not be solved from what I gathered. However even solving just a) is an improvement so +1.

I am going to try it and test it. CC @LSobanski

jcrespo raised the priority of this task from Low to Medium.Aug 25 2021, 11:22 AM

If Allow Mixed Priority wouldn't work, we could alternative set the default restore priority (RestoreFiles job) to be 10, like the regular backups. Having the same priority, counterintuitively, would lead to solution of b) (in addition to a))

Change 771551 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Add mixed priority to all jobs

https://gerrit.wikimedia.org/r/771551

Change 771551 merged by Jcrespo:

[operations/puppet@production] bacula: Add mixed priority to all jobs

https://gerrit.wikimedia.org/r/771551

I will now run a regular priority backup (priority 10) and seconds later run a recovery (which should run by default with priority 1) and the recovery should start immediately.

Recovery could be blocked in the future if there are not enough parallel slots for it (e.g. at the start of the month, but we can always kill those in a real emergency- and potentially on the next release, pause them).

It worked, restore running despite being executed later with a different priority:

 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
424549  Back Incr     12,860    1.038 G doc1001.eqiad.wmnet-Monthly-1st-Tue-production-srv-doc is running
424550  Rest Rest          0         0  RestoreFiles      is running

Considering the ticket resolved (as per title scope), more refinement and documentation may be needed (and potential issues with this in the future- e.g. how low prio migrate jobs behave with this).