During the datacenter switchover for MediaWiki, the database is set into read-only mode, but the JobQueue does not respect that setting and continues posting jobs, which all result in failures since most of the jobs are attempting to write into the DB. The jobs are then retried, but if the read-only period is too long, we end up losing some jobs completely.
In order to avoid that we need to respect the read-only mode of MediaWiki in the Job Queue. We could pause the queue for the read-only period, but that will still result in losing jobs that were on-going at the moment of enabling the read-only mode.
Alternatively, we could implement something like retry-after - if the job has failed because of the read-only, the JobExecutor will set retry-after header to the response to some large value - this will fix ongoing jobs problem, but this will still make the queue spin during the read-only and make doomed-to-fail requests to the executor.
We could implement both approaches together - that would be perfect. What do you think? Do we need this?