Page MenuHomePhabricator

Kafka JobQueue should respect DB readonly mode
Closed, ResolvedPublic

Description

During the datacenter switchover for MediaWiki, the database is set into read-only mode, but the JobQueue does not respect that setting and continues posting jobs, which all result in failures since most of the jobs are attempting to write into the DB. The jobs are then retried, but if the read-only period is too long, we end up losing some jobs completely.

In order to avoid that we need to respect the read-only mode of MediaWiki in the Job Queue. We could pause the queue for the read-only period, but that will still result in losing jobs that were on-going at the moment of enabling the read-only mode.

Alternatively, we could implement something like retry-after - if the job has failed because of the read-only, the JobExecutor will set retry-after header to the response to some large value - this will fix ongoing jobs problem, but this will still make the queue spin during the read-only and make doomed-to-fail requests to the executor.

We could implement both approaches together - that would be perfect. What do you think? Do we need this?

Event Timeline

What about RunSingleJob returning a header like x-readonly: true? When CP sees that, it could pause the execution altogether for a predefined (and/or tunable) amount of time and retry the same job. This way all of the workers will eventually find out that MW is set to read-only and wait until it switches back to read-write, all the while not losing jobs.

I like x-readonly: true - that leaves CP to decide what exactly to do, retry-after splits responsibility on controlling the event flow between MW and CP. Will follow that path

Change 460946 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Specify readonly in the job executor response.

https://gerrit.wikimedia.org/r/460946

Change 460946 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Specify readonly in the job executor response.

https://gerrit.wikimedia.org/r/460946

Change 461229 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] RPC/RunSingleJob.php - send X-Readonly header.

https://gerrit.wikimedia.org/r/461229

Change 461229 merged by jenkins-bot:
[operations/mediawiki-config@master] RPC/RunSingleJob.php - send X-Readonly header.

https://gerrit.wikimedia.org/r/461229

Mentioned in SAL (#wikimedia-operations) [2018-09-20T06:20:29Z] <mobrovac@deploy1001> Synchronized rpc/RunSingleJob.php: Have RunSingleJob send the X-Readonly header - T204154 (duration: 00m 58s)

Change 463072 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/mediawiki-config@master] RunSingleJob: Delay job execution while in read-only mode

https://gerrit.wikimedia.org/r/463072

Change 463213 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/extensions/EventBus@master] Minor: Return readonly status even when job creation fails

https://gerrit.wikimedia.org/r/463213

Change 463213 merged by Ppchelko:
[mediawiki/extensions/EventBus@master] Minor: Return readonly status even when job creation fails

https://gerrit.wikimedia.org/r/463213

Change 463072 merged by jenkins-bot:
[operations/mediawiki-config@master] RunSingleJob: Delay job execution while in read-only mode

https://gerrit.wikimedia.org/r/463072

Mentioned in SAL (#wikimedia-operations) [2018-10-02T10:58:17Z] <mobrovac@deploy1001> Synchronized rpc/RunSingleJob.php: RunSingleJob: Delay job execution while in read-only mode - T204154 (duration: 00m 57s)

I think all the pieces were deployed, so I'm resolving the task. Let's see next week how it goes, will reopen in case of an issue