Maniphest T204154

Kafka JobQueue should respect DB readonly mode
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Pchelolo
	Sep 12 2018, 8:01 PM

Description

During the datacenter switchover for MediaWiki, the database is set into read-only mode, but the JobQueue does not respect that setting and continues posting jobs, which all result in failures since most of the jobs are attempting to write into the DB. The jobs are then retried, but if the read-only period is too long, we end up losing some jobs completely.

In order to avoid that we need to respect the read-only mode of MediaWiki in the Job Queue. We could pause the queue for the read-only period, but that will still result in losing jobs that were on-going at the moment of enabling the read-only mode.

Alternatively, we could implement something like retry-after - if the job has failed because of the read-only, the JobExecutor will set retry-after header to the response to some large value - this will fix ongoing jobs problem, but this will still make the queue spin during the read-only and make doomed-to-fail requests to the executor.

We could implement both approaches together - that would be perfect. What do you think? Do we need this?

Details

Subject	Repo	Branch	Lines +/-
Specify readonly in the job executor response.	mediawiki/extensions/EventBus	master	+18 -19
RunSingleJob: Delay job execution while in read-only mode	operations/mediawiki-config	master	+8 -0
Minor: Return readonly status even when job creation fails	mediawiki/extensions/EventBus	master	+1 -0
RPC/RunSingleJob.php - send X-Readonly header.	operations/mediawiki-config	master	+3 -0

Customize query in gerrit

Related Objects

Mentioned In: T218692: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error"
T204138: Add 'Risk Rating' field to tasks created via advanced template

Event Timeline

• Pchelolo created this task.Sep 12 2018, 8:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2018, 8:01 PM

Legoktm mentioned this in T204138: Add 'Risk Rating' field to tasks created via advanced template.Sep 12 2018, 8:05 PM

What about RunSingleJob returning a header like x-readonly: true? When CP sees that, it could pause the execution altogether for a predefined (and/or tunable) amount of time and retry the same job. This way all of the workers will eventually find out that MW is set to read-only and wait until it switches back to read-write, all the while not losing jobs.

I like x-readonly: true - that leaves CP to decide what exactly to do, retry-after splits responsibility on controlling the event flow between MW and CP. Will follow that path

Change 460946 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Specify readonly in the job executor response.

https://gerrit.wikimedia.org/r/460946

gerritbot added a project: Patch-For-Review.Sep 17 2018, 6:55 PM

Change 460946 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Specify readonly in the job executor response.

https://gerrit.wikimedia.org/r/460946

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-09-18 (1.32.0-wmf.22)).Sep 18 2018, 11:00 AM

Change 461229 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] RPC/RunSingleJob.php - send X-Readonly header.

https://gerrit.wikimedia.org/r/461229

Change 461229 merged by jenkins-bot:
[operations/mediawiki-config@master] RPC/RunSingleJob.php - send X-Readonly header.

https://gerrit.wikimedia.org/r/461229

Mentioned in SAL (#wikimedia-operations) [2018-09-20T06:20:29Z] <mobrovac@deploy1001> Synchronized rpc/RunSingleJob.php: Have RunSingleJob send the X-Readonly header - T204154 (duration: 00m 58s)

• mobrovac moved this task from Untriaged to EventBus infra on the WMF-JobQueue board.Sep 26 2018, 1:44 PM

• mobrovac edited projects, added Services (doing); removed Services (designing).

Change 463072 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/mediawiki-config@master] RunSingleJob: Delay job execution while in read-only mode

https://gerrit.wikimedia.org/r/463072

Krinkle added a project: Performance-Team (Radar).Sep 26 2018, 8:42 PM

Change 463213 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/extensions/EventBus@master] Minor: Return readonly status even when job creation fails

https://gerrit.wikimedia.org/r/463213

Change 463213 merged by Ppchelko:
[mediawiki/extensions/EventBus@master] Minor: Return readonly status even when job creation fails

https://gerrit.wikimedia.org/r/463213

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-10-02 (1.32.0-wmf.24)); removed MW-1.32-notes (WMF-deploy-2018-09-18 (1.32.0-wmf.22)).Oct 1 2018, 4:00 PM

Related patch for change-prop https://github.com/wikimedia/change-propagation/pull/292

Change 463072 merged by jenkins-bot:
[operations/mediawiki-config@master] RunSingleJob: Delay job execution while in read-only mode

https://gerrit.wikimedia.org/r/463072

Mentioned in SAL (#wikimedia-operations) [2018-10-02T10:58:17Z] <mobrovac@deploy1001> Synchronized rpc/RunSingleJob.php: RunSingleJob: Delay job execution while in read-only mode - T204154 (duration: 00m 57s)

I think all the pieces were deployed, so I'm resolving the task. Let's see next week how it goes, will reopen in case of an issue

• Pchelolo edited projects, added Services (done); removed Services (doing), Patch-For-Review.Oct 4 2018, 9:59 PM

• mobrovac mentioned this in T218692: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error".Apr 19 2019, 10:46 PM

Kafka JobQueue should respect DB readonly modeClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Kafka JobQueue should respect DB readonly mode
Closed, ResolvedPublic
Actions