Page MenuHomePhabricator

Exception "Job queue is read-only"
Closed, ResolvedPublicPRODUCTION ERROR

Description

This exception has started to show little over a month ago (possibly earlier) in Logstash under mediawiki-errors.

Frequency: About 20-40 fatal exceptions each day.

Only for enwiki and only from codfw appservers and api_appservers (mw2177, mw2137, mw2144, ..). Most likely triggered by monitoring requests, but still these exceptions shouldn't happen.

Is it not possible to write to the active jobqueue from codfw? While it's fine not to support database writes (given we don't support active-active yet, so there aren't meant to be write requests there), as far as I know, we do have a small number of jobs that can be inserted from read-only requests - as indicated by these errors.

[W0lDugrAIEIAAL2owt0AAAAL] /wiki/Main_Page   JobQueueReadOnlyError from line 693 of /srv/mediawiki/php-1.32.0-wmf.12/includes/jobqueue/JobQueue.php:

Job queue is read-only: MediaWiki is in read-only mode for maintenance. Please try again in a few minutes.

#0 /srv/mediawiki/php-1.32.0-wmf.12/includes/jobqueue/JobQueue.php(316): JobQueue->assertNotReadOnly()
#1 /srv/mediawiki/php-1.32.0-wmf.12/includes/jobqueue/JobQueue.php(302): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.32.0-wmf.12/includes/jobqueue/JobQueueGroup.php(154): JobQueue->push(array)
#3 /srv/mediawiki/php-1.32.0-wmf.12/includes/jobqueue/JobQueueGroup.php(218): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.32.0-wmf.12/includes/deferred/MWCallableUpdate.php(34): JobQueueGroup::pushLazyJobs()
#5 /srv/mediawiki/php-1.32.0-wmf.12/includes/deferred/DeferredUpdates.php(268): MWCallableUpdate->doUpdate()
#6 /srv/mediawiki/php-1.32.0-wmf.12/includes/deferred/DeferredUpdates.php(214): DeferredUpdates::runUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#7 /srv/mediawiki/php-1.32.0-wmf.12/includes/deferred/DeferredUpdates.php(134): DeferredUpdates::execute(array, string, integer)
#8 /srv/mediawiki/php-1.32.0-wmf.12/includes/MediaWiki.php(913): DeferredUpdates::doUpdates(string)
#9 /srv/mediawiki/php-1.32.0-wmf.12/includes/MediaWiki.php(733): MediaWiki->restInPeace(string, boolean)
#10 [internal function]: Closure$MediaWiki::doPostOutputShutdown()
#11 {main}

Event Timeline

Krinkle updated the task description. (Show Details)
mobrovac edited subscribers, added: Pchelolo, Joe, mobrovac; removed: Aklapper.

Thanks for unearthing this, @Krinkle . This is probably the last thing we forgot to change in the JobQueue switch. So, both the JobQueue and the DB layer in MW are controlled by the $wgReadOnly variable, which is set to a truth-y value in codfw. It is read from EtcD.

It seems that the way forward would be to decouple these into multiple settings, since now we have the semantics of "enqueuing a job" and "executing a job" which don't happen in the same place any longer. @Joe, @Pchelolo what do you think?

Thanks for unearthing this, @Krinkle . This is probably the last thing we forgot to change in the JobQueue switch. So, both the JobQueue and the DB layer in MW are controlled by the $wgReadOnly variable, which is set to a truth-y value in codfw. It is read from EtcD.

It seems that the way forward would be to decouple these into multiple settings, since now we have the semantics of "enqueuing a job" and "executing a job" which don't happen in the same place any longer. @Joe, @Pchelolo what do you think?

I guess that we could in fact decouple enqueueing and executing jobs, but I'm not sure why we would want a read-only datacenter to enqueue jobs. Is there a scenario where this makes sense? Isn't this probably some issue in the API code that tries to enqueue jobs even from a read-only dc where there is no traffic?

Do we have more info about the jobs that we failed to submit? that could help us understand which requests triggered it.

I guess that we could in fact decouple enqueueing and executing jobs, but I'm not sure why we would want a read-only datacenter to enqueue jobs. Is there a scenario where this makes sense? Isn't this probably some issue in the API code that tries to enqueue jobs even from a read-only dc where there is no traffic?

Since now the JobQueue execution path is controlled by other entities (EventBus, ChangeProp, etc), MW should no longer be concerned with where the job is enqueued; wherever a job is pushed into the queue, our set up ensures it will be executed in the correct DC. This would bring us one (small) step closer to having an active-active setup. Moreover, since the JobQueue transport and execution mechanism now works differently, it makes sense conceptually to make the difference between enqueuing a job and executing it.

Do we have more info about the jobs that we failed to submit? that could help us understand which requests triggered it.

These indeed seem to be monitoring requests as their URI is /wiki/Main_Page (for enwiki) coming from einsteinium.

herron triaged this task as High priority.Jul 18 2018, 6:32 PM

To serve read traffic correctly, $wgReadOnly needs to be false. $wgReadOnly is mostly a UI-layer concept which shows some informative message to the user, not just on POST, but also on confirmation pages. So it's not really necessary to fix this to implement active-active support, we can just set $wgReadOnly to false. So I don't think this is high priority, it is just logspam.

That said, I don't know why pushing a job should throw an exception when wfReadOnly() is true. It was proposed in T130795 by @aaron, partly in order to have "app level read-only errors instead [of] redis error spam about read-only mode". But that seems to be an argument for configuring a read-only mode on a per-queue basis, rather than using the global read only mode.

mobrovac lowered the priority of this task from High to Medium.Jul 19 2018, 8:34 AM

To serve read traffic correctly, $wgReadOnly needs to be false. $wgReadOnly is mostly a UI-layer concept which shows some informative message to the user, not just on POST, but also on confirmation pages. So it's not really necessary to fix this to implement active-active support, we can just set $wgReadOnly to false. So I don't think this is high priority, it is just logspam.

The problem with $wgReadOnly is that it affects all operations, whereas here we just don't want it to affect the enqueuing of jobs. Hence, it makes sense to keep $wgReadOnly for execution paths that would provoke a change in the system, but remove its impact on execution paths that don't (directly/immediately) do so.

Change 446762 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/core@master] JobQueue: Allow jobs to be enqueued despite $wgReadOnly

https://gerrit.wikimedia.org/r/446762

Normally, it would be odd to let jobs pile up but not execute them, though the multi-DC use case of $wgReadOnly in one of the DCs wasn't considered in T130795. Ideally, jobs enqueued on GET/HEAD wouldn't be a thing...but that's not going away anytime soon.

Change 446866 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/core@master] JobQueueGroup: Allow readOnlyReason to be specified per JQ type

https://gerrit.wikimedia.org/r/446866

Change 446866 merged by jenkins-bot:
[mediawiki/core@master] JobQueueGroup: Allow readOnlyReason to be specified per JQ type

https://gerrit.wikimedia.org/r/446866

Change 446762 abandoned by Mobrovac:
JobQueue: Allow jobs to be enqueued despite $wgReadOnly

Reason:
Superseded by I8f1a57a81ea11c1c587c0057fa8bb3454b0e0b56

https://gerrit.wikimedia.org/r/446762

Change 447055 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/mediawiki-config@master] JobQueue: Signal JobQueueEventBus is never read-only

https://gerrit.wikimedia.org/r/447055

Change 447055 merged by jenkins-bot:
[operations/mediawiki-config@master] JobQueue: Signal JobQueueEventBus is never read-only

https://gerrit.wikimedia.org/r/447055

Mentioned in SAL (#wikimedia-operations) [2018-07-26T08:18:25Z] <mobrovac@deploy1001> Synchronized wmf-config/CommonSettings.php: Set readOnlyReason to false everywhere for JobQueueEventBus - T199594 (duration: 00m 55s)

These errors should disappear shortly once wmf.14 is deployed. Keeping the task open until that happens.

The errors have completely disappeared as of this morning UTC.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM