Page MenuHomePhabricator

Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable"
Open, HighPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.35.0-wmf.26

message
Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable

Impact

JobQueue updates lots could be from any core component or extensions. So far seen from the LInter extension, but does not appear to be specific to it in terms of backend error.

See also:
*T235358: Could not enqueue jobs: "Unable to deliver all events: 500: Internal Server Error"
*T248602: Lots of "EventBus: Unable to deliver all events: 504: Gateway Timeout"

Details

Request ID
Xo2tgQpAMMAAAzWSURcAAAEA
Request URL
https://el.wikipedia.org/w/rest.php/el.wikipedia.org/v3/page/pagebundle/153_%CF%80.%CE%A7./8026692
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.35.0-wmf.26/includes/jobqueue/JobQueue.php(367): JobQueueEventBus->doBatchPush(array, integer)
#1 /srv/mediawiki/php-1.35.0-wmf.26/includes/jobqueue/JobQueue.php(337): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.35.0-wmf.26/includes/jobqueue/JobQueueGroup.php(171): JobQueue->push(array)
#3 /srv/mediawiki/php-1.35.0-wmf.26/extensions/Linter/includes/Hooks.php(200): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.35.0-wmf.26/includes/Hooks.php(174): MediaWiki\Linter\Hooks::onParserLogLinterData(string, integer, array)
#5 /srv/mediawiki/php-1.35.0-wmf.26/includes/Hooks.php(234): Hooks::callHook(string, array, array, NULL, string)
#6 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/extension/src/Config/DataAccess.php(344): Hooks::runWithoutAbort(string, array)
#7 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/src/Logger/LintLogger.php(129): MWParsoid\Config\DataAccess->logLinterData(MWParsoid\Config\PageConfig, array)
#8 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/src/Parsoid.php(186): Wikimedia\Parsoid\Logger\LintLogger->logLintOutput()
#9 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(529): Wikimedia\Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#10 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(66): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(MWParsoid\Config\PageConfig, array)
#11 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/Router.php(353): MWParsoid\Rest\Handler\PageHandler->execute()
#12 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/Router.php(308): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#13 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/EntryPoint.php(138): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#14 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/EntryPoint.php(105): MediaWiki\Rest\EntryPoint->execute()
#15 /srv/mediawiki/php-1.35.0-wmf.26/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#16 /srv/mediawiki/w/rest.php(3): require(string)
#17 {main}

Event Timeline

ssastry created this task.Apr 8 2020, 6:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 8 2020, 6:02 PM
cscott renamed this task from Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable to Parsoid Linter: "Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable".Apr 8 2020, 6:04 PM
Joe added a subscriber: Joe.Apr 9 2020, 4:05 AM

No it's not. This is just a problem with the service that listens for jobs submission, and is unrelated to the other task.

How often do we see this error?

ssastry triaged this task as Low priority.Apr 9 2020, 2:38 PM

https://logstash.wikimedia.org/goto/4eeab475f394eb8823426dd1fef43972 .. so, about ~300 events over the last week. So, not very common.

But, between March 11 and March 16th, there are about 18.5K events, with about 12.3K in a 2-day pereiod between March 11 - March 13. This would explain some of the intermittent user complaints we got about linter updates getting lost.

The very first instance in logstash in the last 6 months seems to be from March 9th - https://logstash.wikimedia.org/goto/19bb1d6c8a70e50153370357c53ed35b

Krinkle renamed this task from Parsoid Linter: "Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable" to Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".Apr 29 2020, 6:51 PM
Krinkle raised the priority of this task from Low to High.
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: Krinkle.

Job loss seems high priority, unless we're sure the root cause is specific to the Linter extension. But I'm not sure that's the case given it's a backend error from JobQueue/EvenBus. Could CPT investigate and confirm that we're not losing any other jobs?

Pchelolo added a subscriber: Pchelolo.

In general this is happening quite a lot, ~5000 cases in the last 7 days, however, this is both for sending jobs as well as monolog events for the API.

Filtering it further to only get the errors affecting the jobs, the severity seems lower: ~900 cases in 7 days. Still fairly high though.

The error itself is coming from envoy: upstream connect error or disconnect/reset before headers. reset reason: connection failure. Here's an individual log message. Given that @Joe
has already bumped the timeout to 25s and retries to 2 in envoy as a part of T248602, I'm not sure if further increasing the numbers would help. Perhaps this case is different and retries don't kick in?

Joe added a comment.Apr 30 2020, 6:52 AM

In general this is happening quite a lot, ~5000 cases in the last 7 days, however, this is both for sending jobs as well as monolog events for the API.

Filtering it further to only get the errors affecting the jobs, the severity seems lower: ~900 cases in 7 days. Still fairly high though.

The error itself is coming from envoy: upstream connect error or disconnect/reset before headers. reset reason: connection failure. Here's an individual log message. Given that @Joe
has already bumped the timeout to 25s and retries to 2 in envoy as a part of T248602, I'm not sure if further increasing the numbers would help. Perhaps this case is different and retries don't kick in?

Just to clarify: 5k errors over 7 days for two systems that get called 17k times per second is not a lot. We get more wiki pages to time out in the same time period.

Having said that, the number of errors in enqueuing jobs is higher than we'd like, and decidedly higher than last time I checked - in fact they're mostly (630 out of ~ 900) in a single event on april 24th at 16:30.

At that time, we had what looks like a problem in eventgate, not being able to submit jobs to kafka, see:

https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=46&fullscreen&orgId=1&from=1587744225109&to=1587746640392&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-main&var-site=eqiad&var-ops_datasource=eqiad%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All

So I would assume the problem in this case has to do wtih eventage-main and its relationship to kafka, not with how MediaWiki calls it.

Joe added a comment.Apr 30 2020, 7:24 AM

In another case, we had ~ 100 errors corresponding to a spike in latency from the backend:

https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=70&fullscreen&orgId=1&from=1587874476000&to=1587879236000&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-main&var-site=eqiad&var-ops_datasource=eqiad%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All

I think we need to look a bit more at what's going on in eventgate - it's very possible we're just exceeding the timeout for some requests there.

a single event on april 24th at 16:30. At that time, we had what looks like a problem in eventgate, not being able to submit jobs to kafka

In the per broker queue graph, it seems only kafka-main1002 is the only broker with the waiting for response queue stacking up. This is waiting for Kafka (leader and replicas) to ACK that it has received and persisted the produce request.

I don't see anything relevant happening at the same time on the Kafka cluster. The most suspicious thing I can find is a 10 minute periodic jump in TCP conns in TIME_WAIT. I don't see this behavior on other Kafka clusters, so I'd like to figure out what is causing this. However, this is happening pretty regularly and doesn't seem to cause the rdkafka produce queue to fill up at other times, so probably isn't related.

There are some associated NodeJS GC anomalies around this event, but it is hard to say if they are the cause or just a symptom of the rdkafka queue piling up and eventually getting freed. I don't know a lot about NodeJS GC, but there is a jump in p99 major GC time just before the messages start piling up in the rdkafka queue.

There isn't any big change in the number of messages POSTed to eventgate, so I agree this one doesn't seem to be caused by Mediawiki. Is there possibly some temporary network congestion between eqiad k8s and kafka-main1002? Or, could there have been network problems between kafka-main1002 and the other Kafka brokers? (The partition leader has to receive confirmation from replica followers that the message has been received before it will ACK back to the producer client.)

brennen moved this task from Backlog to Logs/Train on the User-brennen board.
jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:45 PM

For the record, in the last 7 days there were no longer matches for Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable in Logstash. Perhaps it was resolved?

Unfortunately, no. The error message got changed to

[{exception_id}] {exception_url} JobQueueError from line 117 of /srv/mediawiki/php-1.36.0-wmf.5/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php: Could not enqueue jobs from stream mediawiki.job.RecordLintJob

as a part of unrelated work.

Krinkle removed a project: Patch-For-Review.
Krinkle added subscribers: thcipriani, Clarakosi.

Still seen. Per T260274: JobQueueError: Could not enqueue jobs from stream [stream]

message
Could not enqueue jobs from stream mediawiki.job.wikibase-addUsagesForPage
exception.trace
#0 /srv/mediawiki/php-1.36.0-wmf.4/includes/jobqueue/JobQueue.php(374): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(array, integer)
#1 /srv/mediawiki/php-1.36.0-wmf.4/includes/jobqueue/JobQueue.php(344): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.36.0-wmf.4/includes/jobqueue/JobQueueGroup.php(171): JobQueue->push(array)
#3 /srv/mediawiki/php-1.36.0-wmf.4/extensions/Wikibase/client/includes/Hooks/DataUpdateHookHandler.php(174): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/MWCallableUpdate.php(38): Wikibase\Client\Hooks\DataUpdateHookHandler->Wikibase\Client\Hooks\{closure}()
#5 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/DeferredUpdates.php(467): MWCallableUpdate->doUpdate()
#6 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/DeferredUpdates.php(344): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#7 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/DeferredUpdates.php(278): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#8 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/DeferredUpdates.php(194): DeferredUpdates::handleUpdateQueue(array, string, integer)
#9 /srv/mediawiki/php-1.36.0-wmf.4/includes/MediaWiki.php(1113): DeferredUpdates::doUpdates(string)
#10 /srv/mediawiki/php-1.36.0-wmf.4/includes/MediaWiki.php(849): MediaWiki->restInPeace()
#11 /srv/mediawiki/php-1.36.0-wmf.4/includes/MediaWiki.php(861): MediaWiki->{closure}()
#12 /srv/mediawiki/php-1.36.0-wmf.4/includes/MediaWiki.php(582): MediaWiki->doPostOutputShutdown()
#13 /srv/mediawiki/php-1.36.0-wmf.4/index.php(53): MediaWiki->run()
#14 /srv/mediawiki/php-1.36.0-wmf.4/index.php(46): wfIndexMain()
#15 /srv/mediawiki/w/index.php(3): require(string)
#16 {main}
  • Shows up on a handful of wikis for a handful of different files
  • No additional errors in stack trace or message gives clue about the root cause
  • Several different streams show up mediawiki.job.wikibase-addUsagesForPage, mediawiki.job.cirrusSearchElasticaWrite

Change 622678 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Replace use of MultiHttpClient with HTTPRequestFactory

https://gerrit.wikimedia.org/r/622678

This comment was removed by Pchelolo.

Change 622678 abandoned by Ppchelko:
[mediawiki/extensions/EventBus@master] Replace use of MultiHttpClient with HTTPRequestFactory

Reason:

https://gerrit.wikimedia.org/r/622678

Ok, my little refactoring wasn't needed.

This bug has been haunting us for quite awhile now, I guess it's time to fix it.

I propose to add retries to MW -> eventgate requests. It seems like given the volume of requests, this issue only occurs quite rarely, so should be fixable by a single retry.

Retrying the requests, especially for the eventuate-main might cause event duplication, but that would not be a problem, since duplicated events will share the .meta.id, so will be deduplicated by our most basic deduplicators.

I do not see that retries are supported by either MWHttpRequest or MultiHttpClient - I propose to add native configurable retry policy generically, and use it in EventBus extension.

Joe added a comment.Aug 27 2020, 6:33 PM

@Joe proposed an alternative to implementing retries in PHP - use envoys retry policy: https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-max-retries

Just for the record, we do already retry once on 5xx errors, per envoy's configuration:

https://github.com/wikimedia/puppet/blob/production/hieradata/common/profile/services_proxy/envoy.yaml#L34

but when latencies spike a lot, that's not enough given the timeout is global (so for all retries), probably.

Joe added a comment.Aug 27 2020, 6:34 PM

We can raise the timeout a bit, and also increase the number of retries, probably.

Hm, this same timeout on MW side is set to 10 seconds. At least we should increase it to match envoy's timeout of 25seconds.

Change 622863 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Increase timeouts for connection to eventgate to match envoy config

https://gerrit.wikimedia.org/r/622863

Change 622863 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase timeouts for connection to eventgate to match envoy config

https://gerrit.wikimedia.org/r/622863

Mentioned in SAL (#wikimedia-operations) [2020-09-21T18:59:18Z] <ppchelko@deploy1001> Synchronized wmf-config/CommonSettings.php: gerrit:622863 T249745 (duration: 00m 56s)

I've increased the timeout on MW side to match the timeout set by envoy, let's see if the situation improves.

This is still occurring as of Sept 23, 2020

There was a spike in events related to "Could not enqueue jobs from stream" for various streams at ~12pm UTC: https://logstash.wikimedia.org/goto/7bbe517376de0be509c6a07dd8d027bc

Should this task be merged with T263132: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount? Is that task a subset of errors in this task?

The priority of this task seems correct, but I'm unclear if any team is working on it?