Page MenuHomePhabricator

Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable"
Open, HighPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.35.0-wmf.26

message
Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable

Impact

JobQueue updates lots could be from any core component or extensions. So far seen from the LInter extension, but does not appear to be specific to it in terms of backend error.

See also:

Details

Request ID
Xo2tgQpAMMAAAzWSURcAAAEA
Request URL
https://el.wikipedia.org/w/rest.php/el.wikipedia.org/v3/page/pagebundle/153_%CF%80.%CE%A7./8026692
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.35.0-wmf.26/includes/jobqueue/JobQueue.php(367): JobQueueEventBus->doBatchPush(array, integer)
#1 /srv/mediawiki/php-1.35.0-wmf.26/includes/jobqueue/JobQueue.php(337): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.35.0-wmf.26/includes/jobqueue/JobQueueGroup.php(171): JobQueue->push(array)
#3 /srv/mediawiki/php-1.35.0-wmf.26/extensions/Linter/includes/Hooks.php(200): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.35.0-wmf.26/includes/Hooks.php(174): MediaWiki\Linter\Hooks::onParserLogLinterData(string, integer, array)
#5 /srv/mediawiki/php-1.35.0-wmf.26/includes/Hooks.php(234): Hooks::callHook(string, array, array, NULL, string)
#6 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/extension/src/Config/DataAccess.php(344): Hooks::runWithoutAbort(string, array)
#7 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/src/Logger/LintLogger.php(129): MWParsoid\Config\DataAccess->logLinterData(MWParsoid\Config\PageConfig, array)
#8 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/src/Parsoid.php(186): Wikimedia\Parsoid\Logger\LintLogger->logLintOutput()
#9 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(529): Wikimedia\Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#10 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(66): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(MWParsoid\Config\PageConfig, array)
#11 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/Router.php(353): MWParsoid\Rest\Handler\PageHandler->execute()
#12 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/Router.php(308): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#13 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/EntryPoint.php(138): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#14 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/EntryPoint.php(105): MediaWiki\Rest\EntryPoint->execute()
#15 /srv/mediawiki/php-1.35.0-wmf.26/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#16 /srv/mediawiki/w/rest.php(3): require(string)
#17 {main}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Job loss seems high priority, unless we're sure the root cause is specific to the Linter extension. But I'm not sure that's the case given it's a backend error from JobQueue/EvenBus. Could CPT investigate and confirm that we're not losing any other jobs?

Pchelolo added a subscriber: Pchelolo.

In general this is happening quite a lot, ~5000 cases in the last 7 days, however, this is both for sending jobs as well as monolog events for the API.

Filtering it further to only get the errors affecting the jobs, the severity seems lower: ~900 cases in 7 days. Still fairly high though.

The error itself is coming from envoy: upstream connect error or disconnect/reset before headers. reset reason: connection failure. Here's an individual log message. Given that @Joe
has already bumped the timeout to 25s and retries to 2 in envoy as a part of T248602, I'm not sure if further increasing the numbers would help. Perhaps this case is different and retries don't kick in?

In general this is happening quite a lot, ~5000 cases in the last 7 days, however, this is both for sending jobs as well as monolog events for the API.

Filtering it further to only get the errors affecting the jobs, the severity seems lower: ~900 cases in 7 days. Still fairly high though.

The error itself is coming from envoy: upstream connect error or disconnect/reset before headers. reset reason: connection failure. Here's an individual log message. Given that @Joe
has already bumped the timeout to 25s and retries to 2 in envoy as a part of T248602, I'm not sure if further increasing the numbers would help. Perhaps this case is different and retries don't kick in?

Just to clarify: 5k errors over 7 days for two systems that get called 17k times per second is not a lot. We get more wiki pages to time out in the same time period.

Having said that, the number of errors in enqueuing jobs is higher than we'd like, and decidedly higher than last time I checked - in fact they're mostly (630 out of ~ 900) in a single event on april 24th at 16:30.

At that time, we had what looks like a problem in eventgate, not being able to submit jobs to kafka, see:

https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=46&fullscreen&orgId=1&from=1587744225109&to=1587746640392&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-main&var-site=eqiad&var-ops_datasource=eqiad%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All

So I would assume the problem in this case has to do wtih eventage-main and its relationship to kafka, not with how MediaWiki calls it.

In another case, we had ~ 100 errors corresponding to a spike in latency from the backend:

https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=70&fullscreen&orgId=1&from=1587874476000&to=1587879236000&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-main&var-site=eqiad&var-ops_datasource=eqiad%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All

I think we need to look a bit more at what's going on in eventgate - it's very possible we're just exceeding the timeout for some requests there.

a single event on april 24th at 16:30. At that time, we had what looks like a problem in eventgate, not being able to submit jobs to kafka

In the per broker queue graph, it seems only kafka-main1002 is the only broker with the waiting for response queue stacking up. This is waiting for Kafka (leader and replicas) to ACK that it has received and persisted the produce request.

I don't see anything relevant happening at the same time on the Kafka cluster. The most suspicious thing I can find is a 10 minute periodic jump in TCP conns in TIME_WAIT. I don't see this behavior on other Kafka clusters, so I'd like to figure out what is causing this. However, this is happening pretty regularly and doesn't seem to cause the rdkafka produce queue to fill up at other times, so probably isn't related.

There are some associated NodeJS GC anomalies around this event, but it is hard to say if they are the cause or just a symptom of the rdkafka queue piling up and eventually getting freed. I don't know a lot about NodeJS GC, but there is a jump in p99 major GC time just before the messages start piling up in the rdkafka queue.

There isn't any big change in the number of messages POSTed to eventgate, so I agree this one doesn't seem to be caused by Mediawiki. Is there possibly some temporary network congestion between eqiad k8s and kafka-main1002? Or, could there have been network problems between kafka-main1002 and the other Kafka brokers? (The partition leader has to receive confirmation from replica followers that the message has been received before it will ACK back to the producer client.)

For the record, in the last 7 days there were no longer matches for Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable in Logstash. Perhaps it was resolved?

Unfortunately, no. The error message got changed to

[{exception_id}] {exception_url} JobQueueError from line 117 of /srv/mediawiki/php-1.36.0-wmf.5/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php: Could not enqueue jobs from stream mediawiki.job.RecordLintJob

as a part of unrelated work.

Krinkle removed a project: Patch-For-Review.
Krinkle added subscribers: thcipriani, Clarakosi.

Still seen. Per T260274: JobQueueError: Could not enqueue jobs from stream [stream]

message
Could not enqueue jobs from stream mediawiki.job.wikibase-addUsagesForPage
exception.trace
#0 /srv/mediawiki/php-1.36.0-wmf.4/includes/jobqueue/JobQueue.php(374): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(array, integer)
#1 /srv/mediawiki/php-1.36.0-wmf.4/includes/jobqueue/JobQueue.php(344): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.36.0-wmf.4/includes/jobqueue/JobQueueGroup.php(171): JobQueue->push(array)
#3 /srv/mediawiki/php-1.36.0-wmf.4/extensions/Wikibase/client/includes/Hooks/DataUpdateHookHandler.php(174): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/MWCallableUpdate.php(38): Wikibase\Client\Hooks\DataUpdateHookHandler->Wikibase\Client\Hooks\{closure}()
#5 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/DeferredUpdates.php(467): MWCallableUpdate->doUpdate()
#6 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/DeferredUpdates.php(344): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#7 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/DeferredUpdates.php(278): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#8 /srv/mediawiki/php-1.36.0-wmf.4/includes/deferred/DeferredUpdates.php(194): DeferredUpdates::handleUpdateQueue(array, string, integer)
#9 /srv/mediawiki/php-1.36.0-wmf.4/includes/MediaWiki.php(1113): DeferredUpdates::doUpdates(string)
#10 /srv/mediawiki/php-1.36.0-wmf.4/includes/MediaWiki.php(849): MediaWiki->restInPeace()
#11 /srv/mediawiki/php-1.36.0-wmf.4/includes/MediaWiki.php(861): MediaWiki->{closure}()
#12 /srv/mediawiki/php-1.36.0-wmf.4/includes/MediaWiki.php(582): MediaWiki->doPostOutputShutdown()
#13 /srv/mediawiki/php-1.36.0-wmf.4/index.php(53): MediaWiki->run()
#14 /srv/mediawiki/php-1.36.0-wmf.4/index.php(46): wfIndexMain()
#15 /srv/mediawiki/w/index.php(3): require(string)
#16 {main}
  • Shows up on a handful of wikis for a handful of different files
  • No additional errors in stack trace or message gives clue about the root cause
  • Several different streams show up mediawiki.job.wikibase-addUsagesForPage, mediawiki.job.cirrusSearchElasticaWrite

Change 622678 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Replace use of MultiHttpClient with HTTPRequestFactory

https://gerrit.wikimedia.org/r/622678

Change 622678 abandoned by Ppchelko:
[mediawiki/extensions/EventBus@master] Replace use of MultiHttpClient with HTTPRequestFactory

Reason:

https://gerrit.wikimedia.org/r/622678

Ok, my little refactoring wasn't needed.

This bug has been haunting us for quite awhile now, I guess it's time to fix it.

I propose to add retries to MW -> eventgate requests. It seems like given the volume of requests, this issue only occurs quite rarely, so should be fixable by a single retry.

Retrying the requests, especially for the eventuate-main might cause event duplication, but that would not be a problem, since duplicated events will share the .meta.id, so will be deduplicated by our most basic deduplicators.

I do not see that retries are supported by either MWHttpRequest or MultiHttpClient - I propose to add native configurable retry policy generically, and use it in EventBus extension.

@Joe proposed an alternative to implementing retries in PHP - use envoys retry policy: https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-max-retries

Just for the record, we do already retry once on 5xx errors, per envoy's configuration:

https://github.com/wikimedia/puppet/blob/production/hieradata/common/profile/services_proxy/envoy.yaml#L34

but when latencies spike a lot, that's not enough given the timeout is global (so for all retries), probably.

We can raise the timeout a bit, and also increase the number of retries, probably.

Hm, this same timeout on MW side is set to 10 seconds. At least we should increase it to match envoy's timeout of 25seconds.

Change 622863 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Increase timeouts for connection to eventgate to match envoy config

https://gerrit.wikimedia.org/r/622863

Change 622863 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase timeouts for connection to eventgate to match envoy config

https://gerrit.wikimedia.org/r/622863

Mentioned in SAL (#wikimedia-operations) [2020-09-21T18:59:18Z] <ppchelko@deploy1001> Synchronized wmf-config/CommonSettings.php: gerrit:622863 T249745 (duration: 00m 56s)

I've increased the timeout on MW side to match the timeout set by envoy, let's see if the situation improves.

This is still occurring as of Sept 23, 2020

There was a spike in events related to "Could not enqueue jobs from stream" for various streams at ~12pm UTC: https://logstash.wikimedia.org/goto/7bbe517376de0be509c6a07dd8d027bc

Should this task be merged with T263132: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount? Is that task a subset of errors in this task?

The priority of this task seems correct, but I'm unclear if any team is working on it?

Change 643783 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/EventBus@master] Remove unrelated stream name from JobQueue backend exception

https://gerrit.wikimedia.org/r/643783

So, I just encountered a 503 service unavailable from eventgate-main in eqiad when producing a canary event for monitoring purposes.

Dec 14 14:30:05 an-launcher1002 produce_canary_events[28679]:   HttpResult(failure)  status: 503 message: Service Unavailable. body: upstream connect error or disconnect/reset before headers. reset reason: local reset

An eventgate-main container on one of the eqiad pods was restarted shortly after I got this error.

[2020-12-14T14:30:53.737Z]  INFO: eventgate-wikimedia/1 on eventgate-main-production-75576c9fc-rjg7n: master(1) initializing 1 workers (levelPath=info/service-runner)

I don't have any error logs as to why the container needed to be restarted. k8s reports that in eqiad, the containers have been restarted many times since the last deploy, 26 days ago:

NAME                                       READY  STATUS   RESTARTS  AGE
eventgate-main-production-75576c9fc-5sfld  3/3    Running  20        26d
eventgate-main-production-75576c9fc-rjg7n  3/3    Running  36        26d
eventgate-main-production-75576c9fc-w78sj  3/3    Running  23        26d

This restart and 503 also correspond with a p99 GC spike.

Screen Shot 2020-12-14 at 10.06.12.png (1×3 px, 424 KB)

There are no restarts in codfw. Perhaps we need more replica pods in eqiad?

Change 649360 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-main - increase replicas from 3 to 5

https://gerrit.wikimedia.org/r/649360

https://grafana.wikimedia.org/d/g0GUXaJMk/t249745?orgId=1 says that instead it might make sense to increase the memory limit. There's 3 spikes above the limit in the last 30 days, leading to extra CPU usage and probably increased GC cycles

Huh interesting. Bumped to from 300Mi to 600Mi in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/649360.

(I can do that in helmfil values.yaml to override the chart's values.yaml, right?)

https://grafana.wikimedia.org/d/g0GUXaJMk/t249745?orgId=1 says that instead it might make sense to increase the memory limit. There's 3 spikes above the limit in the last 30 days, leading to extra CPU usage and probably increased GC cycles

To further support this, you can see your containers being killed from time to time due to failing liveness probes (readiness probes fail on a regular basis as well): https://logstash-next.wikimedia.org/goto/1abd5531232bd291b57af64715e2af0c

Change 649360 merged by Ottomata:
[operations/deployment-charts@master] eventgate-main - increase mem limit to 600Mi

https://gerrit.wikimedia.org/r/649360

Mentioned in SAL (#wikimedia-operations) [2020-12-15T16:47:46Z] <ottomata> bumped eventate-main memory limits from 300M to 600M - T249745

Change 643783 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Remove unrelated stream name from JobQueue backend exception

https://gerrit.wikimedia.org/r/643783

Noticed a spike of these around 2021-01-19 20:55 UTC:

Could not enqueue jobs from stream mediawiki.job.wikibase-addUsagesForPage
Could not enqueue jobs from stream mediawiki.job.RecordLintJob
Could not enqueue jobs from stream mediawiki.job.cirrusSearchElasticaWrite
Could not enqueue jobs from stream mediawiki.job.LocalGlobalUserPageCacheUpdateJob
Could not enqueue jobs from stream mediawiki.job.cirrusSearchLinksUpdate
Could not enqueue jobs from stream mediawiki.job.constraintsRunCheck

Still seen. Looks like we're consistently losing several thousand jobs a day.

That's presumably after any internal retry from Envoy middleware and EventBus client.

Screenshot 2021-04-01 at 01.09.27.png (1×2 px, 290 KB)

Ok, so the pod restarts were not the problem then:

kube_env eventgate-main eqiad; kubectl get pods
NAME                                         READY   STATUS    RESTARTS   AGE
eventgate-main-canary-7694f6b64d-swks4       3/3     Running   0          9d
eventgate-main-production-756758d58f-6p6rm   3/3     Running   0          9d
eventgate-main-production-756758d58f-bj27w   3/3     Running   0          9d
eventgate-main-production-756758d58f-xhj2w   3/3     Running   0          9d

Change 679855 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[operations/deployment-charts@master] Envoy: set per_try_timeout for eventgate-main.

https://gerrit.wikimedia.org/r/679855

Change 680372 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[operations/puppet@production] Envoy: set per_try_timeout for eventgate-main.

https://gerrit.wikimedia.org/r/680372

Change 680372 merged by Alexandros Kosiaris:

[operations/puppet@production] Envoy: set per_try_timeout for eventgate-main.

https://gerrit.wikimedia.org/r/680372

Change 679855 merged by jenkins-bot:

[operations/deployment-charts@master] Envoy: set per_try_timeout for eventgate-main.

https://gerrit.wikimedia.org/r/679855

Change 681132 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[operations/mediawiki-config@master] [EventBus] Make eventage-main timeout consistent with envoy

https://gerrit.wikimedia.org/r/681132

Change 681132 merged by jenkins-bot:

[operations/mediawiki-config@master] [EventBus] Make eventage-main timeout consistent with envoy

https://gerrit.wikimedia.org/r/681132

Mentioned in SAL (#wikimedia-operations) [2021-04-19T18:21:20Z] <ppchelko@deploy1002> Synchronized wmf-config/CommonSettings.php: T249745 [EventBus] Make eventage-main timeout consistent with envoy (duration: 00m 56s)

After deploying all the updated improved timeouts, we're down to 4 jobs lost in a day: https://logstash.wikimedia.org/goto/865a45a301f303638167191fa11511be

There's other events being lost, for example analytics ones, but for those I didn't touch timeout settings.

Let's monitor for a few days and see what happens.

That's great!

Do not divide the skin of a bear we didn't kill yet. Let's monitor the situation for a few days.

After that, I want to make the following improvements:

  • Include the target event service into the logs
  • Indicate that we're using ?hasty mode to EventBus and lower the log level to 'warning' in that case - if we explicitly opted-in for non-guaranteed delivery, we shouldn't scream error if event wasn't delivered.

Do not divide the skin of a bear we didn't kill yet.

٩(^‿^)۶
T120242 is the real bear

ok, a bit more time passed and we've lost 24 jobs in 7 days.

@Krinkle this is a significant improvement over the previous state, but given its a production error - do we need to aim for 0?

@Pchelolo I'm not sure.

Failure is OK. I think in a production environment of our scale, 100% success is unrealistic and may even be damaging. As such, it seems fair to have known failure modes when our service is degraded (perhaps better than not degrading and failing fully for all requests. If we can make components of the infrastructure optional, seems good.

Failure is not OK. However, jobs are not optional from a business logic perspective. And that's why failure to submit a job results in a fatal error that should e.g. propagate to the user, and result in rollback of any write action the user has performed. Except, we don't do that (usually) currently because during the jobrunner-Redis era we were (possibly incorrectly) under the illusion that with sufficient retries from the PHP side, that we can very nearly guruantee job delivery and so except for very few "really important" jobs, almost all jobs are submitted post-send from a deferred, which means we've already committed the DB transaction and told the user everything is A-OK. The caller has passed the point of no return.

Failure is surprising. I mean this in the nicest possible way, but, literally the only purpose of this stack (MW EventBus > Http -> Envoy-tx -> Envoy-recv -> EventGate -> Kafka) is to store a string of JSON for (usually) less than one or two seconds. I can imagine chains this long, or longer, that scale well, stream mostly in real-time, with minimal per-packet overhead. However, while I know very little about EventGate and Kafka, the sense that I'm getting is that the use of this proxy, and Node.js, and the congested buffers or unreleased stale memory problems we're seeing, are causing problems that aren't natural or inevitable as result from e.g. saturating a physical resource like the node's allocated CPU or RAM or network bandwidth, but rather from problems that are more related to avoidable complexity or design choices. Again, I know very little of this stack, but is there some truth to this? Or is it avoidable with exactly what we have, but there's a config problem we just haven't found yet?

Continuing for one more second in this naive mood, it feels to me like a one or two round-robin'ed Redis nodes with a pubsub channel, might offer lower latency, higher uptime, and almost no loss of submissions during regular operation. From there they could be, offline, persisted in Kafka to support remembering offsets, partitioning consumers, replicate/switchover to other DCs etc. This is not a serious suggestion to use Redis on top of Kafka. But, can you confidently say and explain that the uptime / submission loss we have is better than that would be? No need to explain to me, but just to think about in theory.

Failure is still OK, but also still surprising. Let me circle back to where I started. Zero error rate is not a fair expectation of any service. As such, I think it's totally reasonable for the end of the chain (EventGate, and Kafka) to ocasionally fail quickly, drop the thing completely, and require a retry from the start. But how far up the chain is the "start"? From my perspective as consumer of JobQueue, I'd prefer that the "start" not be the user's web browse and their page view or edit request. Instead, it should perhaps be the internal JobQueue-push request. And, as I understand it, we are already doing several retries at one or multiple levels in this stack.

That begs the question: Why are we still seeing delivery failures... from the MW side? Are these astronomical coincidences where the same rare failure happens multiple times in a row? If so, I guess we have a very high rate of submissions that those coincindences can happen and perhaps that's OK then. But, I suspect that perhaps this is not the case, but rather that we are suffering from regular day-to-day ocurrences (not maintenace, exceptional outage, or unusual load) during which the EventGate service is just not accepting anything from anyone for more than a second at a time. Is that true? If so, that seems like something we probably shouldn't tolerate in for a service depended on from Tier-1 production traffic. If this is infeasible to avoid with the current technology, the my naive mindset would think that perhaps this needs to be a multi-node cluster with load balancing in front to avoid such regular outages. (And yes, a few seconds ago he was complaining about complexity....)


EDIT: I did not see the rate was 24 in 7 days. I guess that's quite low. I think you can ignore this comment if the mitigation that got us to that low rate involved something other than (further) increasing of timeouts/retries.

I guess I should file a follow-up task to audit our JobQueue pushes and reconsider the choice of PRESEND vs POSTSEND. In my mind, this is more or less equivalent to a DB transaction where you've called commit() and got a success, but it wasn't ever saved in the first place (vs lost due to special circumstances). If we conclude that assuming success is wrong, then that of course will make latency a more significant point. Originaly when we moved to EventGate/Kafka, latency wasn't reviewed or worried about (for end-users) since we hid push latency behind POSTSEND. With timeouts as big as we have today, improving them will likely involve the same kind of work as improving delivery rate. So perhaps it's still basically the same problem. With low latencies it seems natural to just do them PRESEND, and then failures will be more acceptable, with end-users knowing it, and considering a retry from their perspective. (But also, they'll be more visible and so we'll want fewer of them, but current rate seems fine for that I guess.) To be continued...

Thanks for the detailed reply. I agree - something is still fishy here and does need to be resolved, so let's keep poking.

So, what I did was added a per-retry timeout in envoy for this request path, because without it timeout on 'envoy -> eventgate' request was counted agains overall envoy timeout, so retries were not actually made. This is actually something we should consider to be a default setting for envoy service proxy - nothing specific to eventgate is happening here. Per-request timeout is now slightly less then 1/2 overall timeout, so envoy makes 1 retry.

Since this change did cause such a dramatic decrease in the rate of errors, it suggests that now we are seeing the issue if both retries are hitting a struggling eventgate nodes and timing out - the probability of this happening is much much lower then hitting one, but it's not impossible. I know the next workaround - make per-request-timeout in envoy 1/3 of overall timeout, cutting the probability of failure happening even more.

The fact that we see failures both in normal mode and hasty mode (where we do not wait for ack from kafka and just fire events and forget), kafka doesn't seem to be the issue. Which leaves eventgate as the most probably cause. So, circling back to @Ottomata

ok, a bit more time passed and we've lost 24 jobs in 7 days.

@Krinkle this is a significant improvement over the previous state, but given its a production error - do we need to aim for 0?

No, aiming for zero is pointless and harmful. If we can't live with 24 failures in 7 days it means we're less tolerant than we are with errors in responses to users, which would be absurd, as Timo already pointed out.

This doesn't mean that MediaWiki shoudn't try to improve the situation by handling the failure to submit a job by saving it somewhere (a specific db table?) and we can replay them later. At the current failure rate, this would guarantee the jobs would be executed with an irrelevant cost in terms of resources.

If anything, I think we should invest resources in doing something like this, rather than adding more retries to envoyproxy. It would also help ease the problem of the reliability of the event stream (see for instance T120242

The fact that we see failures both in normal mode and hasty mode (where we do not wait for ack from kafka and just fire events and forget), kafka doesn't seem to be the issue. Which leaves eventgate as the most probably cause. So, circling back to @Ottomata

I think we might want to go back to look if handling the increasing amount of messages eventgate manages now is maybe too much. I'd go down the following path:

  • Up the resource limits for eventgate in the chart for eventgate-main
  • Increase the number of pods we use for it

before we go any further in the direction of investigating what to do with eventgate - it's become a quite central part of our infrastructure and I'd love to see a deeper investigation into what happens when it has those moments of inability to respond to requests.

Increase the number of pods we use for it (EventGate)

FYI, We did this slightly for this task after https://phabricator.wikimedia.org/T249745#6689046. Perhaps we need more??? :)

problems that are more related to avoidable complexity or design choices

EventGate exists mostly as a convenience to avoid problems that PHP has. As does envoy. Both help with connection pooling; envoy is connection pooling https, and EventGate is connection pooling kafka (and of course it does other stuff too :) )

For production services that don't have PHPs downsides, there is no need for EventGate. We do need some things (validation, etc.) to happen to events before they are produced to Kafka, but there's no reason this couldn't be handled by a client side library. This is what we are doing for JVM based event producers. If we had a performant Kafka PHP client (and could use it during a user HTTP request), I think we'd avoid EventGate altogether.

I'd love to see a deeper investigation

T264021: > ~1 request/second to intake-logging.wikimedia.org times out at the traffic/service interface could be related too.

So, 503 is solved - we have zero 503 errors for jobs after https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724173 was deployed. Now we have only 504 Gateway Timeout. Shall poke at those a little.

While the instability and latency problem never fully went away, it appears to be getting worse as of late, sometimes drowning out regressions.

Search "Could not enqueue jobs" on the logstash mediawiki-errors dashboard, with spikes of thousands of lost jobs - afaik without recovery.

Screenshot 2022-07-13 at 17.50.28.png (444×952 px, 44 KB)