Page MenuHomePhabricator

Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable"
Open, HighPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.35.0-wmf.26

message
Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable

Impact

JobQueue updates lots could be from any core component or extensions. So far seen from the LInter extension, but does not appear to be specific to it in terms of backend error.

See also:

Details

Request ID
Xo2tgQpAMMAAAzWSURcAAAEA
Request URL
https://el.wikipedia.org/w/rest.php/el.wikipedia.org/v3/page/pagebundle/153_%CF%80.%CE%A7./8026692
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.35.0-wmf.26/includes/jobqueue/JobQueue.php(367): JobQueueEventBus->doBatchPush(array, integer)
#1 /srv/mediawiki/php-1.35.0-wmf.26/includes/jobqueue/JobQueue.php(337): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.35.0-wmf.26/includes/jobqueue/JobQueueGroup.php(171): JobQueue->push(array)
#3 /srv/mediawiki/php-1.35.0-wmf.26/extensions/Linter/includes/Hooks.php(200): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.35.0-wmf.26/includes/Hooks.php(174): MediaWiki\Linter\Hooks::onParserLogLinterData(string, integer, array)
#5 /srv/mediawiki/php-1.35.0-wmf.26/includes/Hooks.php(234): Hooks::callHook(string, array, array, NULL, string)
#6 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/extension/src/Config/DataAccess.php(344): Hooks::runWithoutAbort(string, array)
#7 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/src/Logger/LintLogger.php(129): MWParsoid\Config\DataAccess->logLinterData(MWParsoid\Config\PageConfig, array)
#8 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/src/Parsoid.php(186): Wikimedia\Parsoid\Logger\LintLogger->logLintOutput()
#9 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(529): Wikimedia\Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#10 /srv/mediawiki/php-1.35.0-wmf.26/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(66): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(MWParsoid\Config\PageConfig, array)
#11 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/Router.php(353): MWParsoid\Rest\Handler\PageHandler->execute()
#12 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/Router.php(308): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#13 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/EntryPoint.php(138): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#14 /srv/mediawiki/php-1.35.0-wmf.26/includes/Rest/EntryPoint.php(105): MediaWiki\Rest\EntryPoint->execute()
#15 /srv/mediawiki/php-1.35.0-wmf.26/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#16 /srv/mediawiki/w/rest.php(3): require(string)
#17 {main}

Related Objects

Mentioned In
T363587: [Event Platform] Instrument EventBus with prometheus MW Statslib
T362977: WDQS updater missed some updates
T120242: Eventually-Consistent MediaWiki state change events | MediaWiki events as source of truth
T349823: [Event Platform] Gracefully handle pod termination in eventgate Helm chart
T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors
T304841: Page was not purged at edit
T290529: JobQueueEventBus->doBatchPush times out while trying to send Echo notifications about a Flow edit
T215001: Revisions missing from mediawiki_revision_create
T260274: JobQueueError: Could not enqueue jobs from stream [stream]
T248602: Lots of "EventBus: Unable to deliver all events: 504: Gateway Timeout"
T249705: Intermittent internal API errors with Flow
Mentioned Here
T317045: [Epic] Re-architect the Search Update Pipeline
T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics
T349823: [Event Platform] Gracefully handle pod termination in eventgate Helm chart
T338357: Pushing jobs to jobqueue is slow again
T264021: > ~1 request/second to intake-logging.wikimedia.org times out at the traffic/service interface
T120242: Eventually-Consistent MediaWiki state change events | MediaWiki events as source of truth
T263132: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount
T260274: JobQueueError: Could not enqueue jobs from stream [stream]
T235358: Could not enqueue jobs: "Unable to deliver all events: 500: Internal Server Error"
T248602: Lots of "EventBus: Unable to deliver all events: 504: Gateway Timeout"
T249705: Intermittent internal API errors with Flow

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 679855 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[operations/deployment-charts@master] Envoy: set per_try_timeout for eventgate-main.

https://gerrit.wikimedia.org/r/679855

Change 680372 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[operations/puppet@production] Envoy: set per_try_timeout for eventgate-main.

https://gerrit.wikimedia.org/r/680372

Change 680372 merged by Alexandros Kosiaris:

[operations/puppet@production] Envoy: set per_try_timeout for eventgate-main.

https://gerrit.wikimedia.org/r/680372

Change 679855 merged by jenkins-bot:

[operations/deployment-charts@master] Envoy: set per_try_timeout for eventgate-main.

https://gerrit.wikimedia.org/r/679855

Change 681132 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[operations/mediawiki-config@master] [EventBus] Make eventage-main timeout consistent with envoy

https://gerrit.wikimedia.org/r/681132

Change 681132 merged by jenkins-bot:

[operations/mediawiki-config@master] [EventBus] Make eventage-main timeout consistent with envoy

https://gerrit.wikimedia.org/r/681132

Mentioned in SAL (#wikimedia-operations) [2021-04-19T18:21:20Z] <ppchelko@deploy1002> Synchronized wmf-config/CommonSettings.php: T249745 [EventBus] Make eventage-main timeout consistent with envoy (duration: 00m 56s)

After deploying all the updated improved timeouts, we're down to 4 jobs lost in a day: https://logstash.wikimedia.org/goto/865a45a301f303638167191fa11511be

There's other events being lost, for example analytics ones, but for those I didn't touch timeout settings.

Let's monitor for a few days and see what happens.

That's great!

Do not divide the skin of a bear we didn't kill yet. Let's monitor the situation for a few days.

After that, I want to make the following improvements:

  • Include the target event service into the logs
  • Indicate that we're using ?hasty mode to EventBus and lower the log level to 'warning' in that case - if we explicitly opted-in for non-guaranteed delivery, we shouldn't scream error if event wasn't delivered.

Do not divide the skin of a bear we didn't kill yet.

٩(^‿^)۶
T120242 is the real bear

ok, a bit more time passed and we've lost 24 jobs in 7 days.

@Krinkle this is a significant improvement over the previous state, but given its a production error - do we need to aim for 0?

@Pchelolo I'm not sure.

Failure is OK. I think in a production environment of our scale, 100% success is unrealistic and may even be damaging. As such, it seems fair to have known failure modes when our service is degraded (perhaps better than not degrading and failing fully for all requests. If we can make components of the infrastructure optional, seems good.

Failure is not OK. However, jobs are not optional from a business logic perspective. And that's why failure to submit a job results in a fatal error that should e.g. propagate to the user, and result in rollback of any write action the user has performed. Except, we don't do that (usually) currently because during the jobrunner-Redis era we were (possibly incorrectly) under the illusion that with sufficient retries from the PHP side, that we can very nearly guruantee job delivery and so except for very few "really important" jobs, almost all jobs are submitted post-send from a deferred, which means we've already committed the DB transaction and told the user everything is A-OK. The caller has passed the point of no return.

Failure is surprising. I mean this in the nicest possible way, but, literally the only purpose of this stack (MW EventBus > Http -> Envoy-tx -> Envoy-recv -> EventGate -> Kafka) is to store a string of JSON for (usually) less than one or two seconds. I can imagine chains this long, or longer, that scale well, stream mostly in real-time, with minimal per-packet overhead. However, while I know very little about EventGate and Kafka, the sense that I'm getting is that the use of this proxy, and Node.js, and the congested buffers or unreleased stale memory problems we're seeing, are causing problems that aren't natural or inevitable as result from e.g. saturating a physical resource like the node's allocated CPU or RAM or network bandwidth, but rather from problems that are more related to avoidable complexity or design choices. Again, I know very little of this stack, but is there some truth to this? Or is it avoidable with exactly what we have, but there's a config problem we just haven't found yet?

Continuing for one more second in this naive mood, it feels to me like a one or two round-robin'ed Redis nodes with a pubsub channel, might offer lower latency, higher uptime, and almost no loss of submissions during regular operation. From there they could be, offline, persisted in Kafka to support remembering offsets, partitioning consumers, replicate/switchover to other DCs etc. This is not a serious suggestion to use Redis on top of Kafka. But, can you confidently say and explain that the uptime / submission loss we have is better than that would be? No need to explain to me, but just to think about in theory.

Failure is still OK, but also still surprising. Let me circle back to where I started. Zero error rate is not a fair expectation of any service. As such, I think it's totally reasonable for the end of the chain (EventGate, and Kafka) to ocasionally fail quickly, drop the thing completely, and require a retry from the start. But how far up the chain is the "start"? From my perspective as consumer of JobQueue, I'd prefer that the "start" not be the user's web browse and their page view or edit request. Instead, it should perhaps be the internal JobQueue-push request. And, as I understand it, we are already doing several retries at one or multiple levels in this stack.

That begs the question: Why are we still seeing delivery failures... from the MW side? Are these astronomical coincidences where the same rare failure happens multiple times in a row? If so, I guess we have a very high rate of submissions that those coincindences can happen and perhaps that's OK then. But, I suspect that perhaps this is not the case, but rather that we are suffering from regular day-to-day ocurrences (not maintenace, exceptional outage, or unusual load) during which the EventGate service is just not accepting anything from anyone for more than a second at a time. Is that true? If so, that seems like something we probably shouldn't tolerate in for a service depended on from Tier-1 production traffic. If this is infeasible to avoid with the current technology, the my naive mindset would think that perhaps this needs to be a multi-node cluster with load balancing in front to avoid such regular outages. (And yes, a few seconds ago he was complaining about complexity....)


EDIT: I did not see the rate was 24 in 7 days. I guess that's quite low. I think you can ignore this comment if the mitigation that got us to that low rate involved something other than (further) increasing of timeouts/retries.

I guess I should file a follow-up task to audit our JobQueue pushes and reconsider the choice of PRESEND vs POSTSEND. In my mind, this is more or less equivalent to a DB transaction where you've called commit() and got a success, but it wasn't ever saved in the first place (vs lost due to special circumstances). If we conclude that assuming success is wrong, then that of course will make latency a more significant point. Originaly when we moved to EventGate/Kafka, latency wasn't reviewed or worried about (for end-users) since we hid push latency behind POSTSEND. With timeouts as big as we have today, improving them will likely involve the same kind of work as improving delivery rate. So perhaps it's still basically the same problem. With low latencies it seems natural to just do them PRESEND, and then failures will be more acceptable, with end-users knowing it, and considering a retry from their perspective. (But also, they'll be more visible and so we'll want fewer of them, but current rate seems fine for that I guess.) To be continued...

Thanks for the detailed reply. I agree - something is still fishy here and does need to be resolved, so let's keep poking.

So, what I did was added a per-retry timeout in envoy for this request path, because without it timeout on 'envoy -> eventgate' request was counted agains overall envoy timeout, so retries were not actually made. This is actually something we should consider to be a default setting for envoy service proxy - nothing specific to eventgate is happening here. Per-request timeout is now slightly less then 1/2 overall timeout, so envoy makes 1 retry.

Since this change did cause such a dramatic decrease in the rate of errors, it suggests that now we are seeing the issue if both retries are hitting a struggling eventgate nodes and timing out - the probability of this happening is much much lower then hitting one, but it's not impossible. I know the next workaround - make per-request-timeout in envoy 1/3 of overall timeout, cutting the probability of failure happening even more.

The fact that we see failures both in normal mode and hasty mode (where we do not wait for ack from kafka and just fire events and forget), kafka doesn't seem to be the issue. Which leaves eventgate as the most probably cause. So, circling back to @Ottomata

ok, a bit more time passed and we've lost 24 jobs in 7 days.

@Krinkle this is a significant improvement over the previous state, but given its a production error - do we need to aim for 0?

No, aiming for zero is pointless and harmful. If we can't live with 24 failures in 7 days it means we're less tolerant than we are with errors in responses to users, which would be absurd, as Timo already pointed out.

This doesn't mean that MediaWiki shoudn't try to improve the situation by handling the failure to submit a job by saving it somewhere (a specific db table?) and we can replay them later. At the current failure rate, this would guarantee the jobs would be executed with an irrelevant cost in terms of resources.

If anything, I think we should invest resources in doing something like this, rather than adding more retries to envoyproxy. It would also help ease the problem of the reliability of the event stream (see for instance T120242

The fact that we see failures both in normal mode and hasty mode (where we do not wait for ack from kafka and just fire events and forget), kafka doesn't seem to be the issue. Which leaves eventgate as the most probably cause. So, circling back to @Ottomata

I think we might want to go back to look if handling the increasing amount of messages eventgate manages now is maybe too much. I'd go down the following path:

  • Up the resource limits for eventgate in the chart for eventgate-main
  • Increase the number of pods we use for it

before we go any further in the direction of investigating what to do with eventgate - it's become a quite central part of our infrastructure and I'd love to see a deeper investigation into what happens when it has those moments of inability to respond to requests.

Increase the number of pods we use for it (EventGate)

FYI, We did this slightly for this task after https://phabricator.wikimedia.org/T249745#6689046. Perhaps we need more??? :)

problems that are more related to avoidable complexity or design choices

EventGate exists mostly as a convenience to avoid problems that PHP has. As does envoy. Both help with connection pooling; envoy is connection pooling https, and EventGate is connection pooling kafka (and of course it does other stuff too :) )

For production services that don't have PHPs downsides, there is no need for EventGate. We do need some things (validation, etc.) to happen to events before they are produced to Kafka, but there's no reason this couldn't be handled by a client side library. This is what we are doing for JVM based event producers. If we had a performant Kafka PHP client (and could use it during a user HTTP request), I think we'd avoid EventGate altogether.

I'd love to see a deeper investigation

T264021: > ~1 request/second to intake-logging.wikimedia.org times out at the traffic/service interface could be related too.

So, 503 is solved - we have zero 503 errors for jobs after https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724173 was deployed. Now we have only 504 Gateway Timeout. Shall poke at those a little.

While the instability and latency problem never fully went away, it appears to be getting worse as of late, sometimes drowning out regressions.

Search "Could not enqueue jobs" on the logstash mediawiki-errors dashboard, with spikes of thousands of lost jobs - afaik without recovery.

Screenshot 2022-07-13 at 17.50.28.png (444×952 px, 44 KB)

Latest impression, a handful every minute throughout the day or more less. The increase in failures may be related to the increased latency per T338357: Pushing jobs to jobqueue is slow again, which may be causing additional timeouts.

Screenshot 2023-08-07 at 19.34.03.png (1×2 px, 200 KB)

Change 971986 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] wgEventServices - add docs about timeout settings

https://gerrit.wikimedia.org/r/971986

Change 971986 merged by jenkins-bot:

[operations/mediawiki-config@master] wgEventServices - add docs about timeout settings

https://gerrit.wikimedia.org/r/971986

Change 971989 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-main - set prestop_sleep and terminiation timeouts

https://gerrit.wikimedia.org/r/971989

Change 971989 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main - set prestop_sleep and terminiation timeouts

https://gerrit.wikimedia.org/r/971989

In T349823: [Event Platform] Gracefully handle pod termination in eventgate Helm chart, we added prestop sleep settings for the envoy tls proxy in front of eventgate. A theory is that the spikes of failed EventBus -> eventgate events were due to envoy shutting down during a deployment while some requests are in flight, causing all in flight requests to be lost.

I've since done a couple of eventgate-main deployments and watched for spikes of these failures, and haven't seen any.

Let's keep an eye out and see if things have improved.

We saw a recurrence of this issue this morning, with a large number of jobs failing with 503 messages from eventgate for a short period. Envoy also saw failures connecting to eventgate around the same time.

Change 1005974 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] eventgate-main: Raise worker_heap_limit_mb to 500

https://gerrit.wikimedia.org/r/1005974

Clement_Goubert raised the priority of this task from High to Unbreak Now!.Feb 23 2024, 3:35 PM
Clement_Goubert subscribed.

Raising this to UBN, we're definitely losing too many jobs.

Change 1005974 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main: Raise worker_heap_limit_mb to 500

https://gerrit.wikimedia.org/r/1005974

Mentioned in SAL (#wikimedia-operations) [2024-02-23T15:38:41Z] <claime> Deploying 1005974 to eventgate-main - T249745

Hey @Clement_Goubert ,

I was on PTO last week and trying to piece together what happened and how the UBN was mitigated.

22.8k errors in the last 3 days, for reference https://logstash.wikimedia.org/goto/fba618b1c3ed026537f7dbe02a457ea3

Looks like we are down to 384 as of 2024-02-28. It's surely better, but would you consider this acceptable?

Change 1005974 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main: Raise worker_heap_limit_mb to 500

https://gerrit.wikimedia.org/r/1005974

There seems to be a correlation between reduced error rate and this patch.
A couple of months ago there was a major update to EventGate nodejs version (10 -> 18). Having to tweak heap size makes me think of possible changes to GC policies and behavior. Hypothesis are a bit hard to validate right now, because we don't collect GC metrics for node services. There's wip to enable those metrics in T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

Hey @Clement_Goubert ,

I was on PTO last week and trying to piece together what happened and how the UBN was mitigated.

22.8k errors in the last 3 days, for reference https://logstash.wikimedia.org/goto/fba618b1c3ed026537f7dbe02a457ea3

Looks like we are down to 384 as of 2024-02-28. It's surely better, but would you consider this acceptable?

I'll defer to the rest of the team on that one, @Joe @akosiaris what do you think?

Change 1005974 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main: Raise worker_heap_limit_mb to 500

https://gerrit.wikimedia.org/r/1005974

There seems to be a correlation between reduced error rate and this patch.
A couple of months ago there was a major update to EventGate nodejs version (10 -> 18). Having to tweak heap size makes me think of possible changes to GC policies and behavior. Hypothesis are a bit hard to validate right now, because we don't collect GC metrics for node services. There's wip to enable those metrics in T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

I don't think it's only about the update, although I don't know enough about nodejs to be certain. What I saw was that some time back the container's available memory was raised to 600MB for this same issue, and during the recent spike in errors, I saw warnings in the container logs about node's single worker being restarted for heap memory exhaustion.

Looking at the graphs, I could see we were not taking advantage of the available memory in the container, and decided to raise that worker_heap_limit_mb value. I think what ends up happening is we drop requests during the restart cycle of the single node worker, and giving it more heap memory would seem to lead to less restarts.

When we're talking about errors, it's always a good idea to reason in terms of error ratios and not error rates.

10 errors per day on a system that received 100 requests per day is worse than 100k errors per day on a platform that receives 100k requests per second.

In our case, given there is no SLO for eventgate-main, it's hard to pin down what "is acceptable".

384 errors over 2 days, given the average job insertion rate is around 2320/s, would mean an error rate of 0.0001%, or an availability of 99.9999% (my calculation is simply 100 - (384*100 / (2320*2*86400)))

I can't see a scenario where a system being available with 6 nines isn't acceptable, but I'd want people who works on MediaWiki to chime in. @MSantos might help with this.

I couple things I wonder about:

  • Though the bottleneck seems to be EventGate more than Kafka, I still wonder why profile::kafka::mirror::properties doesn't blacklist all MW jobs?* Is anything making use of that extra data?
  • Are there stats on the average byte length of jobs enqueued? Maybe JobQueueEventBus could emit those to find bulky jobs. At least cirrus search jobs are known for being bulky. Maybe there are more. I assume the bulky cirrus job problem will be resolved if the new stream-based updater is deployed (T317045).
  • I wonder if JobQueueGroup::lazyPush()/JobQueueEventBus could be rigged to make the provided jobs use "hasty" mode in EventGate?

In the last 15 days, for << channel:EventBus AND "Could not enqueue jobs for stream" >> in logstash, I see:

stream.keyword: Descending	Count
mediawiki.job.parsoidCachePrewarm	3755
mediawiki.job.cirrusSearchLinksUpdate	1890
mediawiki.job.RecordLintJob	1834
mediawiki.job.cirrusSearchElasticaWrite	1766
mediawiki.job.wikibase-addUsagesForPage	436
mediawiki.job.refreshLinks	123
mediawiki.job.htmlCacheUpdate	82
mediawiki.job.EntityChangeNotification	60
mediawiki.job.recentChangesUpdate	55
mediawiki.job.refreshLinksPrioritized	52
mediawiki.job.activityUpdateJob	49
mediawiki.job.cdnPurge	43
mediawiki.job.cirrusSearchLinksUpdatePrioritized	35
mediawiki.job.ORESFetchScoreJob	33
mediawiki.job.constraintsRunCheck	28
mediawiki.job.EchoNotificationDeleteJob	18
mediawiki.job.DispatchChanges	12
mediawiki.job.checkuserPruneCheckUserDataJob	12
mediawiki.job.newcomerTasksCacheRefreshJob	9
mediawiki.job.watchlistExpiry	6
mediawiki.job.cirrusSearchOtherIndex	5
mediawiki.job.flaggedrevs_CacheUpdate	5
mediawiki.job.refreshUserImpactJob	4
mediawiki.job.CleanTermsIfUnused	3
mediawiki.job.categoryMembershipChange	3
mediawiki.job.cirrusSearchCheckerJob	3
mediawiki.job.CognateCacheUpdateJob	1
mediawiki.job.EchoPushNotificationRequest	1
mediawiki.job.ThumbnailRender	1
mediawiki.job.enotifNotify	1
mediawiki.job.globalUsageCachePurge	1
mediawiki.job.ipinfoLogIPInfoAccess	1
mediawiki.job.newUserMessageJob	1

Given those job types, none of that should have horrific consequences. The worse would require some null edits and purges I suppose. I wouldn't lose sleep over 6 nines for this. Maybe global rename jobs would be the most annoying to fail, though GlobalRenameUser blocks on JobQueueGroup::push() and does some DB row locking tricks and checks around updating ru_status to sidestep job enqueue failures and unexpected DB rollback...so that's probably fine too.

*https://wikitech.wikimedia.org/wiki/Kafka/Administration#MirrorMaker

I couple things I wonder about:

  • Though the bottleneck seems to be EventGate more than Kafka, I still wonder why profile::kafka::mirror::properties doesn't blacklist all MW jobs?* Is anything making use of that extra data?

Replication is not just needed if something is using it, it's also a need for disaster recovery.

Say our current primary datacenter has huge electrical issues and is unreachable. We just decide to switch to the other datacenter, of course. What about mediawiki jobs that were already enqueued and not processed? We don't want to lose them, I think.

I still wonder why profile::kafka::mirror::properties doesn't blacklist all MW jobs?* Is anything making use of that extra data?

I believe we do replicate the mediawiki.job.* topics to Kafka jumbo-eqiad as well. We used to import them into Hadoop (as we do pretty much everything) in case someone wanted to do some long term analysis, etc. on them, but we stopped because no one was using them (and they don't have deterministic schemas so they are hard to use). We could stop replicating them to Kafka jumbo-eqiad.

But, I doubt that is related to the problem here :)

Given the six 9's reliability that Joe cited above ( T249745#958681 ) and Aaron's logstash dive over last 15 days and analysis ( T249745#9592919 ), it seems that we could probably close this task as not additionally actionable.

However, given that there could be job types that could have bad consequences if they were dropped, it might be useful to at least look at all known job types and isolate those that cannot be dropped and ensure that they have failsafe mechanisms to ensure the jobs are indeed queued (Aaron notes that GlobalRenameUser jobs already have this failsafe mechanism). Do we already have this information somewhere?

Does that seem reasonable as a final audit before we close this task?

ssastry lowered the priority of this task from Unbreak Now! to High.Mar 6 2024, 8:14 PM

It would be good if we could distinguish between "critical" jobs and "not-su-critical" jobs in the code.

One way to achive this would be to be deliberate about using JobQueueGroup::push for "critical" jobs, and JobQueueGroup::lazyPush for "normal" jobs that can be enqueued after the response has been sent to the user and no errors can be reported.. Petter perhaps, we could add a JobQueueGroup::securePush method that guarantees that the job has been successfully enqueued (or the method throws an exception). Application logic should then use either securePush or lazyPush, and plain push would be reserved for internal use.

This way, we can roll back the main transaction and inform the user when we fail to enqueue a critical job. For non-critical jobs we should of course still log any failure to enqueue, and we should airm for five nices of reliability, but we can accept some failure rate.

@daniel What you describe exists and is exactly what JobQueueGroup::push is and how it is used.

When the completion of a job is directly user-visible and self-correcting (i.e. sending a peer-to-peer email, or deleting a page via jobqueue etc) we queue it via push() and the transaction is rolled back and an error page shown if this fails.

Once the job is succesfully queued, no matter how it was queued, it will eventually be run, including a number of re-tries if it fails or times out for any reason (unless opted out via Job::allowRetries).

Do we have a use case for a third category for which we would "try harder" or something like that? I don't see why anyone would want a less reliable queueing method. If we want to incorporate a retry for failed queuing attempts, I would expect that to be a general improvement. By default, no opt-in, no opt-out. Developers shouldn't worry about such details. If there is a high cost to re-trying (is there?) we could potentially limit it to push only, but I don't see why we'd do that. Afaik we also have automatic re-tries at the Envoy level already, so in a sense we already re-try various idempotent requests automatically as part of Varnish/ATS and service mesh routing.

I wonder if JobQueueGroup::lazyPush()/JobQueueEventBus could be rigged to make the provided jobs use "hasty" mode in EventGate?

We should do this, but I don't think it will mitigate the symptom.
If the expectation is that JobQueueGroup::lazyPush is unreliable anyway, then using EventGate's hasty mode would allow JobQueueEventBus to fire and forget. EventGate would return a 202 as soon as it can.
However, if we are seeing 503s, hasty mode probably won't matter.

This doesn't mean that MediaWiki shoudn't try to improve the situation by handling the failure to submit a job by saving it somewhere (a specific db table?) and we can replay them later. At the current failure rate, this would guarantee the jobs would be executed with an irrelevant cost in terms of resources.

@Joe this sounds sort of similar to the Outbox solution described in T120242, albeit only for failed submissions instead of all of them. Functionally this sounds like a nice solution to the eventual consistency problem described there, but I'd expect it would add some latency to the user response (waiting for ACK from EventGate+Kafka). Actually it sounds more like this (discarded?) solution, except:

  1. open MariaDB transaction
  2. attempt to produce event
  3. if fail, insert event into MariaDB retry table
  4. close MariaDB transaction

Later/async: retry produce event

If you think this is a potential solution, I'll add it to the list in T120242's description.

We had a bunch of error again today. All of them connection errors to eventgate-analytics and main leading to 503 after exceeding the retry limit. There where ERROR ferm input drop default policy not set, ferm might not have been started correctly alerts during that time but I'm not convinced those are related.

image.png (1×3 px, 251 KB)

image.png (990×1 px, 138 KB)

This doesn't mean that MediaWiki shoudn't try to improve the situation by handling the failure to submit a job by saving it somewhere (a specific db table?) and we can replay them later. At the current failure rate, this would guarantee the jobs would be executed with an irrelevant cost in terms of resources.

@Joe this sounds sort of similar to the Outbox solution described in T120242, albeit only for failed submissions instead of all of them. Functionally this sounds like a nice solution to the eventual consistency problem described there, but I'd expect it would add some latency to the user response (waiting for ACK from EventGate+Kafka). Actually it sounds more like this (discarded?) solution, except:

This is from 2021 though, specifically T249745#7040610. In those 3 years, we got to 99.9999% of availability per T249745#9586819. Since reaching 100% (either by making the current solution better or creating a new solution) is probably infeasible (see the CAP theorem), never mind extremely costly, it would make sense first to see why 99.9999% isn't sufficient.

see the CAP theorem

C != eventual-C. Eventual Consistency + AP is feasible and done often.

why 99.9999% isn't sufficient.

Is 99.9999% sufficient for MariaDB replication?

But for MW jobs, perhaps this is fine? For replicating state changes (T120242) we should aim for 100% eventual consistency.

see the CAP theorem

C != eventual-C. Eventual Consistency + AP is feasible and done often.

Agreed, but the the point isn't about eventual consistency. It's that it's inevitable that you 'll have some losses.

why 99.9999% isn't sufficient.

Is 99.9999% sufficient for MariaDB replication?

It is not unheard of for MariaDB replication to break and for the solution to be to skip parts of the binlog. Can't answer whether that's 99.9999% or 99.999999% or even more, but it's definitely not 100%.

But for MW jobs, perhaps this is fine?

Quite possibly.

For replicating state changes (T120242) we should aim for 100% eventual consistency.

Why though? Why is 99.9999% (or 99.999999% or 99.99%) not enough? In fact, T120242#7523847, made 2.5 years ago lowering of the task implies that it might be enough for many use cases.

For replicating state changes (T120242) [...]

Why though? Why is 99.9999% (or 99.999999% or 99.99%) not enough?

There is a "Why do we need this?" section in T120242's description. Let's keep this discussion there?

That task wants MW state propagation using events to be equally consistent with MariaDB. That way folks can trust the state they are getting to build read-model products outside of MW, where it is difficult to do so (search indexes, AI backed patrolling tools, WDQS, etc. etc.)

For replicating state changes (T120242) [...]

Why though? Why is 99.9999% (or 99.999999% or 99.99%) not enough?

There is a "Why do we need this?" section in T120242's description. Let's keep this discussion there?

That task wants MW state propagation using events to be equally consistent with MariaDB. That way folks can trust the state they are getting to build read-model products outside of MW, where it is difficult to do so (search indexes, AI backed patrolling tools, WDQS, etc. etc.)

Are you sure that's a problem? how a search index not getting updated in 0.001% of edits is a problem?

The expectation must be that those data could be regenerated from canonical data on the wiki (AI re-scoring the edit, re-indexing search indices, etc.), there shouldn't be any canonical data being handed out via jobs (or if it really has to, it needs to have at least have some backing in the database, like user rename jobs), otherwise it's working exactly as intended.

search index not getting updated in 0.001% of edits

Search is probably fine.

those data could be regenerated from canonical data on the wiki

This is very expensive, and requires complicated logic and maintenance to do.

But! It is a possible solution to this problem (reconciliation, Lambda Arch), especially if we can provide a general-purpose way for folks to do this.

If we want to go deeper right now let's continue this discussion on T120242. I agree that MW jobs might not need 100% consistency.

search index not getting updated in 0.001% of edits

Search is probably fine.

I'm not sure about the "probably" part. We want to solve user-facing problems. Do you know of a case that users started complaining search index is not updated because of the 0.001% job failure? As long as users are happy, spending resources on something that wouldn't be noticed and could be spent on something else that users actually complain about, is a waste of resources IMO. see also the concept of "error budget"

those data could be regenerated from canonical data on the wiki

This is very expensive, and requires complicated logic and maintenance to do.

if you have a way to find mismatch and redo for a small portion of changes (e.g. even 1% of changes), it should be fine.

For replicating state changes (T120242) [...]

Why though? Why is 99.9999% (or 99.999999% or 99.99%) not enough?

There is a "Why do we need this?" section in T120242's description. Let's keep this discussion there?

Fair enough. I 'll point it out there too, but noting that after reading it, there's only qualitative discussions of problem and no quantitative requirements.

That task wants MW state propagation using events to be equally consistent with MariaDB.

I fear I read that task, the way it is written at least, differently. I see no mention of a need to have consistency equal to MariaDB. All I see if consistent enough events to make their use case work. And enough not being defined neither more quantitatively nor in relation to anything else.

I fear I read that task, the way it is written at least, differently.

Yeah that might be something that was flushed out in the comments and never updated in description.

And enough not being defined neither more quantitatively nor in relation to anything else.

Good point. When I get back from parental leave I'll try to nail this down in that task description better.