Maniphest T204183

EventBus error "Unable to deliver all events: (curl error: 28) Timeout was reached"
Closed, InvalidPublicPRODUCTION ERROR
Actions

Description

Error

About 7,500 hits in the last 30 days (September 2018). Here's two samples.

From api_server

Unable to deliver all events: (curl error: 28) Timeout was reached

reqId: W5mwZArAEHcAAIoQAMwAAABJ
url: http://commons.wikimedia.org/w/api.php
http_method: POST

events:
  {
  "database": "commonswiki",
  "mediawiki_signature": "###",
  "meta": {
    "dt": "###",
    "domain": "commons.wikimedia.org",
    "topic": "mediawiki.job.RecordLintJob",
    "id": "###",
    "uri": "https://commons.wikimedia.org/wiki/File:Les_%C3%89parges_(Meuse)_%C3%A9glise_(01).JPG",
    "request_id": "W5mwZArAEHcAAIoQAMwAAABJ"
  },
  "page_title": "File:Les_Éparges_(Meuse)_église_(01).JPG",
  "type": "RecordLintJob",
  "params": {
    "requestId": "W5mwZArAEHcAAIoQAMwAAABJ",
    "errors": [],
    "revision": 317205039
  },
  "page_namespace": 6
}

From jobrunner

Unable to deliver all events: (curl error: 28) Timeout was reached

reqId: W5l9ZQrAMF0AAExjtjoAAABN
server: jobrunner.discovery.wmnet
url: /rpc/RunSingleJob.php
wiki: idwiki

events:
  {
  "sha1": "###",
  "database": "idwiki",
  "mediawiki_signature": "###",
  "meta": {
    "dt": "###",
    "domain": "id.wikipedia.org",
    "topic": "mediawiki.job.cirrusSearchLinksUpdatePrioritized",
    "id": "###",
    "uri": "https://id.wikipedia.org/wiki/Kirill_Pushkin",
    "request_id": "W5l9ZQrAMF0AAExjtjoAAABN"
  },
  "page_title": "Kirill_Pushkin",
  "type": "cirrusSearchLinksUpdatePrioritized",
  "params": {
    "cluster": null,
    "removedLinks": [],
    "addedLinks": [],
    "requestId": "W5l9ZQrAMF0AAExjtjoAAABN",
    "prioritize": true
  },
  "page_namespace": 0
}

Notes

When searching on Logstash/mediawiki-errors for Unable to deliver all events the vast majority of the matches are actually collected under internal/jsonTruncated instead of mediawiki/EventBus because the error message was too large to process.

Details

	Subject	Repo	Branch	Lines +/-
	[EventBus] Decrese the maximum log size to 1kb	mediawiki/extensions/EventBus	master	+3 -3
	Increase MW -> EventBus service HTTP request timeout.	operations/mediawiki-config	master	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: T180017: Timeouts on event delivery to EventBus

Event Timeline

Krinkle created this task.Sep 13 2018, 2:10 AM

Restricted Application added a project: Analytics. · View Herald TranscriptSep 13 2018, 2:10 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Krinkle moved this task from Untriaged to Apr 2019 / 1.33.wmf.25+ on the Wikimedia-production-error board.Sep 13 2018, 2:10 AM

• Pchelolo closed this task as a duplicate of T180017: Timeouts on event delivery to EventBus.Sep 13 2018, 5:48 AM

• Pchelolo mentioned this in T180017: Timeouts on event delivery to EventBus.

Luca did some restarts of the main Kafka clusters recently, I wonder if these correlate to those restarts.

Still seen regularly. With bursts throughout the last 12 days with about 2,000 reports in total of one or more JobQueue submissions lost, presumably without recovery.

Tentatively triaging as high priority because it is not expected that events can be lost that easily as part of regular operations like performing a restart. If they are in fact not lost, then perhaps the error message needs to be changed or caught in some way.

Krinkle triaged this task as High priority.Dec 20 2018, 7:48 PM

Krinkle added a project: Platform Engineering.

Krinkle moved this task from Untriaged to EventBus infra on the WMF-JobQueue board.Dec 20 2018, 7:57 PM

Milimetric moved this task from Incoming to Event Platform on the Analytics board.Jan 7 2019, 4:41 PM

Hm, ok. That is a lot. The timeout issue is a known (and was difficult to debug in prod, since we can't reproduce locally). We are planning to replace this service with EventGate in the next quarter or two, so I'd prefer to focus on that then delve into this bug.

As an easy mitigation attempt, we could try to increase the HTTP timeout between MW and EventBus proxy service. Currently, it's not specified in the config and 5-second default is used. Given that all the event submissions are run as deferred updates and not tie up the client response, and that the default timeout for the MultiHTTPClient is 300 seconds, I would say it's safe to bump up the EventBus timeout significantly, to 60 seconds for example.

Not a bad idea!

Change 482722 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Increase MW -> EventBus service HTTP request timeout.

https://gerrit.wikimedia.org/r/482722

gerritbot added a project: Patch-For-Review.Jan 7 2019, 10:09 PM

When searching on Logstash/mediawiki-errors for Unable to deliver all events the vast majority of the matches are actually collected under internal/jsonTruncated instead of mediawiki/EventBus because the error message was too large to process.

We have actually had mitigation for that limiting the size of the log to 8kb, which was Logstash limit. However, seems like all the jsonTruncated messages are truncated by syslog, which has 1 kb message size limit. Unless we have some custom limit I'm not aware of, I will change the maximum log size for EventBus to 1kb

Change 482722 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase MW -> EventBus service HTTP request timeout.

https://gerrit.wikimedia.org/r/482722

Mentioned in SAL (#wikimedia-operations) [2019-01-08T11:46:06Z] <mobrovac@deploy1001> Synchronized wmf-config/CommonSettings.php: Increase time out on the MW side to 60s - T204183 (duration: 00m 51s)

• kchapman moved this task from Inbox to Needs Cleaning - Security, stability, performance, and scalability (TEC1) on the Platform Engineering board.Jan 15 2019, 3:29 PM

• kchapman edited projects, added Platform Engineering (Needs Cleaning - Security, stability, performance, and scalability (TEC1)); removed Platform Engineering.

CCicalese_WMF assigned this task to • Pchelolo.Jan 15 2019, 3:33 PM

CCicalese_WMF added a project: Platform Team Workboards (Doing).

Change 484575 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] [EventBus] Decrese the maximum log size to 1kb

https://gerrit.wikimedia.org/r/484575

Change 484575 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] [EventBus] Decrese the maximum log size to 1kb

https://gerrit.wikimedia.org/r/484575

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.14; 2019-01-22).Jan 16 2019, 12:01 AM

Krinkle removed a project: Patch-For-Review.Jan 22 2019, 3:55 AM

Still seen currently, but will re-evaluate once wmf.14 has gone out this week.

Logstash query: https://logstash.wikimedia.org/goto/a4a92b4dea44dafc56fe590774a7c42f

The wmf.14 branch has gone out, but the error is still at roughly the same frequency as before it seems.

https://logstash.wikimedia.org/goto/ea6f33d407358a6a306de04e2a4629d5

Krinkle removed a project: MW-1.33-notes (1.33.0-wmf.14; 2019-01-22).Jan 26 2019, 12:59 AM

Krinkle moved this task from Apr 2019 / 1.33.wmf.25+ to Older on the Wikimedia-production-error board.Mar 12 2019, 9:19 PM

WDoranWMF removed a project: Platform Team Workboards (Doing).Jul 5 2019, 6:17 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

We're not using eventlogging-eventbus service anymore, it was replaced by eventgate, please create a new task if timeouts keep happening.

	F28044980: Screen Shot 2019-01-25 at 16.58.06.png
	Jan 26 2019, 12:59 AM

EventBus error "Unable to deliver all events: (curl error: 28) Timeout was reached"Closed, InvalidPublicPRODUCTION ERRORActions