Page MenuHomePhabricator

EventBus error "Unable to deliver all events: (curl error: 28) Timeout was reached"
Closed, InvalidPublicPRODUCTION ERROR

Description

Error

About 7,500 hits in the last 30 days (September 2018). Here's two samples.

From api_server
Unable to deliver all events: (curl error: 28) Timeout was reached

reqId: W5mwZArAEHcAAIoQAMwAAABJ
url: http://commons.wikimedia.org/w/api.php
http_method: POST

events:
  {
  "database": "commonswiki",
  "mediawiki_signature": "###",
  "meta": {
    "dt": "###",
    "domain": "commons.wikimedia.org",
    "topic": "mediawiki.job.RecordLintJob",
    "id": "###",
    "uri": "https://commons.wikimedia.org/wiki/File:Les_%C3%89parges_(Meuse)_%C3%A9glise_(01).JPG",
    "request_id": "W5mwZArAEHcAAIoQAMwAAABJ"
  },
  "page_title": "File:Les_Éparges_(Meuse)_église_(01).JPG",
  "type": "RecordLintJob",
  "params": {
    "requestId": "W5mwZArAEHcAAIoQAMwAAABJ",
    "errors": [],
    "revision": 317205039
  },
  "page_namespace": 6
}
From jobrunner
Unable to deliver all events: (curl error: 28) Timeout was reached

reqId: W5l9ZQrAMF0AAExjtjoAAABN
server: jobrunner.discovery.wmnet
url: /rpc/RunSingleJob.php
wiki: idwiki

events:
  {
  "sha1": "###",
  "database": "idwiki",
  "mediawiki_signature": "###",
  "meta": {
    "dt": "###",
    "domain": "id.wikipedia.org",
    "topic": "mediawiki.job.cirrusSearchLinksUpdatePrioritized",
    "id": "###",
    "uri": "https://id.wikipedia.org/wiki/Kirill_Pushkin",
    "request_id": "W5l9ZQrAMF0AAExjtjoAAABN"
  },
  "page_title": "Kirill_Pushkin",
  "type": "cirrusSearchLinksUpdatePrioritized",
  "params": {
    "cluster": null,
    "removedLinks": [],
    "addedLinks": [],
    "requestId": "W5l9ZQrAMF0AAExjtjoAAABN",
    "prioritize": true
  },
  "page_namespace": 0
}

Notes

When searching on Logstash/mediawiki-errors for Unable to deliver all events the vast majority of the matches are actually collected under internal/jsonTruncated instead of mediawiki/EventBus because the error message was too large to process.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Luca did some restarts of the main Kafka clusters recently, I wonder if these correlate to those restarts.

Krinkle reopened this task as Open.EditedDec 20 2018, 7:48 PM

Still seen regularly. With bursts throughout the last 12 days with about 2,000 reports in total of one or more JobQueue submissions lost, presumably without recovery.

Tentatively triaging as high priority because it is not expected that events can be lost that easily as part of regular operations like performing a restart. If they are in fact not lost, then perhaps the error message needs to be changed or caught in some way.

Krinkle triaged this task as High priority.Dec 20 2018, 7:48 PM
Krinkle added a project: Platform Engineering.

Hm, ok. That is a lot. The timeout issue is a known (and was difficult to debug in prod, since we can't reproduce locally). We are planning to replace this service with EventGate in the next quarter or two, so I'd prefer to focus on that then delve into this bug.

As an easy mitigation attempt, we could try to increase the HTTP timeout between MW and EventBus proxy service. Currently, it's not specified in the config and 5-second default is used. Given that all the event submissions are run as deferred updates and not tie up the client response, and that the default timeout for the MultiHTTPClient is 300 seconds, I would say it's safe to bump up the EventBus timeout significantly, to 60 seconds for example.

Change 482722 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Increase MW -> EventBus service HTTP request timeout.

https://gerrit.wikimedia.org/r/482722

When searching on Logstash/mediawiki-errors for Unable to deliver all events the vast majority of the matches are actually collected under internal/jsonTruncated instead of mediawiki/EventBus because the error message was too large to process.

We have actually had mitigation for that limiting the size of the log to 8kb, which was Logstash limit. However, seems like all the jsonTruncated messages are truncated by syslog, which has 1 kb message size limit. Unless we have some custom limit I'm not aware of, I will change the maximum log size for EventBus to 1kb

Change 482722 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase MW -> EventBus service HTTP request timeout.

https://gerrit.wikimedia.org/r/482722

Mentioned in SAL (#wikimedia-operations) [2019-01-08T11:46:06Z] <mobrovac@deploy1001> Synchronized wmf-config/CommonSettings.php: Increase time out on the MW side to 60s - T204183 (duration: 00m 51s)

Change 484575 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] [EventBus] Decrese the maximum log size to 1kb

https://gerrit.wikimedia.org/r/484575

Change 484575 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] [EventBus] Decrese the maximum log size to 1kb

https://gerrit.wikimedia.org/r/484575

Still seen currently, but will re-evaluate once wmf.14 has gone out this week.

Logstash query: https://logstash.wikimedia.org/goto/a4a92b4dea44dafc56fe590774a7c42f

The wmf.14 branch has gone out, but the error is still at roughly the same frequency as before it seems.

Screen Shot 2019-01-25 at 16.58.06.png (354×2 px, 58 KB)

https://logstash.wikimedia.org/goto/ea6f33d407358a6a306de04e2a4629d5

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

We're not using eventlogging-eventbus service anymore, it was replaced by eventgate, please create a new task if timeouts keep happening.