Page MenuHomePhabricator

EventBus HTTP Proxy service does not report errors to logstash
Closed, ResolvedPublic5 Estimated Story Points

Description

The service is not reporting errors to logstash. Instead, it seems to be returning the logstash-formatted error back to the caller, as exemplified in this RESTBase log entry:

{
  "_index": "logstash-2018.04.27",
  "_type": "restbase",
  "_id": "AWMGnwseh452DCx2bntT",
  "_version": 1,
  "_score": null,
  "_source": {
    "err_root_req_headers_content-length": "0",
    "err_levelPath": "error/events/emit",
    "line": "257",
    "pid": 177,
    "type": "restbase",
    "err_message": "400",
    "err_root_req_uri": "/www.wikidata.org/v1/page/html/Q52220064/670276832",
    "file": "/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/preq/index.js",
    "err_root_req_method": "get",
    "err_root_req_headers_x-request-class": "external",
    "host": "restbase1014",
    "@version": "1",
    "err_root_req_headers_user-agent": "CrossrefEventDataBot",
    "err_root_req_headers_x-client-ip": "52.214.31.76",
    "gelf_level": "3",
    "err_stack": "restbase: 400\n    at Request.P.try.bind.then.err.retry.HTTPError.status (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/preq/index.js:257:23)\n    at Request.tryCatcher (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/bluebird/js/release/promise.js:512:31)\n    at Promise._settlePromise (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/bluebird/js/release/promise.js:569:18)\n    at Promise._settlePromise0 (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/bluebird/js/release/promise.js:614:10)\n    at Promise._settlePromises (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/bluebird/js/release/promise.js:693:18)\n    at Promise._fulfill (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/bluebird/js/release/promise.js:638:18)\n    at Request._callback (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/bluebird/js/release/nodeback.js:45:21)\n    at Request.self.callback (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/request/request.js:186:22)\n    at emitTwo (events.js:106:13)\n    at Request.emit (events.js:191:7)\n    at Request.<anonymous> (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/request/request.js:1163:10)\n    at emitOne (events.js:96:13)\n    at Request.emit (events.js:188:7)\n    at IncomingMessage.<anonymous> (/srv/deployment/restbase/deploy-cache/revs/fbce5204d61e0eae55fc8ab72f1ccb3976d4a87d/node_modules/request/request.js:1085:12)\n    at IncomingMessage.g (events.js:292:16)",
    "err_root_req_headers_x-forwarded-for": "52.214.31.76, 10.20.0.178, 10.20.0.178, 10.20.0.167, ::ffff:10.64.32.107",
    "source_host": "10.64.48.133",
    "level": "ERROR",
    "err_headers_server": "TornadoServer/4.4.2",
    "message": "400",
    "err_request_id": "a3a8e634-4a04-11e8-bb7c-03466a8b4324",
    "version": "1.0",
    "normalized_message": "400",
    "tags": [
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "err_body": "[{\"wiki\": \"\", \"recvFrom\": \"kafka1002.eqiad.wmnet\", \"uuid\": \"a629cbb84a0411e89896141877334a4e\", \"timestamp\": 1524824446, \"revision\": 14035058, \"event\": {\"message\": \"Failed sending event <Event a3cff637-4a04-11e8-b297-8741ff9b83e8 of schema (u'resource_change', 1)>. KafkaTimeoutError: \", \"code\": \"validation\", \"revision\": 1, \"rawEvent\": \"{\\\"meta\\\": {\\\"domain\\\": \\\"www.wikidata.org\\\", \\\"uri\\\": \\\"http://www.wikidata.org/api/rest_v1/page/html/Q52220064\\\", \\\"topic\\\": \\\"resource_change\\\", \\\"request_id\\\": \\\"a3a8e634-4a04-11e8-bb7c-03466a8b4324\\\", \\\"schema_uri\\\": \\\"resource_change/1\\\", \\\"dt\\\": \\\"2018-04-27T10:20:42.343Z\\\", \\\"id\\\": \\\"a3cff637-4a04-11e8-b297-8741ff9b83e8\\\"}, \\\"tags\\\": [\\\"restbase\\\"]}\", \"schema\": \"resource_change\"}, \"schema\": \"EventError\"}, {\"wiki\": \"\", \"recvFrom\": \"kafka1002.eqiad.wmnet\", \"uuid\": \"b812c7f84a0411e89896141877334a4e\", \"timestamp\": 1524824476, \"revision\": 14035058, \"event\": {\"message\": \"Failed sending event <Event a3cff638-4a04-11e8-b3bc-d43f8df4e0e1 of schema (u'resource_change', 1)>. KafkaTimeoutError: \", \"code\": \"validation\", \"revision\": 1, \"rawEvent\": \"{\\\"meta\\\": {\\\"domain\\\": \\\"www.wikidata.org\\\", \\\"uri\\\": \\\"http://www.wikidata.org/api/rest_v1/page/html/Q52220064/670276832\\\", \\\"topic\\\": \\\"resource_change\\\", \\\"request_id\\\": \\\"a3a8e634-4a04-11e8-bb7c-03466a8b4324\\\", \\\"schema_uri\\\": \\\"resource_change/1\\\", \\\"dt\\\": \\\"2018-04-27T10:20:42.343Z\\\", \\\"id\\\": \\\"a3cff638-4a04-11e8-b3bc-d43f8df4e0e1\\\"}, \\\"tags\\\": [\\\"restbase\\\"]}\", \"schema\": \"resource_change\"}, \"schema\": \"EventError\"}]",
    "@timestamp": "2018-04-27T10:21:16.361Z",
    "err_headers_content-type": "text/html; charset=UTF-8",
    "err_name": "restbase",
    "err_status": 400,
    "err_root_req_headers_x-request-id": "a3a8e634-4a04-11e8-bb7c-03466a8b4324",
    "err_headers_date": "Fri, 27 Apr 2018 10:20:42 GMT"
  }
}

As a result, the only way to know what is going on with the service is to log onto the nodes and tail the log files which does not scale and represents a considerable loss of time in case of incidents.

Event Timeline

mobrovac triaged this task as High priority.Apr 27 2018, 10:39 AM
mobrovac created this task.
Restricted Application added a project: Analytics. · View Herald TranscriptApr 27 2018, 10:39 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fdans raised the priority of this task from High to Needs Triage.Apr 30 2018, 4:30 PM
fdans moved this task from Incoming to Radar on the Analytics board.
mforns triaged this task as Medium priority.Apr 30 2018, 4:51 PM
mforns raised the priority of this task from Medium to High.Apr 30 2018, 4:52 PM

Change 429865 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Log eventlogging-service-eventbus logs to logstash

https://gerrit.wikimedia.org/r/429865

Change 430089 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Limit length of request.body in logging output

https://gerrit.wikimedia.org/r/430089

Change 430089 merged by Ottomata:
[eventlogging@master] Remove request.body from logging

https://gerrit.wikimedia.org/r/430089

Mentioned in SAL (#wikimedia-operations) [2018-05-01T17:36:32Z] <otto@tin> Started deploy [eventlogging/eventbus@c70e8c5]: remove occasional logging of request.body in prep for T193230

Mentioned in SAL (#wikimedia-operations) [2018-05-01T17:39:01Z] <otto@tin> Finished deploy [eventlogging/eventbus@c70e8c5]: remove occasional logging of request.body in prep for T193230 (duration: 02m 29s)

Change 429865 merged by Ottomata:
[operations/puppet@production] Log eventlogging-service-eventbus logs to logstash

https://gerrit.wikimedia.org/r/429865

Change 430092 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] eventbus - Install python-logstash if using logstash logging

https://gerrit.wikimedia.org/r/430092

Change 430092 merged by Ottomata:
[operations/puppet@production] eventbus - Install python-logstash if using logstash logging

https://gerrit.wikimedia.org/r/430092

I've deployed this to prod, and restarted a few EventBus instances, but they don't send any logs to logstash there. This works exactly as is in deployment-prep, so I'm not sure what's different. I can manually setup a python logstash logger and send to logstash fine, but the eventbus.log.cfg file based logging config we use doesn't seem to do it... investigating.

Ah, this did not actually work in deployment-prep. The logs I saw in logstash there were because remote_syslog is configured to log to the deployment-prep logstash server, which is not true in production. Back to figuring out how to log to logstash...

Change 430104 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Need to set root logger to use logstash handler

https://gerrit.wikimedia.org/r/430104

Change 430104 merged by Ottomata:
[operations/puppet@production] Need to set root logger to use logstash handler

https://gerrit.wikimedia.org/r/430104

Mentioned in SAL (#wikimedia-operations) [2018-05-01T19:00:45Z] <ottomata> rolling restart of eventlogging-service-eventbus to apply logstash logging configs - T193230

Change 430111 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add logstash tag 'eventlogging-service-eventbus'

https://gerrit.wikimedia.org/r/430111

Change 430111 merged by Ottomata:
[operations/puppet@production] Add logstash tag 'eventlogging-service-eventbus'

https://gerrit.wikimedia.org/r/430111

Mentioned in SAL (#wikimedia-operations) [2018-05-01T19:27:11Z] <ottomata> rolling restart of eventbus to apply logstash tag https://phabricator.wikimedia.org/T193230

Ottomata moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Change 430120 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use python-logstash formatter version 1 (not 0)

https://gerrit.wikimedia.org/r/430120

Change 430120 merged by Ottomata:
[operations/puppet@production] Use python-logstash formatter version 1 (not 0)

https://gerrit.wikimedia.org/r/430120

Mentioned in SAL (#wikimedia-operations) [2018-05-01T20:00:14Z] <ottomata> rolling restart of eventbus to apply new logstash formatter version T193230

@Ottomatta so we are still not getting proper logs right? At least I can't find them ;(

Should we bring in @Gehel or should we try to check out the other lib for logging?

I tried the other lib real quick too, and didn't have it work. I just need to get back into it.

Change 430801 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] setup_logging - Support using yaml config file, add configurable ExtraFilter

https://gerrit.wikimedia.org/r/430801

Change 430808 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] eventlogging service logstash with gelf

https://gerrit.wikimedia.org/r/430808

mforns set the point value for this task to 5.May 7 2018, 3:32 PM

Change 430801 merged by Ottomata:
[eventlogging@master] setup_logging - Support using yaml config file, add configurable ExtraFilter

https://gerrit.wikimedia.org/r/430801

Mentioned in SAL (#wikimedia-operations) [2018-05-09T19:35:08Z] <otto@tin> Started deploy [eventlogging/eventbus@c70e8c5]: logstash - T193230

Mentioned in SAL (#wikimedia-operations) [2018-05-09T19:38:41Z] <otto@tin> Finished deploy [eventlogging/eventbus@c70e8c5]: logstash - T193230 (duration: 03m 33s)

Change 430808 merged by Ottomata:
[operations/puppet@production] eventlogging service logstash with gelf

https://gerrit.wikimedia.org/r/430808

Change 432157 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use gelf port for eventbus logstash

https://gerrit.wikimedia.org/r/432157

Change 432157 merged by Ottomata:
[operations/puppet@production] Use gelf port for eventbus logstash

https://gerrit.wikimedia.org/r/432157

Mentioned in SAL (#wikimedia-operations) [2018-05-09T19:48:42Z] <otto@tin> Started deploy [eventlogging/eventbus@aa9eb2c]: logstash T193230

Mentioned in SAL (#wikimedia-operations) [2018-05-09T19:48:58Z] <otto@tin> Finished deploy [eventlogging/eventbus@aa9eb2c]: logstash T193230 (duration: 00m 15s)

Mentioned in SAL (#wikimedia-operations) [2018-05-09T19:49:48Z] <otto@tin> Started deploy [eventlogging/eventbus@aa9eb2c]: logstash T193230

Mentioned in SAL (#wikimedia-operations) [2018-05-09T19:51:05Z] <otto@tin> Finished deploy [eventlogging/eventbus@aa9eb2c]: logstash T193230 (duration: 01m 17s)

HA! gelf as the solution? I've told you!!!

Haha, yup, the library was better and just easier to work with.

Thanks for the tip! :)

Change 432380 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set default eventbus kafka log level to WARNING

https://gerrit.wikimedia.org/r/432380

Change 432380 merged by Ottomata:
[operations/puppet@production] Set default eventbus kafka log level to WARNING

https://gerrit.wikimedia.org/r/432380

We've got the logs in logstash, thank you @Ottomata

Pchelolo closed this task as Resolved.May 11 2018, 7:05 PM
Vvjjkkii renamed this task from EventBus HTTP Proxy service does not report errors to logstash to t3daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Ottomata as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii edited subscribers, added: Ottomata; removed: gerritbot, Aklapper.
Community_Tech_bot set the point value for this task to 5.Jul 1 2018, 6:02 AM
Community_Tech_bot updated the task description. (Show Details)
Yann renamed this task from t3daaaaaaa to EventBus HTTP Proxy service does not report errors to logstash.Jul 1 2018, 12:08 PM
Yann closed this task as Resolved.
Yann assigned this task to Ottomata.
Yann edited subscribers, added: gerritbot, Aklapper; removed: Ottomata.