Use gzip for logstash
Closed, DeclinedPublic

Description

While checking into the upgrade of logstash to 5.x i noticed a couple errors due to malformed GELF logging requests. This is explicitly *not* a problem with the 5.x upgrade, our 1.5.x install in production is logging the same errors, I just noticed these because i was looking over logs while preparing the upgrade.

These are basically udp messages formatted with json received over port 12201. One example message:

{"@timestamp":"2017-03-27T21:11:24","type":"ores","logger_name":"uwsgi","host":"deployment-sca03","level":"ERROR","message":"[pid: 31379] 10.68.21.68 (-) {32 vars in 521 bytes} [Mon Mar 27 21:11:24 2017] GET /scores/enwiki/goodfaith/?model_info=test_stats&format=json => generated 2060 bytes in 7 msecs (HTTP/1.1 200) 6 headers in 209 bytes (1 switches on core 0) user agent \"MediaWiki/1.29.0-alpha\""}

The problem here is the logstash can only accept compressed input over GELF, plaintext is not supported. I'm no uwsgi expert so can't provide exact details on how to fix, but for the logs to be accepted by logstash and saved into elasticsearch, to be viewed in kibana, the uwsgi config in /etc/uwsgi/apps-available/ores.ini will need to be updated to compress the data sent out over the socket connection.

Related Objects

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

In production this error has been logged 704,874 times between 2017-03-27T06:25:20 and 2017-03-27T21:28:06.41, or just under 800 times per minute. The full cluster logs ~15k messages per second, so adding these to the set is reasonable to handle. If we don't actually need all these messages though, as they have never been previously available via centralized logging, it might be worth just turning them off.

Halfak assigned this task to Ladsgroup.Apr 13 2017, 3:14 PM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptApr 13 2017, 3:14 PM
Halfak triaged this task as High priority.Apr 13 2017, 3:15 PM
Halfak moved this task from Backlog to Maintenance/cleanup on the Scoring-platform-team-Backlog board.

Mentioned in SAL (#wikimedia-releng) [2017-04-14T00:45:31Z] <Amir1> cherry-picking 348184/1 (T161563)

bd808 added a subscriber: bd808.Apr 14 2017, 12:55 AM

The uwsgi logging config is setup to send json datagrams to port 11514. It shouldn't be hitting the GELF input at all.

The port issue got resolved. I made https://gerrit.wikimedia.org/r/#/c/348184/1 to send things over gzip (even if it's not needed, let's have it for faster I/O)

Ladsgroup moved this task from Active to Review on the Scoring-platform-team board.
Ladsgroup renamed this task from ORES logs not being saved to logstash to Use gzip for logstash.
Ladsgroup lowered the priority of this task from High to Normal.

Okay. My conclusion is there are two things:

  • The beta cluster used a way out-dated puppet config which got these errors Erik mentioned. It simply got resolved by updating the puppetmaster
  • Logs should be gzipped but that's not necessary for logging. Since we still see logs but they are not gzipped: https://logstash.wikimedia.org/app/kibana#/dashboard/ORES

Mentioned in SAL (#wikimedia-releng) [2017-04-14T08:03:42Z] <hashar> beta: resetting puppetmaster to last good tag snapshot-20170414T0030 A cherry pick for T161563 end up dropping three patches which broke other parts of the infrastructure

Mentioned in SAL (#wikimedia-releng) [2017-04-14T08:17:16Z] <hashar> beta: cherry picking again 348184/4 'service: use gzip for logging in uwsgi' for T161563

Mentioned in SAL (#wikimedia-releng) [2017-04-25T06:46:58Z] <Amir1> uncherry-pick f6ce64e99a and 225b8d4e82 (T161563)

when un-cherry picked, it works like a charm. I cherry-pick it again to see what happens.

Mentioned in SAL (#wikimedia-releng) [2017-04-27T07:26:54Z] <Amir1> cherry-picking 348184/4 (T161563)

Change 348184 abandoned by Ladsgroup:
service: use gzip for logging in uwsgi

Reason:
It breaks beta cluster. Let's not do it.

https://gerrit.wikimedia.org/r/348184

Ladsgroup closed this task as Declined.Jun 4 2017, 12:47 AM
awight moved this task from Review to Done on the Scoring-platform-team board.Jul 3 2017, 5:48 PM