Page MenuHomePhabricator

Use gzip for logstash
Closed, DeclinedPublic

Description

While checking into the upgrade of logstash to 5.x i noticed a couple errors due to malformed GELF logging requests. This is explicitly *not* a problem with the 5.x upgrade, our 1.5.x install in production is logging the same errors, I just noticed these because i was looking over logs while preparing the upgrade.

These are basically udp messages formatted with json received over port 12201. One example message:

{"@timestamp":"2017-03-27T21:11:24","type":"ores","logger_name":"uwsgi","host":"deployment-sca03","level":"ERROR","message":"[pid: 31379] 10.68.21.68 (-) {32 vars in 521 bytes} [Mon Mar 27 21:11:24 2017] GET /scores/enwiki/goodfaith/?model_info=test_stats&format=json => generated 2060 bytes in 7 msecs (HTTP/1.1 200) 6 headers in 209 bytes (1 switches on core 0) user agent \"MediaWiki/1.29.0-alpha\""}

The problem here is the logstash can only accept compressed input over GELF, plaintext is not supported. I'm no uwsgi expert so can't provide exact details on how to fix, but for the logs to be accepted by logstash and saved into elasticsearch, to be viewed in kibana, the uwsgi config in /etc/uwsgi/apps-available/ores.ini will need to be updated to compress the data sent out over the socket connection.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

In production this error has been logged 704,874 times between 2017-03-27T06:25:20 and 2017-03-27T21:28:06.41, or just under 800 times per minute. The full cluster logs ~15k messages per second, so adding these to the set is reasonable to handle. If we don't actually need all these messages though, as they have never been previously available via centralized logging, it might be worth just turning them off.

Halfak triaged this task as High priority.Apr 13 2017, 3:15 PM
Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

The uwsgi logging config is setup to send json datagrams to port 11514. It shouldn't be hitting the GELF input at all.

The port issue got resolved. I made https://gerrit.wikimedia.org/r/#/c/348184/1 to send things over gzip (even if it's not needed, let's have it for faster I/O)

Ladsgroup renamed this task from ORES logs not being saved to logstash to Use gzip for logstash.Apr 14 2017, 1:38 AM
Ladsgroup lowered the priority of this task from High to Medium.
Ladsgroup moved this task from Incoming to Blocked on others on the User-Ladsgroup board.
Ladsgroup moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.

Okay. My conclusion is there are two things:

  • The beta cluster used a way out-dated puppet config which got these errors Erik mentioned. It simply got resolved by updating the puppetmaster
  • Logs should be gzipped but that's not necessary for logging. Since we still see logs but they are not gzipped: https://logstash.wikimedia.org/app/kibana#/dashboard/ORES

Mentioned in SAL (#wikimedia-releng) [2017-04-14T08:03:42Z] <hashar> beta: resetting puppetmaster to last good tag snapshot-20170414T0030 A cherry pick for T161563 end up dropping three patches which broke other parts of the infrastructure

Mentioned in SAL (#wikimedia-releng) [2017-04-14T08:17:16Z] <hashar> beta: cherry picking again 348184/4 'service: use gzip for logging in uwsgi' for T161563

Mentioned in SAL (#wikimedia-releng) [2017-04-25T06:46:58Z] <Amir1> uncherry-pick f6ce64e99a and 225b8d4e82 (T161563)

when un-cherry picked, it works like a charm. I cherry-pick it again to see what happens.

Change 348184 abandoned by Ladsgroup:
service: use gzip for logging in uwsgi

Reason:
It breaks beta cluster. Let's not do it.

https://gerrit.wikimedia.org/r/348184