Page MenuHomePhabricator

RESTBase logging broken in both production & staging
Closed, ResolvedPublic

Description

Seems to have stopped on the 14th:

pasted_file (5×1 px, 2 MB)

There were no deploys that day, nor any config changes we are aware of.

@bd808, were there any changes in logstash around that time?

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke subscribed.
GWicke raised the priority of this task from High to Unbreak Now!.
GWicke set Security to None.
GWicke updated the task description. (Show Details)

were there any changes in logstash around that time?

SAL says there were (seemingly unrelated) changes wrt ganglia on the nodes on 2015-09-13, but that's it.

I didn't find anything obviously related in the puppet git log either.

Only candidates form SAL:

  • some elasticsearch changes on the 15th
  • lots of FERM firewalling, but mostly after the 14th as well

RESTBase is configured to log to logstash1001. (I don't think we have a service IP / LVS for logstash -- fixing that should be a priority.)

On logstash1001, I see the following in /var/log/logstash/logstash.err:

Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

The message is not timestamped, but the file was last modified on Sep 14 14:37.

The service appears to have exited then, and it has not been restarted:

$ service logstash status
● logstash.service - LSB: Starts Logstash as a daemon.
   Loaded: loaded (/etc/init.d/logstash)
   Active: active (exited) since Thu 2015-08-27 15:59:19 UTC; 3 weeks 0 days ago

Aug 27 15:59:19 logstash1001 logstash[629]: logstash started.
Aug 27 15:59:19 logstash1001 systemd[1]: Started LSB: Starts Logstash as a daemon..

logstash1001 is also the forward_syslog for MediaWiki, so I imagine MediaWiki logs are missing, too.

Change 239307 had a related patch set uploaded (by Ori.livneh):
logstash: add monitoring for logstash process

https://gerrit.wikimedia.org/r/239307

Change 239307 merged by Ori.livneh:
logstash: add monitoring for logstash process

https://gerrit.wikimedia.org/r/239307

ori claimed this task.

Thanks, @ori.

I'll keep this open for now to track some follow-up work:

  • move away from logstash1001, to avoid being co-located with mw logs
  • follow up on setting up a service ip for logstash

Change 239506 had a related patch set uploaded (by GWicke):
Switch RESTBase logging to logstash1002.eqiad

https://gerrit.wikimedia.org/r/239506

Change 239506 merged by Ori.livneh:
Switch RESTBase logging to logstash1002.eqiad

https://gerrit.wikimedia.org/r/239506

Thanks, @ori.

I'll keep this open for now to track some follow-up work:

  • move away from logstash1001, to avoid being co-located with mw logs

This is now done with https://gerrit.wikimedia.org/r/239506.

  • follow up on setting up a service ip for logstash

Tracked in T113104: Set up a service IP for logstash.

Closing this task.