Notes from post mortem:
We need to reduce the heartache of inline logging to logstash: log to local rsyslog, rsyslog queues and degrades gracefully, then MW chugs on happily in case of flood. Without this we will run an unnecessarily elevated risk of losing logs at the slightest Redis outage, and ongoing cascading failures during periods of error or limited outage. The timeout to rsyslog should be very small. Logstash can alert on "dead" hosts I believe.
- https://gerrit.wikimedia.org/r/#/c/181350/7/wmf-config/logging.php should have been reviewed more closely.
- Set a *much* lower timeout for Redis / Memcached, *especially* for non-critical things like logging
- Switch to UDP logging instead of Redis?
- The 0.25 second timeout setting was copied from $wgMemCachedTimeout / $wgBloomFilterStores['main']['redisConfig']['connectTimeout']. It is probably not a good value for these configuration options either.
- The timeout should be tested by making the logstash redis server actually unreachable and making sure that an app server under load isn't otherwise affected (i.e., everything but logging to logstash should work)
- Any timeout won't do it. Any time we have something important failing there are going to be showers of those for EACH request. Async is the only way (either natively or by using rsyslog +1
- Monolog stack allows us to easily change the transport layer by using a different "Handler" class. Just need to pick one (or even write our own if crazy). Rsyslog and GELF would both be possible. EIther could use UDP without going back to the fragile udp2log -> log2udp -> logstash pipeline we were using before. Local syslog forwarding is working pretty well for the hhvm logs I think. That may be the easiest to try first. One problem with that right now is that all of the hhvm rsyslog servers are pointed at the same IP in the logstash cluster instead of spreading the output to all of them. --bd808