Decouple logging infrastructure failures from MediaWiki logging
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• chasemp
	Feb 5 2015, 8:02 PM

Description

Notes from post mortem:

We need to reduce the heartache of inline logging to logstash: log to local rsyslog, rsyslog queues and degrades gracefully, then MW chugs on happily in case of flood. Without this we will run an unnecessarily elevated risk of losing logs at the slightest Redis outage, and ongoing cascading failures during periods of error or limited outage. The timeout to rsyslog should be very small. Logstash can alert on "dead" hosts I believe.

https://gerrit.wikimedia.org/r/#/c/181350/7/wmf-config/logging.php should have been reviewed more closely.
Set a *much* lower timeout for Redis / Memcached, *especially* for non-critical things like logging
Switch to UDP logging instead of Redis?
The 0.25 second timeout setting was copied from $wgMemCachedTimeout / $wgBloomFilterStores['main']['redisConfig']['connectTimeout']. It is probably not a good value for these configuration options either.
The timeout should be tested by making the logstash redis server actually unreachable and making sure that an app server under load isn't otherwise affected (i.e., everything but logging to logstash should work)
Any timeout won't do it. Any time we have something important failing there are going to be showers of those for EACH request. Async is the only way (either natively or by using rsyslog +1
Monolog stack allows us to easily change the transport layer by using a different "Handler" class. Just need to pick one (or even write our own if crazy). Rsyslog and GELF would both be possible. EIther could use UDP without going back to the fragile udp2log -> log2udp -> logstash pipeline we were using before. Local syslog forwarding is working pretty well for the hhvm logs I think. That may be the easiest to try first. One problem with that right now is that all of the hhvm rsyslog servers are pointed at the same IP in the logstash cluster instead of spreading the output to all of them. --bd808

Details

Subject	Repo	Branch	Lines +/-
Send MediaWiki events for all wikis to Logstash	operations/mediawiki-config	master	+0 -1
Send group0 + group1 MediaWiki events to logstash	operations/mediawiki-config	master	+2 -2
debug logging: Convert to Monolog logging	operations/mediawiki-config	master	+339 -410

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	bd808	T69817 Monitor for anomalies/spikes in read failures of memcached
Resolved	bd808	T100735 Have Logstash report per-channel log message rate to Graphite
Resolved	bd808	T99735 Upgrade Logstash to 1.5.3
Resolved	bd808	T97545 reinstall logstash1001-1003
Resolved	Anomie	T1272 Deploy ApiFeatureUsage extension on WMF wikis
Resolved	bd808	T87521 Convert JobRunner.php to PSR-3 logging and add levels
Resolved	bd808	T88732 Decouple logging infrastructure failures from MediaWiki logging
Resolved	bd808	T88870 Prototype Monolog and rsyslog configuration to ship log events from MediaWiki to Logstash
Resolved	bd808	T85535 Upgrade monolog to 1.12.0
Resolved	bd808	T89313 Fix MWLoggerMonologHandler::isHandling() to work with Monolog 1.12.0
Resolved	bd808	T96814 Update Logstash Puppet roles to allow separating Elasticsearch from Logstash & Kibana
Resolved	bd808	T96692 Rack and Setup (3) Logstash Servers
Resolved	RobH	T84958 eqiad: (3) servers for logstash service
Declined	bd808	T87078 Upgrade RAM for logstash100[123] to 64G
Resolved	RobH	T89402 purchase 3 additional logstash nodes
Declined	RobH	T87460 Allocate temporary Elasticsearch nodes from spares pool for Logstash
Resolved	faidon	T97481 jessie installs fail - mirror issue due to jessie release?
Resolved	bd808	T97645 Elasticsearch not starting on Jessie hosts

Event Timeline

• chasemp created this task.Feb 5 2015, 8:02 PM

• chasemp raised the priority of this task from to Needs Triage.

• chasemp updated the task description. (Show Details)

• chasemp added projects: Incident-20150205-SiteOutage, acl*sre-team, Wikimedia-Logstash.

• chasemp subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2015, 8:02 PM

bd808 subscribed.Feb 5 2015, 8:04 PM

Monolog logging is currently disabled in the prod cluster via https://gerrit.wikimedia.org/r/#/c/188837/. We should not re-enable it until we figure out the right transport to use for logstash shipping.

bd808 triaged this task as High priority.Feb 5 2015, 8:18 PM

The two easiest paths forward for this are to switch from Monolog\Handler\RedisHandler to either:

Monolog\Handler\SyslogUdpHandler and and add rsyslog config to forward to logstash as is done for HHVM and Apache logs
Monolog\Handler\GelfHandler and send UDP packets directly to logstash as the nodejs and hadoop clusters do

These are both transports that we already have enabled for Logstash and either would shift the MediaWiki output to fire and forget UDP packets.

The GELF handler requires the use of an additional PHP library (https://github.com/bzikarsky/gelf-php) that would need a security review to add to mediawiki/vendor. One thing I like better about it however is that it would allow pre-request round robin selection of a destination Logstash server. The current rsyslog forwarding pins all rsyslog input to a single Logstash host. This could probably be fixed with some LVS magic in front of the Logstash aggregators.

For the love of consistency I would vote rsyslog. It's very standard and fairly easily debugable and we have to live with it anyways. I'm not sure what you mean by pre-request round robin selection. Do you mean before sending each syslog event it would pick a logstash host from a pool and fire it off?

The way I have distributed syslog before is per host. We could potentially use hiera to split up our app server hosts by some grouping and each group uses a particular logstash aggregator? I would vote for that. Generally rsyslog filter conditions are robust enough I wouldn't be surprised if we could route things in that manner?

Daniel_Mietchen subscribed.Feb 6 2015, 12:43 AM

One possible negative of rsyslog as the shipping transport for structured application logs is the hard 4K line length limit. 4K sounds like a lot of text for a log message, but stack traces can get really big pretty easily. The message format that we are using with Monolog for shipping to Logstash is a JSON structure and if it is truncated by the transport layer the event will not be decoded and stored by Logstash. I'll try to figure out how common really large stacktraces are in our PHP stack by looking at some logs for the past few weeks.

Segregating the rsyslog traffic destination by host would probably cover the same thing I was trying to do with the random selection of Redis address per request in the current PHP code. Mostly I wanted to spread the MediaWiki log events across multiple Logstash backends in a semi-fair spread. For the non-Redis transports we are currently using (GELF and syslog) we have not been balancing the traffic. All HHVM & apache syslog events are sent to logstash1001 currently and the GELF applications have each picked a single logstash100* host as well.

• MZMcBride subscribed.Feb 6 2015, 5:19 AM

In T88732#1020141, @bd808 wrote:

One possible negative of rsyslog as the shipping transport for structured application logs is the hard 4K line length limit. 4K sounds like a lot of text for a log message, but stack traces can get really big pretty easily. The message format that we are using with Monolog for shipping to Logstash is a JSON structure and if it is truncated by the transport layer the event will not be decoded and stored by Logstash. I'll try to figure out how common really large stacktraces are in our PHP stack by looking at some logs for the past few weeks.

I think we could accomodate with

http://www.rsyslog.com/doc/v8-stable/configuration/global/index.html#true-global-directives

$MaxMessageSize ...32k is the current maximum needed for IHE

This is for TCP not UDP but I think we would be interested in TCP as transport mechanism if the actual offloading of the message from the box is done async from processing client requests.

bd808 mentioned this in T87521: Convert JobRunner.php to PSR-3 logging and add levels.Feb 6 2015, 9:34 PM

I think I'd prefer going with Gelf directly from MediaWiki rather than involving another component in the path. Are there any other downsides compared to syslog besides the extra PHP library dependency?

In T88732#1025215, @faidon wrote:

I think I'd prefer going with Gelf directly from MediaWiki rather than involving another component in the path. Are there any other downsides compared to syslog besides the extra PHP library dependency?

Not that I've found yet but I haven't deeply investigated Monolog's GELF support. I've been exploring the syslog path in T87521 but I can trial a GELF configuration as well once I get that working.

One downside of the syslog transport is that it will require configuration changes for the rsyslog on each MediaWiki host in addition to the Monolog configuration. Rsyslog would need to be configured to listen for UDP input at least on 127.0.0.1 and a relay rule would also be needed to forward the messages either directly to Logstash or to an intermediate aggregator. Initial investigation has also shown that the built-in Monolog syslog packet formatting system is not sufficient for our needs, but the fix for this seems trivial at this point.

In T88732#1025266, @bd808 wrote:

Not that I've found yet but I haven't deeply investigated Monolog's GELF support. I've been exploring the syslog path in T87521 but I can trial a GELF configuration as well once I get that working.

I actually meant T88870. What I've found there is that we could use a syslog over UDP transport and send the datagrams either directly to Logstash or use rsyslog as an intermediate buffer/aggregator.

We don't really care about local logging, so I don't see why we'd involve the local rsyslog. Remote syslog over UDP or GELF over UDP work sound equally good from my perspective (i.e. without taking into account @ori's concerns about embedding JSON structures).

greg subscribed.Feb 10 2015, 4:29 PM

Thought about this a bit and actually have had this same conversation in the past. I don't think rsyslog is clearly a better option but if I was making the call I would choose it for these reasons:

Sending to local rsyslog means that in the event of whacky things happening (like a cascading failure that happens to bring logstash down while being coupled with an unforeseen network error) we can route logs to local disk on specific hosts to troubleshoot trivially
Rsyslog is another gear in the mechanism but with pretty low overhead. I think also that istening for logs locally is more or less standard practice anyway. Not much in the configuration overhead department.
It is true we don't do log review in any real fashion and the vast majority of our logs pass silently, but the periods where we really care about doing reconstruction of a timeline seem to be self-selecting as periods where network errors happen and thus we increase the risk of losing the only logs we actually care about.
This is more or less free resiliency and flexibility in the log routing pipeline that also maintains a standard

That being said, making it so this isn't a synchronous process with client requests was the only real actionable so I'm good with both proposed solutions Just wanted to add my 2 cents :)

thcipriani subscribed.Feb 10 2015, 6:00 PM

Input from one logging customer:

[14:44]  <    bd808>	hoo: how important is debug message capture loss to you as a consumer of the logs? Meaning how hard do you think we should be working to make sure that every wfDebugLog call ends up on fluorine and in logstash?
[14:44]  <    bd808>	Obviously no logging sucks, but is some amount of loss reasonable?
[14:45]  <      hoo>	You mean from the stuff we have set as what we want to log?
[14:45]  <      hoo>	In InitializeSettings
[14:45]  <    bd808>	Asking this in context of https://phabricator.wikimedia.org/T88732
[14:45]  <    bd808>	the redis thing was an attempt to be really lossless
[14:46]  <    bd808>	that blew up
[14:46]  <    bd808>	now trying to find real requirements to inform the next gen design
[14:46]  <      hoo>	I can't give you any numbers, but I had various occasions in the past where I needed one specific log entry
[14:47]  <      hoo>	So having loss there is a rather huge trade off
[14:47]  <    bd808>	I can say pretty confidently that udp2log loses some unquantified amount of messages
[14:48]  <      hoo>	I know :/
[14:48]  <      hoo>	Also api.log is now sampled and all things like taht :/
[14:48]  <      hoo>	Especially for the SULF stuff I had quite some occasions where I just couldn't investigate because of missing logs
[14:48]  <  legoktm>	+1 ^
[14:48]  <    bd808>	*nod*
[14:49]  <      hoo>	But before the hhvm switch it mostly just worked fine
[14:49]  <      hoo>	(and full api logs are also important, to stress that)

I took a look at GELF as a transport option and found it lacking. The Monolog GELF formatter works around GELF's lack of support for lists as would normally be used for stacktraces by embedding json encoded data in a string key that we would have to rehydrate on the Logstash side.

In T88732#1028273, @chasemp wrote:

Thought about this a bit and actually have had this same conversation in the past. I don't think rsyslog is clearly a better option but if I was making the call I would choose it for these reasons:

Sending to local rsyslog means that in the event of whacky things happening (like a cascading failure that happens to bring logstash down while being coupled with an unforeseen network error) we can route logs to local disk on specific hosts to troubleshoot trivially

Rsyslog is another gear in the mechanism but with pretty low overhead. I think also that listening for logs locally is more or less standard practice anyway. Not much in the configuration overhead department.

It is true we don't do log review in any real fashion and the vast majority of our logs pass silently, but the periods where we really care about doing reconstruction of a timeline seem to be self-selecting as periods where network errors happen and thus we increase the risk of losing the only logs we actually care about.

This is more or less free resiliency and flexibility in the log routing pipeline that also maintains a standard

That being said, making it so this isn't a synchronous process with client requests was the only real actionable so I'm good with both proposed solutions Just wanted to add my 2 cents :)

As I see the debate thus far, the discussion has settled down to whether or not to employ rsyslog on the localhost as an intermediary between MediaWiki's Monolog logging stack and Logstash.

Pros:

Log forwarding rules in a semi-central place
- Apache2 and HHVM already relay logs to Logstash via rsyslog forwarding rules
Ability to tweak rsyslog rules to dump logs locally if needed in a debugging scenario
Ability to use TCP over datagrams for forwarding to Logstash
(?) Ability to buffer log events in case of Logstash or network routing failure

Cons:

Need to configure rsyslog to ensure that large datagrams (up to 64K) are collected and forwarded properly
Yet another moving part that might fail in the log stack
(?) Round-robin use of multiple Logstash backends for each rsyslog instance difficult

The easiest way to move forward from the MediaWiki side as I see it would be to merge my Syslog over UDP Handler and configure MediaWiki to talk directly to Logstash. Rsyslog could be inserted as an intermediary if and when Ops decide that the merits of having it outweigh the negatives. Switching at that point would only require a config change in MediaWiki to alter the host and port that Monolog was emitting datagrams to.

Does this sound like a reasonable plan or should I go back to the drawing board and look for other options?

Seems reasonable to me :). We can always adjust as you say.

I have everything working in beta to ship log events to Logstash using syslog UDP datagrams sent directly from MediaWiki to the Logstash server. See T88870 for details. We could start rolling this to production in parallel with the upcoming 1.25wmf18 branch.

Backporting all the necessary pieces to 1.25wmf17 would be kind of a pain and until we have new hardware for the Logstash Elasticsearch cluster (T89402) a full deployment would be likely to cause traffic volume related outages/gaps in Logstash anyway.

Change 191259 had a related patch set uploaded (by BryanDavis):
logstash: Ship logs via syslog udp datagrams

https://gerrit.wikimedia.org/r/191259

Patch-For-Review

bd808 claimed this task.Feb 18 2015, 5:48 AM

bd808 added a project: MediaWiki-Core-Team.

bd808 moved this task from Backlog to In Dev/Progress on the MediaWiki-Core-Team board.

The configuration change to begin testing this in production has been blocked in code review based on the complexity and current method of generating the Monolog logger configuration. I have neither the time nor the energy currently to rethink and rewrite the entire configuration mechanism and I'm honestly not sure that the simple configuration desired by the reviewer is possible with the current state of MediaWiki logging.

If you are not sure that using Monolog is a good solution, evaluate it further in Labs or another development environment before enabling it in production. If you are sure that Monolog is a good solution, then there is no reason to shirk the work of introducing it properly, which is by deprecating $wgDebugLogGroups and rewriting the configuration blocks that utilize it in operations/mediawiki-config to use Monolog instead.

In T88732#1061071, @ori wrote:

If you are not sure that using Monolog is a good solution, evaluate it further in Labs or another development environment before enabling it in production. If you are sure that Monolog is a good solution, then there is no reason to shirk the work of introducing it properly, which is by deprecating $wgDebugLogGroups and rewriting the configuration blocks that utilize it in operations/mediawiki-config to use Monolog instead.

Fair enough. The specifics of how the Monolog config will be presented in mediawiki-config may be the point of debate. A direct Monolog config will look something like https://gerrit.wikimedia.org/r/#/c/175604/3/wmf-config/logging-labs.php,unified. Will that pass your muster or will it also be rejected for being too complex? MaxSem wasn't excited about that level of config verbosity which in part led to the first generated config I introduced in https://gerrit.wikimedia.org/r/#/c/178977/1/wmf-config/logging-labs.php,unified.

It was also a nice thing to be able to generate the config at that point so that it would not drift by having two separate logging systems to configure during the transition from the LegacyLogger to Monolog. In some sense we are back in that position again until we are sure that the syslog udp transport will work for us at production scale, I want to go through testing this on group0 followed by group1 before finally re-enabling it for all wikis. The volume of log messages emitted in beta really doesn't give anything a workout as far as load testing goes.

I am also not keen to re-enable all wikis logging to Logstash until we get the new hardware for the backing Elasticsearch cluster and migrate over to using it. Based on past performance I think we can just handle group0 + group1 traffic. Adding in the Wikipedia sites brought our little logging cluster to its knees last time and I seen no reason to believe that it won't do it again.

Maybe what we could do for now is only configure a small number of logging channels (exceptions being most useful in my opinion and the api data that Brad needs for his extension being next on the list) across all wikis to go to Logstash and leave the rest going only to the udp2log service on fluorine.

bd808 closed subtask T88870: Prototype Monolog and rsyslog configuration to ship log events from MediaWiki to Logstash as Resolved.Feb 25 2015, 5:35 PM

Anomie added a parent task: T1272: Deploy ApiFeatureUsage extension on WMF wikis.Mar 4 2015, 3:24 PM

Anomie mentioned this in T1272: Deploy ApiFeatureUsage extension on WMF wikis.

bd808 merged a task: T90708: get php/hhvm exceptions back into logstash.Mar 13 2015, 3:43 AM

bd808 added subscribers: aude, hoo, JanZerebecki.

bd808 mentioned this in T85013: Monitor size of "logstash" redis list.Mar 13 2015, 3:46 AM

bd808 added a parent task: T92597: [DO NOT USE] MediaWiki debug logging (tracking) [superseded by #MediaWiki-Debug-Logger project].

Legoktm added a project: MediaWiki-Debug-Logger.Mar 13 2015, 5:37 PM

Legoktm added a parent task: T87521: Convert JobRunner.php to PSR-3 logging and add levels.Mar 18 2015, 6:18 PM

RandomDSdevel awarded a token.Apr 6 2015, 12:32 AM

RandomDSdevel subscribed.

bd808 edited projects, added Reading-Infrastructure-Team-Old (Don't use); removed MediaWiki-Core-Team.Apr 8 2015, 9:15 PM

RandomDSdevel rescinded a token.Apr 12 2015, 5:06 PM

RandomDSdevel awarded a token.

bd808 moved this task from Backlog to In Dev/Progress on the Wikimedia-Logstash board.Apr 22 2015, 3:34 AM

bd808 added subtasks: T96814: Update Logstash Puppet roles to allow separating Elasticsearch from Logstash & Kibana, T96692: Rack and Setup (3) Logstash Servers, T84958: eqiad: (3) servers for logstash service.Apr 23 2015, 4:14 PM

I have a new an improved version of the mediawiki-config patch (https://gerrit.wikimedia.org/r/#/c/191259/) that I hope better addresses @ori's concerns than my prior attempts. The config is still more complex than I would ultimately like to see, but until all of core is converted to PSR-3 style logging with appropriate log levels for each message (T88649) I think it's about as compact as it can really be.

Krinkle mentioned this in T45086: Capture PHP warnings with stacktraces in MediaWiki and save to logstash.Apr 23 2015, 4:29 PM

Change 191259 merged by jenkins-bot:
debug logging: Convert to Monolog logging

https://gerrit.wikimedia.org/r/191259

bd808 mentioned this in rOMWC2680380cba02: debug logging: Convert to Monolog logging.Apr 23 2015, 9:35 PM

bd808 closed subtask T96814: Update Logstash Puppet roles to allow separating Elasticsearch from Logstash & Kibana as Resolved.May 6 2015, 2:14 AM

bd808 closed subtask T96692: Rack and Setup (3) Logstash Servers as Resolved.May 6 2015, 3:17 AM

Change 209170 had a related patch set uploaded (by BryanDavis):
Send group0 group1 MediaWiki events to logstash

https://gerrit.wikimedia.org/r/209170

Change 209172 had a related patch set uploaded (by BryanDavis):
Send MediaWiki events for all wikis to Logstash

https://gerrit.wikimedia.org/r/209172

Change 209170 merged by jenkins-bot:
Send group0 group1 MediaWiki events to logstash

https://gerrit.wikimedia.org/r/209170

bd808 mentioned this in rOMWC2557bdc5a1b5: Send group0 + group1 MediaWiki events to logstash.May 6 2015, 3:03 PM

Change 209172 merged by jenkins-bot:
Send MediaWiki events for all wikis to Logstash

https://gerrit.wikimedia.org/r/209172

• Manybubbles mentioned this in rOMWC97b3c20d1b5b: Send MediaWiki events for all wikis to Logstash.May 11 2015, 3:04 PM

All wikis are logging to Logstash again using the syslog UDP datagram transport.

bd808 moved this task from In Dev/Progress to Archive on the Wikimedia-Logstash board.Jun 5 2015, 6:55 PM

bd808 moved this task from Backlog to Done on the Reading-Infrastructure-Team-Old (Don't use) board.Jul 6 2015, 4:01 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 6 2015, 4:01 PM

hashar mentioned this in T125084: MediaWiki monolog doesn't handle Kafka failures gracefully.Jan 28 2016, 5:03 PM

Danny_B removed a parent task: T92597: [DO NOT USE] MediaWiki debug logging (tracking) [superseded by #MediaWiki-Debug-Logger project].Feb 14 2016, 4:20 PM

hashar added a parent task: T92597: [DO NOT USE] MediaWiki debug logging (tracking) [superseded by #MediaWiki-Debug-Logger project].Feb 15 2016, 8:42 AM

Restricted Application added a project: User-bd808. · View Herald TranscriptJul 29 2016, 8:19 PM

• Phabricator_maintenance removed a parent task: T92597: [DO NOT USE] MediaWiki debug logging (tracking) [superseded by #MediaWiki-Debug-Logger project].Jul 29 2016, 8:19 PM

RandomDSdevel unsubscribed.Aug 31 2016, 8:50 PM

bd808 moved this task from To Do to Archive on the User-bd808 board.Sep 1 2016, 5:14 PM

bd808 mentioned this in T215611: MediaWiki errors overloading logstash.Feb 13 2019, 3:35 AM

fgiunchedi added a project: observability.Aug 19 2019, 2:29 PM

Decouple logging infrastructure failures from MediaWiki loggingClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Decouple logging infrastructure failures from MediaWiki logging
Closed, ResolvedPublic
Actions

Related Objects
Search...