Page MenuHomePhabricator

MediaWiki errors overloading logstash
Closed, ResolvedPublic

Description

Today db1114 went down (T214720) and even though the LB worked great (T180918 - thanks!!) MW still logged millions of errors (https://logstash.wikimedia.org/goto/0121ecd09eca77276c4af8bcde8dc2a2)

It caused logstash to lag behind and had packet loss whilst the issue lasted:
Dashboard: https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&from=1549625552953&to=1549632990689&var-input=kafka%2Frsyslog-udp-localhost
Packet loss: https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&from=1549622528006&to=1549633328006&var-input=kafka%2Frsyslog-udp-localhost
Input increase: https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&from=1549622564189&to=1549633364189&var-input=kafka%2Frsyslog-udp-localhost

We should probably do something about this and try to rate limit the amount of verbosity or logging for MW for a given event or something.
This task aims to start a discussion about how this can be better handled in the future as unfortunately hardware can always fail and this scenario will happen sooner or later.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 8 2019, 1:47 PM

To clarify "lag behind"- it created at least 20 minutes of lag, which would have blocked any mediawiki deployment and or hide any extra ongoing error.

I don't feel nearly well-versed in PHP/PSR-3/Monolog nor the MW codebase to suggest implementations, but it seems to me that Mediawiki itself should be doing some ratelimiting/deduplication of its logging output.

There are likely things we can do to make the logging pipeline more robust, but handling spikes of ~300,000 events/minute seems like it is always going to be challenging.

fsero added a subscriber: fsero.Feb 8 2019, 4:21 PM
fsero added a comment.Feb 8 2019, 4:44 PM

my 2 cents. i think what @CDanis proposes seems the right approach, alternatively we could ratelimit kafka output using some of the librdkafka parameters https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md like queue.buffering.max.messages if the installed librdkafka version and kafka version supports it (probably yes)

Reedy renamed this task from MediaWiki errors overloading logtash to MediaWiki errors overloading logstash.Feb 8 2019, 7:19 PM
Marostegui updated the task description. (Show Details)Feb 8 2019, 7:48 PM
jijiki triaged this task as High priority.Feb 11 2019, 4:35 PM

Since the spike was logging from the LoadBalancer layer, pinging @aaron because he knows that code best.

For the record https://logstash.wikimedia.org/goto/0121ecd09eca77276c4af8bcde8dc2a2 shows messages on the DBQuery, DBReplication, and DBConnection channels from 12:12:49 to 12:46:42 UTC. The "events over time" graph goes from near-zero to around 135k per 30 seconds at 12:16–12:17 and continuing at about that level to the end of the timeframe. The message summary box lists the following:

messageCount
Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server {host} is not replicating?7,676,280
Error connecting to {db_server}: {error}61,535
Wikimedia\Rdbms\LoadMonitor::getServerStates: host {db_server} is unreachable36,119
{fname} {db_server} {errno} {error} {sql1line}16
jcrespo added a comment.EditedFeb 11 2019, 6:17 PM

There was a small mention on this on the SRE meeting, while there was not an exact decision, in general there is 2 separate measures:

  • Hardening infrastructure is desirable, and something had already been proposed (rsyslog could deduplicate messages?) and others

but, in general

  • There is no desire to support the use case of a 10x increase of logging rate, which implies there should be also changes on mediawiki.

There was some discussion on the second point, but there is not clear way to do the second without hacks (emulating application scope through memcache). Please provide some ideas around this.

On a related note, I created another non-user facing incident today by deploying a software change that pooled a new server with bad grants: almost 20-30 minutes of lag was created, and this could cause in the future some delays on detecting other concurrent issues https://logstash.wikimedia.org/goto/ac9366ef568a8ef97106f08b002eb85e Sorry about that. In this case, this was a server with 0.1% of the traffic- as anomie confirms, the issue doesn't happen due to connection handling, but due to replication control.

Edit: When I mean code/mediawiki, I don't mean literally that, I mean load balancer/cronology protector/replication control logic. There is options of course of handling that outside of pure code (e.g. dynamic configuration?)

Edit2: I forgot, the team SRE in charge of monitoring/logs will create a more detailed incident report to understand the actual impact.

I wonder if this is a good use case for the circuit breaker pattern.

Change 490119 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: avoid duplicate spammy logging in LoadBalancer::getRandomNonLagged

https://gerrit.wikimedia.org/r/490119

I don't feel nearly well-versed in PHP/PSR-3/Monolog nor the MW codebase to suggest implementations, but it seems to me that Mediawiki itself should be doing some ratelimiting/deduplication of its logging output.

To effectively do this within MediaWiki we would need a shared state to check either across requests to the same deployment or across a cluster of MediaWiki deployments depending on how much consolidation and rate limiting is desired. Because PHP is by design a shared-nothing environment where all application state is created and destroyed for each and every request this shared state tracking would need to live in an external service (memcached, redis, mysql/mariadb, etc). This is slightly an over generalization as we do deploy PHP with APC caching enabled which allows some storage to be shared locally across the PHP interpreter instances on a single server, but it is broadly true.

Performing stateful message/de-duplication inspection in-line with the emission of log events will have some impact on the runtime of MediaWiki. How much overhead this would impose is an open question and likely would vary by implementation. I would be very surprised however if in practice it was not a fairly significant runtime cost in the aggregate.

Just to get an idea of the current maximum log event output from MediaWiki for a single request I did a little experiment using the X-Wikimedia-Debug header to force full logging for render of enwiki's Barack Obama article:

  • ssh mwlog1001
  • tail -0f /srv/mw-log/XWikimediaDebug.log | tee /tmp/bd808-T215611.log
  • enabled Wikimedia Debug Header extension and set it to make requests to mwdebug1002 requesting full log output
  • Navigated to https://en.wikipedia.org/wiki/Barack_Obama
  • Waited for log events to stop arriving
  • ls -lh /tmp/bd808-T215611.log
    • -rw-rw-r-- 1 bd808 wikidev 328K Feb 13 02:22 /tmp/bd808-T215611.log
  • grep -E '^2019-02-13 02:22:' /tmp/bd808-T215611.log | wc -l
    • 856

Approximately 856 log events totaling 328K of formatted messages we output. This is actually less than the output that would be made to the ELK stack I think because the Monolog config which routes to mwlog1001 uses human readable formatting rather than full json formatting. The human formatting omits much of the metadata for each log event that the full json representation retains. This full debug level logging is a perverse measurement because we silence many log channels and discard many messages as being too fine grained in the pipeline that heads to the ELK stack, but it is useful to illustrate the number and size of log event data that flows up into the Monolog layer where we would likely attempt to implement in-app de-duplication.

I can't find the full incident notes, but T88732: Decouple logging infrastructure failures from MediaWiki logging is a partial post-mortem follow up from February 2015 when I tried to implement lossless logging from MediaWiki to ELK using Redis as the transport queue. In that incident we experienced Logstash failing to keep up with the rate of log events sent from MediaWiki which in turn filled the available Redis storage and then finally caused MediaWiki to halt when emitting log events because Redis was blocking new queue entries. The decision at that point was to return to a potentially lossy but non-blocking UDP transport. The lesson I (re)learned at that time is that you really do want your in-process logging pipeline to be as light weight as possible.

My recommendation would be that if de-duplication or rate limiting is actually what we want it be done out of process by some intermediate service that filters the log event stream after MediaWiki but before storage. One example of a system to do this is rsyslog's $RepeatedMsgReduction flag although I do not actually think that particular implementation would work in the scenario described by this ticket as the log events will differ from one another due to metadata about the triggering request rather than being completely identical.

Change 490370 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.33.0-wmf.17] rdbms: avoid duplicate spammy logging in LoadBalancer::getRandomNonLagged

https://gerrit.wikimedia.org/r/490370

^I don't have enough context for this patch, is configuration for regular servers setup to nor produce DEBUG lines outside? Otherwise, I don't understand how the patch can produce the desired output.

Change 490119 merged by jenkins-bot:
[mediawiki/core@master] rdbms: avoid duplicate spammy logging in LoadBalancer::getRandomNonLagged

https://gerrit.wikimedia.org/r/490119

Change 490370 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.17] rdbms: avoid duplicate spammy logging in LoadBalancer::getRandomNonLagged

https://gerrit.wikimedia.org/r/490370

Mentioned in SAL (#wikimedia-operations) [2019-02-13T18:00:58Z] <krinkle@deploy1001> Synchronized php-1.33.0-wmf.17/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Id70fdfa62ef / T215611 (duration: 00m 55s)

aaron added a comment.Feb 14 2019, 5:38 PM

^I don't have enough context for this patch, is configuration for regular servers setup to nor produce DEBUG lines outside? Otherwise, I don't understand how the patch can produce the desired output.

We don't emit DEBUG type events unless explicitly configured to.

herron added a subscriber: herron.Feb 14 2019, 5:58 PM
aaron added a comment.EditedFeb 14 2019, 8:20 PM

My recommendation would be that if de-duplication or rate limiting is actually what we want it be done out of process by some intermediate service that filters the log event stream after MediaWiki but before storage. One example of a system to do this is rsyslog's $RepeatedMsgReduction flag although I do not actually think that particular implementation would work in the scenario described by this ticket as the log events will differ from one another due to metadata about the triggering request rather than being completely identical.

I tend to agree with that, if we are trying to check for duplication and so on. When logs spam and have high variation of slightly different float numbers mentioned, that makes this trickier.

For something simple and stupid (no uniqueness checking), I wonder if, for 1 every X log writes, some MergeableUpdate class could be added to post-send deferred updates, where it would increment per channel/severity APC counters (with short TTLs not extended by apc_incr). If the count reached a threshold, a blacklist key listing all throttled logs and their expiry could be atomically updated to include that channel/severity if it's not already there or is but is near expiration (via apc_cas). Logging the change to the blacklist would happen in such cases (for the winning apc_cas thread). To do throttle checks before log writes, the blacklist key could be read once and process cached for any MW request that has (non-DEBUG) log write calls. Once a throttle is within X seconds of expiring, each MW request will sometimes allow all the log writes for that channel/severity (decided once per request), with increasing probability of doing so as the throttle expiration nears (e.g. vaguely like WANObjectCache's "lowTTL"). If that trips the counter, the list will be updated again as normal.

It could also be done without MW dependencies by using register_shutdown_function and raw APC calls (doing at most one incr per channel/severity used for non-DEBUG stuff).

A per-request upper limit could also be added if we don't have one already (in case of one thread stuck in a stupid loop or something), with logging itself of course (1 such entry per request max).

Not saying we should do this, but just thinking of simple ideas.

We don't emit DEBUG type events unless explicitly configured to.

Thanks for the explanation, it help me understand it. Thanks for the patch, too!

Not saying we should do this, but just thinking of simple ideas.

Let me suggest a course of action, with all the ideas proposed. We test the change by creating the same problem in the least impactful way possible (I can do that at my morning). I am guessing it may take a week to get propagated to all of production (correct me if I am wrong)?

If the problems reappear, we keep this open. If the issue is mitigated, we close it as resolved and consider the immediate followup to the incident solved. On a separate or separate tickets, we can propose additional followups with the necessary changes that each "owner" considers necessary/appropiate both on the application and on the logging infrastructure. What do you think?

aaron added a comment.Feb 15 2019, 1:48 AM

I think Timo backported the change, so it should be live:

18:01 krinkle@deploy1001: Synchronized php-1.33.0-wmf.17/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Id70fdfa62ef / T215611 (duration: 00m 55s)

A test next week seems fine to me, though since I don't maintain logstash itself, some more relevant person should be looped in on that. Tasks for throttling could be separate if people agree enough to try explore some idea there.

Apologies for the late reply -- and thanks to all who have helped in fixing the issue!

Approximately 856 log events totaling 328K of formatted messages we output. This is actually less than the output that would be made to the ELK stack I think because the Monolog config which routes to mwlog1001 uses human readable formatting rather than full json formatting. The human formatting omits much of the metadata for each log event that the full json representation retains. This full debug level logging is a perverse measurement because we silence many log channels and discard many messages as being too fine grained in the pipeline that heads to the ELK stack, but it is useful to illustrate the number and size of log event data that flows up into the Monolog layer where we would likely attempt to implement in-app de-duplication.

Very useful figure even if perverse! Particularly handy as we move udp2log stream to kafka (more on that below).

I can't find the full incident notes, but T88732: Decouple logging infrastructure failures from MediaWiki logging is a partial post-mortem follow up from February 2015 when I tried to implement lossless logging from MediaWiki to ELK using Redis as the transport queue. In that incident we experienced Logstash failing to keep up with the rate of log events sent from MediaWiki which in turn filled the available Redis storage and then finally caused MediaWiki to halt when emitting log events because Redis was blocking new queue entries. The decision at that point was to return to a potentially lossy but non-blocking UDP transport. The lesson I (re)learned at that time is that you really do want your in-process logging pipeline to be as light weight as possible.

Thanks for the additional context! I think there's agreement that MW's logging should be as lightweight as possible since it happens in-request, that's one of the reasons UDP was kept as logging transport but moved from the network to localhost (and thus much more reliable in theory since communication is all in-kernel) with rsyslog on the other end.

My recommendation would be that if de-duplication or rate limiting is actually what we want it be done out of process by some intermediate service that filters the log event stream after MediaWiki but before storage. One example of a system to do this is rsyslog's $RepeatedMsgReduction flag although I do not actually think that particular implementation would work in the scenario described by this ticket as the log events will differ from one another due to metadata about the triggering request rather than being completely identical.

Indeed, message deduplication would be tough as-is, another example of possible deduplication would be based on (one, many) json fields in case of structured logging (I haven't looked at how simple it'd be in rsyslog, nor suggesting we actually implement it now).

Another avenue for deduplication would be logstash (e.g. logstash lessons handling duplicates) although that means having the message already persisted in kafka.

WRT rate limit I think we have more space for maneuvering but blunter tools outside mw, namely rsyslog udp input rate limiting (lossy of course), rsyslog output to kafka rate limit / queuing. Logstash itself has also a "natural" limit which as we have experienced was the bottleneck in this case and caused the UDP loss.

The rate limit will also play a part as we are onboarding udp2log into the logging pipeline (T205856), of which mw udp2log stream is quite the "firehose" compared to non-udp2log mw volume (about 15x, last I checked). Slightly OT for this task but the plan there is to also move udp2log destination to localhost, write to kafka and consume from mwlog into per-channel files.

Krinkle assigned this task to jcrespo.Feb 19 2019, 9:13 PM
Krinkle moved this task from Inbox to Blocked or Needs-CR on the Performance-Team board.
Krinkle added a subscriber: Krinkle.

@jcrespo I believe the issue is resolved, but leaving it open for you to confirm and/or test as you wish. Feel free to assign back if there's anything else, so I won't miss it :)

Thanks @herron, I would like to know more information about what caused the extra logging, but I didn't find it on the incident report, do you know it, or know someone that does?

jcrespo removed jcrespo as the assignee of this task.Mar 4 2019, 4:55 PM

I am not working on this.

Thanks @herron, I would like to know more information about what caused the extra logging, but I didn't find it on the incident report, do you know it, or know someone that does?

@hashar or @Ottomata may be able to explain in more detail the cause of mw log spike

Krinkle closed this task as Resolved.Mar 4 2019, 5:17 PM
Krinkle assigned this task to aaron.

Sadly this bit us again last week. Details outlined in https://wikitech.wikimedia.org/wiki/Incident_documentation/20190228-logstash

The screenshot from Grafana does indicate that starting around 19:40 nearly 90% of MW events were dropped. By itself not conclusive that MW is the cause, merely that it is a victim. MediaWiki's INFO messages decreased from ~120K/min to 40K/min. But, ERROR messages increased, from <100/min to ~12K/min.

Looking at type:mediawiki Logstash with level:ERROR shows:

channelCount
EventBus133,483
memcached5,473
JobExecutor527
CountMessage
133,480Unable to deliver all events: (curl error: 7) Couldn't connect to server

This seems unrelated to the original issue in this task, which was about MediaWiki's core logging behaviour for database issues. I'm closing this for now, and leaving to @Ottomata or @herron to file a task about this incident and/or EventBus' logging.

herron added a comment.EditedMar 4 2019, 5:49 PM

The screenshot from Grafana does indicate that starting around 19:40 nearly 90% of MW events were dropped. By itself not conclusive that MW is the cause, merely that it is a victim.

From what I see the order of events was mw error logs spike, logstash can't keep up and begins queueing logs, then logstash begins dropping logs.

This seems unrelated to the original issue in this task, which was about MediaWiki's core logging behaviour for database issues. I'm closing this for now, and leaving to @Ottomata or @herron to file a task about this incident and/or EventBus' logging.

OK fair enough, and fwiw I agree that the cause of the log spike is not the same, but the effect of "mediawiki errors overloading logstash" does match. Maybe we could clarify the task description in this case.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM