Page MenuHomePhabricator

Logstash dashboard mediawiki-errors lacks "error"-level messages from diagnostic channels
Closed, DeclinedPublic

Description

https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors has a filter for error channels.

FieldValue
typemediawiki
channel.keywordexception error jsonTruncated

That excludes ERROR level messages from other channels and exceptions which are in the exception channel but otherwise lack a level. The dashboard thus reflects a partial view of all errors.

I found that out a fairly recently and removing the current filter is how I caught ExtensionDistributor throwing errors (T340483). The reported issue had:

FieldValue
channelExtensionDistributor
levelERROR
typemediawiki

A filter letting any ERROR (regardless of channel) and any message to the exception channel (regardless of level) would have shown the issue.

Event Timeline

This is intentional. The way it is meant to work is that uncaught errors or errors that otherwise means the request/process did not succeed in what it wanted to do (i.e. relevant operationally in terms of service health), that those go to the error channel.

Diagnostic messages of interest to the maintainers, or when looking for early signs or secondary fallout, are logged to individual components. These have access to a range of levels like (we generally use only debug, info, warning, and error) but are all confined to something that isn't relevant at the high level.

If a component is logging a fatal error to a diagnostic channel, that is a bug with that component.

The alternative suggested, which is to query diagnostic message from any channel that may locally use the "error" level, would most mean the signal can no longer be used operationally. E.g. it would no longer be a reliable signal for alerts (https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts) or for Scap to validate deployments.

Perhaps we could change only the dashboard, and leave the alert and scap signal unchanged. But, then the dashboard would no longer be useful when responding to alerts, aborted deploys, or doing train triage.

In terms of when you're looking for other messages, we have other Logstash dashboards:

  1. Logstash: mediawiki, any channel, any level.
  2. Logstash: mediawiki-warnings, any channel, ignore debug/info level.
  3. Logstash: mediawiki-errors, only the "error" channel (PHP errors and fatal MediaWiki exceptions).

So if you're intententionally looking for diagnostic messages the first two woud make a better starting point.

Krinkle renamed this task from Kibana dashboard mediawiki-errors lacks channel errors and exceptions to Logstash dashboard mediawiki-errors lacks "error"-level messages from diagnostic channels.May 29 2025, 5:01 PM