Page MenuHomePhabricator

Maintenance scripts should consistently log errors
Open, Needs TriagePublic

Description

Error logging in MediaWiki maintenance scripts is a bit of a mess:

  • Maintenance scripts call Setup.php relatively late (and even within Setup.php exception handling is set up relatively late - that's T177735: Error handling is set up too late to catch extension registration errors), so any error that's thrown within e.g. Maintenance::setup() does not get handled at all.
  • Maintenance::execute() is wrapped in a try..catch that prints the error message to stderr and otherwise suppresses the error. The default systemd config sends stderr to /var/log but doesn't seem to report it in a more useful way.
  • Errors thrown in other parts (e.g. Maintenance::validateParamsAndArgs()) are handled by the top-level exception handler in the usual way (sent to Logstash + written to stdout).

It would probably make more sense to always pass errors to MWExceptionHandler::logException().

Event Timeline

Tgr added a subscriber: Krinkle.

Adding RelEng for feedback as they are the main consumers of error logs. Also @Krinkle who is probably the most active in maintaining MediaWiki logging code.

Trying to deal with Setup.php being called too early seems bothersome so maybe better left to be fixed by some larger-scale refactoring of the maintenance system (such as T99268: RfC: Create a proper command-line runner for MediaWiki maintenance tasks). Logging caught exceptions, either to the production error channel, or a custom channel, would be trivial though. Do you have any concerns about doing that?

(FWIW the immediate motivation for this task was T306535: Mentor dashboard stopped updating, where the effect of the maintenance script failing was sufficiently user-visible that it got reported, but in general not surfacing maintenance script errors better seems like a poor state of affairs.)

Thanks for raising this. We ran into a similar issue last year as part of the ParserCache epic (T282761).

My take on this is T285896: Ingest logs from scheduled maintenance scripts at WMF in Logstash

In a nut shell: I would keep error handling as-is, and recognise that for CLI scripts there are any number of early or rare fatal error scenarios possible that we may not be able to capture at runtime, and instead ensure that the standard error stream of these processes is collected in Logstash. We could even try to tee a subset of those into a type:mediawiki feed for easy discovery in existing workflows and queries, in relation to owning/maintaining MediaWiki components in production.

T308186 was another example of being bitten by this - the maintenance script was failing on some wikis only (some i18n-related string was too long and resulted in a service request URL exceeding length limits and 400-ing), and we only noticed after a few days that the feature doesn't seem to be working as expected.