Maintenance scripts should consistently log errors
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Tgr
	May 2 2022, 10:35 PM

Description

Error logging in MediaWiki maintenance scripts is a bit of a mess:

Maintenance scripts call Setup.php relatively late (and even within Setup.php exception handling is set up relatively late - that's T177735: Error handling is set up too late to catch extension registration errors), so any error that's thrown within e.g. Maintenance::setup() does not get handled at all.
Maintenance::execute() is wrapped in a try..catch that prints the error message to stderr and otherwise suppresses the error. The default systemd config sends stderr to /var/log but doesn't seem to report it in a more useful way.
Errors thrown in other parts (e.g. Maintenance::validateParamsAndArgs()) are handled by the top-level exception handler in the usual way (sent to Logstash + written to stdout).

It would probably make more sense to always pass errors to MWExceptionHandler::logException().

Related Objects

Mentioned In: T370560: Make failures from refreshLinkRecommendation job visible in Logstash
T308186: Support long section exclusion lists for link recommendations
Mentioned Here: T308186: Support long section exclusion lists for link recommendations
T282761: purgeParserCache.php should not take over 24 hours for its daily run
T285896: Ingest logs from scheduled maintenance scripts at WMF in Logstash
T99268: RfC: Create a proper command-line runner for MediaWiki maintenance tasks
T306535: Mentor dashboard stopped updating
T177735: Error handling is set up too late to catch extension registration errors

Event Timeline

Tgr created this task.May 2 2022, 10:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2022, 10:35 PM

Tgr updated the task description. (Show Details)May 2 2022, 10:35 PM

Adding RelEng for feedback as they are the main consumers of error logs. Also @Krinkle who is probably the most active in maintaining MediaWiki logging code.

Trying to deal with Setup.php being called too early seems bothersome so maybe better left to be fixed by some larger-scale refactoring of the maintenance system (such as T99268: RfC: Create a proper command-line runner for MediaWiki maintenance tasks). Logging caught exceptions, either to the production error channel, or a custom channel, would be trivial though. Do you have any concerns about doing that?

(FWIW the immediate motivation for this task was T306535: Mentor dashboard stopped updating, where the effect of the maintenance script failing was sufficiently user-visible that it got reported, but in general not surfacing maintenance script errors better seems like a poor state of affairs.)

Thanks for raising this. We ran into a similar issue last year as part of the ParserCache epic (T282761).

My take on this is T285896: Ingest logs from scheduled maintenance scripts at WMF in Logstash

In a nut shell: I would keep error handling as-is, and recognise that for CLI scripts there are any number of early or rare fatal error scenarios possible that we may not be able to capture at runtime, and instead ensure that the standard error stream of these processes is collected in Logstash. We could even try to tee a subset of those into a type:mediawiki feed for easy discovery in existing workflows and queries, in relation to owning/maintaining MediaWiki components in production.

kostajh subscribed.May 6 2022, 12:09 PM

Tgr mentioned this in T308186: Support long section exclusion lists for link recommendations.May 12 2022, 1:49 PM

T308186 was another example of being bitten by this - the maintenance script was failing on some wikis only (some i18n-related string was too long and resulted in a service request URL exceeding length limits and 400-ing), and we only noticed after a few days that the feature doesn't seem to be working as expected.

thcipriani edited projects, added Release-Engineering-Team (Radar); removed Release-Engineering-Team.Dec 6 2022, 10:19 PM

Tgr mentioned this in T370560: Make failures from refreshLinkRecommendation job visible in Logstash.Aug 5 2024, 11:02 AM

Michael subscribed.Sep 10 2024, 8:26 AM

Maintenance scripts should consistently log errorsOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Maintenance scripts should consistently log errors
Open, Needs TriagePublic
Actions