Page MenuHomePhabricator

Migrate MediaWiki.errors.fatal to statslib
Open, MediumPublic

Description

Follow the migration process as outlined below.

Secure/Conduct code review(s).
Deploy the changes to production via the train (https://wikitech.wikimedia.org/wiki/Deployments/Train).
Verify that the changes have been successfully implemented.
Update the relevant dashboard(s) by replacing the old Graphite metric(s) with the new Prometheus metric(s).
Please follow the guidelines and standards outlined in the provided documentation:

https://www.mediawiki.org/wiki/Manual:Stats for detailed guidance on the conversion process.
https://drive.google.com/file/d/12yQEuOapkML1vb9MgCaX1QzbLBdXE6X2/view for a video tutorial on the conversion process.
https://docs.google.com/presentation/d/1SZWf_D3mWNX-XHN8PHYI84LDZr6GUQC2AMhZ9mQXCI0/edit#slide=id.g2795460c956_0_23 for slides on the best practices for converting metrics to statslib.

  • MediaWiki.errors.fatal

Event Timeline

colewhite changed the task status from Open to In Progress.Apr 4 2024, 8:47 PM
colewhite claimed this task.
colewhite triaged this task as Medium priority.

Change #1017078 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] wmerrors: add config and code to copy stats to dogstatsd

https://gerrit.wikimedia.org/r/1017078

Change #1017078 merged by Cwhite:

[operations/puppet@production] wmerrors: add config and code to copy stats to dogstatsd

https://gerrit.wikimedia.org/r/1017078

Change #1049625 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] mediawiki: enable forward of fatal metrics to statsd exporter

https://gerrit.wikimedia.org/r/1049625

Aklapper changed the task status from In Progress to Open.Apr 11 2025, 10:20 PM

Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one year (see T380300). Feel free to set that status again, or rather break down into smaller subtasks.

Krinkle subscribed.

I noticed today that the "Fatal error page" impression rate at https://grafana-rw.wikimedia.org/d/000000438/mediawiki-exceptions-alerts was suspiciously similar to the total "Logged excepetions" rate from Logstash elsewhere in that dashboard.

That's unusual because, a lot of exceptions we log to Logstash are caught errors, post-send errors, CLI errors, and other errors that can't/don't result in a Fatal error web page being shown to a user. It's only web requests that can do so, and only for uncaught errors, and specifically uncaught errors during the response (as opposed to post-send).

Turns out they are similar, because they are plotting the same thing. The "Fatal error page served via php-wmerrors" panel was instead plotting the Logstash doc count for logged exceptions, which is already plotted on the dashboard. It was also plotting a rate per second, whilst the legend claims it is a rate per minute.

Screenshot 2025-09-09 at 01.31.03.png (654×2 px, 204 KB)

Screenshot 2025-09-09 at 01.31.25.png (1×2 px, 226 KB)

When I remove this and plot the original statsd/Graphite metric, that of course stops in April 2025 when Graphite went read-only.

Screenshot 2025-09-09 at 01.33.21.png (1×1 px, 172 KB)

A dogstatsd/Prometheus equivalent was added and merged in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017078. But... trying to plot mediawiki_fatal_errors_total yields nothing. At least, not anymore.

Screenshot 2025-09-09 at 01.34.40.png (1×2 px, 175 KB)

Change #1049625 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] mediawiki: enable forward of fatal metrics to statsd exporter

https://gerrit.wikimedia.org/r/1049625

I'm guessing this patch needs to be finished/deployed for it to work in mw-on-k8s?