Page MenuHomePhabricator

Tracking: Monitoring and alerts for "business" metrics
Open, HighPublic

Description

T119736 showed that we sometime fail to recognize the severity of bugs that have a substantial impact on users. To make sure nothing slips through the cracks, we should have monitoring and alerting of small set of key "business" metrics. Namely:

  • Logins, Sign-ups, account creation - T146090
  • Edits
  • Exceptions / fatals
  • MediaWiki load time - T146125

See also:

Details

Related Gerrit Patches:

Related Objects

Event Timeline

ori created this task.Jul 20 2016, 7:30 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 20 2016, 7:30 PM
greg added a subscriber: greg.Jul 20 2016, 7:35 PM

There are some graphite metrics for authn things: https://grafana.wikimedia.org/dashboard/db/authentication-metrics

One thing we noticed about login failures especially during the SessionManager deployment is that this can skew upwards to crazy levels due to just one bot that is retrying to login in an infinite loop. I think the rate of login success is a bit more stable.

Tgr added a comment.Jul 20 2016, 7:58 PM

One thing we noticed about login failures especially during the SessionManager deployment is that this can skew upwards to crazy levels due to just one bot that is retrying to login in an infinite loop. I think the rate of login success is a bit more stable.

There is a filter for login interface (web / API). There is no guarantee some bot wouldn't use the web interface, but in practice filtering out API queries gives pretty predictable results.

In the case of T119736 the existing metrics would not have been helpful because an exception in the middle of the login process is not counted as failure. Filed that as T140943. (As for successful logins, there wasn't any visible dip. I suspect people tried to log in several times so 1K login failures per day translates to much less than 1K missing successful logins per day, and so isn't really visible over the background of ~10K successful logins.)

Tgr added a comment.Jul 20 2016, 8:00 PM

FWIW I think the obvious thing to alert on in this case would have been the thousands of exceptions per day in production.

Tgr added a comment.Jul 20 2016, 8:10 PM

It seems like T117470 would have helped as well (compare frequency of T119736 and frequency of CAS errors).

Gehel triaged this task as Low priority.Jul 21 2016, 8:52 AM
Gehel added a subscriber: Gehel.

Triaging this as low priority to match T117470.

ori raised the priority of this task from Low to High.Jul 21 2016, 6:12 PM

Triaging this as low priority to match T117470.

No, this should definitely have a higher priority.

Change 300327 had a related patch set uploaded (by Ori.livneh):
Add alerting for MediaWiki exceptions and fatals

https://gerrit.wikimedia.org/r/300327

Change 300327 merged by Ori.livneh:
Add alerting for MediaWiki exceptions and fatals

https://gerrit.wikimedia.org/r/300327

ori updated the task description. (Show Details)Jul 21 2016, 8:38 PM
Tgr added a comment.Jul 21 2016, 9:08 PM

Thinking about this more, not sure if login/signup metrics are worth the effort. One of the strengths of Wikimedia is the strong connections between developers and power users so universal problems like login breaking completely would be reported very quickly. And less universal problems would not be detected, e.g. T119736 is not visible at all in the login stats.

What might be more helpful is marking certain pages or code sections as mission critical and alerting when exceptions are thrown from there.

Thinking about this more, not sure if login/signup metrics are worth the effort. One of the strengths of Wikimedia is the strong connections between developers and power users so universal problems like login breaking completely would be reported very quickly. And less universal problems would not be detected, e.g. T119736 is not visible at all in the login stats.
What might be more helpful is marking certain pages or code sections as mission critical and alerting when exceptions are thrown from there.

Paging people if there are exceptions thrown on Special:Login or Special:Preferences might be worthwhile, yes.

Tgr updated the task description. (Show Details)Sep 16 2016, 9:40 PM

This is really a follow-up item from a wikimedia incident.

hashar added a subscriber: hashar.

Account creation got broken entirely for 18 hours last week despite metrics being available. I have filled T146090 to get it to send an alarm/page or whatever.

greg updated the task description. (Show Details)Sep 19 2016, 10:40 PM
hashar updated the task description. (Show Details)Sep 20 2016, 8:13 AM
hashar renamed this task from Monitoring and alerts for "business" metrics to Tracking: Monitoring and alerts for "business" metrics.Sep 20 2016, 8:33 AM
hashar added a project: Tracking-Neverending.
hashar updated the task description. (Show Details)

The alert feature introduced in the recent Grafana update (T152473) could be of interest for this. To quote some marketing copy from http://grafana.org/blog/2016/12/12/grafana-4.0-stable-release/ : "Alerting is a really revolutionary feature for Grafana. It transforms Grafana from a visualization tool into a truly mission critical monitoring tool. The alert rules are very easy to configure using your existing graph panels and threshold levels can be set simply by dragging handles to the right side of the graph. The rules will continually be evaluated by grafana-server and notifications will be sent out [e.g. via email] when the rule conditions are met." I understand the Performance team has started to try this out already.

Peter added a subscriber: Peter.Mar 2 2017, 2:12 PM
Peter added a comment.Mar 2 2017, 3:28 PM

I've been trying out alerts for a while, let me write down a summary the coming days.

hashar removed a subscriber: hashar.Oct 16 2017, 11:50 AM