Page MenuHomePhabricator

Implement "WMF Log Errors count" KPI
Closed, ResolvedPublic

Description

Definition: A measure of the rate of WMF Log Errors in production.

  • Need to clearly define the source of the data (eg: sampling rate of group0)

Goal: a downward trend, not upward... absolute numbers TBD

Purpose: a noisy log is not a useful log. We want to be able to trust our logs and quickly revert on sight if something starts complaining in them.

Event Timeline

greg created this task.Aug 11 2015, 9:43 PM
greg raised the priority of this task from to Medium.
greg updated the task description. (Show Details)
greg added a project: Release-Engineering-Team.
greg added subscribers: greg, Aklapper, demon, mmodell.
greg moved this task from INBOX to Backlog on the Release-Engineering-Team board.Aug 11 2015, 9:58 PM
greg added a comment.Sep 2 2015, 12:22 AM

Related but not the point of this KPI: Open tasks in the Wikimedia-production-error project over time:
https://phabricator.wikimedia.org/maniphest/report/burn/?project=PHID-PROJ-4uc7r7pdosfsk55qg7f6

(unfortunately I can't limit that query to things NOT in the "no prod impact" column....)

greg moved this task from Backlog to Next on the Release-Engineering-Team board.Sep 24 2015, 1:24 AM
greg assigned this task to demon.Oct 7 2015, 5:24 PM
demon closed this task as Resolved.Oct 20 2015, 7:46 PM

A little bit of explanation about what stats we're tracking here and what the goals are for them:

  • MW error logs by severity - Ideally anything WARNING/FATAL/ERROR level approaches zero (see T115630). INFO/DEBUG level stats are probably mostly ok, although we should investigate some of the louder ones to see if they're truly useful and being looked at. If not, we're just logging extra bytes.
  • HHVM error logs by severity - All should approach zero.
  • Apache syslogs - These are all ERROR level, so should be dealt with so this graph reads zero. Will likely require some Operations help.
  • OOM - We don't have a lot of these, which is good. We want to keep it that way. Anything appearing on this graph is a bad day really and needs to be tracked down. The graph can likely be tweaked further to make sure the very few OOMs we do have actually show up.
  • MW logs by channel - Mostly a useful table to see the breakdown of where the MW logs are coming from. Won't ever really be zero, not really a part of the KPI either.

I have edited the releng main board https://grafana.wikimedia.org/dashboard/db/releng-main-page to list out all dashboards tagged releng. Such dashboard list is a feature of Grafana 2.x we recently got installed.

The production logging one now shows up at https://grafana.wikimedia.org/dashboard/db/releng-main-page

I haven't done it on the boards I created, but one can add a text block that can be used as an introduction. Ie copy paste your previous comment :-}

I am very happy to see those graphs well done!