Page MenuHomePhabricator

Implement "WMF Log Errors count" KPI
Closed, ResolvedPublic

Description

Definition: A measure of the rate of WMF Log Errors in production.

  • Need to clearly define the source of the data (eg: sampling rate of group0)

Goal: a downward trend, not upward... absolute numbers TBD

Purpose: a noisy log is not a useful log. We want to be able to trust our logs and quickly revert on sight if something starts complaining in them.

Event Timeline

greg raised the priority of this task from to Medium.
greg updated the task description. (Show Details)
greg added a project: Release-Engineering-Team.
greg added subscribers: greg, Aklapper, demon, mmodell.

Related but not the point of this KPI: Open tasks in the Wikimedia-production-error project over time:
https://phabricator.wikimedia.org/maniphest/report/burn/?project=PHID-PROJ-4uc7r7pdosfsk55qg7f6

(unfortunately I can't limit that query to things NOT in the "no prod impact" column....)

A little bit of explanation about what stats we're tracking here and what the goals are for them:

  • MW error logs by severity - Ideally anything WARNING/FATAL/ERROR level approaches zero (see T115630). INFO/DEBUG level stats are probably mostly ok, although we should investigate some of the louder ones to see if they're truly useful and being looked at. If not, we're just logging extra bytes.
  • HHVM error logs by severity - All should approach zero.
  • Apache syslogs - These are all ERROR level, so should be dealt with so this graph reads zero. Will likely require some SRE help.
  • OOM - We don't have a lot of these, which is good. We want to keep it that way. Anything appearing on this graph is a bad day really and needs to be tracked down. The graph can likely be tweaked further to make sure the very few OOMs we do have actually show up.
  • MW logs by channel - Mostly a useful table to see the breakdown of where the MW logs are coming from. Won't ever really be zero, not really a part of the KPI either.

I have edited the releng main board https://grafana.wikimedia.org/dashboard/db/releng-main-page to list out all dashboards tagged releng. Such dashboard list is a feature of Grafana 2.x we recently got installed.

The production logging one now shows up at https://grafana.wikimedia.org/dashboard/db/releng-main-page

I haven't done it on the boards I created, but one can add a text block that can be used as an introduction. Ie copy paste your previous comment :-}

I am very happy to see those graphs well done!