Better monitoring and error reporting of Errors and Exceptions
Closed, ResolvedPublic

Description

As stated by Ori after a nasty account creation bug (bug 49727):

"Errors and exceptions are currently broadcast to fluorine and vanadium via
UDP. I have code that parses the stream and generates the Ganglia graphs,
but it isn't hooked up to Icinga or any other form of monitoring. Would
anyone from ops want to pair up with me on this?"

Let's do this.


Version: unspecified
Severity: normal
Whiteboard: deploysprint-13
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=47844

Details

Reference
bz49757
bzimport raised the priority of this task from to High.
bzimport set Reference to bz49757.
bzimport added a subscriber: Unknown Object (MLST).
greg created this task.Jun 18 2013, 5:20 PM
greg added a comment.Jul 22 2013, 9:44 PM

Ori: Do you have something started that you can share on this bug? A project page or anything?

hashar suggested you might log the fatals to a db so that we could enlist Analytics to work on a real dashboard for it.

ori added a comment.Jul 23 2013, 10:24 AM

(In reply to comment #2)

Ori: Do you have something started that you can share on this bug? A project
page or anything?

Not yet, but close. I need another day or two.

hashar suggested you might log the fatals to a db so that we could enlist
Analytics to work on a real dashboard for it.

Yes; it's a good idea :)

Change 75560 had a related patch set uploaded by Ori.livneh:
(WIP) Parse errors and write to MongoDB

https://gerrit.wikimedia.org/r/75560

ori added a comment.Jul 25 2013, 7:57 AM

Some notes about how things are currently configured:

MediaWiki can report errors to a remote host via UDP. The MediaWiki instances on the production cluster are configured to log to a host named 'fluorine'. This is done by specifying its address as the value of $wmfUdp2logDest in CommonSettings.php (in operations/mediawiki-config.git).

The MediaWiki instances that power the beta cluster set $wmfUdp2logDest to 'deployment-bastion' (https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000390), a Labs instance which plays the role of fluorine. It writes log data to files in /home/wikipedia/logs. Exceptions and fatals are respectively logged to exception.log and fatal.log in that directory.

When I first started looking at these logs, I didn't want to mess with the file-based logging, since it's an important service that developers rely on. So I submitted a patch to have fluorine stream the log data as it receives it to an another host (vanadium), in addition to writing it to disk. On vanadium I have a script that is generating the Ganglia graphs at http://ur1.ca/edq1f.

Yesterday I submitted change Ia0cc8de43 and Ryan merged it. That change reproduces the state of affairs described above (i.e. the duplication of the log stream to two destinations, fluorine and vanadium) on the beta cluster. It does so by having deployment-bastion forward a copy of the log data to a new instance, deployment-fluoride (https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000084c).

So the TL;DR is that there is an instance on the beta cluster (deployment-fluoride) that receives a live stream of errors and fatals being generated on the beta cluster MediaWikis, and we're free to use it as a sandbox for trying out different ways of capturing and representing this data.

I've only taken some initial steps, which is to take the stream of exceptions and fatals (which follow an idiosyncratic format that is not easy to analyze) and transform each error report into a JSON document. This is the work done in Ia0cc8de43 (https://gerrit.wikimedia.org/r/#/c/75560/). Or "half-done", as I should say, since I've discovered a couple of bugs that I haven't yet had a chance to fix.

The nice thing about JSON is that most modern languages have built-in modules in their standard library for handling it. So the status quo is that pending a couple of bugfixes there will shortly be streaming JSON service on deployment-fluoride that publishes MediaWiki error and exception reports as machine-readable objects.

In this state, the logs are quite easy to pipe into a data store or a visualization framework. We have to figure out what exactly we want to do, though, and then spec out some solution, ideally using solid off-the-shelf solutions where such solutions exist.

Some ideas to get the ball rolling:
https://getsentry.com/welcome/ (packages itself as a paid service, but the software is open-source).
http://logstash.net/

We could also build our own custom UI for spelunking the data.

See also bug 52026 about documenting on wikitech our fatal/exception stuff

ori added a comment.Nov 13 2013, 10:40 AM

This kind of bug is difficult to close because there's no clear criterion for considering it resolved. The 'exception-json' log bucket on fluorine got enabled today, so let's pick that as an arbitrary marker and mark this resolved, even though we clearly need to do way more work on logging.

(In reply to comment #7)

This kind of bug is difficult to close because there's no clear criterion for
considering it resolved. The 'exception-json' log bucket on fluorine got
enabled today, so let's pick that as an arbitrary marker and mark this
resolved, even though we clearly need to do way more work on logging.

An alternative is to make this bug a tracking bug, but [[WP:OKAY]].

Change 75560 abandoned by Ori.livneh:
Parse MediaWiki fatals/exceptions and republish as JSON stream

https://gerrit.wikimedia.org/r/75560