Page MenuHomePhabricator

Add some facility to scap for custom logstash canary checks
Open, MediumPublic

Description

A facility for detecting a rate increase of a particular error following a scap of MediaWiki would be useful during re-deployments. The basic idea is:

  1. User provides logstash filter(s) via CLI argument or config file
  2. User runs scap to sync MediaWiki code or wikiversions (directly or via deploy-promote script)
  3. Scap integrates new logstash canary check(s) into the current run using provided filter(s)

Event Timeline

dduvall created this task.Jul 2 2018, 4:49 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 2 2018, 4:50 PM
thcipriani triaged this task as Medium priority.Jul 10 2018, 10:16 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.

I like this idea and would like to work on it.

Currently the logstash query is fixed and is (for Reasons™) stored in operations/puppet under file: service::logstash_checker.py:

query = ('host:("%(host)s") '
                 'AND ((type:mediawiki '
                 'AND (channel:exception '
                 'OR channel:error)) '
                 'OR type:hhvm)') % vars(self)

        return {
            "size": 0,
            "aggs": {
                "2": {
                    "date_histogram": {
                        "interval": "10s",
                        "field": "@timestamp"
                    }
                }
            },
            "query": {
                "bool": {
                    "filter": [
                        {
                            "range": {
                                "@timestamp": {
                                    "lte": "now",
                                    "gte": "now-60m"
                                }
                            }
                        },
                        {
                            "query_string": {
                                "query": query
                            }
                        }
                    ],
                    "must_not": [
                        {
                            "terms": {
                                "level": [
                                    "INFO",
                                    "DEBUG"
                                ]
                            }
                        },
                        {
                            "match": {
                                "message": {
                                    "query": "SlowTimer",
                                    "type": "phrase"
                                }
                            }
                        },
                        {
                            "match": {
                                "message": {
                                    "query": "Invalid host name",
                                    "type": "phrase"
                                }
                            }
                        },
                        {
                            "match": {
                                "message": {
                                    "query": "LuaSandbox/Engine.php",
                                    "type": "phrase"
                                }
                            }
                        }
                    ]
                }
            }
        }

This query is run by scap, the threshold for the query is set via scap.cfg via the canary_threshold variable. This is the multiple by which the rate of errors returned by the above query needs to increase in order to fail a deployment. In this case the multiple is 10.0.

I see a few possibilities for how to implement a better logstash checker. One that might be simple is allowing customization of the must and must not sections of the query via a file in ./scap

scap/logstash_canary.yaml
rate_multiplier: 10.0
query: ((type:mediawiki AND (channel:exception OR channel:error)) OR type:hhvm)
matches:
  - level: Error
  - ! message: Slowtimer

scap could maybe have some facility for trying these queries out in the logstash interface.

No opinions on this yet, just an idea, would love feedback on it. Adding @zeljkofilipin since he and I discussed this in IRC. Adding @Krinkle since he's been helpful in the past in formulating improvements to this process in past.

greg awarded a token.Jul 27 2018, 7:18 PM

Moving it out of puppet and into mediawiki-config seems like a good start indeed.

query: ((type:mediawiki AND (channel:exception OR channel:error)) OR type:hhvm)
matches:
  - level: Error
  - ! message: Slowtimer

Exclusion of level:"INFO" seems problematic. PHP Notice and PHP Warning errors are often at that level. Given the already very narrow set of channels we match here there should be no need to exclude any level of messages as these channels contain no unimportant messages. (The Jenkins jobs for MediaWiki commits fails if any of these logs are non-empty after installing, testing, and browsing MediaWiki.)

Also, I wonder why channel:fatal is missing. That seems quite important.

Channel "exception" are would-be fatal errors we managed to catch at the highest level within the UI layer and respond to by outputting a normally skinned page. Channel "fatal" are errors that we were unable to catch even there, usually because it threw too early, or because the UI layer itself broke, or because of limited resources (e.g. out of memory).

I've updated Fatal-Monitor on Kibana and suggest we use the same query in Scap.

query: "(type:mediawiki AND (channel:(fatal OR exception OR error))) OR type:hhvm"
matches:
- '! message:"SlowTimer"'
jijiki added a subscriber: jijiki.Mar 15 2019, 11:34 AM