Include fatal log rate check in scap canary test
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aaron
	Jan 5 2017, 5:32 AM

Description

Would it be feasible to check the rate of hhvm errors (with a few filtering criteria) for the scap canary back-out check? The following Kibana query seems useful:

https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2017-01-04T22:56:57.593Z',mode:absolute,to:'2017-01-04T23:29:30.920Z'))&_a=(filters:!(),options:(darkTheme:!f),panels:!((col:1,id:Dashboards,panelIndex:1,row:1,size_x:12,size_y:2,type:visualization),(col:1,id:Events-Over-Time,panelIndex:2,row:3,size_x:9,size_y:2,type:visualization),(col:1,id:Event-Types,panelIndex:3,row:5,size_x:5,size_y:3,type:visualization),(col:6,id:Event-Level,panelIndex:4,row:5,size_x:3,size_y:3,type:visualization),(col:1,columns:!(type,level,wiki,host,message),id:Default-Events-List,panelIndex:5,row:8,size_x:12,size_y:25,sort:!('@timestamp',desc),type:search),(col:10,id:Top-20-Hosts,panelIndex:6,row:3,size_x:3,size_y:2,type:visualization),(col:9,id:Events-Over-Time-By-Channel,panelIndex:7,row:5,size_x:4,size_y:3,type:visualization)),query:(query_string:(analyze_wildcard:!t,query:'%2Btype:hhvm%20%2Blevel:ERROR%20%2Bmessage:%22Catchable%20fatal%20error%22')),title:default,uiState:(P-2:(vis:(legendOpen:!f)),P-3:(vis:(legendOpen:!f)),P-4:(vis:(legendOpen:!f))))

The filter is essentially

+type:hhvm +level:ERROR +message:"Catchable fatal error"

It doesn't seem to noisy over the weeks AFAIK. Could an equivalent filter be used by the scap or is the filtering more a priori at the moment (in which case the entries would need be tagged/counted as they come).

Details

	Subject	Repo	Branch	Lines +/-
	Include hhvm fatals and exceptions in scap canary checks	operations/puppet	production	+69 -0

Customize query in gerrit

Related Objects

Mentioned In: T121597: Implement MediaWiki pre-promote checks
T142784: Check same set of errors/warnings/fatals in scap logstash_checker.py as there is in `fatalmonitor` on fluorine
Mentioned Here: T142784: Check same set of errors/warnings/fatals in scap logstash_checker.py as there is in `fatalmonitor` on fluorine

Event Timeline

aaron created this task.Jan 5 2017, 5:32 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 5 2017, 5:32 AM

Can incorporate that query into the logstash checker. We'll have to combine it with the current logic.

I took a crack at a patch to incorporate fatalmonitor into the logstash_checker, but it ended up showing lots of false positives (i.e., redis connection timeout, database timeout, pl proc line: 2959, etc)—I was worried that folks would get into the habit of deploying with --force if we released it, so I shelved it for a bit. Will likely dust off: https://gerrit.wikimedia.org/r/#/c/304327 for this.

Maybe if I limit hhvm to ERRORs and mediawiki to the exception channel I can get to a more robust endpoint. Still far from being able to catch everything, but definitely would have caught this: https://logstash.wikimedia.org/goto/56cefba2cda599faf09d5eb36b8bb79f

greg added a project: Wikimedia-Incident.Jan 5 2017, 5:12 PM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

Change 304327 had a related patch set uploaded (by Thcipriani):
Include hhvm fatals and exceptions in scap canary checks

https://gerrit.wikimedia.org/r/304327

gerritbot added a project: Patch-For-Review.Jan 6 2017, 8:02 PM

So this is the query I came up with:

((type:mediawiki AND channel:exception) OR type:hhvm)

Filtered with:

{
  "bool": {
    "must_not": [
      {
        "terms": {
          "level": [
            "INFO",
            "NOTICE",
            "WARNING"
          ]
        }
      },
      {
        "query": {
          "match": {
            "message": {
              "query": "SlowTimer",
              "type": "phrase"
            }
          }
        }
      },
      {
        "query": {
          "match": {
            "message": {
              "query": "Invalid host name",
              "type": "phrase"
            }
          }
        }
      }
    ]
  }
}

It captures the error from the incident. I explicitly excluded some noisy errors, but more may have to be added.
https://logstash.wikimedia.org/goto/52d694f5e2690623ad0c5627cda75a8c

aaron updated the task description. (Show Details)Jan 6 2017, 8:43 PM

Krinkle mentioned this in T142784: Check same set of errors/warnings/fatals in scap logstash_checker.py as there is in `fatalmonitor` on fluorine.Jan 6 2017, 9:50 PM

Krinkle subscribed.

In T154646#2919756, @thcipriani wrote:

I took a crack at a patch to incorporate fatalmonitor into the logstash_checker (T142784), but it ended up showing lots of false positives (i.e., redis connection timeout, database timeout, pl proc line: 2959, etc)

The false positives do exist indeed, but I'm not sure I understand why that is a problem. We also have non-zero mediawiki errors/fatals (which the current logstash checker already uses to abort a deployment). Presumably the reason this isn't an issue is because the checker doesn't just fail on any error/fatal, but only if there is a significant increase in them following the deployment to the canary servers. As such, wouldn't the same approach work fine for all PHP warnings and errors from hhvm or mediawiki?

Krinkle mentioned this in T121597: Implement MediaWiki pre-promote checks.Jan 6 2017, 10:09 PM

In T154646#2924576, @Krinkle wrote:

Presumably the reason this isn't an issue is because the checker doesn't just fail on any error/fatal, but only if there is a significant increase in them following the deployment to the canary servers.

This is true. The current behavior is to wait 20 seconds, then compare the average error rate over the past 60 minutes to the error rate over the past 20 seconds. If the average error rate increases 10x then a deployment is failed. All these values are adjustable via the scap config and may yet need tweaking over time.

In T154646#2924576, @Krinkle wrote:

As such, wouldn't the same approach work fine for all PHP warnings and errors from hhvm or mediawiki?

Yep, pretty much what is done in https://gerrit.wikimedia.org/r/#/c/304327/

I didn't add the hhvm check initially since it seemed noisier and appeared to correlate less to actual errors caused by a deployment than mediawiki logs when I was testing.

When initially building canary checks, I felt that showing too many false failures would be worse than no canary check as it encourages overriding a bothersome check in the interest expediency.

One heuristic I find to be useful when watching a deployment is to look over the error messages right before deploying, then after the deploy I visually scan for new messages that weren't seen previously. It's pretty difficult to do this manually but that seems like something that would be fairly trivial to pick out of logstash and it might be able to be more sensitive than requiring a 10x increase in error rate. Maybe just weight the results to count new error messages more heavily than previously seen messages?

Change 304327 merged by Filippo Giunchedi:
Include hhvm fatals and exceptions in scap canary checks

https://gerrit.wikimedia.org/r/304327

In T154646#2965895, @gerritbot wrote:

Change 304327 merged by Filippo Giunchedi:
Include hhvm fatals and exceptions in scap canary checks

https://gerrit.wikimedia.org/r/304327

Done?

In T154646#2973234, @greg wrote:

In T154646#2965895, @gerritbot wrote:

Change 304327 merged by Filippo Giunchedi:
Include hhvm fatals and exceptions in scap canary checks

https://gerrit.wikimedia.org/r/304327

Done?

Still needs the next scap release to be complete.

thcipriani closed this task as Resolved.Jan 31 2017, 3:30 PM

thcipriani claimed this task.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident, Scap.Apr 28 2020, 9:50 PM

Include fatal log rate check in scap canary testClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Include fatal log rate check in scap canary test
Closed, ResolvedPublic
Actions