Page MenuHomePhabricator

Include fatal log rate check in scap canary test
Closed, ResolvedPublic

Description

Would it be feasible to check the rate of hhvm errors (with a few filtering criteria) for the scap canary back-out check? The following Kibana query seems useful:

https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2017-01-04T22:56:57.593Z',mode:absolute,to:'2017-01-04T23:29:30.920Z'))&_a=(filters:!(),options:(darkTheme:!f),panels:!((col:1,id:Dashboards,panelIndex:1,row:1,size_x:12,size_y:2,type:visualization),(col:1,id:Events-Over-Time,panelIndex:2,row:3,size_x:9,size_y:2,type:visualization),(col:1,id:Event-Types,panelIndex:3,row:5,size_x:5,size_y:3,type:visualization),(col:6,id:Event-Level,panelIndex:4,row:5,size_x:3,size_y:3,type:visualization),(col:1,columns:!(type,level,wiki,host,message),id:Default-Events-List,panelIndex:5,row:8,size_x:12,size_y:25,sort:!('@timestamp',desc),type:search),(col:10,id:Top-20-Hosts,panelIndex:6,row:3,size_x:3,size_y:2,type:visualization),(col:9,id:Events-Over-Time-By-Channel,panelIndex:7,row:5,size_x:4,size_y:3,type:visualization)),query:(query_string:(analyze_wildcard:!t,query:'%2Btype:hhvm%20%2Blevel:ERROR%20%2Bmessage:%22Catchable%20fatal%20error%22')),title:default,uiState:(P-2:(vis:(legendOpen:!f)),P-3:(vis:(legendOpen:!f)),P-4:(vis:(legendOpen:!f))))

The filter is essentially

+type:hhvm +level:ERROR +message:"Catchable fatal error"

It doesn't seem to noisy over the weeks AFAIK. Could an equivalent filter be used by the scap or is the filtering more a priori at the moment (in which case the entries would need be tagged/counted as they come).

Event Timeline

thcipriani moved this task from Needs triage to Debt on the Scap board.
thcipriani subscribed.

Can incorporate that query into the logstash checker. We'll have to combine it with the current logic.

I took a crack at a patch to incorporate fatalmonitor into the logstash_checker, but it ended up showing lots of false positives (i.e., redis connection timeout, database timeout, pl proc line: 2959, etc)—I was worried that folks would get into the habit of deploying with --force if we released it, so I shelved it for a bit. Will likely dust off: https://gerrit.wikimedia.org/r/#/c/304327 for this.

Maybe if I limit hhvm to ERRORs and mediawiki to the exception channel I can get to a more robust endpoint. Still far from being able to catch everything, but definitely would have caught this: https://logstash.wikimedia.org/goto/56cefba2cda599faf09d5eb36b8bb79f

Change 304327 had a related patch set uploaded (by Thcipriani):
Include hhvm fatals and exceptions in scap canary checks

https://gerrit.wikimedia.org/r/304327

So this is the query I came up with:

((type:mediawiki AND channel:exception) OR type:hhvm)

Filtered with:

{
  "bool": {
    "must_not": [
      {
        "terms": {
          "level": [
            "INFO",
            "NOTICE",
            "WARNING"
          ]
        }
      },
      {
        "query": {
          "match": {
            "message": {
              "query": "SlowTimer",
              "type": "phrase"
            }
          }
        }
      },
      {
        "query": {
          "match": {
            "message": {
              "query": "Invalid host name",
              "type": "phrase"
            }
          }
        }
      }
    ]
  }
}

It captures the error from the incident. I explicitly excluded some noisy errors, but more may have to be added.
https://logstash.wikimedia.org/goto/52d694f5e2690623ad0c5627cda75a8c

I took a crack at a patch to incorporate fatalmonitor into the logstash_checker (T142784), but it ended up showing lots of false positives (i.e., redis connection timeout, database timeout, pl proc line: 2959, etc)

The false positives do exist indeed, but I'm not sure I understand why that is a problem. We also have non-zero mediawiki errors/fatals (which the current logstash checker already uses to abort a deployment). Presumably the reason this isn't an issue is because the checker doesn't just fail on any error/fatal, but only if there is a significant increase in them following the deployment to the canary servers. As such, wouldn't the same approach work fine for all PHP warnings and errors from hhvm or mediawiki?

Presumably the reason this isn't an issue is because the checker doesn't just fail on any error/fatal, but only if there is a significant increase in them following the deployment to the canary servers.

This is true. The current behavior is to wait 20 seconds, then compare the average error rate over the past 60 minutes to the error rate over the past 20 seconds. If the average error rate increases 10x then a deployment is failed. All these values are adjustable via the scap config and may yet need tweaking over time.

As such, wouldn't the same approach work fine for all PHP warnings and errors from hhvm or mediawiki?

Yep, pretty much what is done in https://gerrit.wikimedia.org/r/#/c/304327/

I didn't add the hhvm check initially since it seemed noisier and appeared to correlate less to actual errors caused by a deployment than mediawiki logs when I was testing.

When initially building canary checks, I felt that showing too many false failures would be worse than no canary check as it encourages overriding a bothersome check in the interest expediency.

One heuristic I find to be useful when watching a deployment is to look over the error messages right before deploying, then after the deploy I visually scan for new messages that weren't seen previously. It's pretty difficult to do this manually but that seems like something that would be fairly trivial to pick out of logstash and it might be able to be more sensitive than requiring a 10x increase in error rate. Maybe just weight the results to count new error messages more heavily than previously seen messages?

Change 304327 merged by Filippo Giunchedi:
Include hhvm fatals and exceptions in scap canary checks

https://gerrit.wikimedia.org/r/304327

Change 304327 merged by Filippo Giunchedi:
Include hhvm fatals and exceptions in scap canary checks

https://gerrit.wikimedia.org/r/304327

Done?

Change 304327 merged by Filippo Giunchedi:
Include hhvm fatals and exceptions in scap canary checks

https://gerrit.wikimedia.org/r/304327

Done?

Still needs the next scap release to be complete.

thcipriani claimed this task.