Page MenuHomePhabricator

Check same set of errors/warnings/fatals in scap logstash_checker.py as there is in `fatalmonitor` on fluorine
Closed, ResolvedPublic

Description

From https://wikitech.wikimedia.org/wiki/Incident_documentation/20160809-MediaWiki

  • At 16:00 Max prepares for SWAT but sees errors in fatalmonitor and investigates:
    • PHP Warning: Creating default object from empty value in /srv/mediawiki/wmf-config/CommonSettings.php on line 686
    • PHP Notice: Undefined variable: wgContactConfig in /srv/mediawiki/wmf-config/CommonSettings.php on line 968
  • Max sees no such errors in Logstash.

Event Timeline

greg renamed this task from Show same set of errors/warnings/fatals in logstash's fatalmonitor as there is in `fatalmonitor` on fluorine to Check same set of errors/warnings/fatals in scap logstash_checker.py as there is in `fatalmonitor` on fluorine.Aug 11 2016, 11:54 PM

Change 304327 had a related patch set uploaded (by Thcipriani):
Add the fatalmonitor query to logstash_checker

https://gerrit.wikimedia.org/r/304327

Krinkle subscribed.

What about 20170104-MonologSpi? That incident had an unconditional fatal error on all page views. I'm surprised scap's canary/logstash checker didn't catch this. T121597 wouldn't expose end-users to these basic errors in the first place, but at least now that we have the canary check it should have automatically aborted and reversed the deployment.

Is it because of the same reason as the discrepancy https://gerrit.wikimedia.org/r/304327 will fix? Or is there something else that needs to be added?


EDIT: Looks like T154646 will cover this.

Change 304327 merged by Filippo Giunchedi:
Include hhvm fatals and exceptions in scap canary checks

https://gerrit.wikimedia.org/r/304327

New scap release is now checking both hhvm and MediaWiki error rates.