Page MenuHomePhabricator

Include 5xx numbers in fluorine fatalmonitor
Closed, ResolvedPublic

Description

The fatalmonitor in fluroine does not capture all the information necessary to see if the site is in a healthy state. 5xx information from graphite or logstash should be integrated into this monitor.

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a project: SRE.
EBernhardson added a subscriber: EBernhardson.
chasemp set Security to None.
mmodell claimed this task.
mmodell added a subscriber: mmodell.

fatalmonitor is a very rudimentary tool. Scap should include such information (See T110068: Basic scap{2,3} canary deployment process & checks), but I don't think fatalmonitor is a good place to add such functionality.

Legoktm added a subscriber: Legoktm.

Until scap3 is actually in use, this is a valid feature enhancement request for fatalmonitor. If/once we're no longer using fatalmonitor, this task can be declined.

@Legoktm: scap3 is in use and T110068 should be done in the near future.

Have you looked at the code for fatalmonitor? It's a series of unix commands piped together and wrapped in watch - it really is not an architecture that would allow this feature to be added in any straightforward manner. IMO, It's not a valid enhancement because it's not sensible to add the feature.

Maybe @bd808 can back me up here ;)

So my prediction: this task sits around for a while, doesn't get implemented, then scap3 adds the feature and we close this. Why clutter up task lists endlessly when an identical feature is already on an active project's radar?

This command line will give the number of errors in the last 30s:

NOW=$(( $(date "+%s") * 1000)); START=$(( $NOW - 30000 )); curl localhost:9200/logstash-$(date "+%Y.%m.%d")/_search -d '{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:hhvm AND message:\"request has exceeded memory limit\""}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":'$START',"to":'$NOW'}}}],"must_not":[{"fquery":{"query":{"query_string":{"query":"level:\"Notice\" OR level:\"Warning\" OR level:\"INFO\""}},"_cache":true}},{"fquery":{"query":{"query_string":{"query":"message:\"SlowTimer\""}},"_cache":true}}]}}}}}}},"1":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:mediawiki AND channel:exception"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":'$START',"to":'$NOW'}}}],"must_not":[{"fquery":{"query":{"query_string":{"query":"level:\"Notice\" OR level:\"Warning\" OR level:\"INFO\""}},"_cache":true}},{"fquery":{"query":{"query_string":{"query":"message:\"SlowTimer\""}},"_cache":true}}]}}}}}}},"2":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:mediawiki AND channel:wfLogDBError"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":'$START',"to":'$NOW'}}}],"must_not":[{"fquery":{"query":{"query_string":{"query":"level:\"Notice\" OR level:\"Warning\" OR level:\"INFO\""}},"_cache":true}},{"fquery":{"query":{"query_string":{"query":"message:\"SlowTimer\""}},"_cache":true}}]}}}}}}},"3":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:hhvm AND NOT message:\"request has exceeded memory limit\""}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":'$START',"to":'$NOW'}}}],"must_not":[{"fquery":{"query":{"query_string":{"query":"level:\"Notice\" OR level:\"Warning\" OR level:\"INFO\""}},"_cache":true}},{"fquery":{"query":{"query_string":{"query":"message:\"SlowTimer\""}},"_cache":true}}]}}}}}}}},"size":20,"query":{"filtered":{"query":{"query_string":{"query":"type:scap AND (channel.raw:scap.announce OR message:\"Started sync_wikiversions\")"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":'$START',"to":'$NOW'}}}]}}}},"sort":[{"@timestamp":{"order":"desc","ignore_unmapped":true}},{"@timestamp":{"order":"desc","ignore_unmapped":true}}]}' | jq '.facets | map(.entries[].count) | add'

To run this from fluorine we need to open up the firewall that is blocking off the logstash elasticsearch cluster (or maybe there is some proxy to talk to?)

I can also work out how to adjust fatalmonitor watch command to do both of these

my $0.02:

Fatalmonitor is hack of tail + sed + awk + watch, and only sees things that are reported to hhvm.log.

Adding a curl + jq component to it that reports one number is not going to give deployers a much better view of the health of the production cluster.

I made the https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor dashboard 18+ months ago and used it to monitor train deploys when I was running them. It works and it gives a much more complete view of how things are going than tailing hhvm.log ever will.

I actually probably did the world a disservice by making the fatalmonitor shell script work again after we switched from apache+php5 to hhvm.

Scap is going to provide a multi-paned terminal layout which monitors logstash in the same terminal window that is running scap. It will be a tremendous improvement over fatalmonitor command. Even so, the kibana fatalmonitor dashboard will still be the preferred place to watch for php errors.

@EBernhardson: If you want to implement logstash error count in fatalmonitor on fluorine, by all means do it, however, it's really just a stop-gap. Scap will make the fatalmonitor command obsolete very soon.

FWIW 5xx and 4xx are available directly from logstash now type:webrequest

Interesting...

42947.8%
40331.4%
40010.8%
4059.8%
4160.2%