We had one of the application server that was not responding anymore despite the Apache process being up (bug 52776). We would need to monitor that the application server are actually serving something.
Version: unspecified
Severity: enhancement
We had one of the application server that was not responding anymore despite the Apache process being up (bug 52776). We would need to monitor that the application server are actually serving something.
Version: unspecified
Severity: enhancement
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T53494 Use Beta cluster as a true canary for code deployments (epic) | |||
Stalled | None | T53497 Setup monitoring for Beta Cluster (tracking) | |||
Resolved | yuvipanda | T54867 monitor that application servers are responding |
incinga has "Apache HTTP" monitoring in WMF production, but only apparently for hosts in PMTPA, not in EQIAD (another issue)
(In reply to comment #2)
incinga has "Apache HTTP" monitoring in WMF production, but only apparently
for
hosts in PMTPA, not in EQIAD (another issue)
Ignore the EQIAD part - Apache isn't monitored on job runners
(In reply to comment #4)
And this bug is about the beta cluster :)
I was more meaning there should be incinga config you can steal/hack/copy and paste or whatever ;)
Resetting severity. If it was really critical it would have been fixed long ago.
Yuvi Panda is working on integrating Shinken for labs, a drop in replacement for Nagios/Icinga.
Hmm, so the 'ideal' way is for shinken to hit port 80 on those instances and check if they are serving content properly. This is complicated by firewall rules. We could theoretically just open up the web security group's port 80 to the shinken hosts, and that is probably the right thing to do here.
I'll work on this.
Change 181775 had a related patch set uploaded (by Yuvipanda):
beta: Add monitoring for mediawiki app servers
http://shinken.wmflabs.org/host/deployment-mediawiki03 :D
So this adds monitoring for bits (requests a static image), and the enwiki main page (checks if the string 'Wikipedia' exists). This hits the individual mediawiki machines - specifically machines with the role role::beta::appserver applied, and reports errors if any.
As soon as it got merged, it told me that mediawiki03 was failing regular mainpage (but not bits!). Restarting hhvm seems to have fixed that.
Change 181787 had a related patch set uploaded (by Yuvipanda):
beta: Add HHVM queue size monitoring
Change 183454 had a related patch set uploaded (by Hashar):
beta: monitor mobile main page