We need a monitoring setup which not only checks that the SPARQL endpoint responds (already exists in icinga) but that endpoint responds in reasonable time. That would allow us to detect DoS scenarios against WDQS which do not lead to full denial of service with 503 code or such but only slow it down to the point where it is not usable anymore.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Add response time checks to WDQS | operations/puppet | production | +12 -0 |
Related Objects
Event Timeline
It should be noted we just had an partial outage for 6/7 hours without us noticed ;)
wdqs1002 seemed to totally die but nothing pinged anyone.
wdqs1001 seemed to stop updating (resulting in a warning in icinga) but again it doesn't seem to have pinged anyone.
Monitoring should not only check the main endpoint (query.wikidata.org) but also each host itself!
Data is being shoved into graphite to power https://grafana.wikimedia.org/dashboard/db/wikidata-query-service which could be used but none of that is product ionized yet.
I think we need to at least put monitor on whatever "varnish latency" counts and alert say if it's over 30 s.
Just looking at the other things I am recording right now but it may infact make sense to put a monitor on the Done Rate or the Queries Per Second.
A Done Rate of 0 or a QPS of 0 for longer than the normal query timeout should shout at us, as it means no queries are being processed.
This includes update queries.
Thoughts?
Change 286992 had a related patch set uploaded (by Gehel):
Add response time checks to WDQS
The Icinga/graphite check "Response time for WDQS" is in status "UNKNOWN" because there are "No valid datapoints found".
This is a common problem with Icinga/Graphite checks we have seen in the past.
The UNKNOWN disappeared now that we are active/active. Previously, when no traffic was sent to codfw, we had no meaningful data about response time. This can be closed again.
@Smalyshev Did you mean to close and put this resolution on the globe task you closed today?