Page MenuHomePhabricator

Create response time monitoring for WDQS endpoint
Closed, ResolvedPublic

Description

We need a monitoring setup which not only checks that the SPARQL endpoint responds (already exists in icinga) but that endpoint responds in reasonable time. That would allow us to detect DoS scenarios against WDQS which do not lead to full denial of service with 503 code or such but only slow it down to the point where it is not usable anymore.

Event Timeline

Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)
Smalyshev subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

It should be noted we just had an partial outage for 6/7 hours without us noticed ;)

wdqs1002 seemed to totally die but nothing pinged anyone.
wdqs1001 seemed to stop updating (resulting in a warning in icinga) but again it doesn't seem to have pinged anyone.

Monitoring should not only check the main endpoint (query.wikidata.org) but also each host itself!
Data is being shoved into graphite to power https://grafana.wikimedia.org/dashboard/db/wikidata-query-service which could be used but none of that is product ionized yet.

I think we need to at least put monitor on whatever "varnish latency" counts and alert say if it's over 30 s.

Just looking at the other things I am recording right now but it may infact make sense to put a monitor on the Done Rate or the Queries Per Second.

A Done Rate of 0 or a QPS of 0 for longer than the normal query timeout should shout at us, as it means no queries are being processed.
This includes update queries.
Thoughts?

Change 286992 had a related patch set uploaded (by Gehel):
Add response time checks to WDQS

https://gerrit.wikimedia.org/r/286992

Change 286992 merged by Gehel:
Add response time checks to WDQS

https://gerrit.wikimedia.org/r/286992

Dzahn subscribed.

The Icinga/graphite check "Response time for WDQS" is in status "UNKNOWN" because there are "No valid datapoints found".

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=einsteinium&service=Response+time+of+WDQS

This is a common problem with Icinga/Graphite checks we have seen in the past.

This comment was removed by Smalyshev.

The UNKNOWN disappeared now that we are active/active. Previously, when no traffic was sent to codfw, we had no meaningful data about response time. This can be closed again.

Implemented as geof:globe, geof:latitude & geof:longitude

@Smalyshev Did you mean to close and put this resolution on the globe task you closed today?