Page MenuHomePhabricator

Figure out monitoring/alerting setup for WDQS production
Closed, ResolvedPublic

Description

Set up a system that monitors entry points for Blazegraph, GUI and health of Updater service and alerts when any of them goes down.

What we want to check for:

  • Disk space usage on / and /var/lib/wdqs is below 90%
  • Port 80 responsive from outside
  • Request by http (port 80) to / returns query service page (having substring "Welcome" is enough)
  • Request by http (port 80) to /bigdata/namespace/wdq/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&format=json returns proper JSON response - check for substring "datatype" : "xsd:dateTime"
  • One process with "java ... blazegraph-service-*-SNAPSHOT-dist.war" is running
  • One process with "java ... org.wikidata.query.rdf.tool.Update" is running

Details

Related Gerrit Patches:
operations/puppet : productionAdd icinga monitoring for WDQS services

Event Timeline

Smalyshev raised the priority of this task from to High.
Smalyshev updated the task description. (Show Details)
Smalyshev removed a project: Wikidata.
Smalyshev set Security to None.
Smalyshev moved this task from Needs triage to WDQS on the Discovery board.Jun 25 2015, 11:25 PM
JanZerebecki moved this task from incoming to monitoring on the Wikidata board.Jul 23 2015, 5:06 PM

Lots of unresolved questions here. @Smalyshev said he would look in to it.

ksmith moved this task from WDQS to On Sprint Board on the Discovery board.Aug 27 2015, 8:25 PM
Smalyshev updated the task description. (Show Details)Sep 2 2015, 7:29 PM
Smalyshev added a subscriber: chasemp.
Smalyshev updated the task description. (Show Details)Sep 2 2015, 7:43 PM
Smalyshev updated the task description. (Show Details)Sep 2 2015, 7:55 PM
Smalyshev updated the task description. (Show Details)
Smalyshev claimed this task.Sep 2 2015, 9:23 PM
Smalyshev updated the task description. (Show Details)

Change 236189 had a related patch set uploaded (by Smalyshev):
Add icinga monitoring for WDQS services

https://gerrit.wikimedia.org/r/236189

Change 236189 merged by Dzahn:
Add icinga monitoring for WDQS services

https://gerrit.wikimedia.org/r/236189

Dzahn updated the task description. (Show Details)Sep 4 2015, 9:53 PM
Dzahn added a comment.Sep 4 2015, 9:58 PM
  • Disk space usage on / and /var/lib/wdqs is below 90%

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wdqs1001&service=Disk+space

  • Port 80 responsive from outside

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wdqs1001&service=WDQS+HTTP
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wdqs1001&service=WDQS+HTTP+Port

  • Request by http (port 80) to / returns query service page (having substring "Welcome" is enough)

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wdqs1001&service=WDQS+HTTP

  • Request by http (port 80) to /bigdata/namespace/wdq/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&format=json returns proper JSON response - check for substring "datatype" : "xsd:dateTime"

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wdqs1001&service=WDQS+SPARQL

  • One process with "java ... blazegraph-service-*-SNAPSHOT-dist.war" is running

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wdqs1001&service=Blazegraph+process
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wdqs1001&service=Blazegraph+Port

  • One process with "java ... org.wikidata.query.rdf.tool.Update" is running

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wdqs1001&service=Updater+process

Dzahn closed this task as Resolved.Sep 4 2015, 9:59 PM
Dzahn updated the task description. (Show Details)
Dzahn removed a project: Patch-For-Review.