Page MenuHomePhabricator

Develop the monitoring of Quarry
Open, Needs TriagePublic

Description

Currently the availability of Quarry is not monitored at all. We rely on our own tests and user feedback. It's clearly not ideal in the long term.

Having something like what is already done for PAWS would be the ideal. See https://github.com/wikimedia/puppet/blob/production/modules/icinga/manifests/monitor/toollabs.pp#L18.

Here is some immediate problems I see to an icinga implementation on prod instance:

  • Site maintainers (me and @zhuyifei1999) need icinga access that requires LDAP NDA. Without this the interest is much lower.
  • Quarry is in the Cloud-VPS, that is itself neither part of the Tools nor part of the production. This position has never before been taken by any service monitored by icinga as I see. That typically mean that modules/icinga/manifests/monitor/toollabs.pp configuration file doesn't look for us, and there is no one for cloud services. Note that PAWS has a special position has it has its own servers in the tool project. IDK if we can just create a new one.
  • I don't know if the case where volunteers are part of icinga notification process has already happened.

Event Timeline

@Dzahn @bd808 your opinion would be interesting here. Thanks!

There's shinken but I don't know if we want to move anything new there going forward.
I've been thinking of setting up icinga for deployment-prep.

There is already Icinga 2 in cloud VPS used to monitor some things. Ask @Paladox

Just a quick note, we should really have some metrics, including query error rate. There are user concerns today about it, but we don't have any simple way to get number to see eventual incident. https://www.mediawiki.org/wiki/Topic:Vhw07swro9jqy4w0

Shouldn't wmcs be added to this?