Page MenuHomePhabricator

Add monitoring to ORES workers
Closed, ResolvedPublic

Description

Our current monitoring setup does not notify us when the ORES workers go down. This is a problem because it is out most common source of downtime.

We have a dummy model for testwiki that would be perfect for this. E.g. requests to this URL should always return 200 OK:

http://ores.wmflabs.org/scores/testwiki/reverted/<revid>/

And you can put whatever integer you want for the <revid>. I think a unix timestamp would work great to make sure we don't pull from the cache. E.g. it's 1450302568 right now so

http://ores.wmflabs.org/scores/testwiki/reverted/1450302568/

will return

{
  "1450302568": {
    "prediction": true,
    "probability": {
      "false": 0.14,
      "true": 0.86
    }
  }
}

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.
Halfak subscribed.
Halfak added a subscriber: Dzahn.
Dzahn triaged this task as Medium priority.Dec 22 2015, 11:59 PM
Dzahn added projects: SRE, observability.

Change 260695 had a related patch set uploaded (by Dzahn):
ores: enhance ORES monitoring pt.2

https://gerrit.wikimedia.org/r/260695

Change 260695 merged by Dzahn:
ores: enhance ORES monitoring pt.2

https://gerrit.wikimedia.org/r/260695

I have confirmed that paging now makes it to my phone.

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ores.wmflabs.org&nostatusheader

has been implemented as suggested. there is the check we had before for just the home page (/) and now
the new check for the worker using http://ores.wmflabs.org/scores/testwiki/reverted/<revid>/ and a timestamp as revid.

@Halfak I noticed this:

curl ores.wmflabs.org/scores/testwiki/reverted/1234
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL: <a href="http://oresweb/scores/testwiki/reverted/1234/">http://oresweb/scores/testwiki/reverted/1234/</a>.  If not click the link.root@neon:/usr/local/lib/nagios/plugins#

the "http://oresweb/scores/" part looks a bit odd. I noticed this because i kept getting a redirect and never a 200.

like this, using check_http .. -u "http://oresweb/scores/testwiki/reverted/${timestamp}/ i'm getting a 200, so i'll change it for that and should be fine