Page MenuHomePhabricator

Add monitoring to ORES workers
Closed, ResolvedPublic

Description

Our current monitoring setup does not notify us when the ORES workers go down. This is a problem because it is out most common source of downtime.

We have a dummy model for testwiki that would be perfect for this. E.g. requests to this URL should always return 200 OK:

http://ores.wmflabs.org/scores/testwiki/reverted/<revid>/

And you can put whatever integer you want for the <revid>. I think a unix timestamp would work great to make sure we don't pull from the cache. E.g. it's 1450302568 right now so

http://ores.wmflabs.org/scores/testwiki/reverted/1450302568/

will return

{
  "1450302568": {
    "prediction": true,
    "probability": {
      "false": 0.14,
      "true": 0.86
    }
  }
}

Details

Related Gerrit Patches:
operations/puppet : productionores: enhance ORES monitoring pt.2

Event Timeline

Halfak created this task.Dec 16 2015, 4:50 PM
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Active on the Scoring-platform-team (Current) board.
Halfak added a subscriber: Halfak.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 16 2015, 4:50 PM
Halfak set Security to None.
Halfak updated the task description. (Show Details)Dec 16 2015, 9:50 PM
Halfak added a subscriber: Dzahn.
Dzahn claimed this task.Dec 17 2015, 1:40 AM
Dzahn triaged this task as Medium priority.Dec 22 2015, 11:59 PM
Dzahn added projects: Operations, observability.

Change 260695 had a related patch set uploaded (by Dzahn):
ores: enhance ORES monitoring pt.2

https://gerrit.wikimedia.org/r/260695

Change 260695 merged by Dzahn:
ores: enhance ORES monitoring pt.2

https://gerrit.wikimedia.org/r/260695

I have confirmed that paging now makes it to my phone.

Dzahn closed this task as Resolved.Dec 23 2015, 1:41 AM

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ores.wmflabs.org&nostatusheader

has been implemented as suggested. there is the check we had before for just the home page (/) and now
the new check for the worker using http://ores.wmflabs.org/scores/testwiki/reverted/<revid>/ and a timestamp as revid.

Dzahn added a comment.Jan 5 2016, 1:01 AM

@Halfak I noticed this:

curl ores.wmflabs.org/scores/testwiki/reverted/1234
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL: <a href="http://oresweb/scores/testwiki/reverted/1234/">http://oresweb/scores/testwiki/reverted/1234/</a>.  If not click the link.root@neon:/usr/local/lib/nagios/plugins#

the "http://oresweb/scores/" part looks a bit odd. I noticed this because i kept getting a redirect and never a 200.

Dzahn added a comment.Jan 5 2016, 1:48 AM

like this, using check_http .. -u "http://oresweb/scores/testwiki/reverted/${timestamp}/ i'm getting a 200, so i'll change it for that and should be fine