Add monitoring to ORES workers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Dec 16 2015, 4:50 PM

Description

Our current monitoring setup does not notify us when the ORES workers go down. This is a problem because it is out most common source of downtime.

We have a dummy model for testwiki that would be perfect for this. E.g. requests to this URL should always return 200 OK:

http://ores.wmflabs.org/scores/testwiki/reverted/<revid>/

And you can put whatever integer you want for the <revid>. I think a unix timestamp would work great to make sure we don't pull from the cache. E.g. it's 1450302568 right now so

http://ores.wmflabs.org/scores/testwiki/reverted/1450302568/

will return

{
  "1450302568": {
    "prediction": true,
    "probability": {
      "false": 0.14,
      "true": 0.86
    }
  }
}

Details

	Subject	Repo	Branch	Lines +/-
	ores: enhance ORES monitoring pt.2	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects

Mentioned In: T122830: change ores monitoring to avoid icinga reload on puppet runs

Event Timeline

Halfak created this task.Dec 16 2015, 4:50 PM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Machine-Learning-Team (Active Tasks).

Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.

Halfak subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 16 2015, 4:50 PM

Halfak added a project: ORES.Dec 16 2015, 4:50 PM

Halfak set Security to None.

Halfak updated the task description. (Show Details)Dec 16 2015, 9:50 PM

Halfak added a subscriber: Dzahn.

Dzahn claimed this task.Dec 17 2015, 1:40 AM

Dzahn triaged this task as Medium priority.Dec 22 2015, 11:59 PM

Dzahn added projects: SRE, observability.

Krinkle subscribed.Dec 23 2015, 12:03 AM

Change 260695 had a related patch set uploaded (by Dzahn):
ores: enhance ORES monitoring pt.2

https://gerrit.wikimedia.org/r/260695

gerritbot added a project: Patch-For-Review.Dec 23 2015, 12:06 AM

Change 260695 merged by Dzahn:
ores: enhance ORES monitoring pt.2

https://gerrit.wikimedia.org/r/260695

https://gerrit.wikimedia.org/r/#/c/260692/2

https://gerrit.wikimedia.org/r/#/c/260695/

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ores.wmflabs.org&nostatusheader

I have confirmed that paging now makes it to my phone.

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ores.wmflabs.org&nostatusheader

has been implemented as suggested. there is the check we had before for just the home page (/) and now
the new check for the worker using http://ores.wmflabs.org/scores/testwiki/reverted/<revid>/ and a timestamp as revid.

Dzahn removed a project: Patch-For-Review.Dec 23 2015, 1:41 AM

Dzahn mentioned this in T122830: change ores monitoring to avoid icinga reload on puppet runs.Jan 5 2016, 12:04 AM

@Halfak I noticed this:

curl ores.wmflabs.org/scores/testwiki/reverted/1234
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL: <a href="http://oresweb/scores/testwiki/reverted/1234/">http://oresweb/scores/testwiki/reverted/1234/</a>.  If not click the link.root@neon:/usr/local/lib/nagios/plugins#

the "http://oresweb/scores/" part looks a bit odd. I noticed this because i kept getting a redirect and never a 200.

like this, using check_http .. -u "http://oresweb/scores/testwiki/reverted/${timestamp}/ i'm getting a 200, so i'll change it for that and should be fine

awight moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Jun 28 2017, 8:39 PM

Add monitoring to ORES workersClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Add monitoring to ORES workers
Closed, ResolvedPublic
Actions