Page MenuHomePhabricator

ORES worker icinga message not specific enough
Open, LowestPublic

Description

E.g. PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds

This should say which worker timed out.

Event Timeline

Halfak created this task.Nov 28 2017, 6:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2017, 6:00 PM
Dzahn added a subscriber: Dzahn.Apr 3 2018, 4:58 PM

It can't know this. ores.wikimedia.org points to a load balancer IP (misc-web-lb.codfw.wikimedia.org.). From external the check can't know which workers are behind that.

This could only be solved by adding an additional and separate monitoring check to each ores worker backend and it would have to connect to them and run a check locally (which one exactly ?)

Ladsgroup triaged this task as Lowest priority.Nov 27 2018, 12:47 AM
Ladsgroup added a subscriber: Ladsgroup.

Actually, we know send a header that which ores node served the request but that would be the uwsgi node and not the worker node. OTOH, We can look up the error in logstash and find out what node sent out the error so it's not much needed anymore thus I triage this as lowest.

fgiunchedi moved this task from Inbox to Radar on the observability board.Dec 9 2019, 12:12 PM