Page MenuHomePhabricator

Configure ORES load balancer to rebalance on 500 error
Closed, ResolvedPublic

Description

ores-lb doesn't detect 500 as a failure and will keep routing traffic to a node.

We need to:

  1. Set up a blank page that will always load if the node is up
  2. Configure nginx to re-route when that one page is down.

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.
Halfak subscribed.
Halfak renamed this task from ORES load balancer doesn't detect 500 as error to ORES load balancer doesn't rebalance on 500 error.Sep 8 2015, 3:51 PM
Halfak set Security to None.
Halfak renamed this task from ORES load balancer doesn't rebalance on 500 error to Configure ORES load balancer to rebalance on 500 error.Mar 30 2016, 5:06 PM
Halfak added a project: ORES.
Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

Per what I've learned from nginx load balancing manual. It does have a very basic system to account for failures and stop but the timeout is set to 10 seconds which might not be enough for us. Changing it is very easy.
@Halfak What timeout value would be good for you?

P.S. More sophisticated health checks are possible via nginx plus which is a proprietary software and I'm not sure if we have it.

We do not have any proprietary software, including nginx plus.

We do not have any proprietary software, including nginx plus.

\o/ Like :)

Our scoring timeout is 15 seconds. See https://github.com/wiki-ai/ores-wikimedia-config/blob/master/config/00-main.yaml#L63 This timeout exists for individual jobs as they are sent to the celery queue. It's possible that someone sends a request to score a large number of rev_ids and doesn't get a response for 30 seconds because it took a few seconds just to gather the data and send it to the celery workers.

So, I'm thinking that a 60 second timeout on requests should be safe enough to catch a real timeout issue. In the past, most of the time, a server would respond with either a 500 "Internal Server Error" immediately when something went wrong, so I don't think that setting a more strict timeout will be necessary.

Change 287640 had a related patch set uploaded (by Ladsgroup):
ores: set nginx timeout fail time 60s

https://gerrit.wikimedia.org/r/287640

Change 287640 merged by Alexandros Kosiaris:
ores: set nginx timeout fail time 60s

https://gerrit.wikimedia.org/r/287640