Configure ORES load balancer to rebalance on 500 error
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Sep 8 2015, 3:51 PM

Description

ores-lb doesn't detect 500 as a failure and will keep routing traffic to a node.

We need to:

Set up a blank page that will always load if the node is up
Configure nginx to re-route when that one page is down.

Details

	Subject	Repo	Branch	Lines +/-
	ores: set nginx timeout fail time 60s	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: rOPUP94f5d965074c: ores: set nginx timeout fail time 60s

Event Timeline

Halfak created this task.Sep 8 2015, 3:51 PM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Machine-Learning-Team (Active Tasks).

Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.

Halfak subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2015, 3:51 PM

Halfak renamed this task from ORES load balancer doesn't detect 500 as error to ORES load balancer doesn't rebalance on 500 error.Sep 8 2015, 3:51 PM

Halfak set Security to None.

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Mar 30 2016, 2:55 PM

Halfak renamed this task from ORES load balancer doesn't rebalance on 500 error to Configure ORES load balancer to rebalance on 500 error.Mar 30 2016, 5:06 PM

Halfak added a project: ORES.

Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

Per what I've learned from nginx load balancing manual. It does have a very basic system to account for failures and stop but the timeout is set to 10 seconds which might not be enough for us. Changing it is very easy.
@Halfak What timeout value would be good for you?

P.S. More sophisticated health checks are possible via nginx plus which is a proprietary software and I'm not sure if we have it.

We do not have any proprietary software, including nginx plus.

Ladsgroup edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.May 9 2016, 8:01 AM

In T111806#2274575, @yuvipanda wrote:

We do not have any proprietary software, including nginx plus.

\o/ Like :)

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).May 9 2016, 3:00 PM

Our scoring timeout is 15 seconds. See https://github.com/wiki-ai/ores-wikimedia-config/blob/master/config/00-main.yaml#L63 This timeout exists for individual jobs as they are sent to the celery queue. It's possible that someone sends a request to score a large number of rev_ids and doesn't get a response for 30 seconds because it took a few seconds just to gather the data and send it to the celery workers.

So, I'm thinking that a 60 second timeout on requests should be safe enough to catch a real timeout issue. In the past, most of the time, a server would respond with either a 500 "Internal Server Error" immediately when something went wrong, so I don't think that setting a more strict timeout will be necessary.

Change 287640 had a related patch set uploaded (by Ladsgroup):
ores: set nginx timeout fail time 60s

https://gerrit.wikimedia.org/r/287640

gerritbot added a project: Patch-For-Review.May 9 2016, 3:30 PM

Change 287640 merged by Alexandros Kosiaris:
ores: set nginx timeout fail time 60s

https://gerrit.wikimedia.org/r/287640

Ladsgroup edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.May 9 2016, 3:38 PM

Ladsgroup moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.

Ladsgroup claimed this task.May 9 2016, 3:40 PM

Halfak closed this task as Resolved.May 10 2016, 8:44 PM

Ladsgroup mentioned this in rOPUP94f5d965074c: ores: set nginx timeout fail time 60s.Jun 17 2016, 6:10 PM

• Phabricator_maintenance added a project: User-Ladsgroup.Aug 12 2016, 8:09 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:47 PM

Configure ORES load balancer to rebalance on 500 errorClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Configure ORES load balancer to rebalance on 500 error
Closed, ResolvedPublic
Actions