ores-lb doesn't detect 500 as a failure and will keep routing traffic to a node.
We need to:
- Set up a blank page that will always load if the node is up
- Configure nginx to re-route when that one page is down.
ores-lb doesn't detect 500 as a failure and will keep routing traffic to a node.
We need to:
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
ores: set nginx timeout fail time 60s | operations/puppet | production | +1 -1 |
Per what I've learned from nginx load balancing manual. It does have a very basic system to account for failures and stop but the timeout is set to 10 seconds which might not be enough for us. Changing it is very easy.
@Halfak What timeout value would be good for you?
P.S. More sophisticated health checks are possible via nginx plus which is a proprietary software and I'm not sure if we have it.
Our scoring timeout is 15 seconds. See https://github.com/wiki-ai/ores-wikimedia-config/blob/master/config/00-main.yaml#L63 This timeout exists for individual jobs as they are sent to the celery queue. It's possible that someone sends a request to score a large number of rev_ids and doesn't get a response for 30 seconds because it took a few seconds just to gather the data and send it to the celery workers.
So, I'm thinking that a 60 second timeout on requests should be safe enough to catch a real timeout issue. In the past, most of the time, a server would respond with either a 500 "Internal Server Error" immediately when something went wrong, so I don't think that setting a more strict timeout will be necessary.
Change 287640 had a related patch set uploaded (by Ladsgroup):
ores: set nginx timeout fail time 60s