Page MenuHomePhabricator

Handle hanging backends gracefully; cancel internal processing on timeout
Closed, InvalidPublic

Description

In testing, a quorum of cassandra nodes going down during heavy load cause restbase workers to accumulate a lot of memory until they eventually reach the configured heap limit and are restarted by the coordinator. Before the limit is reached, they tend to get fairly slow as GC tries hard to make do with the available memory.

We probably need to be more aggressive about timing out & freeing backend connections internally when the cassandra table storage layer is down. We do currently limit the number of concurrent live connections per worker. This does not help for long backend outages though, as incoming http connections are timing out, which allows other requests to come in.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke triaged this task as Normal priority.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke added a project: RESTBase-Cassandra.
GWicke set Security to None.
GWicke added a subscriber: GWicke.

Things have improved quite a bit with v2 of the cassandra driver as well as our improved configuration. Reconnects are now happening a lot more quickly, which avoids connections building up over an extended period of time.

There is more to do though:

  • establish and enforce a hard limit on outstanding parallel requests
  • (possibly) tweak retry policies for the driver
Pchelolo closed this task as Invalid.Aug 5 2019, 4:19 PM
Pchelolo added a subscriber: Pchelolo.

We have not observed this problem in production.

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptAug 5 2019, 4:19 PM