Page MenuHomePhabricator

stream.wikimedia.org: Uneven distribution of client connections on backends
Closed, DeclinedPublic

Description

It appears that client connections are not load-balanced properly, resulting in uneven load on the backends.

rcs1002:
instance rcs1002:10080: {"connected_clients": 1, "queue_size": 0}
instance rcs1002:10081: {"connected_clients": 0, "queue_size": 0}
instance rcs1002:10082: {"connected_clients": 1, "queue_size": 0}
instance rcs1002:10083: {"connected_clients": 0, "queue_size": 0}
instance rcs1002:10084: {"connected_clients": 0, "queue_size": 0}
instance rcs1002:10085: {"connected_clients": 1, "queue_size": 0}
instance rcs1002:10086: {"connected_clients": 0, "queue_size": 0}
instance rcs1002:10087: {"connected_clients": 13, "queue_size": 0}
instance rcs1002:10088: {"connected_clients": 92, "queue_size": 0}
instance rcs1002:10089: {"connected_clients": 0, "queue_size": 0}
instance rcs1002:10090: {"connected_clients": 0, "queue_size": 0}
instance rcs1002:10091: {"connected_clients": 1, "queue_size": 0}
instance rcs1002:10092: {"connected_clients": 93, "queue_size": 0}
instance rcs1002:10093: {"connected_clients": 0, "queue_size": 0}
instance rcs1002:10094: {"connected_clients": 0, "queue_size": 0}
instance rcs1002:10095: {"connected_clients": 1, "queue_size": 0}
instance rcs1002:10096: {"connected_clients": 1, "queue_size": 0}
instance rcs1002:10097: {"connected_clients": 0, "queue_size": 0}
instance rcs1002:10098: {"connected_clients": 1, "queue_size": 0}

rcs1001:
instance rcs1001:10080: {"connected_clients": 2, "queue_size": 0}
instance rcs1001:10081: {"connected_clients": 0, "queue_size": 0}
instance rcs1001:10082: {"connected_clients": 1, "queue_size": 0}
instance rcs1001:10083: {"connected_clients": 0, "queue_size": 0}
instance rcs1001:10084: {"connected_clients": 2, "queue_size": 0}
instance rcs1001:10085: {"connected_clients": 2, "queue_size": 0}
instance rcs1001:10086: {"connected_clients": 0, "queue_size": 0}
instance rcs1001:10087: {"connected_clients": 53, "queue_size": 0}
instance rcs1001:10088: {"connected_clients": 78, "queue_size": 0}
instance rcs1001:10089: {"connected_clients": 0, "queue_size": 0}
instance rcs1001:10090: {"connected_clients": 3, "queue_size": 0}
instance rcs1001:10091: {"connected_clients": 0, "queue_size": 0}
instance rcs1001:10092: {"connected_clients": 20, "queue_size": 0}
instance rcs1001:10093: {"connected_clients": 0, "queue_size": 0}
instance rcs1001:10094: {"connected_clients": 0, "queue_size": 0}
instance rcs1001:10095: {"connected_clients": 2, "queue_size": 0}
instance rcs1001:10096: {"connected_clients": 3, "queue_size": 0}
instance rcs1001:10097: {"connected_clients": 0, "queue_size": 0}
instance rcs1001:10098: {"connected_clients": 3, "queue_size": 0}


Version: wmf-deployment
Severity: normal

Details

Reference
bz67957

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:27 AM
bzimport added a project: EventStreams.
bzimport set Reference to bz67957.
bzimport added a subscriber: Unknown Object (MLST).

Change 145997 had a related patch set uploaded by Ori.livneh:
rcstream: make lvs health check fetch /nginx_status

https://gerrit.wikimedia.org/r/145997

Andrew triaged this task as Medium priority.Feb 8 2015, 9:28 PM
Andrew set Security to None.

Ori, asking for feedback on the comments on your changeset so this can get unstalled.