Page MenuHomePhabricator

Packet loss from Voxel to text load balancers
Closed, DeclinedPublic

Description

Reported on wikimedia-operations: a 10% packet loss when trying to each from Voxel network text-lb.esams.wikimedia.org. The issue started the 2016-12-22 at 23:01:31 UTC.

The following stacktrace has been generated with packet lost only at Wikimedia side: http://pastebin.ubuntu.com/23671267/

This can be reproduced teaching text-lb.eqiad.wikimedia.org too.

According Icinga, all is fine for the text load balancers.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 23 2016, 2:40 AM

This traceroute does not make much sense. It reports a ping time of ~100ms for the last step, but more like 10ms for the step right before that. Our infrastructure definitely does not add 90ms of latency between our routers and the load balancers. The packetloss loss reported is also a single packet (1 out of 10). A quick check on the state of the routers around the time reported did not show anything suspicious. And icinga does not report a single event. All of these make me think that we have no problem on our end but if there is a problem, it's on the reporter's/reporter's ISP side. FWIW, It would be way more useful to have the IP address (or subnet if there are privacy reasons) somehow (private paste ?, private IRC message ?) to check for the reverse route and see if we can change our path to their ISP.

The traceroute has been generated with mtr sending TCP packets to port 80, ie something like mtr -4 --tcp -P 80 text-lb.esams.wikimedia.org.

An explanation I can see for this packet lost is some throttle in place, and yes this could be on their side too.

The ~100ms is probably the HTTP server answer time.

Why did they have to go through the trouble of using TCP ? Maybe some kind of restriction on their home network? Even in that case, the jump from <10ms to ~100ms in the very last hop is really peculiar. Maybe a local forwarding proxy ? That would be consistent with all of the above and could be the source of the problem.

faidon changed the task status from Open to Stalled.Jan 9 2017, 1:17 AM
faidon added a subscriber: faidon.

This is impossible to debug further without more information. Can we get a complete traceroute (ICMP or UDP, although TCP in addition to those won't hurt) as well as the client's IP (the /24 is fine, privately is also fine).

faidon moved this task from Backlog to In Progress on the netops board.Jan 9 2017, 1:17 AM
faidon closed this task as Declined.Jan 18 2017, 7:41 AM

Since this was a user on IRC I doubt we'll hear much soon. Declining for now, feel free to reopen if the issue persists and we hear back from this or another Voxel subscriber.