Page MenuHomePhabricator

TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org
Closed, ResolvedPublic

Assigned To
Authored By
CDanis
Feb 19 2021, 6:58 PM
Referenced Files
F34707922: image.png
Oct 23 2021, 2:44 PM
F34707926: image.png
Oct 23 2021, 2:44 PM
F34449443: image.png
May 10 2021, 8:08 PM
F34449428: image.png
May 10 2021, 8:08 PM
F34449446: image.png
May 10 2021, 8:08 PM
F34449423: image.png
May 10 2021, 8:08 PM
F34449440: image.png
May 10 2021, 8:08 PM

Description

Since approx Jan 9, we've had occasional reports of TATA SKY Broadband customers being unable to successfully establish TLS sessions with upload.wikimedia.org.

Different customer locations have reported trouble at different times, with no clear pattern.

Our 'text' loadbalancer that serves wiki content (e.g. en.wikipedia.org) has not been affected, only upload (serving image/multimedia content).

As best as we are able to tell from the debugging data available, there appears to be a misbehaving middlebox somewhere between their customer and our network that is sending TCP RSTs on TLSv1.2 connections.

This is a tracking task, because to the best of WMF's knowledge, there's not steps we can take ourselves to resolve the situation.

Event Timeline

We decided to file https://github.com/citizenlab/test-lists/pull/730 so that we can get some test data.

(Thanks Chris for suggesting the use of this specific image to help incorporate some other debugging data as well.)

@ssingh did we get results from those test data?

@ayounsi, @Aklapper: I forgot to update this ticket, apologies! Last I checked, all reports were in the green and there wasn't any debugging information. But since we have had no user updates and it's been a while since I looked, I will check again and update this ticket accordingly. Thanks!

There is evidence in our NEL data that suggests this problem still exists.

This particular issue ought to manifest as reports of type tcp.reset. tcp.reset events can occur for a variety of reasons, so there's plenty of background noise here.

Looking at all tcp.reset reports received from India, AS134674 doesn't hugely stand out as a source of these reports -- it's in line with large ISPs:

image.png (431×639 px, 67 KB)

https://logstash.wikimedia.org/goto/cf4f14573bd6e35ab8d42dfc980aec6a

However, when you look at the breakdowns by server IP address and for domain name, and how they differ between these ISPs, things get interesting. Here's an example for Jio (but Airtel and Vodafone India look quite similar to this).

image.png (348×631 px, 70 KB)

https://logstash.wikimedia.org/goto/5373f6427174bccc7cb064ffd255bf95

For whatever reason, there's a higher ambient rate of resets for IPv6 connections, and said rate is fairly even between the text and upload addresses.

Here's the Jio view of resets by domain name:

image.png (419×630 px, 71 KB)

It's skewed towards upload.wikimedia.org, as users will make some number of requests to that host regardless of what wiki they're accessing -- but en.m.wikipedia.org is a close second.

Now here's the same two plots for Tata Sky Broadband:

image.png (393×608 px, 71 KB)

image.png (428×671 px, 53 KB)

https://logstash.wikimedia.org/goto/d0a28eb99fb3e62398e7d3f9469f5d06

In Tata's case, the data is wildly skewed towards upload-lb, regardless of IPv4 vs IPv6, and, the per-domain data is incredibly skewed towards upload.wm.o.

Additionally, the total level of reports from Tata Sky is itself suggestive of an issue -- per our webrequest data, while the rate of tcp.reset error reports from Tata users is the same order of magnitude as what we receive from Jio or Airtel, the total overall request volume from Tata Sky users is only 1-2% of the volume from either Jio or Airtel (!).

This issue has been reported on Znuny on Ticket#2021091710000804, so it's safe to assume the problem is still present.

Recent email exchange on this as they contacted us again:

From: Cathal Mooney
Sent: Thursday, September 16, 2021, 9:42 AM
To: TataSkyBB NOC
Cc: noc; info-en; TataSkyBB NOC; L3 Servicedesk; DL-Technology
Subject: Wikipedia images are not getting open_KOL region

Hi,

Thank you for your mail. I am sorry to hear your users are having
problems using Wikipedia.

There was a lengthy email thread between your team and ourselves
starting back last January, which I've attached to this mail.

At the time it was clear, from a supplied pcap you sent, that your
user's TLS requests to "upload.wikimedia.org" were being blocked by
some device / middlebox in the path, and not reaching us. This was
clear as the PCAP showed the client receiving responses from "our" IP
with ~10ms RTT. But when we traceroute to your IP space from the
serving Wikipedia node (in Singapore) the actual RTT is over 50ms. In
each case as soon as the TLS Client Hello was sent by your client, a
fake TCP RST packet was sent to it purporting to come from our IP,
approx 10ms after the Client Hello was sent. There is no way this is
really a response from our server, as it could not have got to us and
back in such a short time. So the packet is clearly being injected by
some device intercepting the flow. I've attached a pcap with just a
single example of this taken from the file you sent us back then.

Did you make any progress in identifying the middlebox / device
responsible for this behaviour? Is the situation now different than
at that point? If you believe the situation is different please let
us know what was done to resolve the issue the last time, and also
provide the following:

  • Public IP of the client making the request in the PCAP (if they are

behind NAT).

  • Dig / DNS lookup to "dyna.wikimedia.org"
  • Dig / DNS lookup to "reflect.wikimedia.org"
  • Traceroute to "dyna.wikimedia.org"
  • PCAP of all that and of an attempt to use upload.wikimedia.org from a browser

Please be advised if the situation is the same as last time it is
completely out of our control and we will be unable to assist. All we
could say is to try to find and remove, or reconfigure, the middlebox
responsible for blocking.

Kind regards,

Cathal Mooney
SRE Infrastructure Foundations
Wikimedia Foundation.
https://wikimediafoundation.org/

On Thu, 16 Sept 2021 at 06:18, TataSkyBB NOC wrote:

Dear Team,

We are unable to open images on wikipedia from Kolkata region. Snap is attached for reference.

Kindly help to whitelist below IP’s (Ipv4 and Ipv6) on https://upload.wikimedia.org as per geographic location Kolkata to load the image properly.

IPv4 -

103.59.72.0/24

103.59.73.0/24

45.119.30.0/24

45.127.44.0/24

45.127.45.0/24

103.195.202.0/24

IPV6-

2402:e280:2400::/40

2402:e280:3d00::/40

AS No. 134674

Regards,

BBlack subscribed.

Removing Traffic as I don't think this looks actionable for our team (but might still be for netops if the conversations above are still ongoing!).

ayounsi closed this task as Resolved.EditedOct 11 2021, 5:51 AM
ayounsi claimed this task.

No more news from Tata Sky and nothing we can do at our network layer neither. To be reopened if needed.

This issue has been reported on Znuny on Ticket#2021091710000804, so it's safe to assume the problem is still present.

And this user states that they're now seeing images again! :)

Per NEL data it looks like that this issue was fixed on Sept 17th!

image.png (506×1 px, 59 KB)

These are the reported TCP resets for Tata Sky Broadband Private Limited (AS134674). As before, they were overwhelmingly for upload.wikimedia.org.

image.png (447×918 px, 57 KB)