Page MenuHomePhabricator

Route problems from some gateways of Italy to WMCloud and Toolforge
Closed, ResolvedPublicBUG REPORT

Description

Today, 2 active Wikimedians reported to me connection timeouts when reaching at least these resources from their city in the North of Italy:

So after some troubleshooting it seems there are some route problems. Since I see that these users are reaching a Wikimedia-related datacenter, maybe it's (also?) a Wikimedia-related problem. Here more info:

Problematic complete details (private)

Complete details (visible only to Paste subscribers - in which there is @cmooney):

{P22947}

Problematic partial details (public)

This is the problematic route report:

$ nslookup chat.wmcloud.org
Address: 185.15.56.49

$ tracepath 185.15.56.49
1?: [LOCALHOST]                      pmtu 1500
 1:  <OMISSIS>                                            0.514ms 
 1:  <OMISSIS>                                            0.591ms 
 2:  172.<OMISSIS>                                        35.024ms 
 3:  172.<OMISSIS>                                        33.952ms 
 4:  172.<OMISSIS>                                        33.133ms 
 5:  172.18.<OMISSIS>                                     36.024ms 
 6:  172.17.<OMISSIS>                                     37.126ms 
 7:  172.19.177.16                                        41.468ms 
 8:  ae49.milano11.mil.seabone.net                        39.925ms 
 9:  185.100.113.147                                      71.685ms 
10:  ae2.cr2-esams.wikimedia.org                          63.067ms 
11:  no reply
12:  no reply
$ nslookup wmch.toolforge.org

Address: 185.15.56.11

$ tracepath 185.15.56.11
1?: [LOCALHOST]                      pmtu 1500
 1:  <OMISSIS>                                           0.465ms 
 1:  <OMISSIS>                                           0.415ms 
 2:  172.<OMISSIS>                                       43.586ms 
 3:  172.<OMISSIS>                                       35.325ms 
 4:  172.19.<OMISSIS>                                    36.316ms 
 5:  172.18.<OMISSIS>                                    34.530ms 
 6:  172.17.224.<OMISSIS>                                37.935ms 
 7:  172.19.177.16                                       39.786ms 
 8:  ae49.milano50.mil.seabone.net                       35.817ms asymm  9
 9:  185.100.113.151                                     57.829ms 
10:  ae2.cr2-esams.wikimedia.org                         67.808ms 
11:  no reply
12:  no reply
tracepath 91.198.174.192
1?: [LOCALHOST]                      pmtu 1500
 1:  <OMISSIS>                                            0.755ms 
 1:  OpenWrt.lan                                          0.381ms 
 2:  172.<OMISSIS>                                        54.475ms 
 3:  172.<OMISSIS>                                        39.743ms 
 4:  172.<OMISSIS>                                        35.398ms 
 5:  172.<OMISSIS>                                        40.415ms 
 6:  172.17.224.<OMISSIS>                                 37.630ms 
 7:  172.19.177.16                                        41.642ms 
 8:  ae49.milano50.mil.seabone.net                        45.099ms asymm  9 
 9:  185.100.113.145                                      63.158ms 
10:  no reply
11:  no reply

Working route info from a nearby geographical region

For comparison purposes, this is a route that works instead from Milano:

$ nslookup chat.wmcloud.org
Address: 185.15.56.49

$ tracepath 185.15.56.49
1?: [LOCALHOST]                      pmtu 1500
1:  <OMISSIS>                                             2.083ms 
1:  <OMISSIS>                                             2.041ms 
2:  <OMISSIS>                                             1.710ms 
...
8:  93-<OMISSIS>.fastwebnet.it                            4.341ms 
9:  62-<OMISSIS>.fastres.net                              5.926ms asymm  7
10:  r1fra3.core.init7.net                                13.826ms asymm  8
11:  r2fra3.core.init7.net                                13.725ms asymm  9
12:  r1lon1.core.init7.net                                25.807ms asymm 10
13:  r1lon2.core.init7.net                                25.972ms asymm 11
14:  r2ams2.core.init7.net                                25.093ms asymm 11
15:  r1ams2.core.init7.net                                24.955ms asymm 12
16:  gw-wikimedia.init7.net                               24.423ms asymm 14
17:  ae1-403.cr2-esams.wikimedia.org                      24.914ms asymm 13
18:  xe-4-1-3.cr2-eqiad.wikimedia.org                    312.716ms asymm 20
19:  no reply
20:  cloudgw1002.eqiad1.wikimediacloud.org               166.273ms asymm 13
21:  instance-proxy-03.project-proxy.wmflabs.org         201.074ms asymm 14
22:  no reply
23:  no reply
24:  no reply
25:  no reply
26:  no reply
27:  no reply
28:  no reply
29:  no reply
30:  no reply
Too many hops: pmtu 1500
Resume: pmtu 1500
$ nslookup wmch.toolforge.org
Address: 185.15.56.11

$ tracepath 185.15.56.11
1?: [LOCALHOST]                      pmtu 1500
1:  <OMISSIS>                                             2.664ms
1:  <OMISSIS>                                             1.209ms
2:  <OMISSIS>                                             3.364ms
...                                                       4.914ms asymm  9
8:  93-<OMISSIS>.fastwebnet.it                            5.613ms
9:  62-<OMISSIS>.fastres.net                              6.327ms asymm  7
10:  r1fra3.core.init7.net                                14.166ms asymm  8
11:  r2fra3.core.init7.net                                14.375ms asymm  9
12:  r1lon1.core.init7.net                                26.001ms asymm 10
13:  r1lon2.core.init7.net                                26.359ms asymm 11
14:  r2ams2.core.init7.net                                25.275ms asymm 11
15:  r1ams2.core.init7.net                                25.489ms asymm 12
16:  gw-wikimedia.init7.net                               24.874ms asymm 14
17:  ae1-403.cr2-esams.wikimedia.org                      24.945ms asymm 13
18:  xe-4-1-3.cr2-eqiad.wikimedia.org                    197.539ms asymm 20
19:  no reply
20:  cloudgw1002.eqiad1.wikimediacloud.org               167.727ms asymm 13
21:  instance-tools-proxy-06.tools.wmflabs.org           201.206ms asymm 14
22:  no reply
23:  no reply

Thank you so much for your comments here!

Event Timeline

It seems to me that the problematic gateway is maybe ae2.cr2-esams.wikimedia.org so maybe ops-esams is interested.

Hi @valerio.bozzolan thank you for the report.

For the affected users can you confirm the source IP they are coming from? I want to validate the path back from our esams (Amsterdam) POP to the affected users.

The first non-private hops in the traceroute are shown with hostnames ae49.milano11.mil.seabone.net and ae49.milano50.mil.seabone.net, for which there is no forward DNS entries to find the IP of these hops.

Tracing back to the IP at hop 9 in the trace that path looks to be ok:

cmooney@cloudsw1-d5-eqiad> traceroute wait 1 source 208.80.154.213 no-resolve 185.100.113.151                     
traceroute to 185.100.113.151 (185.100.113.151) from 208.80.154.213, 30 hops max, 40 byte packets
 1  208.80.154.212  0.829 ms  0.562 ms  0.733 ms
 2  80.239.132.225  1.235 ms  0.885 ms  0.926 ms
 3  80.91.248.156  11.934 ms  11.911 ms  14.104 ms
     MPLS Label=24007 CoS=0 TTL=1 S=1
 4  62.115.123.125  11.825 ms  1.377 ms  11.571 ms
 5  195.22.206.120  11.168 ms  12.344 ms  0.586 ms
 6  185.100.113.151  92.876 ms  84.922 ms  89.216 ms

So this issue is not a general problem with traffic coming from us to Seabone / Telecom Italia.

Would it be possible to tell us the source IPs of the affected users? Also if one of them could do a traceroute to 91.198.174.192 (Wikipedia IP in Amsterdam) it would be good to compare and see if they had problems reaching that also.

Thanks!

Also @valerio.bozzolan you should feel free to email the IPs to noc@wikimedia.org if you wish to avoid putting them here which is public.

I've added all the details in a nice private Paste visible to you (P22947) and added it in the Task description. Thank you for your work!

Thanks for the info @valerio.bozzolan

It seems the return traffic to that address was routing out of our network to Telia, who were handing off to Seabone/TI in the same region, but the trace died out after the first hop on the TI network:

cmooney@cloudsw1-d5-eqiad> traceroute wait 1 source 208.80.154.213 no-resolve <ip>
traceroute to <ip> from 208.80.154.213, 30 hops max, 40 byte packets
 1  208.80.154.212  0.999 ms  0.549 ms  0.495 ms
 2  80.239.132.225  1.298 ms  1.163 ms  0.930 ms
 3  80.91.248.156  11.647 ms *  11.206 ms
     MPLS Label=24007 CoS=0 TTL=1 S=1
 4  62.115.123.125  11.910 ms  11.928 ms 62.115.123.123  11.982 ms
 5  195.22.206.120  0.983 ms  22.976 ms  23.029 ms
 6  * * *

I've changed our routing pref so that this traffic now routes directly out to Seabone from Wikimedia, but unfortunately we observe the same thing, trace dies after first hop on Seabone network:

cmooney@cloudsw1-d5-eqiad> traceroute wait 1 source 208.80.154.213 no-resolve <ip>
traceroute to <ip> from 208.80.154.213, 30 hops max, 40 byte packets
 1  208.80.154.212  11.680 ms  11.568 ms  0.519 ms
 2  206.126.236.6  0.678 ms  12.189 ms  0.556 ms
 3  195.22.206.1  11.237 ms  12.017 ms  12.252 ms
 4  * * *
 5  * * *
 6  * * *

For now I've set things back the way they were. I will see if I can get in contact with Seabone/TI to validate what the situation is on their side and see if they can resolve the issue.

Ok I've emailed Seabone/TI NOC now, hopefully they come back with something meaningful. There isn't a whole lot more we can do here, as whatever way we send the traffic back from our US POP it hands over to TI in the region, who seem to be dropping it regardless of exactly who hands off the packets to them.

@valerio.bozzolan the affected users are direct Telecom Italia customers is that correct?

It certainly wouldn't hurt if they were to contact TI support to report the issue. Sometimes that has better effect/quicker response. They could link this task and my traceroutes as an example of what is going on.

valerio.bozzolan claimed this task.

@valerio.bozzolan the affected users are direct Telecom Italia customers is that correct?

It seems Yep.

It certainly wouldn't hurt if they were to contact TI support to report the issue.

Indeed.

BTW flash news: now both users are saying that everything works normally again! If we have additional information we will attach it here for future reference.

slowlydisappears

Mark as "resolved". Thank you so much for your time. Thank you for your time and for your awesome commitment and sorry for this trouble.

This comment was removed by cmooney.

Hmm ok. I can see in the traceroute it now makes it a few hops further:

cmooney@re0.cr2-eqiad> traceroute wait 1 no-resolve source 208.80.154.197 <ip> as-number-lookup    
traceroute to <ip> from 208.80.154.197, 30 hops max, 52 byte packets
 1  80.239.132.225 [AS  1299]  1.174 ms  0.852 ms  1.742 ms
 2  * 80.91.248.156 [AS  1299]  1.604 ms  1.301 ms
     MPLS Label=24007 CoS=0 TTL=1 S=1
 3  62.115.123.125 [AS  1299]  0.864 ms 62.115.123.123 [AS  1299]  1.365 ms 62.115.123.125 [AS  1299]  0.694 ms
 4  195.22.206.120 [AS  6762]  6.373 ms  13.179 ms  20.024 ms
 5  195.22.209.215 [AS  6762]  99.480 ms  100.747 ms  101.998 ms
 6  195.22.192.145 [AS  6762]  99.594 ms 195.22.205.99 [AS  6762]  102.571 ms 195.22.192.145 [AS  6762]  100.962 ms
 7  * * *
 8  * * *
 9  * * *

Happy days I guess! TI haven't responded to shed any light on what may have happened.

Thank you for the report, I'll close this now but please do let us know of any similar trouble :)

That wasn't sent until way after your issues started nor were fixed.