Page MenuHomePhabricator

Offload pings to dedicated server
Closed, ResolvedPublic

Description

After discussion with the Traffic team, this task is to track the testing and, if successful/valuable, production deployment of a system to offload ICMP pings to a dedicated host.

Large amount of ICMP echo request toward our main IPs, usually used by people and machines to test their connectivity to the Internet, has been causing issue. For example reaching rate limiters thresholds (set to not overwhelm our servers) and dropping monitoring ICMP requests.

1st part, to deploy a test instance in eqiad

  • Get a VM in a private vlan (ping1001.eqiad.wmnet)
  • Reserve a test public IP in the LVS range in DNS (208.80.154.225)
  • Assign the IP to the VM's loopback IP
  • Add a firewall rule on cr1/2-eqiad to redirect icmp requests (before term default)
set firewall family inet filter border-in4 term offload-ping4 from protocol icmp
set firewall family inet filter border-in4 term offload-ping4 from icmp-type echo-request
set firewall family inet filter border-in4 term offload-ping4 from destination-address 208.80.154.225
set firewall family inet filter border-in4 term offload-ping4 then next-ip 10.64.32.31
  • From there pings sent to the test IP should be replied by the the VM. (Confirmed)

Monitoring
Internally, pings to a LVS VIP should be replied by host behind the LVS
Externally, they should be replied by the VM.

  • Add VM to standard monitoring (Icinga, Prometheus, etc)
  • Ensure external monitoring does ICMP checks for the LVS VIPs (and not balanced hostname)
  • Ensure availability of the service hosted on the LVS VIP is externally monitored by a check different than ICMP

The previous 2 points are to prevent people (and availability stats) to think the actual service (eg. wikipedia.org) is down, when only the ICMP server is.

2nd part, catch real ICMP traffic in eqiad

  • Write puppet scaffolding - https://gerrit.wikimedia.org/r/#/c/424151/
  • Assign 208.80.154.224 (text-lb.eqiad.wikimedia.org) to the VM's loopback IP
  • Update the cr1/2-eqiad firewall rule
  • Verify monitoring is happy
  • Decommission the test VIP

3rd part, duplicate in codfw

  • Get a VM in a private vlan (ping2001.codfw.wmnet)
  • Add VM to standard monitoring (Icinga, Prometheus, etc)
  • Ensure external monitoring does ICMP checks for the LVS VIPs (and not balanced hostname)
  • Ensure availability of the service hosted on the LVS VIP is externally monitored by a check different than ICMP
  • Assign 208.80.153.224 (text-lb.eqiad.wikimedia.org) to the VM's loopback IP
  • Update the cr1/2-codfw cr1-eqdfw firewall rule
  • Verify monitoring is happy

4th part, deploy to POPs

  • Either order dedicated hardware or wait for VM solution to be available on the site.
  • Duplicate to puppet

Redundancy
If required, be implemented with two hosts per sites, sharing a VIP using VRRP or BGP (preferred). On day 1 or at a later iteration.

Caveats

  • Results could be considered as "lying", as pings to a host would be replied by a different host (might confuse troubleshooting)
  • List of ping targets to "catch" needs to be maintained in 2 more tools (puppet + network automation)
    • Can be alleviated with kernel's AnyIP feature (eg. lo listens on all /27 VIPs range)

Event Timeline

ayounsi triaged this task as Medium priority.Mar 19 2018, 8:47 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 420923 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Ping offload test, reserve IPs

https://gerrit.wikimedia.org/r/420923

Change 420923 merged by Ayounsi:
[operations/dns@master] Ping offload test, reserve IPs

https://gerrit.wikimedia.org/r/420923

Change 420933 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Add DHCP and partman for ping1001

https://gerrit.wikimedia.org/r/420933

Change 420933 merged by Ayounsi:
[operations/puppet@production] Add DHCP and partman for ping1001

https://gerrit.wikimedia.org/r/420933

Change 420949 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Add ping1001 to puppet (test)

https://gerrit.wikimedia.org/r/420949

Change 420949 merged by Ayounsi:
[operations/puppet@production] Add ping1001 to puppet (test)

https://gerrit.wikimedia.org/r/420949

1st part completed, server's specific configuration (loopback IP) is not puppetized yet.

From outside:

$ ping ping-test.eqiad.wikimedia.org
PING ping-test.eqiad.wikimedia.org (208.80.154.225) 56(84) bytes of data.
64 bytes from ping-test.eqiad.wikimedia.org (208.80.154.225): icmp_seq=1 ttl=56 time=76.7 ms

On ping1001:

$ sudo tcpdump -i ens5 icmp -nn
[...]
16:38:50.898230 IP 198.27.253.200 > 208.80.154.225: ICMP echo request, id 6753, seq 1, length 64
16:38:50.898325 IP 208.80.154.225 > 198.27.253.200: ICMP echo reply, id 6753, seq 1, length 64

Change 424151 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Puppet: add ping_offload role and profile

https://gerrit.wikimedia.org/r/424151

About kernel tuning, here are the variables we can adjust as necessary, with their default.

50 -- /proc/sys/net/ipv4/icmp_msgs_burst
1000 -- /proc/sys/net/ipv4/icmp_msgs_per_sec
1000 -- /proc/sys/net/ipv4/icmp_ratelimit
6168 -- /proc/sys/net/ipv4/icmp_ratemask

Note that the default icmp_ratemask doesn't include ICMP requests or echo. This mean the rate limiters are not "enabled" for pings.
More details on: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

About kernel tuning, here are the variables we can adjust as necessary, with their default.

50 -- /proc/sys/net/ipv4/icmp_msgs_burst
1000 -- /proc/sys/net/ipv4/icmp_msgs_per_sec
1000 -- /proc/sys/net/ipv4/icmp_ratelimit
6168 -- /proc/sys/net/ipv4/icmp_ratemask

For reference, these are the values we're currently using on the LVSs:

  • net.ipv4.icmp_ratemask: 350233
  • net.ipv4.icmp_ratelimit: 200
  • net.ipv4.icmp_msgs_per_sec: 3000

Change 424151 merged by Ayounsi:
[operations/puppet@production] Puppet: add ping_offload role and profile

https://gerrit.wikimedia.org/r/424151

Verified that external monitoring doesn't do ping checks (but http, etc. instead) to hostnames (en.wikipedia.org, etc).
Added a Watchmouse ping check for text-lb.eqiad.wikimedia.org.

Mentioned in SAL (#wikimedia-operations) [2018-04-23T20:53:43Z] <XioNoX> redirect text-lb.eqiad pings to ping1001 on cr1/2-eqiad (24h tests) - T190090

Change 429012 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Ping offload: remove test VIP

https://gerrit.wikimedia.org/r/429012

Change 429013 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Ping offload: remove test VIP from DNS

https://gerrit.wikimedia.org/r/429013

Change 429012 merged by Ayounsi:
[operations/puppet@production] Ping offload: remove test VIP

https://gerrit.wikimedia.org/r/429012

Mentioned in SAL (#wikimedia-operations) [2018-04-25T20:21:30Z] <XioNoX> remove test VIP for eqiad ping offload server - T190090

Change 429013 merged by Ayounsi:
[operations/dns@master] Ping offload: remove test VIP from DNS

https://gerrit.wikimedia.org/r/429013

Change 429099 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Assign IP for ping2001.codfw.wmnet

https://gerrit.wikimedia.org/r/429099

Change 429099 merged by Ayounsi:
[operations/dns@master] Assign IP for ping2001.codfw.wmnet

https://gerrit.wikimedia.org/r/429099

Change 429106 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Ping offload, dhcp, partman and puppet for ping2001

https://gerrit.wikimedia.org/r/429106

Change 429106 merged by Ayounsi:
[operations/puppet@production] Ping offload, dhcp, partman and puppet for ping2001

https://gerrit.wikimedia.org/r/429106

While preparing the firewall rule for Dallas I discovered a limitation not accounted for previously.
The rule that says "if ping to VIPs, then redirect to IP X" being applied to our external links, this works fine when applied on routers that have a direct connectivity to the pingXXXX server, but not on networking POPs (eqdfw, knams).
eqord is spared so far as we don't advertise our prefixes from there (so no inbound traffic), that's why the issue haven't been noticed with the eqiad tests.

From here I see 3 options:
1/ Keep the plan as it, and don't offload traffic from the network pop
Creates inconsistencies
2/ Establish a GRE tunnel from networking POP to ping servers
Increases complexity
3/ Apply the filter to traffic entering the public vlan (instead of entering our networks)
Redirects all pings to the VIPs (including internal), seems to be the best option to me.
Unless internal pings to the VIPs should not be redirected.

Diff for option 3 in eqiad is:

[edit interfaces ae1 unit 1017 family inet]
+       filter {
+           output private-out4;
+       }
[edit interfaces ae2 unit 1018 family inet]
+       filter {
+           output private-out4;
+       }
[edit interfaces ae3 unit 1019 family inet]
+       filter {
+           output private-out4;
+       }
[edit interfaces ae4 unit 1020 family inet]
+       filter {
+           output private-out4;
+       }
[edit firewall family inet filter border-in4]
-      term offload-ping4 {
-          from {
-              destination-address {
-                  208.80.154.224/32;
-              }
-              protocol icmp;
-              icmp-type echo-request;
-          }
-          then {
-              next-ip 10.64.32.31/32;
-          }
-      }
[edit firewall family inet]
      filter cloud-in4 { ... }
+     /* T190090 */
+     filter private-out4 {
+         term no-offload-ping4 {
+             from {
+                 destination-address {
+                     208.80.154.224/32;
+                 }
+                 source-prefix-list {
+                     wikimedia4;
+                     trusted-space4;
+                 }
+                 protocol icmp;
+                 icmp-type echo-request;
+             }
+             then accept;
+         }
+         term offload-ping4 {
+             from {
+                 destination-address {
+                     208.80.154.224/32;
+                 }
+                 protocol icmp;
+                 icmp-type echo-request;
+             }
+             then {
+                 next-ip 10.64.32.31/32;
+             }
+         }
+         term default {
+             then accept;
+         }
+     }

Discussed it with Brandon and we think that option 3 is the best path forward. Over to @faidon for thoughts/review.

Discussed it with Faidon,

Updated the previous diff to not redirect pings coming from our infra.

Some questions, concerns raised:

  • There is a risk of moving the issue toward the Ganeti servers
    • To be monitored, so far no issues have been seen since the POC has been deployed in eqiad
  • Why can't it be done on the LVS?
    • Even though they are acting as routers, they are subject to the same Linux kernel limitations/rate limiters as regular servers.
  • How is the ping server monitored?
    • Watchmouse (external monitoring) pings the VIPs (and thus lands on the ping server).
    • The non VIP IP of the ping server is monitored by Icinga
    • In addition we could add a Grafana alert for InAddrErrors >0 which means the VIP is missing from the ping server.

Here is the diff for codfw:

[edit interfaces ae1 unit 2017 family inet]
+       filter {
+           output private-out4;
+       }
[edit interfaces ae2 unit 2018 family inet]
+       filter {
+           output private-out4;
+       }
[edit interfaces ae3 unit 2019 family inet]
+       filter {
+           output private-out4;
+       }
[edit interfaces ae4 unit 2020 family inet]
+       filter {
+           output private-out4;
+       }
[edit firewall family inet]
+     /* T190090 */
+     filter private-out4 {
+         term no-offload-ping4 {
+             from {
+                 destination-address {
+                     208.80.153.224/32;
+                 }
+                 source-prefix-list {
+                     wikimedia4;
+                     trusted-space4;
+                 }
+                 protocol icmp;
+                 icmp-type echo-request;
+             }
+             then accept;
+         }
+         term offload-ping4 {
+             from {
+                 destination-address {
+                     208.80.153.224/32;
+                 }
+                 protocol icmp;
+                 icmp-type echo-request;
+             }
+             then {
+                 next-ip 10.192.0.22/32;
+             }
+         }
+         term default {
+             then accept;
+         }
+     }
      filter border-in4 { ... }

Will deploy it 1 vlan at a time, while monitoring pings to text-lb.codfw, and Icinga overall. Rollback if any sign of issue.

Mentioned in SAL (#wikimedia-operations) [2019-03-06T19:14:04Z] <XioNoX> apply ping-offload redirect to private1-a-codfw - T190090

After applying it only to cr1-codfw, I noticed an increase of ICMP errors to eqiad's LVS, see https://grafana.wikimedia.org/d/000000513/ping-offload?orgId=1&from=1551899767183&to=1551900644166
And no ICMP packets increase on ping2001.

I rolled back the change for investigation.

Redirect test with unused .225 IP

[edit interfaces ae1 unit 2017 family inet]
+       filter {
+           output private-out4;
+       }
[edit firewall family inet filter private-out4 term no-offload-ping4 from destination-address]
+        208.80.153.225/32;
-        208.80.153.224/32;
[edit firewall family inet filter private-out4 term offload-ping4 from destination-address]
+        208.80.153.225/32;
-        208.80.153.224/32;

Mentioned in SAL (#wikimedia-operations) [2019-03-06T21:23:18Z] <XioNoX> test ping-offload with unused IP 208.80.153.225 - T190090

Everything has been rolled back for now.

I also added a logging term:

then {                    
    count ping-redirected;
    next-ip 10.192.0.22/32;
}

Which does get a surprisingly high amount of hits:

cr1-codfw# run show firewall counter ping-redirected filter private-out4 

Filter: private-out4                                           
Counters:
Name                                                Bytes              Packets
ping-redirected                                 145580484              1733101

But nothing makes it to ping2001 (10.192.0.22).

My theory so far, until we can get confirmation from JTAC (as I can't find any doc confirming it or not), is that the firewall action next-ip can only be applied to input filters.
We do know that it works, with input, as it worked when applied to the transit-in4 filter.

Based on that, the next logical place to redirect ICMP coming through network POPs is to instead do the redirect on the transport links, for example on cr1-codfw it would looks like:

[edit interfaces xe-5/0/0 unit 0 family inet]
+       filter {
+           input transport-in4;
+       }
[edit interfaces xe-5/0/2 unit 0 family inet]
+       filter {
+           input transport-in4;
+       }
[edit interfaces xe-5/1/2 unit 0 family inet]
+       filter {
+           input transport-in4;
+       }
[edit interfaces xe-5/2/1 unit 0 family inet]
+       filter {
+           input transport-in4;
+       }
[edit firewall family inet]
+     /* T190090 */
+     filter transport-in4 {
+         term no-offload-ping4 {
+             from {
+                 destination-address {
+                     208.80.153.224/32;
+                 }
+                 source-prefix-list {  
+                     wikimedia4;
+                     trusted-space4;
+                 }
+                 protocol icmp;
+                 icmp-type echo-request;
+             }
+             then accept;
+         }
+         term offload-ping4 {
+             from {
+                 destination-address {
+                     208.80.153.224/32;
+                 }
+                 protocol icmp;
+                 icmp-type echo-request;
+             }
+             then {
+                 next-ip 10.192.0.22/32;
+             }
+         }
+         term default {
+             then accept;
+         }
+     }                                 
      filter border-in4 { ... }

So same firewall rules, but applied on all transport links for consistency.
We can start with testing it with the unused .225 IP first as well.

Mentioned in SAL (#wikimedia-operations) [2019-03-20T21:00:50Z] <XioNoX> apply icmp redirect on cr1-codfw:xe-5/0/2 (to cr4-ulsfo) for test IP 208.80.154.225 - T190090

Typo above, test IP is 208.80.153.225.
Successfully tested on 1 link with:
cr4-ulsfo> ping source 129.250.204.6 208.80.153.225
Pushing the change to the other transports links, then cr2-codfw.

cr2-codfw
[edit interfaces xe-5/0/0]
-   description "Core: cr2-eqdfw:xe-0/1/4 (CyrusOne wikimedia:ix2.dfw4_to_ix2.dfw5.245.0009) {#11403} [10Gbps wave]";
+   description "Transport: cr2-eqdfw:xe-0/1/4 (CyrusOne wikimedia:ix2.dfw4_to_ix2.dfw5.245.0009) {#11403} [10Gbps wave]";
[edit interfaces xe-5/0/0 unit 0 family inet]
+       filter {
+           input transport-in4;
+       }
[edit interfaces xe-5/0/1 unit 0 family inet]
+       filter {
+           input transport-in4;
+       }
[edit interfaces xe-5/2/1 unit 0 family inet]
+       filter {
+           input transport-in4;
+       }
[edit firewall family inet]
+     /* T190090 */
+     filter transport-in4 {
+         term no-offload-ping4 {
+             from {
+                 destination-address {
+                     208.80.153.225/32;
+                 }
+                 source-prefix-list {
+                     wikimedia4;
+                     trusted-space4;
+                 }
+                 protocol icmp;
+                 icmp-type echo-request;
+             }                         
+             then accept;
+         }
+         term offload-ping4 {
+             from {
+                 destination-address {
+                     208.80.153.225/32;
+                 }
+                 protocol icmp;
+                 icmp-type echo-request;
+             }
+             then {
+                 next-ip 10.192.0.22/32;
+             }
+         }
+         term default {
+             then accept;
+         }
+     }
      filter border-in4 { ... }

Tested with NTT looking glass:

Sending 5, 100-byte ICMP Echos to 208.80.153.225, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/2 ms

ping2001:~$ sudo tcpdump -i ens5 icmp 
21:27:25.601709 IP ae-9.r11.dllstx09.us.bb.gin.ntt.net > 208.80.153.225: ICMP echo request, id 16772, seq 0, length 80
21:27:25.601772 IP 208.80.153.225 > ae-9.r11.dllstx09.us.bb.gin.ntt.net: ICMP echo reply, id 16772, seq 0, length 80

Mentioned in SAL (#wikimedia-operations) [2019-03-20T21:37:44Z] <XioNoX> apply transit-in4 term offload-ping4 with test IP to cr1/2-codfw - T190090

Next step is to apply the following to replace the test IP with codfw text-lb IP.

[edit firewall family inet filter transport-in4 term no-offload-ping4 from destination-address]
+        208.80.153.224/32;
-        208.80.153.225/32;
[edit firewall family inet filter transport-in4 term offload-ping4 from destination-address]
+        208.80.153.224/32;
-        208.80.153.225/32;
[edit firewall family inet filter border-in4 term offload-ping4 from destination-address]
+        208.80.153.224/32;
-        208.80.153.225/32;

Mentioned in SAL (#wikimedia-operations) [2019-03-21T21:39:10Z] <XioNoX> Ping offload - replace test IP with text-lb.codfw IP on cr1/2-codfw - T190090

Change 498264 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Add Icinga alert to ping-offload dashboard alerts

https://gerrit.wikimedia.org/r/498264

Change 498264 merged by Ayounsi:
[operations/puppet@production] Add Icinga alert to ping-offload dashboard alerts

https://gerrit.wikimedia.org/r/498264

Mentioned in SAL (#wikimedia-operations) [2019-03-25T21:40:38Z] <XioNoX> apply transport-in4 filter to cr1/2-eqiad - T190090

ayounsi updated the task description. (Show Details)

Everything needed here is done.
Full doc on https://wikitech.wikimedia.org/wiki/Ping_offload
Will open a followup task once the Ganeti clusters are ready in the POPs T96852

CDanis added a subscriber: CDanis.

boldly re-opening this, now that the POPs have Ganeti clusters available.

Today I learned that text-lb.esams receives something like 60k+ PPS of ICMP

That sounds like a good idea to me, @BBlack for a final opinion, and I can take care of it this Q if good to go.

+1 from me, this was one of the many things we made the ganeti clusters for :)

Mentioned in SAL (#wikimedia-operations) [2020-01-15T09:42:13Z] <XioNoX> enable ping offload in esams - T190090