After discussion with the Traffic team, this task is to track the testing and, if successful/valuable, production deployment of a system to offload ICMP pings to a dedicated host.
Large amount of ICMP echo request toward our main IPs, usually used by people and machines to test their connectivity to the Internet, has been causing issue. For example reaching rate limiters thresholds (set to not overwhelm our servers) and dropping monitoring ICMP requests.
**1st part, to deploy a test instance in eqiad**
[x] Get a VM in a private vlan (ping1001.eqiad.wmnet)
[x] Reserve a test public IP in the LVS range in DNS (220.127.116.11)
[x] Assign the IP to the VM's loopback IP
[x] Add a firewall rule on cr1/2-eqiad to redirect icmp requests (before term default)
set firewall family inet filter border-in4 term offload-ping4 from protocol icmp
set firewall family inet filter border-in4 term offload-ping4 from icmp-type echo-request
set firewall family inet filter border-in4 term offload-ping4 from destination-address 18.104.22.168
set firewall family inet filter border-in4 term offload-ping4 then next-ip 10.64.32.31
[x] From there pings sent to the test IP should be replied by the the VM. (Confirmed)
Internally, pings to a LVS VIP should be replied by host behind the LVS
Externally, they should be replied by the VM.
[x] Add VM to standard monitoring (Icinga, Prometheus, etc)
 Ensure external monitoring does ICMP checks for the LVS VIPs (and not hostname)
 Ensure availability of the service hosted on the LVS VIP is externally monitored by a check different than ICMP
The previous 2 points are to prevent people (and availability stats) to think the actual service (eg. wikipedia.org) is down, when only the ICMP server is.
 Write documentation (eg. how to disable redirect) - WIP: https://wikitech.wikimedia.org/wiki/Ping_offload
 Optional: Write an ICMP dashboard in Grafana - WIP: https://grafana.wikimedia.org/dashboard/db/ping-offload
**2nd part, catch real ICMP traffic in eqiad**
 Write puppet scaffolding
 Assign 22.214.171.124 (text-lb.eqiad.wikimedia.org) to the VM's loopback IP
 Update the cr1/2-eqiad firewall rule
 Verify monitoring is happy
 Decommission the test VIP
**3rd part, if eqiad deployment satisfying, duplicate in codfw**
**4th part, deploy to POPs**
 Either order dedicated hardware or wait for VM solution to be available on the site.
 Duplicate to puppet
If required, be implemented with two hosts per sites, sharing a VIP using VRRP or BGP (preferred). On day 1 or at a later iteration.
* Results could be considered as "lying", as pings to a host would be replied by a different host (might confuse troubleshooting)
* List of ping targets to "catch" needs to be maintained in 2 more tools (puppet + network automation)