Page MenuHomePhabricator

Review analytics-in4/6 rules on cr1/cr2 eqiad
Closed, ResolvedPublic21 Estimated Story Points

Description

After a chat with @ayounsi we decided to review some analytics-in4 terms on cr1/cr2 eqiad because they contain stale IPs.

logstash

term logstash {
    from {
        destination-address {
            10.64.32.137/32;
            10.64.0.122/32;
            10.64.48.113/32;
        }
        protocol udp;
        destination-port 12201;
    }
    then accept;
}

This one seems related to T84332 (we don't use anymore logstash for Hadoop) and contains 3 stale IPs (logstash100[1-3], now decommed). I propose to drop it.

eventlogging_zeromq

term eventlogging_zeromq {
    from {
        destination-address {
            10.64.32.167/32;
        }
        destination-port [ 8521-8523 8600 8421-8422 ];
    }
}

Related to an old service running on eventlog1001 (now decommed). I propose to drop it.

zookeeper

term zookeeper {
    from {
        destination-address {
            /* conf100{1,2,3} */
            10.64.0.18/32;
            10.64.32.180/32;
            10.64.48.111/32;
            /* conf100{4,5,6} */
            10.64.0.23/32;
            10.64.16.29/32;
            10.64.48.167/32;
        }
        protocol tcp;
        destination-port [ 2181 2182 2183 ];
    }
    then accept;
}

conf100[1-3] should be removed since zookeeper is not running on them anymore.

wdqs

term wdqs {
    from {
        destination-address {
            /* wdqs1001 */
            10.64.48.112/32;
            /* wdqs1002 */
            10.64.32.183/32;
            /* wdqs1003 */
            10.64.0.14/32;
            /* wdqs2001 */
            10.192.32.148/32;
            /* wdqs2002 */
            10.192.48.65/32;
            /* wdqs2003 */
            10.192.0.29/32;
        }
        protocol tcp;
        destination-port 8888;
    }
    then accept;
}

I had a chat with @Addshore and they seem to use at the moment only wdqs1003 via this code. There are some stale IPs that need to be updated, and also T176875 filed as follow up. Adding also @Gehel for the final word on what hosts are best to use. I'd propose to remove all the IPs in there and replace them with the VIP wdqs.svc.eqiad.wmnet.

ipsec

term ipsec {
    from {
        protocol esp;
    }
    then accept;
}
term ipsec-ike {
    from {
        protocol udp;
        destination-port 500;
    }
    then accept;
}

This one was probably needed to allow IPsec connections between kafka1012->23 to cp*. The kafka hosts do not need anymore this connection since their webrequest traffic is now handled by Kafka Jumbo (not in the analytics vlan).

es

/* Revert this when we get a good queue to undo T120281 */
term es {
    from {
        destination-address {
            /* elastic1017 */
            10.64.48.39/32;
            /* elastic1018 */
            10.64.48.40/32;
            /* elastic1019 */
            10.64.48.41/32;
[..looong list of IPs..]

For this one we agreed with @Gehel and @dcausse that only a few hosts are needed. They are all listed in this patch:

  • elastic1017
  • elastic1051
  • elastic1052
  • elastic2010
  • elastic2035
  • elastic2036

kafka

IPs are ok but we'd need to add port 9093 to the destination addresses (TLS).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Note that some descriptions don't match the IPs anymore.
For example:

/* wdqs1001 */
10.64.48.112/32;
 /* wdqs1002 */
10.64.32.183/32;

While 10.64.48.112 is now flerovium.eqiad.wmnet (most likely re-purposed) and 10.64.32.183 doesn't have a PTR (most likely decommissioned).

term git_deploy
[...]
destination-address:
10.64.0.196/32 : tin decommed?
10.192.16.132 : no PTR
10.192.32.22 : no PTR

ayounsi renamed this task from Review analytics-in4 rules con cr1/cr2 eqiad to Review analytics-in4 rules on cr1/cr2 eqiad.Jul 3 2018, 9:07 PM

For WDQS, we should keep access to at least 2 nodes in both eqiad and codfw. I propose:

wdqs1003: 10.64.0.14
wdqs1004: 10.64.0.17
wdqs2001: 10.192.32.148
wdqs2002: 10.192.48.65

If it is all the same to you, we could also keep access open to wdqs1005 / wdqs2003.

Note that all those servers are the public wdqs cluster. We don't want analytics to access the internal cluster, as the internal cluster is meant for synchronous and cheap requests.

First batch of changes:

delete firewall family inet filter analytics-in4 term logstash
delete firewall family inet filter analytics-in4 term eventlogging_zeromq
delete firewall family inet filter analytics-in4 term zookeeper from destination-address 10.64.0.18/32
delete firewall family inet filter analytics-in4 term zookeeper from destination-address 10.64.32.180/32
delete firewall family inet filter analytics-in4 term zookeeper from destination-address 10.64.48.111/32
delete firewall family inet filter analytics-in4 term ipsec
delete firewall family inet filter analytics-in4 term ipsec-ike
set firewall family inet filter analytics-in4 term kafka from destination-port 9093

Remaining ones are term es and wdqs.

New term es:

elukey@re0.cr1-eqiad> show configuration firewall family inet filter analytics-in4 term es
from {
    destination-address {
        /* elastic1017 */
        10.64.48.39/32;
        /* elastic1051 */
        10.64.32.21/32;
        /* elastic1052 */
        10.64.32.22/32;
        /* elastic2010 */
        10.192.16.146/32;
        /* elastic2035 */
        10.192.48.74/32;
        /* elastic2036 */
        10.192.48.75/32;
    }
    protocol tcp;
    destination-port [ 9200 9243 ];
}
then accept;

Mentioned in SAL (#wikimedia-operations) [2018-07-04T08:53:08Z] <elukey> update analytics-in4 filter rules on cr1/cr2 eqiad - T198623

New term wdqs:

elukey@re0.cr1-eqiad> show configuration firewall family inet filter analytics-in4 term wdqs
from {
    destination-address {
        /* wdqs1003 */
        10.64.0.14/32;
        /* wdqs2001 */
        10.192.32.148/32;
        /* wdqs2002 */
        10.192.48.65/32;
        /* wdqs2003 */
        10.192.0.29/32;
        /* wdqs1004 */
        10.64.0.17/32;
        /* wdqs1005 */
        10.64.48.46/32;
    }
    protocol tcp;
    destination-port 8888;
}
then accept;

term git_deploy
[...]
destination-address:
10.64.0.196/32 : tin decommed?
10.192.16.132 : no PTR
10.192.32.22 : no PTR

I am pretty sure that this is a pre-scap thing, we should drop it :)

Other thing: shall I also drop the puppet term in analytics-in4 since it is already present in common-infrastructure4 ?

Change 443803 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@master] Add note about wdqs access through firewall

https://gerrit.wikimedia.org/r/443803

Change 443804 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@production] Add note about wdqs access through firewall

https://gerrit.wikimedia.org/r/443804

Change 443803 merged by jenkins-bot:
[analytics/wmde/scripts@master] Add note about wdqs access through firewall

https://gerrit.wikimedia.org/r/443803

Change 443804 merged by jenkins-bot:
[analytics/wmde/scripts@production] Add note about wdqs access through firewall

https://gerrit.wikimedia.org/r/443804

I am pretty sure that this is a pre-scap thing, we should drop it :)

Great!

Other thing: shall I also drop the puppet term in analytics-in4 since it is already present in common-infrastructure4 ?

Yep.

@ayounsi are we sure that we can touch common-infrastructure4 without affecting anything else? Is there any trace of who made it? If it is used only by analytics it would make sense to just include those terms in analytics-in4 for visibility..

@ayounsi are we sure that we can touch common-infrastructure4 without affecting anything else?

Yep, the only place where common-infrastructure4 is mentioned is at the top of analytics-in4

Is there any trace of who made it?

@faidon might know

If it is used only by analytics it would make sense to just include those terms in analytics-in4 for visibility..

This is what I'm aiming for. It's a different file for Capirca, but all in the same firewallfilter at the end.

Also:

show firewall family inet filter analytics-in4 term archiva    
from {
    destination-address {
        208.80.154.18/32;
    }
    protocol tcp;
    destination-port [ 80 873 443 ];
}

There is no then permit is that term still needed or should we add the permit?

show firewall family inet filter analytics-in4 term archiva    
from {
    destination-address {
        208.80.154.18/32;
    }
    protocol tcp;
    destination-port [ 80 873 443 ];
}

There is no then permit is that term still needed or should we add the permit?

Yes definitely, IIUC from this page the default is permit so this is why it has been working so far.. Good catch! Will fix it tomorrow :)

Fixed archiva and removed puppet in analytics-in4. The last step is to drop Ganglia and git-deploy terms from common-infrastructure4.

elukey set the point value for this task to 8.

Last changed applied by Arzhel, including merging common-infrastructure4 to analytics-in4

ayounsi renamed this task from Review analytics-in4 rules on cr1/cr2 eqiad to Review analytics-in4/6 rules on cr1/cr2 eqiad.Jul 10 2018, 3:38 PM

I added the IPv6 equivalent of the v4 filter with a default "log+permit" term, so we can see if we missed anything.

3 highlights:
1/ Hosts hitting that default rule were using SLAAC IPs and not v6 IPs based on their v4 IPs, fixed in T199180

2/ At least stat1005/1006 are talking to 2620:0:861:3:208:80:154:85 (which is gerrit.wikimedia.org) via https
Note that there was no equivalent IPv4 term.
But there is a "gerritssh" term for destination cobalt.wikimedia.org and port 29418/tcp that is not being used so far.

I added a temporary policy to allow that traffic over v4 and v6 (to keep v4/v6 similar) but the program doing that query should use the webproxies instead.

3/ At least stat1005 is talking to 2620:0:861:ed1a:0:0:0:1 (which is text-lb.eqiad) via https
I did a bit more digging here and it seems like a php script is running every minutes and throwing those queries.
I saw @Addshore name multiple times while looking at cron jobs and running processes.
Same as above, ideally those script should be using the webproxies.

Proxy are well known to be hard to work with, and I see some scripts already configured to use them, so it could be a miss-configuration, bug (v4/v6 http/https), etc..

Once this is solved, we can switch the v6 filter to a default "reject+log".

The WMDE scripts have requests going to the following places not via the webproxy:

I can write a patch to send the 3 public request types through the webproxy, although perhaps for the wikidata.org requests and the query.wikidata requests there is some internal address that could be used? But I guess from the analytics machines that would require more firewall rules and the webproxy may be prefered?

The WMDE scripts have requests going to the following places not via the webproxy:

I can write a patch to send the 3 public request types through the webproxy, although perhaps for the wikidata.org requests and the query.wikidata requests there is some internal address that could be used? But I guess from the analytics machines that would require more firewall rules and the webproxy may be prefered?

If you could use the webproxy for the public queries it would surely be the easiest option :)

And more redundant, as query.wikidata.org and wikidata.org are load balanced.

Change 444907 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@master] Always request external urls via the webproxy

https://gerrit.wikimedia.org/r/444907

Change 444911 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@production] Always request external urls via the webproxy

https://gerrit.wikimedia.org/r/444911

Change 444907 merged by jenkins-bot:
[analytics/wmde/scripts@master] Always request external urls via the webproxy

https://gerrit.wikimedia.org/r/444907

Change 444911 merged by jenkins-bot:
[analytics/wmde/scripts@production] Always request external urls via the webproxy

https://gerrit.wikimedia.org/r/444911

The next puppet run on stat1005 should mean all of the wmde scripts that request external resources should go through the webproxy :)

Ran puppet and confirmed that those flows are not hitting the firewall anymore.
Thanks for your quick reply!

Edit:
As pointed out by Addshore, the flows mentioned in 2/ are Puppet/git calls:
Eg. https://github.com/wikimedia/puppet/blob/479729c712dbc4d54883d5c03238a849919ac7d5/modules/statistics/manifests/wmde/graphite.pp#L68

A fix might be: git config --global http.proxy http://webproxy.eqiad.wmnet:8080

Not sure who owns those repositories or who show approve/do the change. @elukey ?

elukey added a parent task: Restricted Task.Jul 11 2018, 7:06 AM

I think that there are more git::clone calls to gerrit from the Analytics VLAN, it is only a matter of finding them and possibly add the http.proxy setting directly to puppet (maybe as an exec?).

Change 445106 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] statistics::user: set http[s] proxy in git config

https://gerrit.wikimedia.org/r/445106

Change 445106 merged by Elukey:
[operations/puppet@production] statistics::user: set http[s] proxy in git config

https://gerrit.wikimedia.org/r/445106

Change 445111 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add profile::analytics::cluster::gitconfig to stat and notebooks

https://gerrit.wikimedia.org/r/445111

Change 445111 merged by Elukey:
[operations/puppet@production] Add profile::analytics::cluster::gitconfig to stat and notebooks

https://gerrit.wikimedia.org/r/445111

@ayounsi I tried to fix the git::clone calls to text-lb with the above changes, let me know if I fixed it or not :)

@elukey
I still see flows to gerrit (2620:0:861:3:208:80:154:85) https from:
2620:0:861:108:10:64:53:26
2620:0:861:108:10:64:53:30
2620:0:861:108:10:64:53:31
2620:0:861:106:10:64:36:116
2620:0:861:104:10:64:5:104
Which seem to match a puppet run schedule.

Flows to conf1006 (2620:0:861:107:10:64:48:167) 2181/tcp from:
2620:0:861:108:10:64:53:21
2620:0:861:106:10:64:36:118
If legit, probably need a dedicated policy?
EDIT: zookeeper term added for IPv6

@Addshore
I still see flows to text-lb.eqiad (2620:0:861:ed1a:0:0:0:1) https from:
2620:0:861:108:10:64:53:30 (stat1005)
Bursts of many connections every few minutes

As well as flows to lists.wikimedia.org (2620:0:861:1:208:80:154:21) https from:
2620:0:861:108:10:64:53:30 (stat1005)
Less frequently

Change 445618 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::gitconfig: use system level configuration for git

https://gerrit.wikimedia.org/r/445618

Change 445618 merged by Elukey:
[operations/puppet@production] profile::analytics::gitconfig: use system level configuration for git

https://gerrit.wikimedia.org/r/445618

Change 445619 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::gitconfig: fix previous pebkac

https://gerrit.wikimedia.org/r/445619

Change 445619 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::gitconfig: fix previous pebkac

https://gerrit.wikimedia.org/r/445619

So now on stat* and notebook* we have a /etc/gitconfig rule that forces all git users to use the http[s] proxy. The conf1006 flow is related to the term zookeeper, it has been added to analytics-in6.

No idea about the last two mentioned for text-lb.eqiad and lists.wikimedia.org :)

In addition to T198623#4415961
We have notebook1003 and notebook1004 sending ICMPv6 Multicast Listener Report every 2 minutes to broadcast address ff02::202.

It's harmless but shouldn't be happening. I dug a bit to try to figure out which process was responsible of this multicast membership.
My best guess is rpcbind, as only rsyslog and rpcbind are listening on udp/v6.

Edit, more digging:
Using cumin I looked for all the hosts subscribed to ff02::202.
helium.eqiad.wmnet,labstore[1006-1007].wikimedia.org,ms1001.wikimedia.org,dataset1001.wikimedia.org,notebook[1003-1004].eqiad.wmnet

All the ones I looked at have rpcbind running.
dataset1001 being a Unused spare system (spare::system), I went ahead and shut rpcbind down: sudo service rpcbind stop.
And ff02::202 disappeared from netstat -g -n. So it's indeed the culprit.
Do we use/need rpcbind? Is there a way to have it not do multicast?

https://www.ietf.org/proceedings/50/I-D/nfsv4-rpc-ipv6-00.txt

IPv6 enabled RPC service must join a well known multicast group,
which is FF02::202. A IPv6 host is
expected to remain in this group for it's entire life and should
rejoin this group if the node leaves this multicast group for any
reason. ONC RPC uses rpcbind or portmapper service to join this group
early during boot phase.

@ayounsi your hunch seems right :)

Change 446067 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add global git http[s].proxy config for thorium and an1003

https://gerrit.wikimedia.org/r/446067

Change 446067 merged by Elukey:
[operations/puppet@production] Add global git http[s].proxy config for thorium and an1003

https://gerrit.wikimedia.org/r/446067

Mentioned in SAL (#wikimedia-operations) [2018-07-17T15:42:37Z] <XioNoX> switching default analytics-in6 term to reject+log - T198623

Note that before switching to a default reject+log, I added terms to permit traffic to text-lb, misc-lb, and lists on port 443 until T198623#4415961 is addressed.

@ayounsi I am running tcpdump on stat1005 with ' ip6 and src 2620:0:861:108:10:64:53:30` on stat1005 but I see only traffic to install1002 as expected.. Are we still having the problem mentioned before?

nevermind, I saw it, it doesn't happen very often though. Will try to figure out its origin.

Ok I think I have finally get something :)

So I left tcpdump to capture ipv6 traffic excluding some "known" IPs like puppetmaster, webproxy, etc.. If I grep from the result only "https" tagged traffic, and group by timing, I can see the following clusters:

elukey@stat1005:~$ grep https ipv6.log | cut -d ":" -f 1,2 | uniq -c
  14921 19:15
      4 21:58
      2 21:59
      9 00:48
  14980 01:15
     54 03:00
  14938 07:15
  15004 13:15
   6707 13:20
    649 13:21

So there is definitely something that runs periodically at :15, surely a cron. I dumped all the users' crontabs and found this one:

elukey@stat1005:~$ sudo -u ezachte crontab -l | grep 19
15 1,7,13,19 * * * nice /home/ezachte/wikistats/mail-lists/bash/report_mail_lists_counts.sh  >/dev/null 2>&1

At some point a perl script tries to contact http://mail.wikipedia.org/mailman/listinfo, that redirects to https://mail.wikipedia.org/mailman/listinfo, that finally redirects to https://lists.wikimedia.org/mailman/listinfo.

The owner of the script already added the http proxy, but I think what is missing is the https one. I added it to the perl script, let's see if it works or not. If it doesn't, I'll contact the owner as follow up :)

Tried to find all the occurrences of webproxy and added the related https configuration, let's see if things will change!

Some https calls still registered, I tried to add more https_proxy settings and I opened T201134 to follow up with the author of the cron scripts making the most traffic.

No flows logged since at least the last 3 days. So it looks fine to me. I removed the syslog statement to minimize noise while Luca is on vacations.

I think the very last items here are:

  • remove the temporary allow+log rule
  • replace the "reject" with "deny" (reject forces the router to send a reply which consumes router's resources

I am back from vacations! I am still seeing some https traffic to lists.w.o and text-lb from stat1005 though, so I think that I'd need to dig a bit more before calling it a win (sigh).

For the moment I captured only these flows:

elukey@stat1005:~$ grep https ipv6_after_changes.log| while read line; do endpoint=$(echo $line | cut -d" " -f 5); timing=$(echo $line | cut -d ":" -f1,2); echo $timing" "$endpoint; done | sort |  uniq -c
      4 00:51 text-lb.codfw.wikimedia.org.https:
      3 00:52 text-lb.codfw.wikimedia.org.https:
     52 03:00 lists.wikimedia.org.https:

I'll keep using tcpdump to see if those are regular or not, and then track down owners of the crons in case.

After one day:

elukey@stat1005:~$ grep https ipv6_after_changes.log| while read line; do endpoint=$(echo $line | cut -d" " -f 5); timing=$(echo $line | cut -d ":" -f1,2); echo $timing" "$endpoint; done | sort |  uniq -c
      9 00:48 text-lb.eqiad.wikimedia.org.https:
      4 00:51 text-lb.codfw.wikimedia.org.https:
      3 00:52 text-lb.codfw.wikimedia.org.https:
    107 03:00 lists.wikimedia.org.https:

The only regular https calls that I can see now are the ones happening on 3:00 AM, so once I find that cron we should be good.

I was able to get rid of the traffic at 03:00 AM, but some trails are remaining:

8 00:53 text-lb.eqiad.wikimedia.org.https:
3 20:11 2a04:4e42:200::223.https:
3 20:11 2a04:4e42::223.https:
3 20:11 2a04:4e42:400::223.https:
3 20:11 2a04:4e42:600::223.https:

The 00:5X calls to text-lb are the ones in T201746, I verified this adding logging of DNS queries in my tcpdump filters (I can clearly see a DNS call to resolve phabricator around that time).

The other https calls are new to me, seems related to pri.authdns.ripe.net., but I guess it was a one off?

From my tcpdumps it seems that no more https calls are made via ipv6 without going through the proxy. @ayounsi, we can proceed with the last steps for the analytics-in6 rules!

[edit firewall family inet filter analytics-in4 term default then]
-       reject;
+       discard;
[edit firewall family inet filter analytics-in4]
-      /*
-       ** Needed until all flows listed in T198623 use the proxies
-       */
-      term https-public {
-          from {
-              destination-address {
-                  /* lists */
-                  208.80.154.21/32;
-                  /* text-lb.eqiad */
-                  208.80.154.224/32;
-                  /* misc-web-lb.eqiad */
-                  208.80.154.251/32;
-              }
-              protocol tcp;
-              destination-port 443;
-          }
-          then accept;
-      }
[edit firewall family inet6 filter analytics-in6 term default then]
-       reject;
+       discard;
[edit firewall family inet6 filter analytics-in6]
-      /*
-       ** Needed until all flows listed in T198623 use the proxies
-       */
-      term https-public {
-          from {
-              destination-address {
-                  /* lists */
-                  2620:0:861:1:208:80:154:21/128;
-                  /* text-lb.eqiad */
-                  2620:0:861:ed1a::1/128;
-                  /* misc-web-lb.eqiad */
-                  2620:0:861:ed1a::3:d/128;
-              }
-              next-header tcp;
-              destination-port 443;
-          }
-          then accept;
-      }

Mentioned in SAL (#wikimedia-operations) [2018-08-27T17:37:05Z] <XioNoX> pushing the above analytics-in changes to cr1/2-eqiad - T198623

elukey changed the point value for this task from 8 to 21.Aug 28 2018, 6:24 AM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.

Mentioned in SAL (#wikimedia-operations) [2018-08-29T09:33:46Z] <elukey> cr1/2-eqiad: update analytics-in4 filter with the new archiva host, add a new term 'archiva' to analytics-in6 filter - T198623

Mentioned in SAL (#wikimedia-operations) [2018-09-18T16:33:58Z] <XioNoX> delete filter common-infrastructure4 on cr1/2-eqiad, unused/obsolete after T198623