Page MenuHomePhabricator

Record traffic flows in and out of eqiad during switchover
Closed, ResolvedPublic

Description

Since we'll be switched to codfw for a few weeks it'll be useful I think to look at the traffic flows in and out of eqiad and audit for anything unexpected.

The list of requirements is as follows (please edit/change at will):

  • Sampling rate as high as doable
  • The primary focus I think should be for flows destined to codfw, although the more coverage we have the better
  • Recording should be as long as we feel comfortable, I'd say no less than 24-48h

Turnilo links for Jul 21 -> Jul 26

directionprotoflagslink
codfw -> eqiadv4 privateSYN by dst porthttps://w.wiki/3gad
eqiad -> codfwv4 privateSYN by dst porthttps://w.wiki/3gae
codfw -> eqiadv4 privateSYN by dst port, no thanos-swift portshttps://w.wiki/3xGz
eqiad -> codfwv4 privateSYN by dst port, no thanos-swift portshttps://w.wiki/3xH2
codfw -> eqiadv4 privateUDP by dst porthttps://w.wiki/3xMX
eqiad -> codfwv4 privateUDP by dst porthttps://w.wiki/3xMa

Breakdown TCP SYNs codfw -> eqiad

  • kafka-ssl kafka-logging + kafka-jumbo + kafka-main (9093)
  • thanos-swift (6000-6025)
  • mcrouter (11214)
  • redis from mw towards mwlog1002 (6379)
  • (unknown) from maps2007 to kubestage1001 (4105)
  • graphite traffic from maps/webperf towards graphite1004 (2003)
  • etherpad traffic SSL (7443)
  • syslog tls towards centrallog (6514)
  • puppetdb postgres (6541)
  • traffic for analytics web towards thorium (8443)
  • plaintext kafka towards kafka-main from cp hosts (9092)
  • otrs ssl traffic (1443)
  • thanos rule metrics towards thanos-fe (17902)
  • graphite plaintext carbon traffic towards graphite1004 (1903)
  • rsync plaintext towards deploy1002/releases1002 (873)
  • ssh traffic towards a bunch of hosts like dbproxy/mw/db, I am assuming for interactive purposes

Breakdown TCP SYNs eqiad -> codfw

  • mysql traffic both on 3306 and multiinstance ports (3317 3311 3314 3312 3318 3315 3316 3313 3323 3325 3320 3321 3322)
  • tls traffic towards chartmuseum2002 and mwmaint2002 (443)
  • kafka traffic towards kafka-main (9093)
  • graphite plaintext carbon traffic towards graphite1004 (1903)
  • syslog tls towards centrallog (6514)
  • thanos rule metrics towards thanos-fe (17902)
  • ssh traffic towards a bunch of hosts like mw/ores/parse, I am assuming for interactive purposes

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The proper fix is T263277.

However there are 2 options to get data quickly and temporarily:

The easiest and "cleanest": enable netflow sampling on the relevant interfaces.
Data will then show up in the wmf_netflow Turnilo dashboard. And can be sliced and diced by filtering on source and destination IPs (eqiad/codfw/private space).
The main drawback is that it's sampled at 1:1000.

The other option is to enable syslog logging on the relevant interfaces, that will log all the packets traversing the interfaces, but might overwhelm the routers or the logging infrastructure.
The plus side is that ELK parsing for those syslog messages is already configured (and the dashboard exist).

So unless the 1:1000 sampling ratio is not acceptable I'd rather do option 1.

Agreed option 1 seems easier and safer than option 2, the sampling isn't great but not the end of the world if we're sampling for multiple days I think (could we turn the sampling up only for those interfaces?).

I think we should try option 1 for example starting on Mon 12th and see what the results are and if we're happy with them. If results are not great then we can consider syslog (perhaps for a short time in the US night/morning span to observe/mitigate impact)

Pushing the following (and similar on cr2) should do the trick. As it's only for a few days, and it would not be trivial to add that "hack" I'm thinking of not pushing it through Homer but doing the change locally.
@cmooney let me know what you think.

re0.cr1-eqiad# show | compare 
[edit interfaces xe-4/2/0 unit 0 family inet filter]
+        output sample-accept4;
[edit interfaces xe-4/2/0 unit 0 family inet6]
+       filter {
+           input sample-accept6;
+           output sample-accept6;
+       }
[edit firewall family inet filter transport-in4 term default then]
+       sample;
[edit firewall family inet]
      filter sample-drop4 { ... }
+     filter sample-accept4 {
+         term default {
+             then {
+                 sample;
+                 accept;
+             }
+         }
+     }
[edit firewall family inet6]
      filter sample-drop6 { ... }
+     filter sample-accept6 {
+         term default {
+             then {
+                 sample;
+                 accept;
+             }
+         }
+     }

Looks good to me @ayounsi if you want to commit.

I would totally agree btw, Netflow is probably handled in silicon, or at least should be fairly efficient. Logging flows to syslog I'm not familiar with on JunOS, but does sound like it'd ask a lot more of the CPU. Plus we'd need to do a lot of work to pull metrics from the resulting syslogs, whereas the Netflow/Turnilo pipeline is already there.

Mentioned in SAL (#wikimedia-operations) [2021-07-21T07:44:12Z] <XioNoX> push extra sampling on cr1-eqiad - T286038

Mentioned in SAL (#wikimedia-operations) [2021-07-21T07:56:25Z] <XioNoX> push extra sampling on cr2-eqiad - T286038

re0.cr2-eqiad# show | compare 
[edit interfaces xe-3/2/2 unit 0 family inet filter]
+        output sample-accept4;
[edit interfaces xe-3/2/2 unit 0 family inet6]
+       filter {
+           input sample-accept6;
+           output sample-accept6;
+       }
[edit firewall family inet filter transport-in4 term default then]
+       sample;
[edit firewall family inet]
      filter sample-drop4 { ... }
+     filter sample-accept4 {
+         term default {
+             then {
+                 sample;
+                 accept;
+             }
+         }
+     }
[edit firewall family inet6]
      filter sample-drop6 { ... }
+     filter sample-accept6 {
+         term default {
+             then {
+                 sample;
+                 accept;
+             }
+         }
+     }

Talked to @fgiunchedi on IRC, let us know when to rollback. Ideally before the end of the week so we don't keep "hacks" for too long.

Thank you @ayounsi @cmooney ! Could we keep the sampling for a week straight ? I understand if you are not comfortable with it though, in that case reverting on Fri sounds good to me. I'll be posting turnilo links with findings etc

Legoktm triaged this task as Medium priority.Jul 26 2021, 11:51 PM

@fgiunchedi anything left to do for netops or is it ok to close the task?

@fgiunchedi anything left to do for netops or is it ok to close the task?

For netops I think nothing left to do (removing tag), thanks for your help!

Perhaps the most surprising result I found so far is kafka plaintext traffic from cp hosts to kafka-main1* (continues to this day) cc Traffic

root@kafka-main1004:~# tcpdump -i any 'port 9092 and src net 10.192/16'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
14:52:53.723841 IP cp2039.codfw.wmnet.31917 > kafka-main1004.eqiad.wmnet.9092: Flags [F.], seq 2502456435, ack 3149126616, win 83, options [nop,nop,TS val 1032650607 ecr 2124169888], length 0
14:52:53.724395 IP cp2039.codfw.wmnet.5299 > kafka-main1004.eqiad.wmnet.9092: Flags [S], seq 109786570, win 42340, options [mss 1460,sackOK,TS val 1032650608 ecr 0,nop,wscale 9], length 0
14:52:53.757454 IP cp2039.codfw.wmnet.5299 > kafka-main1004.eqiad.wmnet.9092: Flags [.], ack 3147550372, win 83, options [nop,nop,TS val 1032650641 ecr 2124169922], length 0
14:52:53.757521 IP cp2039.codfw.wmnet.5299 > kafka-main1004.eqiad.wmnet.9092: Flags [P.], seq 0:30, ack 1, win 83, options [nop,nop,TS val 1032650641 ecr 2124169922], length 30
14:52:53.791064 IP cp2039.codfw.wmnet.5299 > kafka-main1004.eqiad.wmnet.9092: Flags [.], ack 273, win 83, options [nop,nop,TS val 1032650674 ecr 2124169956], length 0

Change 714795 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache: Support TLS on kafka::statsv

https://gerrit.wikimedia.org/r/714795

Change 714796 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hieradata: Enable SSL for statsv varnishkafka@cp4032

https://gerrit.wikimedia.org/r/714796

Change 714795 merged by Vgutierrez:

[operations/puppet@production] cache: Support TLS on kafka::statsv

https://gerrit.wikimedia.org/r/714795

Change 714796 merged by Vgutierrez:

[operations/puppet@production] hieradata: Enable SSL for statsv varnishkafka@cp4032

https://gerrit.wikimedia.org/r/714796

Change 714964 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hieradata: Enable SSL cluster wide for statsv varnishkafka

https://gerrit.wikimedia.org/r/714964

Mentioned in SAL (#wikimedia-operations) [2021-08-26T09:21:39Z] <elukey> elukey@kafka-main1001:~$ kafka acls --add --allow-principal User:CN=varnishkafka --producer --topic statsv - T286038

Mentioned in SAL (#wikimedia-operations) [2021-08-26T10:07:31Z] <vgutierrez> disable puppet on cp-text to merge I52cf2a573980e33487d1f05f19b192ae7d13d717 - T286038

Change 714964 merged by Vgutierrez:

[operations/puppet@production] hieradata: Enable SSL cluster wide for statsv varnishkafka

https://gerrit.wikimedia.org/r/714964

Thanks for pointing it out. It's been already fixed. Thanks @elukey for his support :)

Perhaps the most surprising result I found so far is kafka plaintext traffic from cp hosts to kafka-main1* (continues to this day) cc Traffic

root@kafka-main1004:~# tcpdump -i any 'port 9092 and src net 10.192/16'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
14:52:53.723841 IP cp2039.codfw.wmnet.31917 > kafka-main1004.eqiad.wmnet.9092: Flags [F.], seq 2502456435, ack 3149126616, win 83, options [nop,nop,TS val 1032650607 ecr 2124169888], length 0
14:52:53.724395 IP cp2039.codfw.wmnet.5299 > kafka-main1004.eqiad.wmnet.9092: Flags [S], seq 109786570, win 42340, options [mss 1460,sackOK,TS val 1032650608 ecr 0,nop,wscale 9], length 0
14:52:53.757454 IP cp2039.codfw.wmnet.5299 > kafka-main1004.eqiad.wmnet.9092: Flags [.], ack 3147550372, win 83, options [nop,nop,TS val 1032650641 ecr 2124169922], length 0
14:52:53.757521 IP cp2039.codfw.wmnet.5299 > kafka-main1004.eqiad.wmnet.9092: Flags [P.], seq 0:30, ack 1, win 83, options [nop,nop,TS val 1032650641 ecr 2124169922], length 30
14:52:53.791064 IP cp2039.codfw.wmnet.5299 > kafka-main1004.eqiad.wmnet.9092: Flags [.], ack 273, win 83, options [nop,nop,TS val 1032650674 ecr 2124169956], length 0
fgiunchedi claimed this task.

I'm boldly resolving this since we have subtasks open for plaintext traffic that needs encryption. We can repeat the audit at the next switchover.