Page MenuHomePhabricator

tcp handshake failure between pfw3-eqiad and frlog1001:6514
Closed, ResolvedPublic

Description

From T263674.

Current status:
pfw3-codfw sends security syslog (over TCP) successfully to both eqiad and codfw collectors
pfw3-eqiad sends security syslog (over TCP) successfully to only codfw

eqiad doesn't work.

pfw3-eqiad> restart security-log gracefully
Sep 24 16:04:12 pfw3-eqiad RT_SYSTEM: RTLOG_CONN_OPEN: Connection established syslog-tls-stream-codfw TLS 10.64.40.65/12432 10.195.0.76/6514
Sep 24 16:04:23 pfw3-eqiad RT_SYSTEM: RTLOG_CONN_ERROR: Connection error syslog-tls-stream Com 12429 abort
Sep 24 16:04:23 pfw3-eqiad RT_SYSTEM: RTLOG_CONN_ERROR: Connection error syslog-tls-stream Error code: major 3 minor 1 code 110, description:TCP time out after SYN is sent out
Sep 24 16:04:23 pfw3-eqiad RT_SYSTEM: RTLOG_CONN_ERROR: Connection error syslog-tls-stream status: 0, Error code: major 3 minor 1 code 110, description:TCP time out after SYN is sent out

Using tcpdump on frlog1001 (which is in the same vlan as the pfw3 interface initiating the handshake), we can see the pfw3 sending the intial SYN, then frlog1001 sending back a SYN-ACK, and then both re-trying.

On the pfw3 side, we see the following:

15:54:35.742502 In IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 44) 10.64.40.72.6514 > 10.64.40.65.12118: S 1037813821:1037813821(0) ack 2167186818 win 29200 <mss 1460>

So not sure why it's not registering the SYN-ACK.

Maybe a pfw3 bug?

Event Timeline

ayounsi triaged this task as High priority.Sep 25 2020, 9:31 AM
ayounsi created this task.
ayounsi created this object with visibility "WMF-NDA (Project)".
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Enabled traceoptions on pfw3-eqiad, but nothing looks interesting.

Oct 26 16:30:13 rtlogd: received 3 log entry, total size 827 bytes
Oct 26 16:30:13 rtlogd: received 0xab payload bytes in 819 bytes from log forwarder
Oct 26 16:30:13 rtlogd: receives log RTLOG_CONN_ERROR from RT_SYSTEM at severity 13
                                        
Oct 26 16:30:13 rtlogd: miscellaneous string(len=51)=Connection error syslog-tls-stream Com 12546 abort
                                        
Oct 26 16:30:13 rtlogd: attribute-list(len=66)=stream-name 17 syslog-tls-stream error-message 15 Com 12546 abort
                                        
Oct 26 16:30:13 rtlogd: forwarding 208 bytes
Oct 26 16:30:13 rtlogd: forwarded log to eventd
Oct 26 16:30:13 rtlogd: received 0x135 payload bytes in 644 bytes from log forwarder
Oct 26 16:30:13 rtlogd: receives log RTLOG_CONN_ERROR from RT_SYSTEM at severity 13
                                        
Oct 26 16:30:13 rtlogd: miscellaneous string(len=120)=Connection error syslog-tls-stream Error code: major 3 minor 1 code 110, description:TCP time out after SYN is sent out
 
Oct 26 16:30:13 rtlogd: attribute-list(len=135)=stream-name 17 syslog-tls-stream error-message 84 Error code: major 3 minor 1 code 110, description:TCP time out after SYN is sent out
 
Oct 26 16:30:13 rtlogd: forwarding 348 bytes
Oct 26 16:30:13 rtlogd: forwarded log to eventd
Oct 26 16:30:13 rtlogd: received 0x14b payload bytes in 331 bytes from log forwarder
Oct 26 16:30:13 rtlogd: receives log RTLOG_CONN_ERROR from RT_SYSTEM at severity 13
 
Oct 26 16:30:13 rtlogd: miscellaneous string(len=131)=Connection error syslog-tls-stream status: 0, Error code: major 3 minor 1 code 110, description:TCP time out after SYN is sent out
 
Oct 26 16:30:13 rtlogd: attribute-list(len=146)=stream-name 17 syslog-tls-stream error-message 95 status: 0, Error code: major 3 minor 1 code 110, description:TCP time out after SYN is sent out

I tried to remove/add the configuration as well, with not luck.

Opened Juniper case 2020-1027-0278.

Applied:

[edit interfaces reth0 unit 1134 family inet]
+       filter {
+           input PCAP;
+           output PCAP;
+       }
[edit forwarding-options]
+   packet-capture {
+       file filename syslog-capture;
+       maximum-capture-size 1500;
+   }
[edit firewall]
+   filter PCAP {
+       term 1 {
+           from {
+               source-address {
+                   10.64.40.72/32;
+               }
+               destination-address {
+                   10.64.40.65/32;
+               }
+           }
+           then {
+               sample;
+               accept;                 
+           }
+       }
+       term 2 {
+           from {
+               source-address {
+                   10.64.40.65/32;
+               }
+               destination-address {
+                   10.64.40.72/32;
+               }
+           }
+           then {
+               sample;
+               accept;
+           }
+       }
+       term allow-all-else {
+           then accept;
+       }
+   }

It created 2 files:

syslog-capture.xe-7.0.17 traffic from the SRX to frlog1001 (so pretty much only SYN)
syslog-capture.reth0 containing return traffic from frlog1001 to the SRX (And return SYN-ACK retransmits)

Followed up with JTAC, it looks to me that the PFE isn't registering the SYN-ACK and passing it on to the RE.

From JTAC:

I review the RSI provided on case 2020-1027-0278 and I can see that the cluster working on active/active mode:

root@pfw3-eqiad> show chassis cluster status
[...]
Redundancy group: 0 , Failover count: 1
node0  100      primary        no      no       None
node1  1        secondary      no      no       None
Redundancy group: 1 , Failover count: 2
node0  100      secondary      no      no       None
node1  1        primary        no      no       None

Currently TCP logging is not supported for Active-Active Cluster setups. When attempting to use Active-Active Cluster setup, the log client residing on the active node is unaware of the log client on the passive node that has a forwarding log session. This results in incorrect processing of the TCP log traffic and results in RST packets being generated.

The solution for this is to move all Redundancy Groups (RGs) to one node for Active-Passive deployment setup.

Please check the following link for more information: https://kb.juniper.net/InfoCenter/index?page=content&id=kb32393&actp=METADATA

Ideal state is node0 to be primary for both redundancy groups (0: control plane, 1: data plane).

Why is it not?

pfw3-eqiad> show chassis cluster information
[...]
Feb 13 17:38:08 primary        secondary-hold Monitor failed: IF 
Feb 13 17:38:09 secondary-hold secondary      Ready to become secondary

pfw3-eqiad> show chassis cluster interfaces
[...]
Interface Monitoring:                   
    Interface         Weight    Status    Redundancy-group
    xe-7/0/17         255       Up        1   
    xe-0/0/17         255       Up        1   

pfw3-eqiad> show interfaces xe-0/0/17
[...]
Last flapped   : 2020-02-13 17:39:41 UTC

So on Feb 13th, xe-0/0/17 (to fasw-c1a-eqiad:xe-0/2/0), flapped.
Unfortunately logs are long gone but we know that there was no maintenance and the remote side shows errors:

fasw-c-eqiad> show interfaces xe-0/2/0 extensive
[...]
Input errors:
  Framing errors: 23  (all CRC/Align errors)
Output errors:
  Carrier transitions: 45

The errors are not increasing while the interface is still seeing little broadcast traffic in the fasw->pfw direction.

In short, an optic of fiber went bad, the firewall detected the issue and failed over as expected.

According to Smokeping there was no noticeable impact of the failover, but it would be interesting to know if it had any service impact (cc @Jgreen).

As pfw3-eqiad is a critical piece of infra, we need to:

  • T271295 - Replace both optics and fiber (pfw3a:xe-0/0/17 and fasw-c1a-eqiad:xe-0/2/0)
    • No impact expected (link not used)
  • Manually fail-back to node0 being primary for RG1
    • Possible few seconds connectivity outage
  • T271298 - Add a monitoring check to alert if a failover happened
    • As it can mean loss of redundancy

@Jgreen let me know when we can schedule the manual failback now that the optics have been replaced
(I'm back on the 1st).

@Jgreen let me know when we can schedule the manual failback now that the optics have been replaced
(I'm back on the 1st).

Looking at the FR calendar for upcoming campaigns, this week (Feb 1 - 5) is good.

The good news:

  • Failover doesn't cause any issue
  • The failover fixed the log issue
    • pfw3-eqiad RT_SYSTEM: RTLOG_CONN_OPEN: Connection established syslog-tls-stream TLS 10.64.40.65/12762 10.64.40.72/6514

The bad news:

  • Interface errors are still there on the pfw-fasw link, manually failing back to node1, reopening T271295
ayounsi claimed this task.

Optics replaced again in T271295, redundancy group 1 failed back to node0 where it belong.
Everything now looks good.

ayounsi changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Feb 9 2021, 7:21 PM