Page MenuHomePhabricator

[Ceph] [alert] toolschecker: NFS read/writeable on Toolforge
Closed, ResolvedPublic

Description

Something happened in Ceph today that triggered a NFS alert (critical, I received a page). The alert resolved without any action after a few minutes.

Timeline as recorded in #wikimedia-cloud-feed (all times UTC):

06:41 <+jinxer-wm> (CephClusterInWarning) firing: The ceph cluster in  is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWar

06:45 <+wmcs-alerts> (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown

06:46 <+icinga-wm> PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 340 bytes in 60.008 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

06:47 <+icinga-wm> PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/cron - 340 bytes in 60.004 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

06:50 <+icinga-wm> RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 52.649 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

06:50 <+wmcs-alerts> (ProbeDown) resolved: (5) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown

06:52 <+icinga-wm> RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

06:56 <+jinxer-wm> (CephClusterInWarning) resolved: The ceph cluster in  is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInW

Event Timeline

taavi renamed this task from [Ceph] [alert] toolschecker: NFS read/writeable on labs instances to [Ceph] [alert] toolschecker: NFS read/writeable on Toolforge.Sep 12 2023, 7:29 AM

I found some more details in /var/log/ceph/ceph.log (cloudcephmon1001). I'm pasting some lines from the log below, there are thousand of similar lines in the log between 6:35 and 6:50 but they all include osd.94 so it looks like the issue was on a single osd. ceph osd tree shows that osd.94 is in the host cloudcephosd1012.

2023-09-12T06:35:14.948510+0000 osd.94 (osd.94) 11576 : cluster [WRN] slow request osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578) initiated 2023-09-12T06:34:44.393880+0000 currently waiting for sub ops

2023-09-12T06:35:17.915316+0000 mon.cloudcephmon1003 (mon.0) 27307674 : cluster [WRN] Health check failed: 2 slow ops, oldest one blocked for 32 sec, osd.94 has slow ops (SLOW_OPS)

2023-09-12T06:40:00.000206+0000 mon.cloudcephmon1003 (mon.0) 27308071 : cluster [WRN] Health detail: HEALTH_WARN 14 slow ops, oldest one blocked for 313 sec, osd.94 has slow ops

2023-09-12T06:50:00.000222+0000 mon.cloudcephmon1003 (mon.0) 27308949 : cluster [WRN] Health detail: HEALTH_WARN 25 slow ops, oldest one blocked for 913 sec, osd.94 has slow ops

2023-09-12T06:50:22.616280+0000 mon.cloudcephmon1003 (mon.0) 27308980 : cluster [WRN] Health check update: 24 slow ops, oldest one blocked for 933 sec, osd.94 has slow ops (SLOW_OPS)

2023-09-12T06:50:27.619846+0000 mon.cloudcephmon1003 (mon.0) 27308986 : cluster [WRN] Health check update: 1 slow ops, oldest one blocked for 364 sec, osd.94 has slow ops (SLOW_OPS)

2023-09-12T06:50:29.939406+0000 mon.cloudcephmon1003 (mon.0) 27308988 : cluster [INF] Health check cleared: SLOW_OPS (was: 1 slow ops, oldest one blocked for 364 sec, osd.94 has slow ops)

I checked this Grafana dashboard showing information about the host cloudcephosd1012 and didn't see any significant change around 6:35 today.

fnegri claimed this task.

Closing this as "Resolved", I won't investigate more unless this happens again.

I see lost jumbo frames and failed pings:

image.png (1×3 px, 513 KB)

There's a single spike in disk latency for sdf on that host:

image.png (1×2 px, 423 KB)

Interesting, likely a networking issue then.

I see there were more lost jumbo frames in the last 24 hours, but they didn't cause any alert.

Screenshot 2023-09-12 at 11.21.10.png (620×1 px, 81 KB)

yep, just trying to gather as much info as I can before it vanishes xd

From the cloudcephosd1012 logs, during that time, right before the first slow ops I see:

Sep 12 06:35:12 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10368 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:35:12 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10369 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0

that's traffic between osds that should not be dropped :/

the full log:

Sep 12 06:35:12 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10368 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:35:12 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10369 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:35:14 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:14.945+0000 7f2864d95700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578) initiated 2023-09-12T06:34:44.393880+0000 
Sep 12 06:35:14 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:14.945+0000 7f2864d95700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.621007689.0:1944674848 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 3747840~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578) initiated 2023-09-12T06:34:44.403450+0000 
Sep 12 06:35:14 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:14.945+0000 7f2864d95700 -1 osd.94 37360605 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578)
Sep 12 06:35:14 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:14.945+0000 7f2864d95700 -1 osd.94 37360605 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578)
Sep 12 06:35:15 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:15.925+0000 7f2864d95700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578) initiated 2023-09-12T06:34:44.393880+0000 
Sep 12 06:35:15 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:15.925+0000 7f2864d95700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.621007689.0:1944674848 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 3747840~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578) initiated 2023-09-12T06:34:44.403450+0000 
Sep 12 06:35:15 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:15.925+0000 7f2864d95700 -1 osd.94 37360606 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578)
Sep 12 06:35:15 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:15.925+0000 7f2864d95700 -1 osd.94 37360606 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578)
Sep 12 06:35:16 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:16.937+0000 7f2864d95700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578) initiated 2023-09-12T06:34:44.393880+0000 
Sep 12 06:35:16 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:16.937+0000 7f2864d95700 -1 osd.94 37360606 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578)
Sep 12 06:35:16 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:16.937+0000 7f2864d95700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.621007689.0:1944674848 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 3747840~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578) initiated 2023-09-12T06:34:44.403450+0000 
Sep 12 06:35:16 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:16.937+0000 7f2864d95700 -1 osd.94 37360606 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578)
Sep 12 06:35:17 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:17.941+0000 7f2864d95700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578) initiated 2023-09-12T06:34:44.393880+0000 
Sep 12 06:35:17 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:17.941+0000 7f2864d95700 -1 osd.94 37360607 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578)
Sep 12 06:35:17 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:17.941+0000 7f2864d95700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.621007689.0:1944674848 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 3747840~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578) initiated 2023-09-12T06:34:44.403450+0000 
Sep 12 06:35:17 cloudcephosd1012 ceph-osd[14100]: 2023-09-12T06:35:17.941+0000 7f2864d95700 -1 osd.94 37360607 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.621007689.0:1944674843 8.1d5 8:ab8b1602:::rbd_data.f75d5cdeabff0d.0000000000003777:head [write 2682880~4096 in=4096b] snapc 441d=[441d,2516] ondisk+write+known_if_redirected e37360578)
Sep 12 06:35:18 cloudcephosd1012 sudo[43043]: prometheus : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/sbin/bmc-info --config-file /tmp/ipmi_exporter-898ade0b389f2fa0535ed791a639dff7

Hmm, iptables looks like it should not have been dropped?

root@cloudcephosd1012:~# iptables -L -n
Chain INPUT (policy DROP)
target     prot opt source               destination         
DROP       all  --  23.226.133.0/24      0.0.0.0/0           
DROP       all  --  31.184.250.10        0.0.0.0/0           
DROP       all  --  92.200.101.215       0.0.0.0/0           
DROP       all  --  93.184.216.34        0.0.0.0/0           
DROP       all  --  169.45.120.238       0.0.0.0/0           
DROP       all  --  218.92.0.16          0.0.0.0/0           
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            PKTTYPE = multicast
DROP       tcp  --  0.0.0.0/0            0.0.0.0/0            state NEW tcp flags:!0x17/0x02
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0           
ACCEPT     tcp  --  192.168.4.6          192.168.4.12         tcp dpts:6800:7100
...
ACCEPT     tcp  --  192.168.6.2          192.168.4.12         tcp dpts:6800:7100
...
DROP       udp  --  0.0.0.0/0            255.255.255.255      udp spt:67 dpt:68
NFLOG      all  --  0.0.0.0/0            0.0.0.0/0            limit: avg 1/sec burst 5 nflog-prefix  "[fw-in-drop]"

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
Sep 12 06:35:12 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10368 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:35:12 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10369 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0

Note the port order here: 6802 is the source port and 59820 is the destination port. So as expected the firewall rules you linked do not apply here. However I believe that traffic should be return traffic, in which case the "allow ESTABLISHED" rule should allow that.

That's right yes, there's not many in the logs, but during that outage there's a bunch:

Sep 12 06:34:45 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=52 TOS=00 PREC=0x00 TTL=62 ID=10349 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411129941 WINDOW=16376 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:45 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=96 TOS=00 PREC=0x00 TTL=62 ID=10350 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411129941 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:45 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=253 TOS=00 PREC=0x00 TTL=62 ID=10351 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390113 ACK=3411129941 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:45 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=96 TOS=00 PREC=0x00 TTL=62 ID=10352 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390314 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:45 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=253 TOS=00 PREC=0x00 TTL=62 ID=10353 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390358 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:46 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10360 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:48 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10362 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:48 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10363 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:52 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10364 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:52 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10365 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:58 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10366 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:34:58 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10367 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:35:12 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10368 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:35:12 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10369 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:35:38 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10370 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:35:38 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10371 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:36:32 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10372 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:36:34 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10373 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:38:18 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10374 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:38:23 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10375 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:40:19 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10376 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:40:24 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10377 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:42:20 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10378 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:42:24 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10379 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:44:21 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10380 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:44:25 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10381 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:46:21 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10382 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:46:26 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10383 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:48:22 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=542 TOS=00 PREC=0x00 TTL=62 ID=10384 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390069 ACK=3411136089 WINDOW=16384 ACK PSH URGP=0 UID=499 GID=499 MARK=0
Sep 12 06:48:27 cloudcephosd1012 ulogd[37236]: [fw-in-drop] IN=ens3f1np1 OUT= MAC=bc:97:e1:28:90:91:88:d9:8f:b7:17:e0:08:00 SRC=192.168.6.2 DST=192.168.4.12 LEN=64 TOS=00 PREC=0x00 TTL=62 ID=10385 DF PROTO=TCP SPT=6802 DPT=59820 SEQ=2315390559 ACK=3411136089 WINDOW=16384 ACK URGP=0 UID=499 GID=499 MARK=0

maybe it's a consequence? Could it be that under memory pressure the OS is unable to keep the table of connections full and drops some of them? (making then the firewall think it's a new connection, and dropping it)

There's definitely traffic there, with the push flag too:

root@cloudcephosd1012:~# tcpdump -nvvi any src port 6802 and src host 192.168.6.2
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
...
15:48:22.533245 IP (tos 0x0, ttl 62, id 26909, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.6.2.6802 > 192.168.4.12.59808: Flags [.], cksum 0xff0e (correct), seq 517959425, ack 2243429203, win 16372, options [nop,nop,TS val 566661525 ecr 3562932610], length 0
15:48:22.533276 IP (tos 0x0, ttl 62, id 26910, offset 0, flags [DF], proto TCP (6), length 96)
    192.168.6.2.6802 > 192.168.4.12.59808: Flags [P.], cksum 0x160c (correct), seq 0:44, ack 1, win 16384, options [nop,nop,TS val 566661526 ecr 3562932610], length 44
15:48:22.533993 IP (tos 0x0, ttl 62, id 26911, offset 0, flags [DF], proto TCP (6), length 253)
    192.168.6.2.6802 > 192.168.4.12.59808: Flags [P.], cksum 0x89ce (correct), seq 44:245, ack 1, win 16384, options [nop,nop,TS val 566661526 ecr 3562932610], length 201
...

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T08:48:51Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-78 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T08:49:54Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-78 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T09:09:02Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-69 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T09:10:00Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-69 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T09:13:02Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-70 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T09:21:49Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-70 (T346123)

Mentioned in SAL (#wikimedia-cloud) [2023-09-15T09:28:31Z] <dcaro> rebooting tools-sge-cron-2 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T09:43:36Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-75 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T09:52:12Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-75 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T09:53:46Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-48 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T10:01:50Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-48 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T10:02:02Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-52 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T10:10:34Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-52 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T11:46:30Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-34 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T11:55:20Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-34 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T11:57:58Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-65 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-15T12:06:23Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-65 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-18T07:35:57Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-64 (T346123)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-18T07:37:15Z] <wm-bot2> dcaro@urcuchillay END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-64 (T346123)