Page MenuHomePhabricator

CephOSDClusterInWarning
Closed, ResolvedPublic

Description

We received an alert this morning about the cephosd Ceph cluster being in a warning state.

Checking the status:

btullis@cephosd1001:~$ sudo ceph -s
  cluster:
    id:     6d4278e1-ea45-4d29-86fe-85b44c150813
    health: HEALTH_WARN
            1 clients failing to respond to capability release
            1 MDSs report slow requests
 
  services:
    mon: 5 daemons, quorum cephosd1001,cephosd1002,cephosd1003,cephosd1004,cephosd1005 (age 3M)
    mgr: cephosd1003(active, since 8d), standbys: cephosd1002, cephosd1005, cephosd1004, cephosd1001
    mds: 3/3 daemons up, 2 standby
    osd: 100 osds: 100 up (since 3M), 100 in (since 10M)
    rgw: 5 daemons active (5 hosts, 1 zones)
 
  data:
    volumes: 3/3 healthy
    pools:   17 pools, 4481 pgs
    objects: 801.74k objects, 520 GiB
    usage:   30 TiB used, 1.1 PiB / 1.1 PiB avail
    pgs:     4481 active+clean
 
  io:
    client:   57 KiB/s rd, 448 KiB/s wr, 58 op/s rd, 82 op/s wr

One MDS (metadata server for cephfs) reports slow requests.

We should investigate this.

Event Timeline

Checking with sudo ceph -w we can see some log messages like this:

2025-03-20T10:19:26.094783+0000 mds.cephosd1004 [WRN] 1022 slow requests, 1 included below; oldest blocked for > 45297.021041 secs
2025-03-20T10:19:26.094789+0000 mds.cephosd1004 [WRN] slow request 964.039001 seconds old, received at 2025-03-20T10:03:22.055494+0000: client_request(client.17580649:51188 getattr AsLsXsFs #0x100000cae14 2025-03-20T10:03:22.051270+0000 caller_uid=900, caller_gid=900{900,}) currently joining batch getattr
2025-03-20T10:19:31.094882+0000 mds.cephosd1004 [WRN] 1022 slow requests, 2 included below; oldest blocked for > 45302.021107 secs
2025-03-20T10:19:31.094886+0000 mds.cephosd1004 [WRN] slow request 3840.041210 seconds old, received at 2025-03-20T09:15:31.053350+0000: client_request(client.17580649:51073 getattr AsLsXsFs #0x100000cae14 2025-03-20T09:15:31.049115+0000 caller_uid=900, caller_gid=900{900,}) currently joining batch getattr
2025-03-20T10:19:31.094890+0000 mds.cephosd1004 [WRN] slow request 63.396623 seconds old, received at 2025-03-20T10:18:27.697938+0000: client_request(client.17580649:51224 getattr AsLsXsFs #0x100000cae14 2025-03-20T10:18:27.696391+0000 caller_uid=900, caller_gid=900{900,}) currently joining batch getattr

The fact that it reports the uid as 900 suggests that this is a Kubernetes pod, since that is our default runtime user. This makes sense because it is the first place that we are making extensive use of cephfs.
The only other place it is mounted it on ml-lab1002.

So this is either an Airflow component or task pod, or it is part of the work that we are doing on T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.

Referring to the guidelines here: https://docs.ceph.com/en/quincy/cephfs/troubleshooting/#slow-requests-mds

I can examine the ops in flight like this:

btullis@cephosd1004:~$ sudo ceph daemon mds.cephosd1004 dump_ops_in_flight | jq .

The output seems to match the current output from ceph -w which states that there are currently 120 slow requests. They are all joining batch getattr.

Identify the stuck commands and examine why they are stuck. Usually the last “event” will have been an attempt to gather locks, or sending the operation off to the MDS log. If it is waiting on the OSDs, fix them. If operations are stuck on a specific inode, you probably have a client holding caps which prevent others from using it, either because the client is trying to flush out dirty data or because you have encountered a bug in CephFS’ distributed file lock code (the file “capabilities” [“caps”] system).

If it’s a result of a bug in the capabilities code, restarting the MDS is likely to resolve the problem.

I think that I will try restarting the MDS process.

We can see from the output of ceph fs dump that is is the dpe file system for which the mds.cephosd1004 MDS is active.

btullis@cephosd1004:~$ sudo ceph fs dump
e7793
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1
 
Filesystem 'dpe' (1)
fs_name	dpe
epoch	7793
flags	12 joinable allow_snaps allow_multimds_snaps
created	2024-10-07T10:00:05.297868+0000
modified	2025-03-20T11:08:04.056231+0000
tableserver	0
root	0
session_timeout	60
session_autoclose	300
max_file_size	1099511627776
required_client_features	{}
last_failure	0
last_failure_osd_epoch	66410
compat	compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds	1
in	0
up	{0=17532364}
failed	
damaged	
stopped	
data_pools	[16]
metadata_pool	15
inline_data	disabled
balancer	
bal_rank_mask	-1
standby_count_wanted	1
[mds.cephosd1004{0:17532364} state up:active seq 189895 addr [v2:10.64.134.12:6800/961568131,v1:10.64.134.12:6801/961568131] compat {c=[1],r=[1],i=[7ff]}]
 
 
Filesystem 'dumps' (2)
fs_name	dumps
epoch	5131
flags	12 joinable allow_snaps allow_multimds_snaps
created	2024-11-21T14:51:51.460928+0000
modified	2025-03-11T16:18:34.225663+0000
tableserver	0
root	0
session_timeout	60
session_autoclose	300
max_file_size	1099511627776
required_client_features	{}
last_failure	0
last_failure_osd_epoch	66414
compat	compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds	1
in	0
up	{0=17507916}
failed	
damaged	
stopped	
data_pools	[19]
metadata_pool	18
inline_data	disabled
balancer	
bal_rank_mask	-1
standby_count_wanted	1
[mds.cephosd1002{0:17507916} state up:active seq 70 addr [v2:10.64.131.21:6800/3148591687,v1:10.64.131.21:6801/3148591687] compat {c=[1],r=[1],i=[7ff]}]
 
 
Filesystem 'home' (3)
fs_name	home
epoch	5125
flags	12 joinable allow_snaps allow_multimds_snaps
created	2024-11-21T15:39:33.394961+0000
modified	2025-03-11T16:14:02.872132+0000
tableserver	0
root	0
session_timeout	60
session_autoclose	300
max_file_size	1099511627776
required_client_features	{}
last_failure	0
last_failure_osd_epoch	66412
compat	compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds	1
in	0
up	{0=17555995}
failed	
damaged	
stopped	
data_pools	[21]
metadata_pool	20
inline_data	disabled
balancer	
bal_rank_mask	-1
standby_count_wanted	1
[mds.cephosd1001{0:17555995} state up:active seq 56 addr [v2:10.64.130.13:6800/801572634,v1:10.64.130.13:6801/801572634] compat {c=[1],r=[1],i=[7ff]}]
 
 
Standby daemons:
 
[mds.cephosd1003{-1:17496402} state up:standby seq 1 addr [v2:10.64.132.23:6800/3923888493,v1:10.64.132.23:6801/3923888493] compat {c=[1],r=[1],i=[7ff]}]
[mds.cephosd1005{-1:17507731} state up:standby seq 1 addr [v2:10.64.135.21:6800/2453807968,v1:10.64.135.21:6801/2453807968] compat {c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 7793

Mentioned in SAL (#wikimedia-analytics) [2025-03-20T11:13:55Z] <btullis> restarting ceph-mds@cephosd1004.service for T389470

btullis@cephosd1004:~$ sudo systemctl restart ceph-mds@cephosd1004.service 
btullis@cephosd1004:~$ echo $?
0

image.png (661×1 px, 108 KB)

So one of the standby MDS servers was promoted automatically to be the active MDS for the dpe file system. In this case, that was ceph-mds@cephosd1005.

All of the following log entries appeared within one second on the ceph -w view.

2025-03-20T11:15:23.634145+0000 mon.cephosd1001 [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2025-03-20T11:15:23.634168+0000 mon.cephosd1001 [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2025-03-20T11:15:23.634178+0000 mon.cephosd1001 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to respond to capability release)
2025-03-20T11:15:23.634189+0000 mon.cephosd1001 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
2025-03-20T11:15:23.650332+0000 mon.cephosd1001 [INF] Standby daemon mds.cephosd1005 assigned to filesystem dpe as rank 0
2025-03-20T11:15:23.650451+0000 mon.cephosd1001 [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)
2025-03-20T11:15:26.732519+0000 mon.cephosd1001 [INF] daemon mds.cephosd1005 is now active in filesystem dpe as rank 0
2025-03-20T11:15:27.699629+0000 mon.cephosd1001 [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded)
2025-03-20T11:15:27.699650+0000 mon.cephosd1001 [INF] Cluster is now healthy

The cluster is now healthy again.

btullis@cephosd1001:~$ sudo ceph -s
  cluster:
    id:     6d4278e1-ea45-4d29-86fe-85b44c150813
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum cephosd1001,cephosd1002,cephosd1003,cephosd1004,cephosd1005 (age 3M)
    mgr: cephosd1003(active, since 8d), standbys: cephosd1002, cephosd1005, cephosd1004, cephosd1001
    mds: 3/3 daemons up, 2 standby
    osd: 100 osds: 100 up (since 3M), 100 in (since 10M)
    rgw: 5 daemons active (5 hosts, 1 zones)
 
  data:
    volumes: 3/3 healthy
    pools:   17 pools, 4481 pgs
    objects: 802.63k objects, 520 GiB
    usage:   30 TiB used, 1.1 PiB / 1.1 PiB avail
    pgs:     4481 active+clean
 
  io:
    client:   57 KiB/s rd, 711 KiB/s wr, 23 op/s rd, 102 op/s wr