Page MenuHomePhabricator

an-worker1176.eqiad.wmnet running out of space
Closed, ResolvedPublic

Description

Running Spark jobs and getting failures from an-worker1176.eqiad.wmnet with:

java.nio.file.FileSystemException: /var/lib/hadoop/data/m/yarn/local/usercache/analytics/appcache/application_1727783536357_364603/blockmgr-208eacee-d5fc-466b-89b1-c289e35a2386/3f: No space left on device

These failures only come from this node.

Event Timeline

My Spark jobs recover, so this is not urgent. But seems like we need to investigate that node.

Thanks @xcollazo - I noticed this last week and made a note here: T353788#10242834 - but I don't think that @Stevemunene had a chance to look at it yet.

Could we decommission this node while we work out the issues then?

BTullis triaged this task as High priority.Oct 22 2024, 3:53 PM

I booted the machine into recovery mode. Then checked the state of file systems.

root@an-worker1176:~# lsblk
NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                              8:0    0   3.6T  0 disk 
├─sda1                           8:1    0   953M  0 part /var/lib/hadoop/data/m
└─sda2                           8:2    0   3.6T  0 part 
sdb                              8:16   0   3.6T  0 disk 
└─sdb1                           8:17   0   3.6T  0 part /var/lib/hadoop/data/l
sdc                              8:32   0   3.6T  0 disk 
└─sdc1                           8:33   0   3.6T  0 part /var/lib/hadoop/data/k
sdd                              8:48   0   3.6T  0 disk 
└─sdd1                           8:49   0   3.6T  0 part /var/lib/hadoop/data/j
sde                              8:64   0   3.6T  0 disk 
└─sde1                           8:65   0   3.6T  0 part /var/lib/hadoop/data/i
sdf                              8:80   0   3.6T  0 disk 
└─sdf1                           8:81   0   3.6T  0 part /var/lib/hadoop/data/h
sdg                              8:96   0   3.6T  0 disk 
└─sdg1                           8:97   0   3.6T  0 part /var/lib/hadoop/data/g
sdh                              8:112  0   3.6T  0 disk 
└─sdh1                           8:113  0   3.6T  0 part /var/lib/hadoop/data/e
sdi                              8:128  0   3.6T  0 disk 
└─sdi1                           8:129  0   3.6T  0 part /var/lib/hadoop/data/f
sdj                              8:144  0 446.6G  0 disk 
├─sdj1                           8:145  0   953M  0 part /boot
├─sdj2                           8:146  0     1K  0 part 
└─sdj5                           8:149  0 445.7G  0 part 
  ├─an--worker1176--vg-swap    254:0    0   9.3G  0 lvm  [SWAP]
  ├─an--worker1176--vg-root    254:1    0  55.9G  0 lvm  /
  └─an--worker1176--vg-journalnode
                               254:2    0    10G  0 lvm  /var/lib/hadoop/journal
sdk                              8:160  0   3.6T  0 disk 
└─sdk1                           8:161  0   3.6T  0 part /var/lib/hadoop/data/d
sdl                              8:176  0   3.6T  0 disk 
└─sdl1                           8:177  0   3.6T  0 part /var/lib/hadoop/data/c
sdm                              8:192  0   3.6T  0 disk 
└─sdm1                           8:193  0   3.6T  0 part /var/lib/hadoop/data/b

Unmounted /dev/sda1

root@an-worker1176:~# umount /var/lib/hadoop/data/m
root@an-worker1176:~#

Executed fdisk /dev/sda and printed the partition table on /dev/sda

root@an-worker1176:~# fdisk /dev/sda

Welcome to fdisk (util-linux 2.36.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): p
Disk /dev/sda: 3.64 TiB, 4000225165312 bytes, 7812939776 sectors
Disk model: PERC H750 Adp   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: gpt
Disk identifier: 79324028-C10F-42E1-82D7-6FB2770FB090

Device       Start        End    Sectors  Size Type
/dev/sda1     2048    1953791    1951744  953M Linux filesystem
/dev/sda2  1953792 7812937727 7810983936  3.6T Linux LVM

Deleted both partitions

Command (m for help): d
Partition number (1,2, default 2): 2

Partition 2 has been deleted.

Command (m for help): d
Selected partition 1
Partition 1 has been deleted.

Created a new partition using all of the space. Reused the existing ext4 filesystem signature.

Command (m for help): n
Partition number (1-128, default 1): 
First sector (34-7812939742, default 2048): 
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-7812939742, default 7812939742): 

Created a new partition 1 of type 'Linux filesystem' and of size 3.6 TiB.
Partition #1 contains a ext4 signature.

Do you want to remove the signature? [Y]es/[N]o: n

Command (m for help): w

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

Checked the filesystem.

root@an-worker1176:~# e2fsck -f /dev/sda1
e2fsck 1.46.2 (28-Feb-2021)
Pass 1: Checking inodes, blocks, and sizes
Inode 526 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Inode 544 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Inode 612 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Inode 639 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Inode 1157 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Pass 1E: Optimizing extent trees
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

hadoop-m: ***** FILE SYSTEM WAS MODIFIED *****
hadoop-m: 1051/61056 files (1.3% non-contiguous), 227064/243968 blocks

Resized the file system to fill all available space.

root@an-worker1176:~# resize2fs /dev/sda1
resize2fs 1.46.2 (28-Feb-2021)
Resizing the filesystem on /dev/sda1 to 976617211 (4k) blocks.
The filesystem on /dev/sda1 is now 976617211 (4k) blocks long.

root@an-worker1176:~#

Mounted all file systems and checked that it looked correct.

root@an-worker1176:~# mount -a
oot@an-worker1176:~# lsblk /dev/sda
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0  3.6T  0 disk 
└─sda1   8:1    0  3.6T  0 part /var/lib/hadoop/data/m

Rebooting to get the node back in the cluster.