an-worker1176.eqiad.wmnet running out of space
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	xcollazo
	Oct 21 2024, 8:21 PM

Description

Running Spark jobs and getting failures from an-worker1176.eqiad.wmnet with:

java.nio.file.FileSystemException: /var/lib/hadoop/data/m/yarn/local/usercache/analytics/appcache/application_1727783536357_364603/blockmgr-208eacee-d5fc-466b-89b1-c289e35a2386/3f: No space left on device

These failures only come from this node.

Related Objects

Mentioned Here: T353788: Add kafka-stretch100[1-2] to the hadoop cluster

Event Timeline

xcollazo created this task.Oct 21 2024, 8:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 21 2024, 8:21 PM

My Spark jobs recover, so this is not urgent. But seems like we need to investigate that node.

Thanks @xcollazo - I noticed this last week and made a note here: T353788#10242834 - but I don't think that @Stevemunene had a chance to look at it yet.

Could we decommission this node while we work out the issues then?

BTullis claimed this task.Oct 22 2024, 3:00 PM

BTullis edited projects, added Data-Platform-SRE (2024.10.19 - 2024.11.08); removed Data-Platform-SRE.

BTullis moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.

BTullis triaged this task as High priority.Oct 22 2024, 3:53 PM

I booted the machine into recovery mode. Then checked the state of file systems.

root@an-worker1176:~# lsblk
NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                              8:0    0   3.6T  0 disk 
├─sda1                           8:1    0   953M  0 part /var/lib/hadoop/data/m
└─sda2                           8:2    0   3.6T  0 part 
sdb                              8:16   0   3.6T  0 disk 
└─sdb1                           8:17   0   3.6T  0 part /var/lib/hadoop/data/l
sdc                              8:32   0   3.6T  0 disk 
└─sdc1                           8:33   0   3.6T  0 part /var/lib/hadoop/data/k
sdd                              8:48   0   3.6T  0 disk 
└─sdd1                           8:49   0   3.6T  0 part /var/lib/hadoop/data/j
sde                              8:64   0   3.6T  0 disk 
└─sde1                           8:65   0   3.6T  0 part /var/lib/hadoop/data/i
sdf                              8:80   0   3.6T  0 disk 
└─sdf1                           8:81   0   3.6T  0 part /var/lib/hadoop/data/h
sdg                              8:96   0   3.6T  0 disk 
└─sdg1                           8:97   0   3.6T  0 part /var/lib/hadoop/data/g
sdh                              8:112  0   3.6T  0 disk 
└─sdh1                           8:113  0   3.6T  0 part /var/lib/hadoop/data/e
sdi                              8:128  0   3.6T  0 disk 
└─sdi1                           8:129  0   3.6T  0 part /var/lib/hadoop/data/f
sdj                              8:144  0 446.6G  0 disk 
├─sdj1                           8:145  0   953M  0 part /boot
├─sdj2                           8:146  0     1K  0 part 
└─sdj5                           8:149  0 445.7G  0 part 
  ├─an--worker1176--vg-swap    254:0    0   9.3G  0 lvm  [SWAP]
  ├─an--worker1176--vg-root    254:1    0  55.9G  0 lvm  /
  └─an--worker1176--vg-journalnode
                               254:2    0    10G  0 lvm  /var/lib/hadoop/journal
sdk                              8:160  0   3.6T  0 disk 
└─sdk1                           8:161  0   3.6T  0 part /var/lib/hadoop/data/d
sdl                              8:176  0   3.6T  0 disk 
└─sdl1                           8:177  0   3.6T  0 part /var/lib/hadoop/data/c
sdm                              8:192  0   3.6T  0 disk 
└─sdm1                           8:193  0   3.6T  0 part /var/lib/hadoop/data/b

Unmounted /dev/sda1

root@an-worker1176:~# umount /var/lib/hadoop/data/m
root@an-worker1176:~#

Executed fdisk /dev/sda and printed the partition table on /dev/sda

root@an-worker1176:~# fdisk /dev/sda

Welcome to fdisk (util-linux 2.36.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): p
Disk /dev/sda: 3.64 TiB, 4000225165312 bytes, 7812939776 sectors
Disk model: PERC H750 Adp   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: gpt
Disk identifier: 79324028-C10F-42E1-82D7-6FB2770FB090

Device       Start        End    Sectors  Size Type
/dev/sda1     2048    1953791    1951744  953M Linux filesystem
/dev/sda2  1953792 7812937727 7810983936  3.6T Linux LVM

Deleted both partitions

Command (m for help): d
Partition number (1,2, default 2): 2

Partition 2 has been deleted.

Command (m for help): d
Selected partition 1
Partition 1 has been deleted.

Created a new partition using all of the space. Reused the existing ext4 filesystem signature.

Command (m for help): n
Partition number (1-128, default 1): 
First sector (34-7812939742, default 2048): 
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-7812939742, default 7812939742): 

Created a new partition 1 of type 'Linux filesystem' and of size 3.6 TiB.
Partition #1 contains a ext4 signature.

Do you want to remove the signature? [Y]es/[N]o: n

Command (m for help): w

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

Checked the filesystem.

root@an-worker1176:~# e2fsck -f /dev/sda1
e2fsck 1.46.2 (28-Feb-2021)
Pass 1: Checking inodes, blocks, and sizes
Inode 526 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Inode 544 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Inode 612 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Inode 639 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Inode 1157 extent tree (at level 2) could be narrower.  Optimize<y>? yes
Pass 1E: Optimizing extent trees
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

hadoop-m: ***** FILE SYSTEM WAS MODIFIED *****
hadoop-m: 1051/61056 files (1.3% non-contiguous), 227064/243968 blocks

Resized the file system to fill all available space.

root@an-worker1176:~# resize2fs /dev/sda1
resize2fs 1.46.2 (28-Feb-2021)
Resizing the filesystem on /dev/sda1 to 976617211 (4k) blocks.
The filesystem on /dev/sda1 is now 976617211 (4k) blocks long.

root@an-worker1176:~#

Mounted all file systems and checked that it looked correct.

root@an-worker1176:~# mount -a
oot@an-worker1176:~# lsblk /dev/sda
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0  3.6T  0 disk 
└─sda1   8:1    0  3.6T  0 part /var/lib/hadoop/data/m

Rebooting to get the node back in the cluster.

BTullis closed this task as Resolved.Oct 23 2024, 10:40 AM

BTullis moved this task from In Progress to Done on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.

an-worker1176.eqiad.wmnet running out of spaceClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

an-worker1176.eqiad.wmnet running out of space
Closed, ResolvedPublic
Actions