Page MenuHomePhabricator

tools-sgeexec-0907 filesystem corruption
Closed, ResolvedPublic

Description

I just depooled tools-sgeexec-0907, it was alerting for grid errors + puppet staleness, kernel logs look like its filesystem is corrupted:

[10980873.951531] EXT4-fs (dm-0): error count since last fsck: 3087
[10980873.951568] EXT4-fs (dm-0): initial error at time 1620058872: ext4_lookup:1623: inode 2
[10980873.951573] EXT4-fs (dm-0): last error at time 1627557661: ext4_lookup:1623: inode 2
[11023343.448022] Process accounting resumed
[11073147.677663] EXT4-fs (dm-0): error count since last fsck: 3087
[11073147.677703] EXT4-fs (dm-0): initial error at time 1620058872: ext4_lookup:1623: inode 2
[11073147.677708] EXT4-fs (dm-0): last error at time 1627557661: ext4_lookup:1623: inode 2
[11109742.533438] Process accounting resumed
[11165421.407818] EXT4-fs (dm-0): error count since last fsck: 3087
[11165421.407822] EXT4-fs (dm-0): initial error at time 1620058872: ext4_lookup:1623: inode 2
[11165421.407824] EXT4-fs (dm-0): last error at time 1627557661: ext4_lookup:1623: inode 2
[11189905.826660] EXT4-fs (dm-0): Delayed block allocation failed for inode 790288 at logical offset 131072 with max blocks 2048 with error 117
[11189905.840635] EXT4-fs (dm-0): This should not happen!! Data will be lost

[11191417.887712] EXT4-fs (dm-0): Delayed block allocation failed for inode 278920 at logical offset 32768 with max blocks 2048 with error 117
[11191417.900359] EXT4-fs (dm-0): This should not happen!! Data will be lost

[11196141.580946] Process accounting resumed
[11250050.060176] EXT4-fs (dm-0): Delayed block allocation failed for inode 280138 at logical offset 32768 with max blocks 2048 with error 117
[11250050.064456] EXT4-fs (dm-0): This should not happen!! Data will be lost

[11250050.067979] Aborting journal on device dm-0-8.
[11250050.115162] EXT4-fs error: 425 callbacks suppressed
[11250050.115182] EXT4-fs error (device dm-0) in ext4_dx_add_entry:2355: Journal has aborted
[11250050.671391] EXT4-fs error (device dm-0): ext4_journal_check_start:61: Detected aborted journal
[11250050.676588] EXT4-fs (dm-0): Remounting filesystem read-only
[11250050.682315] EXT4-fs error (device dm-0) in ext4_evict_inode:273: Journal has aborted
[11257695.133816] EXT4-fs (dm-0): error count since last fsck: 3091
[11257695.133851] EXT4-fs (dm-0): initial error at time 1620058872: ext4_lookup:1623: inode 2
[11257695.133867] EXT4-fs (dm-0): last error at time 1631309016: ext4_evict_inode:273: inode 2

Event Timeline

taavi triaged this task as High priority.Sep 11 2021, 8:54 AM
taavi created this task.

integration-agent-qemu-1001 got corrupted fairly recently as well ( T290615 ). We more or less recovered it but will definitely rebuild it from scratch since some files have been altered.

Mentioned in SAL (#wikimedia-cloud) [2021-09-13T08:14:26Z] <arturo> rebooting sgeexec-0907 (T290798)

The VM is running on cloudvirt1016 in case it matters.

Forced fsck doing this:

  • for the root disk, add fsck.mode=force fsck.repair=yes to the kernel cmdline on /etc/default/grub, run sudo update-grub2 and reboot the VM
  • for the tmp disk, comment its line on /etc/fstab, reboot the VM and then run sudo e2fsck /dev/mapper/vd-separate--tmp -y

Both disks had corruptions to fix. The tmp one had the most, with lots of dangling inodes and such.

Had to do the same with at /dev/dm-0 level for whatever reason (uncommenting the /tmp mount at /etc/fstab and rebooting first)

Mentioned in SAL (#wikimedia-cloud) [2021-09-13T08:55:48Z] <arturo> repooling sgeexec-0907 (T290798)

aborrero claimed this task.

I think we can close this for now, we'll keep an eye here.