Page MenuHomePhabricator

potential disk issue on wdqs1010
Closed, ResolvedPublic2 Estimated Story Points

Description

Data reload on wdqs1010 (T267927) failed with journal corruption, which could look like disk error. SMART data does not show and failed IO. More investigation with badblocks (as done in T263125) would make sense.

AC:

  • we know if the disks are good or not
  • ticket is created to replace the disks if needed

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel set the point value for this task to 2.

Running sudo badblocks -v -o badblocks -w -f /dev/mapper/vg0-srv &> badblocks.log on wdqs1010

block 65 seems to be bad and kernel is throwing a number of errors similar to:

[702641.004691] EXT4-fs error (device dm-2): ext4_map_blocks:614: inode #38928414: block 155721773: comm git: lblock 0 mapped to illegal pblock 155721773 (length 1)
[702641.021024] EXT4-fs warning (device dm-2): htree_dirblock_to_tree:995: inode #38928414: lblock 0: comm git: error -117 reading directory block
[704406.562031] EXT4-fs error (device dm-2): ext4_map_blocks:614: inode #38993923: block 155722274: comm git: lblock 0 mapped to illegal pblock 155722274 (length 1)

SMART does not report any errors.

Now let's try to find out which disk is problematic.

After discussion with @Volans:

We see no errors at hardware level. Badblocks does find a bad block, but going through mdraid creates enough levels of indirections that this might not be hardware related. Filesystem is corrupted.

At this point, we can try blaming the error on cosmic rays, reimage to get into a sane configuration and hope it does not happen again. There is no certainty that this will resolve the issue for good, but it seems a good compromise, not spending too much time investigating further at the moment.

Next step: reimage and copy data from another server (note that wdqs2008 is running an initial import at the moment).

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

wdqs1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191316_gehel_15559_wdqs1010_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wdqs1010.eqiad.wmnet']

and were ALL successful.

scheduled a week of downtime, we're not ready for the data import yet (see T267927)

Change 697832 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] reimage: raid0.default_layout=2 for all installers

https://gerrit.wikimedia.org/r/697832

Change 697832 merged by Ryan Kemper:

[operations/puppet@production] reimage: raid0.default_layout=2 for all installers

https://gerrit.wikimedia.org/r/697832