Some swift filesystems reporting negative disk usage
Closed, ResolvedPublic

Description

It looks like some xfs filesystems on ms-be104[0123] are reporting negative usage:

/dev/sda3        94G   30G   64G  32% /srv/swift-storage/sda3
/dev/sda4       297G  340M  297G   1% /srv/swift-storage/sda4
/dev/sdb4       297G  340M  297G   1% /srv/swift-storage/sdb4
/dev/sdl1       3.7T  -13T   17T    - /srv/swift-storage/sdl1
/dev/sdj1       3.7T  -13T   17T    - /srv/swift-storage/sdj1
/dev/sdi1       3.7T  -29T   33T    - /srv/swift-storage/sdi1
/dev/sdb3        94G   35G   59G  38% /srv/swift-storage/sdb3
/dev/sdn1       3.7T  -14T   17T    - /srv/swift-storage/sdn1
/dev/sdh1       3.7T  -13T   17T    - /srv/swift-storage/sdh1
/dev/sdc1       3.7T  3.0T  674G  82% /srv/swift-storage/sdc1
/dev/sdk1       3.7T  3.1T  643G  83% /srv/swift-storage/sdk1
/dev/sdm1       3.7T  3.1T  632G  84% /srv/swift-storage/sdm1
/dev/sdf1       3.7T  3.1T  612G  84% /srv/swift-storage/sdf1
/dev/sdg1       3.7T  3.1T  623G  84% /srv/swift-storage/sdg1
/dev/sdd1       3.7T  3.1T  641G  83% /srv/swift-storage/sdd1
tmpfs            13G     0   13G   0% /run/user/4849
/dev/sde1       3.7T  -13T   17T    - /srv/swift-storage/sde1

These hosts are part of the last batch of additions to eqiad, their counterparts in codfw don't seem to be affected.

Looking at e.g. ms-be1040 it seems the issue started in may around different times for different filesystems:

There are a very large number of changes, so older changes are hidden. Show Older Changes
herron added a subscriber: herron.Jul 16 2018, 5:54 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-17T07:29:31Z] <godog> un xfs_repair on filesystems reporting negative space available on ms-be1042 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-07-18T08:51:44Z] <godog> run xfs_repair on filesystems reporting negative space available on ms-be1043 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-07-19T09:27:12Z] <godog> run xfs_repair on filesystems reporting negative space available on ms-be1040 - T199198

fgiunchedi closed this task as Resolved.Jul 20 2018, 7:54 AM

The reported negative disk space issue has been repaired, we have alerting on the condition itself if it happens again. Also xfs upstream will be adding more safeguards to avoid writing obviously-wrong superblocks.

Krinkle reopened this task as Open.EditedJul 29 2018, 2:40 AM
Krinkle added a subscriber: Krinkle.
[#wikimedia-operations] 01:55 UTC <icinga-wm>
PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdc1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdc1 site=codfw

https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2040&panelId=12&fullscreen&from=1532822153571&to=1532829392024&var-datasource=codfw%20prometheus%2Fops&var-cluster=swift

Thanks @Krinkle!

Indeed still another case of the 32nd bit flipping, interestingly on a codfw host where we haven't been seeing this yet:

sb_fdblocks 4515987426, counted 221020130
obase=2
221020130
1101001011000111111111100010
4515987426
100001101001011000111111111100010

I've inquired upstream, one of the suggested approaches is to run with page poisoning. I'll do that on one host in codfw, also this issue will be likely checked for and fixed in linux 4.19.

Mentioned in SAL (#wikimedia-operations) [2018-07-30T10:25:23Z] <godog> run xfs_repair on sdc1 on ms-be2040 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-07-30T12:40:43Z] <godog> reboot ms-be2040 with page_poisoning=1 - T199198

fgiunchedi changed the task status from Open to Stalled.Aug 2 2018, 4:17 PM

Stalling this, might happen again and upstream likely will have mitigations in linux 4.19

fgiunchedi moved this task from Doing to Blocked on the User-fgiunchedi board.Aug 8 2018, 7:41 AM

Mentioned in SAL (#wikimedia-operations) [2018-08-22T16:47:59Z] <godog> repair on ms-be2020 sdh/sdc - T199198

Mentioned in SAL (#wikimedia-operations) [2018-08-22T16:51:11Z] <godog> correction, ms-be2040 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-08-22T16:51:33Z] <godog> repair on ms-be2042 sdk/sdn - T199198

Mentioned in SAL (#wikimedia-operations) [2018-08-27T11:50:43Z] <godog> repair on ms-be2042 sdd - T199198

Mentioned in SAL (#wikimedia-operations) [2018-08-28T13:15:15Z] <godog> repair sdd1 on ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-08-29T08:31:07Z] <godog> repair sdd1 on ms-be2040 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-08-29T08:31:22Z] <godog> repair sdh1 on ms-be2043 - T199198

I've also documented the repair on https://wikitech.wikimedia.org/wiki/Graphite#Repair_xfs_misreporting_free_space as it will come up again for sure.

Mentioned in SAL (#wikimedia-operations) [2018-08-30T16:59:51Z] <godog> xfs_repair on ms-be1041 sdf1 - T199198 (retroactive, started at 15:32

Mentioned in SAL (#wikimedia-operations) [2018-08-31T08:16:57Z] <godog> repair sde1 on ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-08-31T19:12:17Z] <mutante> ms-be2043 - following instructions at https://wikitech.wikimedia.org/wiki/Graphite#Repair_xfs_misreporting_free_space to repair xfs misreporting free space (T199198), fixing docs, icinga-downtime doesn't want fqdn but short name

Dzahn added a comment.Aug 31 2018, 7:41 PM

< icinga-wm> RECOVERY - Filesystem available is greater than filesystem size on ms-be2043 is OK: All metrics within thresholds.

but command still running in a root screen

Mentioned in SAL (#wikimedia-operations) [2018-08-31T19:46:29Z] <mutante> right when it was fixed on ms-be2043 it also broke on ms-be2040. following the same instructions to fix xfs in a root screen (T199198)

Mentioned in SAL (#wikimedia-operations) [2018-09-03T08:32:40Z] <godog> fix xfs on ms-be1041 sde - T199198

@Dzahn thanks a lot! Also I realized I've pasted the fix instructions on the wrong page (graphite vs swift). The right location is https://wikitech.wikimedia.org/wiki/Swift/How_To#Repair_xfs_free_blocks_counter_corruption

Mentioned in SAL (#wikimedia-operations) [2018-09-05T10:04:33Z] <godog> repair sdh on ms-be1043 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-05T13:44:15Z] <godog> repair sdn1 on ms-be2041 - T199198

Krinkle removed a subscriber: Krinkle.Sep 6 2018, 2:20 AM

Mentioned in SAL (#wikimedia-operations) [2018-09-06T08:06:54Z] <godog> repair sde1 on ms-be2042 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-07T23:25:01Z] <mutante> ms-be2041 - repairing /dev/sdh1 (T199198)

Mentioned in SAL (#wikimedia-operations) [2018-09-07T23:26:35Z] <mutante> ms-be2042 - repairing /dev/sdj1 (T199198)

Mentioned in SAL (#wikimedia-operations) [2018-09-11T08:39:32Z] <godog> repair xfs on sdh/sdc on ms-be2040 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-11T16:16:40Z] <godog> repair sdd1 on ms-be1043 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-11T16:17:21Z] <godog> correction, sdk1 on ms-be1041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-11T16:53:18Z] <godog> repair sdd on ms-be1043 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-12T08:36:10Z] <godog> repair sdc on ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-12T12:17:40Z] <godog> repair sdm / sdj on ms-be2042 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-13T09:54:54Z] <godog> repair sdd on ms-be2043 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-13T16:45:30Z] <godog> repair sde on ms-be1041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-17T07:31:20Z] <godog> repair sdj on ms-be2040 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-17T13:40:25Z] <godog> repair sdl on ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-17T23:32:15Z] <mutante> ms-be2042 - reparing xfs - (T199198)

Mentioned in SAL (#wikimedia-operations) [2018-09-18T13:32:02Z] <godog> repair sdd on ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-19T09:26:42Z] <godog> repair sdf on ms-be2043 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-20T07:31:18Z] <godog> repair sdl on ms-be2042 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-20T14:48:57Z] <godog> repair sdn on ms-be2040 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-20T17:24:29Z] <godog> repair sdk on ms-be2043 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-24T06:57:07Z] <godog> repair sdn on ms-be1041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-24T06:57:47Z] <godog> repair sde on ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-25T07:20:49Z] <godog> repair sdm sdi on ms-be2043 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-25T12:34:25Z] <godog> repair sde on ms-be2042 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-26T07:29:57Z] <godog> repair /dev/sdh1 on ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-26T07:46:23Z] <godog> repair /dev/sdd1 on ms-be1043 - T199198

Change 463065 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: link documentation for negative disk space available reported

https://gerrit.wikimedia.org/r/463065

Change 463065 merged by Filippo Giunchedi:
[operations/puppet@production] base: link documentation for negative disk space available reported

https://gerrit.wikimedia.org/r/463065

Change 463071 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: link documentation for negative disk space available reported #2

https://gerrit.wikimedia.org/r/463071

Mentioned in SAL (#wikimedia-operations) [2018-09-26T16:18:45Z] <godog> test formatting sdd on ms-be2040 with crc=0 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-27T01:14:18Z] <mutante> repair /dev/sde1 on ms-be0240 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-27T01:14:42Z] <mutante> repair /dev/sdn1 on ms-be0241 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-27T07:27:07Z] <godog> test formatting sdc on ms-be1040 with crc=0 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-09-27T07:30:46Z] <godog> test formatting sdd on ms-be1040 with crc=0 - T199198

Since yesterday we've seen reoccurence of this bug on previously-fixed filesystems (i.e. sdd on ms-be2040) thus I've started testing reformatting filesystems on ms-be2040 and ms-be1040 (i.e. the "first" hosts of last batch) with a set of mkfs.xfs defaults that changed between xfsprogs versions. Starting with crc=0 and thus finobt=0, the latter activating the free inode btree feature which might be related to this bug. The former disables crc for _metadata_ and finobt depends on crc being disabled.

Relevant entries from mkfs.xfs

crc=value
       This is used to create a filesystem which maintains and checks CRC
       information in all metadata objects on disk. The value is either 0
       to disable the feature, or 1 to enable the use of CRCs.

       CRCs  enable  enhanced  error  detection  due  to hardware issues,
       whilst the format changes also improves crash recovery  algorithms
       and  the  ability of various tools to validate and repair metadata
       corruptions when they  are  found.   The  CRC  algorithm  used  is
       CRC32c,  so  the overhead is dependent on CPU architecture as some
       CPUs have hardware acceleration of this algorithm.  Typically  the
       overhead of calculating and checking the CRCs is not noticeable in
       normal operation.

       By default, mkfs.xfs will enable metadata CRCs.

finobt=value
       This option enables the use of a separate free inode  btree  index
       in  each  allocation  group.  The value is either 0 to disable the
       feature, or 1 to create a free  inode  btree  in  each  allocation
       group.

       The  free  inode  btree mirrors the existing allocated inode btree
       index which indexes both used and  free  inodes.  The  free  inode
       btree does not index used inodes, allowing faster, more consistent
       inode allocation performance as filesystems age.

       By default, mkfs.xfs will create free inode btrees for filesystems
       created with the (default) -m crc=1 option set. When the option -m
       crc=0 is used, the free inode btree feature is not  supported  and
       is disabled.

Mentioned in SAL (#wikimedia-operations) [2018-09-27T08:54:06Z] <godog> test formatting sde on ms-be1040 with crc=0 - T199198

Change 463071 abandoned by Filippo Giunchedi:
base: link documentation for negative disk space available reported #2

Reason:
Abandoning for now, dashboard_links should be used only for grafana.wikimedia.org

https://gerrit.wikimedia.org/r/463071

Mentioned in SAL (#wikimedia-operations) [2018-09-27T13:13:57Z] <godog> repair /dev/sdh1 on ms-be2042 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-01T09:33:12Z] <godog> test formatting sdh and sdi on ms-be2040 with crc=0 - T199198

All filesystems on ms-be1040 have been reformatted with crc=0 and almost all have finished filling up again with data. I haven't seen the bug occurring again on that host which is promising, filesystems on ms-be2040 are being reformatted on demand as the bug comes up to avoid creating a lot of load on codfw since we're serving swift out of there now.

If the bug keeps not reoccuring on either ms-be1040 or ms-be2040 we can then proceed on reformatting the remaining ms-be[12]00[234] filesystems as well.

Mentioned in SAL (#wikimedia-operations) [2018-10-01T10:21:42Z] <godog> repair /dev/sdf1 /dev/sde1 on ms-be1041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-04T07:25:40Z] <godog> reformat ms-be1041 with crc=1 finobt=0 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-08T09:51:47Z] <godog> rebuild sdc sdh sdj sdi on ms-be2041 with crc=1 finobt=0 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-09T20:08:32Z] <mutante> repair /dev/sdg1 on ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-09T22:26:10Z] <shdubsh> repairing /dev/sdl1 on ms-be2040 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-12T15:48:31Z] <mutante> repair /dev/sdh1 on ms-be1043 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-12T15:50:04Z] <mutante> repair /dev/sde1 on ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-15T07:57:33Z] <godog> reformat ms-be2040 with crc=1 finobt=0 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-18T14:22:31Z] <godog> begin reformat of ms-be2041 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-22T07:24:52Z] <godog> reformat ms-be2042 - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-25T13:46:35Z] <godog> reformat ms-be2043 xfs filesystems - T199198

Mentioned in SAL (#wikimedia-operations) [2018-10-29T08:07:15Z] <godog> reformat ms-be1042 xfs filesystems - T199198

fgiunchedi moved this task from Blocked to Doing on the User-fgiunchedi board.Oct 29 2018, 1:31 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-02T07:38:50Z] <godog> reformat ms-be1043 xfs filesystems - T199198

Mentioned in SAL (#wikimedia-operations) [2018-11-05T10:02:09Z] <godog> reformat xfs filesystems on ms-be1040 - T199198

Change 472415 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: disable free inode btree at mkfs time

https://gerrit.wikimedia.org/r/472415

fgiunchedi added a comment.EditedNov 8 2018, 11:17 AM

I've finished rebuilding the 12x filesystems on each of 8x hosts affected, no reoccurrence has been observed since.

Change 472415 merged by Filippo Giunchedi:
[operations/puppet@production] swift: disable free inode btree at mkfs time

https://gerrit.wikimedia.org/r/472415

fgiunchedi closed this task as Resolved.Nov 12 2018, 8:15 AM

This is completed now! New xfs filesystems provisioned by puppet will also contain the right flags to mkfs.xfs.

Dzahn awarded a token.Nov 12 2018, 5:24 PM