Page MenuHomePhabricator

File System corruption on cloud-vps instances
Open, In Progress, HighPublic

Description

Recently a couple of VMs in the integration project showed up with filesystem corruption, most recently deployment-elastic07.deployment-prep.eqiad.wmflabs.

I ran a search to see if this is a widespread issue. It's not /that/ widespread, but also not totally isolated.

sudo cumin --force --timeout 500  "A:all" "dmesg | grep 'since last fsck'"

shows 11 hosts with 1 or more errors. They are:

commonsarchive-prod.commonsarchive.eqiad1.wikimedia.cloud
deployment-elastic05.deployment-prep.eqiad1.wikimedia.cloud
deployment-elastic06.deployment-prep.eqiad1.wikimedia.cloud
deployment-elastic07.deployment-prep.eqiad1.wikimedia.cloud
ores-lb-03.ores.eqiad1.wikimedia.cloud
parsing-qa-01.wikitextexp.eqiad1.wikimedia.cloud
pub2.wikiapiary.eqiad1.wikimedia.cloud
toolsbeta-sgeexec-0901.toolsbeta.eqiad1.wikimedia.cloud
toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad1.wikimedia.cloud
toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad1.wikimedia.cloud
whgi.wikidumpparse.eqiad1.wikimedia.cloud

Two other instances had the same issue:

tools-sgeexec-0907.toolsbeta.eqiad1.wikimedia.cloudT290798
integration-agent-qemu-1001.integration.eqiad1.wikimedia.cloudT290615

Going forward, we can run sudo cumin --force --timeout 500 "A:all" "dmesg | grep -q -m 1 'since last fsck'" to see if the list is growing. With that command any host with "success" has that error in dmesg. As of 2021-10-04, the number returned from that is 9 (with one host affected down, so it doesn't report "success").

Event Timeline

Things to look for:

  • preserve dates about when corruption first appears on VMs
  • check logs on virt drivers
  • check logs on ceph OSDs to see if they're troubled
Andrew triaged this task as High priority.Sep 15 2021, 3:08 PM

On the OSDs, I'm trying to find a good keyword to grep for from dmesg or kern.log files. The word "reset" is not bad since failing SSDs will start resetting themselves instead of actually failing sometimes.

You can interrogate the controller for things it has seen as well with sudo megacli -adpeventlog -getevents -f events.log -aALL -nolog, which will regrettably make an "events.log" file that you can read. I don't know any way to make that dump to stdout off the top of my head and megacli is not self-documenting as far as I can tell.

The OSD disks aren't really used as filesystems so the ceph logs under /var/log/ceph/ might have something. Maybe we can grep for times if there is a common time when corruption started on the disks?

Today there are 10 hosts showing the dmesg errors. This gives me some hope that this isn't a growing/ongoing problem:

commonsarchive-mwtest.commonsarchive.eqiad1.wikimedia.cloud,configgrouptest.trove.eqiad1.wikimedia.cloud,gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud,img-sugg-db-test.trove.eqiad1.wikimedia.cloud,locality.trove.eqiad1.wikimedia.cloud,metricsinfra-db-1.trove.eqiad1.wikimedia.cloud,mwv-builder-03.mediawiki-vagrant.eqiad1.wikimedia.cloud,osmit-pgsql.trove.eqiad1.wikimedia.cloud,pawsdb-1.trove.eqiad1.wikimedia.cloud,quarry-db-02.trove.eqiad1.wikimedia.cloud

From

sudo cumin --force --timeout 500  "A:all" "dmesg -T  | grep 'since last fsck' | head -n 3"

Unfortunately I don't see a clear day when things went wrong

===== NODE GROUP =====                                                                                                                               
(1) ores-lb-03.ores.eqiad1.wikimedia.cloud                                                                                                           
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Sat Feb 27 00:16:56 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Sun Feb 28 01:54:11 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Mon Mar  1 03:31:58 2021] EXT4-fs (dm-0): error count since last fsck: 1
===== NODE GROUP =====                                                                                                                               
(1) pub2.wikiapiary.eqiad1.wikimedia.cloud                                                                                                           
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Sat Feb 27 00:17:02 2021] EXT4-fs (dm-0): error count since last fsck: 3                                                                            
[Sun Feb 28 01:54:16 2021] EXT4-fs (dm-0): error count since last fsck: 3                                                                            
[Mon Mar  1 03:32:03 2021] EXT4-fs (dm-0): error count since last fsck: 3
===== NODE GROUP =====                                                                                                                               
(1) toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad1.wikimedia.cloud                                                                               
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Mon Jun 14 21:59:26 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Tue Jun 15 23:36:44 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Thu Jun 17 01:14:34 2021] EXT4-fs (dm-0): error count since last fsck: 1
===== NODE GROUP =====                                                                                                                               
(1) whgi.wikidumpparse.eqiad1.wikimedia.cloud                                                                                                        
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Sat Feb 27 00:17:05 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Sun Feb 28 01:54:19 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Mon Mar  1 03:32:05 2021] EXT4-fs (dm-0): error count since last fsck: 1
===== NODE GROUP =====                                                                                                                               
(1) parsing-qa-01.wikitextexp.eqiad1.wikimedia.cloud                                                                                                 
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Sat Sep 18 23:50:16 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Mon Sep 20 01:28:10 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
===== NODE GROUP =====                                                                                                                               
(1) toolsbeta-sgeexec-0901.toolsbeta.eqiad1.wikimedia.cloud                                                                                          
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Mon Jun 14 22:01:20 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Tue Jun 15 23:38:35 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Thu Jun 17 01:16:23 2021] EXT4-fs (dm-0): error count since last fsck: 1
===== NODE GROUP =====                                                                                                                               
(1) deployment-elastic07.deployment-prep.eqiad1.wikimedia.cloud                                                                                      
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Tue Mar  2 05:09:59 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Wed Mar  3 06:47:53 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Thu Mar  4 08:25:44 2021] EXT4-fs (dm-0): error count since last fsck: 1
===== NODE GROUP =====                                                                                                                               
(1) commonsarchive-prod.commonsarchive.eqiad1.wikimedia.cloud                                                                                        
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Fri Feb 26 20:57:47 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Sat Feb 27 22:35:01 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Mon Mar  1 00:12:48 2021] EXT4-fs (dm-0): error count since last fsck: 1
===== NODE GROUP =====                                                                                                                               
(1) toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad1.wikimedia.cloud                                                                              
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Mon Jun 14 21:49:10 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Tue Jun 15 23:26:31 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Thu Jun 17 01:04:24 2021] EXT4-fs (dm-0): error count since last fsck: 1
===== NODE GROUP =====                                                                                                                               
(1) deployment-elastic06.deployment-prep.eqiad1.wikimedia.cloud                                                                                      
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Mon Mar  1 03:32:08 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Tue Mar  2 05:10:02 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Wed Mar  3 06:47:57 2021] EXT4-fs (dm-0): error count since last fsck: 1
===== NODE GROUP =====                                                                                                                               
(1) deployment-elastic05.deployment-prep.eqiad1.wikimedia.cloud                                                                                      
----- OUTPUT of 'dmesg -T  | grep...sck' | head -n 3' -----                                                                                          
[Mon Mar  1 03:31:49 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Tue Mar  2 05:09:35 2021] EXT4-fs (dm-0): error count since last fsck: 1                                                                            
[Wed Mar  3 06:47:22 2021] EXT4-fs (dm-0): error count since last fsck: 1
================

Expanding on @Andrew 's log, I created a dummy scirpt (F34649566) to parse the output, trying to get some patterns, this is the output, using the last 100 lines for each host: F34649565

Some things to note:

  • It seems to happen per VM, once a day, but not exactly at the same time (maybe the backups?)
  • It seems to affect more often VMs of the same project at a time, and one after the other (with a few min separation).

I'm going to see if I can correlate this with the backups next.

Finding out where the VM backups happen:

11:39 AM ~/Work/wikimedia/operations-puppet  (production|…1)
dcaro@vulcanus$ git grep -B 1 virt_ceph_and_backy
manifests/site.pp-node /^cloudvirt102[1-8]\.eqiad\.wmnet$/ {
manifests/site.pp:    role(wmcs::openstack::eqiad1::virt_ceph_and_backy)
...
11:39 AM ~/Work/wikimedia/operations-puppet  (production|…1)
dcaro@vulcanus$ ssh cloudvirt1021.eqiad.wmnet sudo wmcs-backup instances where toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad1.wikimedia.cloud
INFO:[2021-09-21 09:40:30,753] cloudvirt1024

But toolsbeta is being excluded xd

### cloudvirt1024.eqiad.wmnet:/etc/wmcs_backup_instances.yaml
...
exclude_servers:
...
  toolsbeta:
    - ".*"
...

Looking at the cloudvirts/cloudcontrol logs around those times...

:facepalm: for myself xd

It most probably repeats every day because it does not get fixed, and the fsck runs every day.

This is the list of first seen errors with times:

2021-02-26 20:57:00 - count: 1
    hosts:
        commonsarchive-prod.commonsarchive.eqiad1.wikimedia.cloud
2021-02-27 00:16:00 - count: 1
    hosts:
        ores-lb-03.ores.eqiad1.wikimedia.cloud
2021-02-27 00:17:00 - count: 2
    hosts:
        whgi.wikidumpparse.eqiad1.wikimedia.cloud
        pub2.wikiapiary.eqiad1.wikimedia.cloud
2021-03-01 03:31:00 - count: 1
    hosts:
        deployment-elastic05.deployment-prep.eqiad1.wikimedia.cloud
2021-03-01 03:32:00 - count: 1
    hosts:
        deployment-elastic06.deployment-prep.eqiad1.wikimedia.cloud
2021-03-02 05:09:00 - count: 1
    hosts:
        deployment-elastic07.deployment-prep.eqiad1.wikimedia.cloud
2021-06-14 21:49:00 - count: 1
    hosts:
        toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad1.wikimedia.cloud
2021-06-14 21:59:00 - count: 1
    hosts:
        toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad1.wikimedia.cloud
2021-06-14 22:01:00 - count: 1
    hosts:
        toolsbeta-sgeexec-0901.toolsbeta.eqiad1.wikimedia.cloud
2021-09-18 23:50:00 - count: 1
    hosts:
        parsing-qa-01.wikitextexp.eqiad1.wikimedia.cloud

So those are the dates/times we are interested in, the 24h period right before those (since last fsck).

dcaro changed the task status from Open to In Progress.Wed, Sep 29, 7:35 AM
dcaro moved this task from Backlog to Today on the cloud-services-team (zz-archived1) board.

Mentioned in SAL (#wikimedia-cloud) [2021-10-04T14:45:24Z] <dcaro> rebooting toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad1.wikimedia.cloud to force a fsck of the dm-0 device on boot (T290970)

I don't see any new file system corruption going on, there's still one host (toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad1.wikimedia.cloud) whose dmesg still complains about the fsystem check having 1 error, but did not find others.

I'll reboot that host to force a fsck on boot, and close the task.

Did the following:

# edit /etc/fstab to have a pass value of 2, so the mount is fsck'ed regularly
- UUID=8d24970c-1b27-4544-b346-adbfe686feb0       /tmp    ext4    rw,nosuid,nodev,noexec,relatime 0       0
+ UUID=8d24970c-1b27-4544-b346-adbfe686feb0       /tmp    ext4    rw,nosuid,nodev,noexec,relatime 0       2


# set a frequency of 1 fsck per mount 
root@toolsbeta-sgewebgrid-generic-0901:~# tune2fs -c 1 /dev/dm-0

# reboot

But something went wrong (from the logs on horizon):

[[0;1;31mFAILED[0m] Failed to listen on RPCbind Server Activation Socket.
See 'systemctl status rpcbind.socket' for details.
[[0;1;33mDEPEND[0m] Dependency failed for NFS status monitor for NFSv2/3 locking..
[[0;32m  OK  [0m] Started Preprocess NFS configuration.
         Starting Notify NFS peers of a restart...
[[0;32m  OK  [0m] Started Notify NFS peers of a restart.
[[0;1;31mFAILED[0m] Failed to listen on RPCbind Server Activation Socket.
See 'systemctl status rpcbind.socket' for details.
[[0;1;33mDEPEND[0m] Dependency failed for NFS status monitor for NFSv2/3 locking..
         Starting Preprocess NFS configuration...
[[0;32m  OK  [0m] Started Preprocess NFS configuration.
         Starting Notify NFS peers of a restart...
[[0;32m  OK  [0m] Started Notify NFS peers of a restart.
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to
try again to boot into default mode.

Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.

Press Enter to continue

From virsh console, after a hard reboot:

[FAILED] Failed to start File System Check o…4970c-1b27-4544-b346-adbfe686feb0.
See 'systemctl status "systemd-fsck@dev…\x2dadbfe686feb0.service"' for details.
[DEPEND] Dependency failed for /tmp.
[DEPEND] Dependency failed for Entropy daemon using the HAVEGE algorithm.
[DEPEND] Dependency failed for Network Time Synchronization.
[DEPEND] Dependency failed for Local File Systems.

I'm guessing that /tmp is failing to get checked properly, looking...

toolsbeta-sgewebgrid-0901 is similar:

Checking in progress on 1 disk (0.0% complete)Checking in progress on 0 disks (100.0% complete)[[0;1;31mFAILED[0m] Failed to start File System Check o���4970c-1b27-4544-b346-adbfe686feb0.

Same after hard reboot. I kind of wonder if this all happened at the same time or what. Considering that host will no longer accept ssh (before I tried hard reboot, it already wouldn't), I wonder if that happened since this ticket opened.

Running sudo cumin --force --timeout 500 "A:all" "dmesg | grep 'since last fsck'" quit quite early on me. I ran it again a little different. This needs to exclude trove from the "all" set somehow. It's annoying. I ran: sudo cumin --force -x --timeout 500 "A:all" "dmesg | grep -m 1 'since last fsck'" to make sure it didn't quit and minimized output from grep.

The results were this after filtering out "cloud" to get hostnames and then removing "trove" to remove all that distraction:

(1) pub2.wikiapiary.eqiad1.wikimedia.cloud
(1) whgi.wikidumpparse.eqiad1.wikimedia.cloud
(1) deployment-elastic05.deployment-prep.eqiad1.wikimedia.cloud
(1) commonsarchive-prod.commonsarchive.eqiad1.wikimedia.cloud
(1) toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad1.wikimedia.cloud
ssh: connect to host toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad1.wikimedia.cloud port 22: Connection refused
(1) ores-lb-03.ores.eqiad1.wikimedia.cloud
(1) commonsarchive-mwtest.commonsarchive.eqiad1.wikimedia.cloud
ssh: connect to host commonsarchive-mwtest.commonsarchive.eqiad1.wikimedia.cloud port 22: No route to host
(1) toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad1.wikimedia.cloud
(2) gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud,mwv-builder-03.mediawiki-vagrant.eqiad1.wikimedia.cloud
(1) deployment-elastic07.deployment-prep.eqiad1.wikimedia.cloud
(1) wcdo.wcdo.eqiad1.wikimedia.cloud
(1) deployment-elastic06.deployment-prep.eqiad1.wikimedia.cloud
(1) toolsbeta-sgeexec-0901.toolsbeta.eqiad1.wikimedia.cloud

That's 13 hosts, plus commonsarchive-mwtest.commonsarchive.eqiad1.wikimedia.cloud, which I'm checking now. @Andrew I don't know if that means the problem is spreading slowly or that my search was just more pervasive. It is worth noting that this is AFTER we rebuilt three hosts that were positive at this point.

Yeah, 13 hosts. commonsarchive-mwtest is just exploded (kernel panics). However, we can take two off the list because it was an ssh issue ((2) gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud,mwv-builder-03.mediawiki-vagrant.eqiad1.wikimedia.cloud). I also removed wcdo because that thing just likes to have python OOM panics it looks like. I didn't see the actual issue in dmesg.

So that means there are 10 now (since I know that sgewebgrid node is down because of corruption). I don't know if cumin has any way to skip ssh failures. It would make this much easier to measure with a reproducible command.

We actually tried to exclude trove from the all alias. It didn't work: https://gerrit.wikimedia.org/r/c/operations/puppet/+/715245 should have done it, but the config file clearly doesn't include that change.

Ah...because that's actually in the instance's hiera in horizon. Changing that.

Ok! Running that again with less distractions on board: sudo cumin --force -x --timeout 500 "A:all" "dmesg | grep -m 1 'since last fsck'" > corruption-search.txt

bstorm@cloud-cumin-01:~$ grep 'error count' corruption-search.txt | wc -l
9

plus 1 for toolsbeta-sgewebgrid-generic-0901 says we are definitely at 10. I don't know if that is a growth or I just captured a couple more in my list. Since json output is possible, maybe I can try to come up with a command that actually can be rerun to show a delta. Otherwise, there's going to be a lot of guesswork in this ticket.

Here, this works (except for hosts that are hard down like toolsbeta-sgewebgrid-generic-0901):
sudo cumin --force --timeout 500 "A:all" "dmesg | grep -q -m 1 'since last fsck'", with that "success" means the filesystem had errors.

results:

1.1% (9/819) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.: commonsarchive-prod.commonsarchive.eqiad1.wikimedia.cloud,deployment-elastic[05-07].deployment-prep.eqiad1.wikimedia.cloud,ores-lb-03.ores.eqiad1.wikimedia.cloud,pub2.wikiapiary.eqiad1.wikimedia.cloud,toolsbeta-sgeexec-0901.toolsbeta.eqiad1.wikimedia.cloud,toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad1.wikimedia.cloud,whgi.wikidumpparse.eqiad1.wikimedia.cloud

This seems reliable.

Nope. This one was something to do with LDAP and the stack on this instance which was upgraded in place to Buster.

Thanks @Bstorm, though that only gives you if those hosts ever had an error, not if they still have it, or when they did start having it.

Modified the script above (new version F34672816) and got the following, for first occurrences since last reboot, bucketed by minute (to try to expose correlated events):

10:13 AM ~
dcaro@vulcanus$ ./parse_fsck.py first-occurrence out
2021-02-26 20:57:00 - count: 1
    hosts:
        commonsarchive-prod.commonsarchive.eqiad1.wikimedia.cloud
2021-02-27 00:16:00 - count: 1
    hosts:
        ores-lb-03.ores.eqiad1.wikimedia.cloud
2021-02-27 00:17:00 - count: 2
    hosts:
        pub2.wikiapiary.eqiad1.wikimedia.cloud
        whgi.wikidumpparse.eqiad1.wikimedia.cloud
2021-03-01 03:31:00 - count: 1
    hosts:
        deployment-elastic05.deployment-prep.eqiad1.wikimedia.cloud
2021-03-01 03:32:00 - count: 1
    hosts:
        deployment-elastic06.deployment-prep.eqiad1.wikimedia.cloud
2021-03-02 05:09:00 - count: 1
    hosts:
        deployment-elastic07.deployment-prep.eqiad1.wikimedia.cloud
2021-10-04 17:12:00 - count: 1
    hosts:
        toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad1.wikimedia.cloud
2021-10-04 17:14:00 - count: 1
    hosts:
        toolsbeta-sgeexec-0901.toolsbeta.eqiad1.wikimedia.cloud

For last occurrences (edited, fixed a bug in the script):

10:42 AM ~
dcaro@vulcanus$ ./parse_fsck.py last-occurrence out
2021-10-04 17:12:00 - count: 1
    hosts:
        toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad1.wikimedia.cloud
2021-10-04 17:14:00 - count: 1
    hosts:
        toolsbeta-sgeexec-0901.toolsbeta.eqiad1.wikimedia.cloud
2021-10-04 20:41:00 - count: 1
    hosts:
        commonsarchive-prod.commonsarchive.eqiad1.wikimedia.cloud
2021-10-04 23:59:00 - count: 2
    hosts:
        deployment-elastic05.deployment-prep.eqiad1.wikimedia.cloud
        pub2.wikiapiary.eqiad1.wikimedia.cloud
2021-10-05 00:01:00 - count: 3
    hosts:
        deployment-elastic06.deployment-prep.eqiad1.wikimedia.cloud
        deployment-elastic07.deployment-prep.eqiad1.wikimedia.cloud
        whgi.wikidumpparse.eqiad1.wikimedia.cloud
2021-10-05 00:17:00 - count: 1
    hosts:
        ores-lb-03.ores.eqiad1.wikimedia.cloud

So it looks like the hosts got corrupted, and they did not recover (the toolsbeta two rebooted 15h ago, but in previous comments you can see they had the error in June).
I don't see any new events there in the last few months.

Also note that dmesg times are kinda unreliable, but well, good enough.

dcaro moved this task from Backlog to Today on the User-dcaro board.
dcaro moved this task from ToDo to In Progress on the cloud-services-team (FY2021/2022-Q1) board.