Page MenuHomePhabricator

cloudvirt1024 reports backup job failures
Closed, ResolvedPublic

Event Timeline

Icinga just paged for 'Check the last execution of backup_vms'

Backy last logs:

Jan 12 05:01:17 cloudvirt1024 wmcs-backup-instances[50951]: [37B blob data]
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: Traceback (most recent call last):
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:   File "/usr/local/sbin/wmcs-backup-instances", line 53, in <module>
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:     config["ceph_pool"], server.id, config["live_for_days"]
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 560, in backup_vm
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:     backup_volume(pool, vm_disk, live_for_days)
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 545, in backup_volume
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:     pool, volume, last_snap, backy_snap_version_uid, expire
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 502, in _differential_backup
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:     noop=noop,
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 31, in run_command
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:     output = subprocess.check_output(args).decode("utf8")
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:   File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:     **kwargs).stdout
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:   File "/usr/lib/python3.7/subprocess.py", line 487, in run
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]:     output=stdout, stderr=stderr)
Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: subprocess.CalledProcessError: Command '['/usr/bin/backy2', 'backup', '-s', '2021-01-12T05:01:16', '-r', '/tmp/tmpb5jf95pl', '-f', b'771d91c4-53ca-11eb-8163-b02628295df0',  'rbd://eqiad1-compute/594b4b48-0bbb-4296-bb52-ef7861036ef6_disk@2021-01-12T05:01:16', '594b4b48-0bbb-4296-bb52-ef7861036ef6_disk', '-e', '2021-01-17', '-t', 'differential_backup']' died with <Signals.SIGSEGV: 11>.

We might be hitting a ceph bug:

[Tue Jan 12 05:00:43 2021] safe_timer[61962]: segfault at 7fbb79a83eb8 ip 00007fbb5f796cef sp 00007fbb5eb80970 error 4 in libceph-common.so.0[7fbb5f5ef000+5d3000]                                                        
[Tue Jan 12 05:00:43 2021] Code: 1f 40 00 4d 89 f4 eb bb 0f 1f 00 41 57 41 56 4c 8d 77 08 41 55 41 54 55 53 48 83 ec 08 48 8b 5f 10 48 85 db 0f 84 d9 00 00 00 <4c> 8b 7e 08 4c 8b 2e 4d 89 f4 eb 11 0f 1f 44 00 00 49 89 
dc 48 8b

and it seemed to happen before at least once:

[Thu Dec 31 03:54:29 2020] safe_timer[21682]: segfault at 7f5e34e82eb8 ip 00007f5e13796cef sp 00007f5e11b7e970 error 4 in libceph-common.so.0[7f5e135ef000+5d3000]                                                        
[Thu Dec 31 03:54:29 2020] Code: 1f 40 00 4d 89 f4 eb bb 0f 1f 00 41 57 41 56 4c 8d 77 08 41 55 41 54 55 53 48 83 ec 08 48 8b 5f 10 48 85 db 0f 84 d9 00 00 00 <4c> 8b 7e 08 4c 8b 2e 4d 89 f4 eb 11 0f 1f 44 00 00 49 89 
dc 48 8b

Maybe related (though seeed to have been fixed either by using nautilus and/or using ceph config instead of config file): https://tracker.ceph.com/issues/24835

The logs on ceph side don't seem to show much, there was some compactation going on on one of them, but finished ok without issues:

journals -> P13733
osd logs -> 04:50->05:00 P13734, 05:00->05:10 P13735

Andrew triaged this task as Medium priority.Jan 12 2021, 5:03 PM
Andrew moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

This might get fixed with T272303, but not sure, after the upgrade, will monitor to see if I see more error logs

This has happened again:
https://phabricator.wikimedia.org/P14413

Feb 19 05:07:31 cloudvirt1024 wmcs-backup[56558]: subprocess.CalledProcessError: Command '['/usr/bin/backy2', 'backup', '--snapshot-name', '2021-02-19T05:07:16_cloudvirt1024', '--rbd', '/tmp/tmpbgykl78e', '--from-version', '547f81fe-71a7-11eb-bee1-b02628295df0', 'rbd://eqiad1-compute/65343972-d77a-4c93-a69f-6a0ff15d082c_disk@2021-02-19T05:07:16_cloudvirt1024', '65343972-d77a-4c93-a69f-6a0ff15d082c_disk', '--expire', '2021-02-24 05:07:16', '--tag', 'differential_backup']' died with <Signals.SIGSEGV: 11>.

looking

Hmm... there's some suspicious logs before and after the error (at least this time):
https://phabricator.wikimedia.org/P14414

Feb 19 05:07:27 cloudvirt1024 sudo[6922]:   nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/lib/nagios/plugins/check_raid megacli
...
Crash happens here
...
Feb 19 05:07:36 cloudvirt1024 nova-compute[62835]: 2021-02-19 05:07:36.410 62835 WARNING nova.compute.resource_tracker [req-1492a22f-6b0d-44b0-b15f-6a174218312a - - - - -] Instance 98b912c0-8cf7-45f2-8728-942fdcdc803b has been moved to another host cloudvirt1022(cloudvirt1022.eqiad.wmnet). There are allocations remaining against the source host that might need to be removed: {'resources': {'VCPU': 1, 'MEMORY_MB': 2048, 'DISK_GB': 80}}.
...
Feb 19 05:08:04 cloudvirt1024 systemd[7807]: gpgconf: running /usr/bin/gpg-agent failed (exitcode=2): General error
Feb 19 05:08:04 cloudvirt1024 systemd[7807]: gpgconf: fatal error (exit status 1)
...
Feb 19 05:08:32 cloudvirt1024 sudo[8264]:   nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/lib/nagios/plugins/check_raid megacli
...
Feb 19 05:09:16 cloudvirt1024 smartd[1190]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], failed to read SMART Attribute Data
Feb 19 05:09:16 cloudvirt1024 smartd[1190]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], Read SMART Self Test Log Failed
...

The nagios check happens rather frequently:

dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep 'check_raid' | wc
    472    7080   68375

Though the gpg error happens only around twice an hour:

dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep '/usr/bin/gpg-agent failed' | wc
     50     600    5842

and always right after debmonitor logs in:

dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep 'su.*session opened for user debmonitor by' | wc
     51     663    5658

Seems like an unrelated issue.

The log about the moved instance seems to happen quite frequently, >1k times/day:

dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep '98b912c0-8cf7-45f2-8728-942fdcdc803b has been moved' | wc
   1416   65136  623040

So probably unrelated (though might be some stuck data somewhere).

The one that bothers me most is the error on smartd:

dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep 'smartd.*failed to read SMART' | wc
     48     720    6000

It seems to happen with the same frequency as the gpg error though (probably a red herring?).

This does not look good, though not sure if I should trust those values:

dcaro@cloudvirt1024:~$ sudo smartctl --all /dev/sda -d megaraid,0 | phaste
https://phabricator.wikimedia.org/P14417

Similar command on cloudvirt1025 shows nicer values...

dcaro@cloudvirt1025:~$ sudo smartctl --all /dev/sda  -d megaraid,0 | phaste
https://phabricator.wikimedia.org/P14416

They seem to be different controllers though:

dcaro@cloudvirt1024:~$ lspci | grep -i megaraid
18:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID Tri-Mode SAS3508 (rev 01)
dcaro@cloudvirt1025:~$ lspci | grep -i megaraid
18:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)

Megaraid cli says everything is ok though:

root@cloudvirt1024:~# megacli -ShowSummary -a0 | phaste
https://phabricator.wikimedia.org/P14418

@Andrew any insights? Can/should the smartctl numbers be ignored?

The 'has been moved to another host' issue can definitely be ignored, I've spent some time cleaning up some of those but got sick of it; it has to do with inconsistent db state with the new placement service and is harmless as far as I can tell.

I don't know if we can ignore the smartctl issues. Icinga has a 'Device not healthy -SMART-' test and there may be some knowledge in that test about what to care about or not care about. The docs don't tell us a whole lot https://wikitech.wikimedia.org/wiki/SMART#Alerts

backup job failed today on cloudcontrol1005 (glance images) at 2021-03-17 17:26 UTC

It was a segfault: cloudcontrol1005 kernel: safe_timer[19993]: segfault at 7ff9c8c66728 ip 00007ff9a777ebcf sp 00007ff9a6358970 error 4 in libceph-common.so.0[7ff9a75d2000+5e8000]

Mentioned in SAL (#wikimedia-cloud) [2021-03-17T17:28:31Z] <bstorm> restarted the backup-glance-images job to clear errors in systemd T271782

I think this is not a problem anymore.