Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | dcaro | T271782 cloudvirt1024 reports backup job failures | |||
Resolved | Jclark-ctr | T275215 Replace sata cables for cloudvirt1024 |
Event Timeline
Backy last logs:
Jan 12 05:01:17 cloudvirt1024 wmcs-backup-instances[50951]: [37B blob data] Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: Traceback (most recent call last): Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: File "/usr/local/sbin/wmcs-backup-instances", line 53, in <module> Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: config["ceph_pool"], server.id, config["live_for_days"] Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 560, in backup_vm Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: backup_volume(pool, vm_disk, live_for_days) Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 545, in backup_volume Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: pool, volume, last_snap, backy_snap_version_uid, expire Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 502, in _differential_backup Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: noop=noop, Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 31, in run_command Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: output = subprocess.check_output(args).decode("utf8") Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: File "/usr/lib/python3.7/subprocess.py", line 395, in check_output Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: **kwargs).stdout Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: File "/usr/lib/python3.7/subprocess.py", line 487, in run Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: output=stdout, stderr=stderr) Jan 12 05:01:27 cloudvirt1024 wmcs-backup-instances[50951]: subprocess.CalledProcessError: Command '['/usr/bin/backy2', 'backup', '-s', '2021-01-12T05:01:16', '-r', '/tmp/tmpb5jf95pl', '-f', b'771d91c4-53ca-11eb-8163-b02628295df0', 'rbd://eqiad1-compute/594b4b48-0bbb-4296-bb52-ef7861036ef6_disk@2021-01-12T05:01:16', '594b4b48-0bbb-4296-bb52-ef7861036ef6_disk', '-e', '2021-01-17', '-t', 'differential_backup']' died with <Signals.SIGSEGV: 11>.
We might be hitting a ceph bug:
[Tue Jan 12 05:00:43 2021] safe_timer[61962]: segfault at 7fbb79a83eb8 ip 00007fbb5f796cef sp 00007fbb5eb80970 error 4 in libceph-common.so.0[7fbb5f5ef000+5d3000] [Tue Jan 12 05:00:43 2021] Code: 1f 40 00 4d 89 f4 eb bb 0f 1f 00 41 57 41 56 4c 8d 77 08 41 55 41 54 55 53 48 83 ec 08 48 8b 5f 10 48 85 db 0f 84 d9 00 00 00 <4c> 8b 7e 08 4c 8b 2e 4d 89 f4 eb 11 0f 1f 44 00 00 49 89 dc 48 8b
and it seemed to happen before at least once:
[Thu Dec 31 03:54:29 2020] safe_timer[21682]: segfault at 7f5e34e82eb8 ip 00007f5e13796cef sp 00007f5e11b7e970 error 4 in libceph-common.so.0[7f5e135ef000+5d3000] [Thu Dec 31 03:54:29 2020] Code: 1f 40 00 4d 89 f4 eb bb 0f 1f 00 41 57 41 56 4c 8d 77 08 41 55 41 54 55 53 48 83 ec 08 48 8b 5f 10 48 85 db 0f 84 d9 00 00 00 <4c> 8b 7e 08 4c 8b 2e 4d 89 f4 eb 11 0f 1f 44 00 00 49 89 dc 48 8b
Maybe related (though seeed to have been fixed either by using nautilus and/or using ceph config instead of config file): https://tracker.ceph.com/issues/24835
This might get fixed with T272303, but not sure, after the upgrade, will monitor to see if I see more error logs
This has happened again:
https://phabricator.wikimedia.org/P14413
Feb 19 05:07:31 cloudvirt1024 wmcs-backup[56558]: subprocess.CalledProcessError: Command '['/usr/bin/backy2', 'backup', '--snapshot-name', '2021-02-19T05:07:16_cloudvirt1024', '--rbd', '/tmp/tmpbgykl78e', '--from-version', '547f81fe-71a7-11eb-bee1-b02628295df0', 'rbd://eqiad1-compute/65343972-d77a-4c93-a69f-6a0ff15d082c_disk@2021-02-19T05:07:16_cloudvirt1024', '65343972-d77a-4c93-a69f-6a0ff15d082c_disk', '--expire', '2021-02-24 05:07:16', '--tag', 'differential_backup']' died with <Signals.SIGSEGV: 11>.
looking
Hmm... there's some suspicious logs before and after the error (at least this time):
https://phabricator.wikimedia.org/P14414
Feb 19 05:07:27 cloudvirt1024 sudo[6922]: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/lib/nagios/plugins/check_raid megacli ... Crash happens here ... Feb 19 05:07:36 cloudvirt1024 nova-compute[62835]: 2021-02-19 05:07:36.410 62835 WARNING nova.compute.resource_tracker [req-1492a22f-6b0d-44b0-b15f-6a174218312a - - - - -] Instance 98b912c0-8cf7-45f2-8728-942fdcdc803b has been moved to another host cloudvirt1022(cloudvirt1022.eqiad.wmnet). There are allocations remaining against the source host that might need to be removed: {'resources': {'VCPU': 1, 'MEMORY_MB': 2048, 'DISK_GB': 80}}. ... Feb 19 05:08:04 cloudvirt1024 systemd[7807]: gpgconf: running /usr/bin/gpg-agent failed (exitcode=2): General error Feb 19 05:08:04 cloudvirt1024 systemd[7807]: gpgconf: fatal error (exit status 1) ... Feb 19 05:08:32 cloudvirt1024 sudo[8264]: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/lib/nagios/plugins/check_raid megacli ... Feb 19 05:09:16 cloudvirt1024 smartd[1190]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], failed to read SMART Attribute Data Feb 19 05:09:16 cloudvirt1024 smartd[1190]: Device: /dev/bus/0 [megaraid_disk_08] [SAT], Read SMART Self Test Log Failed ...
The nagios check happens rather frequently:
dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep 'check_raid' | wc 472 7080 68375
Though the gpg error happens only around twice an hour:
dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep '/usr/bin/gpg-agent failed' | wc 50 600 5842
and always right after debmonitor logs in:
dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep 'su.*session opened for user debmonitor by' | wc 51 663 5658
Seems like an unrelated issue.
The log about the moved instance seems to happen quite frequently, >1k times/day:
dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep '98b912c0-8cf7-45f2-8728-942fdcdc803b has been moved' | wc 1416 65136 623040
So probably unrelated (though might be some stuck data somewhere).
The one that bothers me most is the error on smartd:
dcaro@cloudvirt1024:~$ sudo journalctl -S '1 day ago' | grep 'smartd.*failed to read SMART' | wc 48 720 6000
It seems to happen with the same frequency as the gpg error though (probably a red herring?).
This does not look good, though not sure if I should trust those values:
dcaro@cloudvirt1024:~$ sudo smartctl --all /dev/sda -d megaraid,0 | phaste https://phabricator.wikimedia.org/P14417
Similar command on cloudvirt1025 shows nicer values...
dcaro@cloudvirt1025:~$ sudo smartctl --all /dev/sda -d megaraid,0 | phaste https://phabricator.wikimedia.org/P14416
They seem to be different controllers though:
dcaro@cloudvirt1024:~$ lspci | grep -i megaraid 18:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID Tri-Mode SAS3508 (rev 01)
dcaro@cloudvirt1025:~$ lspci | grep -i megaraid 18:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
Megaraid cli says everything is ok though:
root@cloudvirt1024:~# megacli -ShowSummary -a0 | phaste https://phabricator.wikimedia.org/P14418
The 'has been moved to another host' issue can definitely be ignored, I've spent some time cleaning up some of those but got sick of it; it has to do with inconsistent db state with the new placement service and is harmless as far as I can tell.
I don't know if we can ignore the smartctl issues. Icinga has a 'Device not healthy -SMART-' test and there may be some knowledge in that test about what to care about or not care about. The docs don't tell us a whole lot https://wikitech.wikimedia.org/wiki/SMART#Alerts
backup job failed today on cloudcontrol1005 (glance images) at 2021-03-17 17:26 UTC
It was a segfault: cloudcontrol1005 kernel: safe_timer[19993]: segfault at 7ff9c8c66728 ip 00007ff9a777ebcf sp 00007ff9a6358970 error 4 in libceph-common.so.0[7ff9a75d2000+5e8000]
Mentioned in SAL (#wikimedia-cloud) [2021-03-17T17:28:31Z] <bstorm> restarted the backup-glance-images job to clear errors in systemd T271782