Page MenuHomePhabricator

toolforge and misc NFS share backups log errors when reading old snapshots
Open, MediumPublic

Description

[Cloud-admin] Cron <root@labstore2003> /usr/local/sbin/block_sync 10.64.37.20 tools tools-project tools-snap backup tools-project tools-project-backup 2T

2018-02-27 20:00:02,290 INFO force is enabled
2018-02-27 20:00:02,365 INFO removing tools-project-backup
2018-02-27 20:00:02,490 INFO removing tools-project-backup
2018-02-27 20:00:03,034 INFO creating tools-project-backup at 2T
2018-02-27 20:00:03,852 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:03,852 INFO force is enabled
2018-02-27 20:00:03,868 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:03,869 INFO removing tools-snap
2018-02-27 20:00:03,885 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:03,901 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:03,902 INFO removing tools-snap
2018-02-27 20:00:05,675 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:05,716 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:05,744 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:05,745 INFO creating tools-snap at 1T
2018-02-27 20:00:05,997 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-27 20:00:06,033 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'

On labstore1004 I see:

root@labstore1004:~# lvs
  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error
  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error
  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error
  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error
  /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error
  /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error
  /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error
  /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error
  LV            VG    Attr       LSize  Pool Origin        Data%  Meta%  Move Log Cpy%Sync Convert
  misc-project  misc  owi-aos---  5.00t
  misc-snap     misc  swi-I-s---  1.00t      misc-project  100.00
  test          misc  -wi-ao---- 10.00g
  tools-project tools owi-aos---  8.00t
  tools-snap    tools swi-I-s---  1.00t      tools-project 100.00
root@labstore1004:~#

Misc:

2018-02-28 20:00:02,835 INFO force is enabled
2018-02-28 20:00:02,893 INFO removing misc-project-backup
2018-02-28 20:00:03,001 INFO removing misc-project-backup
2018-02-28 20:00:03,932 INFO creating misc-project-backup at 2T
2018-02-28 20:00:04,783 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-28 20:00:04,784 INFO force is enabled
2018-02-28 20:00:04,813 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-28 20:00:04,813 INFO removing misc-snap
2018-02-28 20:00:04,829 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-28 20:00:04,846 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-28 20:00:04,846 INFO removing misc-snap
2018-02-28 20:00:05,434 ERROR b'  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error\n  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2018-02-28 20:00:05,497 INFO creating misc-snap at 1T

Are snapshots not being created each time or did 1T of content really fill up while sync was in progress?

Event Timeline

@madhuvishy :) I know you had a late night, can you look into this sometime today when you're feeling up to it?

chasemp renamed this task from toolforge NFS share backup failed to toolforge and misc NFS share backups failed.Mar 1 2018, 9:52 PM
chasemp updated the task description. (Show Details)

So misc and tools both failed for the same reason it seems like. That's concerning. @madhuvishy thanks for looking into it.

On labstore1004 I see:

root@labstore1004:~# lvs
  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error
  /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error
  /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error
  /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error
  /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error
  /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error
  /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error
  /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error
  LV            VG    Attr       LSize  Pool Origin        Data%  Meta%  Move Log Cpy%Sync Convert
  misc-project  misc  owi-aos---  5.00t
  misc-snap     misc  swi-I-s---  1.00t      misc-project  100.00
  test          misc  -wi-ao---- 10.00g
  tools-project tools owi-aos---  8.00t
  tools-snap    tools swi-I-s---  1.00t      tools-project 100.00
root@labstore1004:~#

Are snapshots not being created each time or did 1T of content really fill up while sync was in progress?

This is a red herring for this specific problem, since backups are sourced from labstore1005. These snapshots on labstore1004 are probably around from when 1004 was the secondary host and filled up. I cleaned this up with

madhuvishy@labstore1004:~$ sudo /usr/local/sbin/snapshot-manager remove tools-snap tools/tools-project
madhuvishy@labstore1004:~$ sudo /usr/local/sbin/snapshot-manager remove misc-snap misc/misc-project

Now

madhuvishy@labstore1004:~$ sudo lvs
  LV            VG    Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  misc-project  misc  -wi-ao----  5.00t
  test          misc  -wi-ao---- 10.00g
  tools-project tools -wi-ao----  8.00t

Despite the error lines in both cron logs for reading misc-snap, both backups seem to have completed successfully. Pasting everything but the Error lines from the above logs

2018-02-27 20:00:02,290 INFO force is enabled
2018-02-27 20:00:02,365 INFO removing tools-project-backup
2018-02-27 20:00:02,490 INFO removing tools-project-backup
2018-02-27 20:00:03,034 INFO creating tools-project-backup at 2T
2018-02-27 20:00:03,852 INFO force is enabled
2018-02-27 20:00:03,869 INFO removing tools-snap
2018-02-27 20:00:03,902 INFO removing tools-snap
2018-02-27 20:00:05,745 INFO creating tools-snap at 1T

Which is the normal course of operations. I was able to compare the copies in codfw and can confirm they are uptodate.

So the only mystery is why the read errors on misc-snap occur, that I still don't have an answer for.

madhuvishy renamed this task from toolforge and misc NFS share backups failed to toolforge and misc NFS share backups log errors when reading old snapshots.Mar 1 2018, 10:57 PM
madhuvishy lowered the priority of this task from High to Medium.

Our current theory is that running when snapshot-manager runs lvs to check if a snapshot exists, it throws these read errors, potentially because the older snapshots are full or unreadable for some reason. But they will get deleted anyway so these errors are red herrings and don't affect the backups. We can either fix logging these errors, or remove the snapshots at the source server after the backup is done to avoid this problem.

bd808 added a subscriber: madhuvishy.
Bstorm subscribed.

Curious if my effort to change to a systemd timer eliminated alerts around this condition (which would be a bad thing). The systemd version only alerts if the process exits with non-zero status.

Bstorm moved this task from Doing to Inbox on the cloud-services-team (Kanban) board.