Page MenuHomePhabricator

revise/fix labstore replicate backup jobs
Closed, ResolvedPublic

Description

The tools backup failed again last night. I'm not sure why, possibly related to high overnight load. I tried to restart it manually and labstore1001 surged in load and become barely responsive. At this time we hover around 10 for load or greater on an 8 core box at the best of times. But whether we can handle the backup jobs with our normal overload may be secondary, there are some pieces of the replicate setup that need attention

  • Some options with rsync seem nonfunctional or errant like "--filter=._/etc/replication-rsync.conf"
  • Tools at rest on the remote end is consistently larger than source (why?)
  • It is ioniced and working now that I have changed to CFQ for disk io scheduling but not nice'd otherwise and causes load issues of its own
  • We keep snapshots locally with a loose cleanup logic
  • How much history we keep on labstore1002 seems not well understood. we keep some snapshots from tools though at this moment it seems random?
tools20160209020010 backup swi-a-s---  1.00t      tools  65.86
tools20160219020015 backup swi-a-s---  1.00t      tools  7.03
tools20160219211007 backup swi-a-s---  1.00t      tools  0.00

we seem to keep only 1 days history for non-tools backups.

  • We continually see issues during our backup process where the snapshot'ng/backup processes causes high load (sometimes really high load) and effects NFS operations
  • Monitoring does not capture the result but rather the action of the backup job
  • Backups are staggered by only by an hour and they often end up running concurrently
  • snapshots kept on the remote labstore2001 or local labstore1001 filling up can kill the backup jobs

Event Timeline

Various bandaids:

  1. Limit CPU and IO usage via systemd (https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html)
  2. Change our snapshot dropper to keep only one snapshot always. Te snapshot is primarily used as rsync source, and does not need to exist past the end of a successful rsync - so we can even just drop them at the end of a successful rsync (but if it is unsucessful, we do not care anyway - so drop it no matter what?)

Backups have been failing again and I had a few moments to look into things (I am also merging in a task daniel made -- thanks daniel -- as however we address this needs to be systemic). This is failing really often, and I believe even when it seems to be succeeding it is failing sometimes. For instances the check will turn green in icinga if I even start a backup regardless of outcome. Eventually, it will show failure again but it's not great.

It's difficult to tell what previous jobs have succeeded. There are some random snapshots left on labstore2001:

tools20160209020010 backup swi-I-s---   1.00t      tools  100.00
tools20160219020015 backup swi-a-s---   1.00t      tools  49.15
tools20160219211007 backup swi-a-s---   1.00t      tools  44.04

When I look at labstore1001 a significant proportion of rsync jobs seem to carry over into the following day and so I'm not sure when certain jobs have finished or IF

daemon.log.1:Feb 22 04:00:10 labstore1001 storage-replicate[10019]: Skipping replication; already in progress since 2016-01-21% 04:00:05
daemon.log.1:Feb 23 04:00:05 labstore1001 storage-replicate[17841]: Skipping replication; already in progress since 2016-01-21% 04:00:05
daemon.log.1:Feb 24 04:00:06 labstore1001 storage-replicate[23695]: Skipping replication; already in progress since 2016-01-21% 04:00:05
daemon.log.1:Feb 25 04:00:11 labstore1001 storage-replicate[6713]: Skipping replication; already in progress since 2016-01-21% 04:00:05
daemon.log.1:Feb 26 04:00:10 labstore1001 storage-replicate[14911]: Skipping replication; already in progress since 2016-01-21% 04:00:05

I believe Feb 26 may be the last day these actually succeeded

The current failures on are based on an informational part of the storage-replicate failing

syslog.1:Mar 2 04:00:08 labstore1001 storage-replicate[5463]: out = ctx.run('/sbin/lvs', '--noheadings', '--options', 'lv_attr', '/dev/mapper/%s-%s' % (vg, lv))

I don't understand yet why that is.

The relevant snapshots on labstore1001 are definitely stale:

maps20160121040005   labstore swi-a-s--k   1.00t      maps   17.80
  others20160224030004 labstore swi-a-s--k   1.00t      others 20.17
  tools20160226020017  labstore swi-a-s--k   1.00t      tools  56.22

I'm currently looking at why things have been consistently failing with copy times, and how we can leave backups in a more consistent state when they do fail. One difficulty I am having is my throughput seems to cap out at around 53MiB/s between labstore's making replication tests rather lengthy.

chasemp renamed this task from revise replicate backup jobs to revise/fix labstore replicate backup jobs.Mar 2 2016, 11:23 PM

just my 2 cents from the merged task. that output line "Last run result for unit replicate-tools was exit-code " really looked as if there was just a typo where it should be 'was $exit-code" or something and output the actual exit code as a first step, separate from why it's failing.

I found some old stale snapshots on labstore2001 from failed backups in the past on friday. I cleaned things up and the replicate jobs seem to be running fine over the weekend. At the same time I started a dd replication of the tools snapshot tools-manual-03042016 that was nice'd to try to ensure we had at least a semi-recent copy of data either way.

Backups were failing again last night, and I'm pretty sure it was related to a full snapshot left behind on labstore2001 (which was one of the previous causes). I removed the offending snapshot.

I have been working through this a bit and trying out a few approaches for feasibility. I think I would like to transition away from using rsync as our backup method in this case, it is slow and puts some strain on labstore1001. Which is frankly mostly noticeable as it's so overloaded already. From what I can tell and what I've read rsync seems nonoptimal for large datasets (something like above 100G) and transferring delta's. It seems to spend a lot of time seeking on disk etc. Another issue with our current approach is we would like to ignore certain types of files for transfer since we keep log and other ephemeral data on NFS at the moment. I believe it's not working as intended, but also I'm more concerned with cpu/mem resources than I am disk for our current model and so I think block replication is a better fit if it can work reliably.

At the moment a job runs overnight that uses snapshots locally and remotely and attempts to transfer delta's with a loose cleanup logic for snapshots. We have seen as snapshots pile up that performance in impacted, and several times in the last few months snapshots being left behind and/or filling up have stopped backups from running reliably.

Another note is that we have pretty vague language on wikitech for the mechanics of our backups.

Here we say scratch is not backed up, and /data/project says it is backed up but there is no details. Here we reference backup mechanisms that I believe are no longer current for tool labs specific things. Another note is we have continually instructed users not to store their primary code on tools or NFS or labs and to use git/revision control. This narrative will only be strengthend as we move to the container model.

Valhallasw helped me dig up old text on what the general user outline has been in the past:

"The basic rule is: there is a lot of redundancy, but no backups of labs projects beyond the filesystem's time travel feature for short-term disaster recovery. "

It seems there was an older method for backups outlined called "time machine" centered around allowing users to self fulfill their backup needs but it is not working or has not worked in some time. We do however have users asking for the correct way to backup their files or to make sure a copy of their files will exist beyond as specific instance, and we do want to provide a solution for this that is not NFS. I hope to address this in the future with shifting the burden of our primary NFS traffic and allocating space for this purpose but that is ancillary to this task.

My plan is to end up in basically this situation:

  • Weekly backups of NFS data
  • Multi-week historical copies as space allows
  • Low performance impacting workload for ongoing backups (our throughput on the link seems to be variable and current backups can run for long periods)
  • Ability to verify the integrity of source vs destination for backups
  • Idempotent for recovery on backup failure or one-off runs pre-maintenance
  • Not based around long surviving snapshots or snapshots outside of the running backup window on user servicing hosts
  • Non-overlapping backup windows for different block devices as is possible
  • Clear language on wiki for what is backed up and what it means for users and DR

For initial block replication between devices I have been using simple dd with ssh which I think is a well understood and straight forward method but it takes a long time at our sizes. So once a device is created on the production labstores and replicated to the backup host my intention is to replicate changes only by block. I have seen two approaches to this problem:

  • Hash and compare by block
  • Use existing metadata for snapshots to transfer deltas

In the first category I have seen three solutions:

Zumastor ddsnap https://lkml.org/lkml/2008/4/30/94 (seems mostly defunkt?)
Written in C simple hash and compare https://github.com/TargetHolding/bdsync
python implementation of hash and compare http://www.bouncybouncy.net/programs/blocksync.py

In the second I have only seen one:

https://github.com/mpalmer/lvmsync

The advantage for lvmsync is that is doesn't spend time going over the entire block device, it actually never reads the snapshot at all. It looks at changes since the snapshot took place that LVM already tracks. This solution seems mostly tailored for VM moving downtime minimizing between virt hosts. As I was experimenting my impression is that it's not hard to end up in a weird situation where a backup has failed and the remote end is no longer in sync with the origin snapshot source volume. This is tricky because the metadata for a snapshot is all lvmsync really cares about and that is based on the assumption of fixed points in time for source on both ends. This is especially not ideal with our long recovery times and data set size.

The rest are basically: hash every block, compare every block on both sides, send blocks that don't match. I did run through implementations in python of this logic and it works fine but wasn't as performant as I would like. It seems like there isn't an advantage to using an interpreted language in this case and performance for raw throughput maximization really is a concern. ddnsap seems to be mostly legacy which leaves bdsync as a contender.

bdsync is pretty simple. We can do our dual end diff and transfer and either immediately patch our device, or generate a binary file for patching later. The patch file approach has the same issues as lvmsync in terms of assumption of underlying block device state, but is nice option.

Procedure at the moment is basically:

  1. Create a new block device on labstore1001
  2. Snapshot this device for a consistent source
  3. Create an equal size block device on labstore2001 for population
  4. Use DD to replicate a full copy like:

dd if=/dev/labstore/tools-manual-03042016 | pv -L40m | nice -19 ssh -c des -i /root/.ssh/id_labstore root@labstore2001.codfw.wmnet dd of=/dev/backup/tools-04032016

  1. Clean up snapshots
  2. Snapshot the source and replicate the diff at an interval and patch the device to be consistent (I run this on the backup destination itself so it does as much of the hard work as possible)
#!/bin/sh
remotedev=/dev/labstore/tools-manual-03102016
remotehost=root@labstore1001.eqiad.wmnet
localdev=/dev/backup/tools-04032016

blocksize=8192

/root/bdsync/bdsync --blocksize=$blocksize \
    --remdata "ssh -i /root/.ssh/id_labstore $remotehost 'nice -19 /root/bdsync/bdsync --server'" \
    $localdev \
    $remotedev | pv | sudo /root/bdsync/bdsync --patch=$localdev

And in this way we end up with 2 identical block devices.

I'm working out the best way to store historical copies on the remote end. Since we are dealing in block devices it is simple to use dd to pipe to a compressed file, but my testing so far says it is a long operation for large devices. It is possible some form of delta using bdsync is a good idea also. I have tried with a few different compression approaches and so far nothing is much better than anythign else but I think using https://packages.debian.org/jessie/zerofree with dd and compression I can come up with something that is a good balance between space conservation and recovery. I haven't entirely worked this out yet. It's also possible that using dd to create compressed img files is made pretty easy with kpartx which allows mounting of .img's as if they are normal devices.

Along with this I would like to use some mechanism for block device comparison which is basically something that can hash the entire device. I have looked at md5sum and md5deep (which allows breaking up large devices into comparable chunks), and openssl with md4, but for now I'm most hopeful about https://github.com/Cyan4973/xxHash which seems to be really, really fast and efficient and could mean much faster comparison times.

I hope we can hopefully replicate as a block device, only incur the cost of replicating the changed blocks throughput wise while getting the benefit of an extremely low overhead operation, and using dd and block device compression keep a good recoverable history on the remote end. With durable integrity checking for verification and an idempotent operation for any operator to kick off new transfers that will result in a consistent state.

Thank you for the indepth investigation and awesome write up, @chasemp!

This all sounds great, but I want to challenge one assumption:

Multi-week historical copies as space allows

I don't really think we need to do this. The whole 'time travel' concept never actually worked - it was using lvm thin snapshots, which never successfully made it to production. These backups should be purely for DR, and I think it is ok to even not offer manual recovery of individual files for users - they should be using git or something similar.

Outside of that, +1 to dropping rsync.

re:

Multi-week historical copies as space allows

I'm open to what makes sense here but I didn't explain the purpose clearly above I think. This is not primarily intended as any kind of user reclaimable history in the sense of time machine. and I believe we are going to continue the narrative of not providing user backups via this mechanism at all. I'm not designing it around that kind of ongoing process at least. In the near future I hope to have a more reasonable option on that front that is more self tenant managed. The real purpose is strictly operational: I propose we need at the least 1, if not as many as we could store cheaply, of offline copies of data. Most of our near-term changes are logs and other ephemerals (though not all), but in a backup process from end-to-end I am uncomfortable with having all copies of the data being "online". i.e. part of the live ongoing process. If any number of things happen we can easily lose data that is still being operated on, and I don't consider snapshots to be offline. They are definitely nondeterministic with cow since once the snapshot fills they are junk. So in this instance I want to have:

  • Live copy on running machines (with ideally a secondary live copy in the same env with DRBD or some other replication means)
  • Online backup in secondary DC that is used for mirroring
  • Offline backup in secondary DC that is the fallback for recovering from failures within the backup process itself. How many of these or how that works is tbd at this moment. This would be older and more stale but is the safety.

A few notes on where this is at for madhu to take over. We have been testing backup schemes and have settled for now on something like what is described in https://phabricator.wikimedia.org/T127567#2113829. We have had even more issues w/ rsync choking on large files in trying to transfer the inital copies to the new cluster.

  • bdsync has been packaged and is available on our jessie nodes (I believe all labstores are jessie -- or should be)
  • we can snapshot the underlying logical volume for a drbd array and use that to backup from the secondary in a drbd resource pair
  • we had previously discussed weekly offsite backups, maybe mondays and fridays? It depends on what is practical replication wise.

Change 321689 had a related patch set uploaded (by Rush):
labstore1001: clean out stale and bad backup jobs

https://gerrit.wikimedia.org/r/321689

Change 321689 merged by Rush:
labstore1001: clean up unused jobs and legacy

https://gerrit.wikimedia.org/r/321689

A bit of monitoring improvements ongoing in T144633: Set up monitoring for secondary labstore HA cluster but generally this is done.