Page MenuHomePhabricator

Evaluate possible solutions to backup Analytics Hadoop's HDFS data
Open, LowPublic

Description

Hi Data Persistence team :)

The Analytics team manages a big Hadoop HDFS cluster, that is capable of storing ~1PB of data (that gets replicated 3 times across nodes, for a grand total of max 3PBs). We upgraded HDFS from 2.6 to 2.10.1 recently, and before that we created a temporary backup cluster composed by new Hadoop worker nodes to hold data that we couldn't really loose (for example, datasets that cannot be regenerated from others, like Pageview etc..).

The task that we used to establish what data was absolutely needed in the backup is: T260409

We ended up using 400TBs of space in the backup cluster (total of ~800TBs with replication 2, on 18 nodes). After the successful upgrade, we moved all the nodes used for the Backup cluster to the main cluster (as it was originally planned), so we don't have any backup solutions in place atm.

The main issue now is that we'll need to upgrade to newer versions during the next fiscal year(s), and having a permanent backup solution in place would be really nice even to prevent accidental data drops or corruption during our day to day work (we have fences to prevent PEBCAKs but not for all use cases). The data to backup would be a mixture of PII/data-sensitive and non PII data.

I don't have a clear idea about annual data growth for those 400TBs, I added Joseph to the task to follow up to have a more reliable source of truth than me :D

Event Timeline

LSobanski added a subscriber: LSobanski.

@elukey thanks for reaching out, a few questions:

  • Is the expectations to do backups continuously or at fixed points in time?
  • Is the cluster in eqiad only and would you expect the backups in the same location (faster) or somewhere else (meteorite-proof)?
  • What would be the estimated retention period and would it differ between data sets?
  • What would be a reasonable ETA for recovery, both full and for key data sets?
  • Does the data have to be exported in any way and if not, how can it be accessed (directly via the file system, API, local command on the data host, remote command)?
  • Is the data compressible
  • Last but not least, what's the rough timeline for the next upgrade?

Adding my thoughts about it, then my team will be able to comment :)

@elukey thanks for reaching out, a few questions:

  • Is the expectations to do backups continuously or at fixed points in time?

Ideally we'd need to periodically get a backup of the new data that comes it, so not only a one off. If this is challenging it would also be ok to have a way to store as one-offs big backups to support cluster upgrades.

  • Is the cluster in eqiad only and would you expect the backups in the same location (faster) or somewhere else (meteorite-proof)?

The cluster is eqiad only indeed, we don't have particular requirements for the location, just that it is somewhere.

  • What would be the estimated retention period and would it differ between data sets?

Some datasets would need to kept indefinitely to avoid loosing history, other things might be probably be dropped after a fixed amount of time. So the latter I'd say.

  • What would be a reasonable ETA for recovery, both full and for key data sets?

Even a day would be fine, as long as we would not loose the data. Of course the faster the better :)

  • Does the data have to be exported in any way and if not, how can it be accessed (directly via the file system, API, local command on the data host, remote command)?

Data can be copied to any host capable of kerberos authentication plus required credentials, since it would be streamed directly from the cluster. We also support a fuse hdfs mountpoint to allow people to access data via os-local mountpoint, but it is a little brittle so I wouldn't suggest to use it.

  • Is the data compressible

It is already in highly compressed formats, but for this question I'd need to ask to others :)

  • Last but not least, what's the rough timeline for the next upgrade?

We don't have it planned yet, it would be awesome to do it during the next fiscal year (like Q3 if I have to pick a number). We may end up not doing it, we still need to figure out resource constraints for the team, but not having a backup solution would be a major problem. We could of course attempt the upgrade anyway, but the risk of data corruption might be too much :(

we don't have particular requirements for the location

The answer to that would mostly be motivated by: how much time could you wait for the recovery to be done? We would need more info, but the difference between single dc vs multi dc would be something along the lines of "1 week" vs "3 months". Of course, the speed up would mean the backups wouldn't protect against all threats.

Even a day would be fine

400 TB would imply a recovery speed of 5 gigabYtes/s, that is physically impossibly with our technology (even if parallelized) :-/

Data can be copied to any host capable of kerberos authentication plus required credentials, since it would be streamed directly from the cluster

That looks to me like an "export", not a raw filesystem copy. There is absolutely no problem with that, but usually that means slowdown on backup and recovery.

Nothing is impossible, but I wanted to introduce "buts" to expectations- specially because backups are usually optimized for long term storage/disk resources/encryption, not as much for performance.

@jcrespo thanks for the infos, lemme add more notes:

  • A day was a random value that picked turned out to be very wrong, I think that we can wait even one/two week(s) for full recovery, as long as we can get the data back.
  • About the raw file system copy - HDFS is split into two kind of daemons: Namenode and Datanode. The former are two, one active and one standby, that keep all the metadata of the filesystem, like where blocks for certain files are located (the inodes basically). Datanodes are one for every worker (so dozens) and they only store raw blocks on ext4 partitions, without really knowing what the data is about. There are tools like HDFS Snapshots to force the datanodes to create a sort of copy-on-write set of blocks to store permanent versions of directories, but I'd need to do some research about it to see if raw data copy is feasible. In theory if we have the Namenode's metadata (basically its inodes - order of some GBs) and the raw blocks, we should be able to get back the hadoop file system working. Practically this might be a little problematic in a data recovery scenario, since I imagine that my team would try to restore something working as soon as possible to keep ingesting new data, trying to recover the lost one somewhere (the export that you mentioned).

Practically this might be a little problematic in a data recovery scenario

Yes, this is something that I expected, as we had a similar kind of decision with Swift. While both technologies are very different, I think they have similar characteristics in how an ideal backup would be taken (given its cluster/distributed nature)- for backups, we may want a more "logical" approach, rather than filesystem, even if that means a bit of extra complexity and slowdown, because it would give us way more flexibility and long term usage.

I would like to know more about "incrementals", as that would be new compared to swift (where data after backups is unmutable) and probably not something that open source Bacula would support by itself (block-level incrementals).

@jcrespo quick question - if we want to move forward with this, do we need hardware planned for next fiscal? I know that the use case is very high level and there are a lot of unclear points, so any inputs will be appreciated :)

do we need hardware planned for next fiscal

Absolutely yes. I thought that was clear, and something you were handling on your own or with my manager, which is the budget owner for data-persistence. We don't have the space to backup what you indicated even if we wiped all existing backups. And probably we don't have the technology either.

We also cannot promise when we will attend this as that will depend on the priorities expressed by the organization leads- again that is something you have to talk to my manager- I am just the person that work in implementation of the things I am told to do, and we only have around 0.5 people at the moment working on backups.

LSobanski moved this task from Refine to Backlog on the Data-Persistence-Backup board.

Blocked at least until we get a clear picture from T283261.