Hi Data Persistence team :)
The Analytics team manages a big Hadoop HDFS cluster, that is capable of storing ~1PB of data (that gets replicated 3 times across nodes, for a grand total of max 3PBs). We upgraded HDFS from 2.6 to 2.10.1 recently, and before that we created a temporary backup cluster composed by new Hadoop worker nodes to hold data that we couldn't really loose (for example, datasets that cannot be regenerated from others, like Pageview etc..).
The task that we used to establish what data was absolutely needed in the backup is: T260409
We ended up using 400TBs of space in the backup cluster (total of ~800TBs with replication 2, on 18 nodes). After the successful upgrade, we moved all the nodes used for the Backup cluster to the main cluster (as it was originally planned), so we don't have any backup solutions in place atm.
The main issue now is that we'll need to upgrade to newer versions during the next fiscal year(s), and having a permanent backup solution in place would be really nice even to prevent accidental data drops or corruption during our day to day work (we have fences to prevent PEBCAKs but not for all use cases). The data to backup would be a mixture of PII/data-sensitive and non PII data.
I don't have a clear idea about annual data growth for those 400TBs, I added Joseph to the task to follow up to have a more reliable source of truth than me :D