As highlighted in https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-29_graphite the current backfill procedure for Graphite can lose data, therefore we need to at least:
- Revise the backfill procedure to be more robust in the face of similar failures in the future (e.g. run a full rsync first, then backfill only the gap)
- Perform validation post-sync / post-backfill to check the number of datapoints across all metric files is roughly in sync between hosts