Page MenuHomePhabricator

Revise and improve Graphite backfill procedure
Open, Needs TriagePublic

Description

As highlighted in https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-29_graphite the current backfill procedure for Graphite can lose data, therefore we need to at least:

  • Revise the backfill procedure to be more robust in the face of similar failures in the future (e.g. run a full rsync first, then backfill only the gap)
  • Perform validation post-sync / post-backfill to check the number of datapoints across all metric files is roughly in sync between hosts