Page MenuHomePhabricator

Revise and improve Graphite backfill procedure
Closed, DeclinedPublic

Description

As highlighted in https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-29_graphite the current backfill procedure for Graphite can lose data, therefore we need to at least:

  • Revise the backfill procedure to be more robust in the face of similar failures in the future (e.g. run a full rsync first, then backfill only the gap)
  • Perform validation post-sync / post-backfill to check the number of datapoints across all metric files is roughly in sync between hosts

Event Timeline

lmata subscribed.

I am closing this task as Graphite is under active deprecation. Existing metrics are being actively migrated to Prometheus. The SRE Observability team asks that no new metrics be deployed to Graphite, as the service will be transitioned to a read-only state in EoQ3 FY2024/25 (~Spring 2025).

For additional details, please see the tracking task Graphite/Deprecation Roadmap: (https://wikitech.wikimedia.org/wiki/Graphite/Deprecation_Roadmap) or the T228380: Tech debt: sunsetting of Graphite.