Definitions of Done
This checklist ensures that all work done for Wikidata follows consistent reliability and resilience patterns, improving stability in Kubernetes deployments, reducing operational toil, and preventing cascading failures.
DoD checklist
- APIs expose a readiness healthcheck endpoint, and Kafka consumers/producers expose a liveness healthcheck endpoint (separate endpoints).
- Services handle SIGTERM gracefully:
- Readiness healthcheck returns “not ready” immediately on SIGTERM
- Ongoing tasks and goroutines are shut down gracefully
- Downstream connections are closed before exit
- Retries to external services use exponential backoff (verified/tuned).
- Circuit breaker pattern is implemented to prevent retry storms when downstream systems degrade or fail.
Test Strategy
- Validate readiness/liveness endpoints in dev
- Simulate container shutdown (SIGTERM) to confirm graceful termination
- Run chaos/fault injection to validate retry + circuit breaker logic
- Manual verification if automated testing not feasible
Things to Consider
- Do we need new alarms/metrics (e.g., readiness/liveness failures, retry exhaustion, circuit breaker open)? [x] Yes [] No
Which ones?
- Do we need to update runbooks/playbooks with shutdown, retry, and healthcheck details? If yes, list docs to update here.