**Definitions of Done**
This checklist ensures that all work done for Wikidata follows consistent reliability and resilience patterns, improving stability in Kubernetes deployments, reducing operational toil, and preventing cascading failures.
DoD checklist
[] APIs expose a readiness healthcheck endpoint, and Kafka consumers/producers expose a liveness healthcheck endpoint (separate endpoints).
[] Services handle SIGTERM gracefully:
[] Readiness healthcheck returns “not ready” immediately on SIGTERM
[] Ongoing tasks and goroutines are shut down gracefully
[] Downstream connections are closed before exit
[] Retries to external services use exponential backoff (verified/tuned).
[] Circuit breaker pattern is implemented to prevent retry storms when downstream systems degrade or fail.
**Test Strategy
**
[] Validate readiness/liveness endpoints in dev
[] Simulate container shutdown (SIGTERM) to confirm graceful termination
[] Run chaos/fault injection to validate retry + circuit breaker logic
[] Manual verification if automated testing not feasible
Things to Consider
- Do we need new alarms/metrics (e.g., readiness/liveness failures, retry exhaustion, circuit breaker open)? [x] Yes [] No
Which ones?
- Do we need to update runbooks/playbooks with shutdown, retry, and healthcheck details? If yes, list docs to update here.