Page MenuHomePhabricator

[APP: Objective 3 Term 3.1 - WME OKR TBD - Q1-Q2 FY26/26] Wikidata is available as a project
Open, HighPublic

Description

Definitions of Done

This checklist ensures that all work done for Wikidata follows consistent reliability and resilience patterns, improving stability in Kubernetes deployments, reducing operational toil, and preventing cascading failures.

DoD checklist

  • APIs expose a readiness healthcheck endpoint, and Kafka consumers/producers expose a liveness healthcheck endpoint (separate endpoints).
  • Services handle SIGTERM gracefully:
  • Readiness healthcheck returns “not ready” immediately on SIGTERM
  • Ongoing tasks and goroutines are shut down gracefully
  • Downstream connections are closed before exit
  • Retries to external services use exponential backoff (verified/tuned).
  • Circuit breaker pattern is implemented to prevent retry storms when downstream systems degrade or fail.

Test Strategy

  • Validate readiness/liveness endpoints in dev
  • Simulate container shutdown (SIGTERM) to confirm graceful termination
  • Run chaos/fault injection to validate retry + circuit breaker logic
  • Manual verification if automated testing not feasible

Things to Consider

  • Do we need new alarms/metrics (e.g., readiness/liveness failures, retry exhaustion, circuit breaker open)? [x] Yes [] No

Which ones?

  • Do we need to update runbooks/playbooks with shutdown, retry, and healthcheck details? If yes, list docs to update here.

Event Timeline

JArguello-WMF renamed this task from [APP: Objective 3 Term 3.1 - WME OKR TBD - Q1 FY26/26] Wikidata is available as a project to [APP: Objective 3 Term 3.1 - WME OKR TBD - Q1-Q2 FY26/26] Wikidata is available as a project .Jul 25 2025, 4:31 PM
JArguello-WMF triaged this task as High priority.
KMontalva-WMF renamed this task from [APP: Objective 3 Term 3.1 - WME OKR TBD - Q1-Q2 FY26/26] Wikidata is available as a project to [APP: Objective 3 Term 3.1 - WME OKR TBD - Q1-Q2 FY26/26] Wikidata is available as a project.Dec 16 2025, 1:11 PM
KMontalva-WMF updated the task description. (Show Details)

Wikidata updater SIGTERM not done yet
Retries in the updater need to be activated again
circuit breaker and run chaos fault ask Renil when he's back