Page MenuHomePhabricator

MariaDB lifetime management system
Open, In Progress, MediumPublic

Description

Implement automation to depool/repool replicas as needed, manage role changes between master / candidate-master / replicas, and ultimately automate master failover.

Requirements:

  • Track where/when ground truth is coming from
  • Hard/soft distributed locks on section / dc to prevent clashing failover / maintenance activities
  • Support multiple candidates and select based on priority
  • Support topology-aware host selection (rack, TOR switch etc)
  • Automated failover/depooling with velocity checks
    • Critical path cannot depend on lower tier services

Initial design doc: https://docs.google.com/document/d/17ApZIOSyGP2kmMbhOgXT-9--_OrRaNCuSpThFPGW6eo/edit?tab=t.0

Roadmap:

Related Objects

StatusSubtypeAssignedTask
OpenNone
In ProgressFCeratto-WMF
OpenFCeratto-WMF
In ProgressFCeratto-WMF
ResolvedFCeratto-WMF
OpenFCeratto-WMF
ResolvedNone
DeclinedABran-WMF
ResolvedABran-WMF
ResolvedABran-WMF
DeclinedFCeratto-WMF
ResolvedMarostegui
ResolvedFCeratto-WMF
OpenNone
ResolvedFCeratto-WMF
DuplicateFCeratto-WMF
OpenNone
OpenNone
Resolved Kormat
DeclinedNone
OpenNone
Resolved Kormat
Resolved Kormat
OpenNone
ResolvedFCeratto-WMF
DeclinedNone
OpenFCeratto-WMF
ResolvedFCeratto-WMF
OpenNone
DuplicateFCeratto-WMF
ResolvedMarostegui
OpenFCeratto-WMF
OpenNone
ResolvedMarostegui
ResolvedFCeratto-WMF
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1129904 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/cookbooks@master] Add switchover cookbook

https://gerrit.wikimedia.org/r/1129904

I'm sharing here an updated diagram of the components based on the ongoing work and a recent discussion with @Joe.
The arrows indicate data flows.

Screenshot 2025-04-03 at 16-17-00 Graphviz Online.png (783×1 px, 102 KB)

White ellipses represent data, rectangles represent processes or functions.

  • status fetcher: extracts replication and general health status from databases at high frequency, directly and/or from Orchestrator
  • dbconfig ingestion: fetches DB-related data from e.g. https://noc.wikimedia.org/dbconfig/eqiad.json
  • daemon: performs database pooling/depooling, master/candidate flips and other activities requiring changing dbcontrol value automatically or on SRE input
  • API: a read only HTTP API primarily providing T384212.
  • CLI tool: used by SREs to set desired states; could be a runbook.
  • puppet ingestion: fetches DB-related data from Puppet primarily to spot inconsistencies or new/unused hosts, see T389663 T388127 T389932

All processes are running as daemons, except for the CLI tool. Most processes should be stateless and able to run in the aux k8s cluster.
(Processes could be unix processes or different functions in the same daemon)

The Puppet ingestion process is optional and/or could be implemented as a manual tool instead.

The API could also serve ancillary data e.g. generate configuration snippets.

The status fetcher also exposes replication delay metrics as Prometheus metrics
addressing T141968 with minimal effort.

(editable diagram at https://is.gd/qtV5ZJ)

This is in the very long term, but we need to keep in mind that if I am successful with T324965: Clean up old gtid_domain_id we may start using orchestrator for master switchovers (both emergency and planned ones). In either case, this (either coding this tool or fixing GTID and trusting orchestrator) is a very long term thing, but just wanted to leave this here.

Change #1145127 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: values.yaml: Fix typo, remove comment

https://gerrit.wikimedia.org/r/1145127

Change #1145127 merged by Clément Goubert:

[operations/deployment-charts@master] zarcillo: values.yaml: Fix typo, remove comment

https://gerrit.wikimedia.org/r/1145127

Change #1146018 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: values.yaml: Add FQDN for SNI

https://gerrit.wikimedia.org/r/1146018

Change #1146018 merged by jenkins-bot:

[operations/deployment-charts@master] zarcillo: values.yaml: Add FQDN for SNI

https://gerrit.wikimedia.org/r/1146018

Change #1156401 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: Allow egress to idp.wikimedia.org

https://gerrit.wikimedia.org/r/1156401

Change #1156401 merged by Federico Ceratto:

[operations/deployment-charts@master] zarcillo: Allow egress to idp.wikimedia.org

https://gerrit.wikimedia.org/r/1156401

Change #1129904 abandoned by Federico Ceratto:

[operations/cookbooks@master] Add switchover cookbook

Reason:

Discussed on IRC: the automation workflow is to be rediscussed

https://gerrit.wikimedia.org/r/1129904

Change #1129904 restored by Federico Ceratto:

[operations/cookbooks@master] Add switchover cookbook

https://gerrit.wikimedia.org/r/1129904

Change #1166227 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: Update egress to idp.wikimedia.org

https://gerrit.wikimedia.org/r/1166227

Change #1166227 merged by jenkins-bot:

[operations/deployment-charts@master] zarcillo: Update egress to idp.wikimedia.org

https://gerrit.wikimedia.org/r/1166227

Change #1172334 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: Enable egress for Alertmanager

https://gerrit.wikimedia.org/r/1172334

Change #1172334 merged by Federico Ceratto:

[operations/deployment-charts@master] zarcillo: Enable egress for Alertmanager

https://gerrit.wikimedia.org/r/1172334

Change #1172635 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: Add egress to Netbox

https://gerrit.wikimedia.org/r/1172635

Change #1172635 merged by Federico Ceratto:

[operations/deployment-charts@master] zarcillo: Add egress to Netbox and config-master

https://gerrit.wikimedia.org/r/1172635

Change #1196437 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: update egress after IDP ipaddr changes

https://gerrit.wikimedia.org/r/1196437

Change #1196437 merged by Federico Ceratto:

[operations/deployment-charts@master] zarcillo: update egress after IDP ipaddr changes

https://gerrit.wikimedia.org/r/1196437

Change #1198924 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: remove obsoleted IDP egress ipaddrs

https://gerrit.wikimedia.org/r/1198924

Change #1198924 merged by Federico Ceratto:

[operations/deployment-charts@master] zarcillo: remove obsoleted IDP egress ipaddrs

https://gerrit.wikimedia.org/r/1198924

Change #1211165 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: Allow egress to etcd to fetch dbctl values

https://gerrit.wikimedia.org/r/1211165

Change #1211165 merged by Federico Ceratto:

[operations/deployment-charts@master] zarcillo: Allow egress to etcd to fetch dbctl values

https://gerrit.wikimedia.org/r/1211165

Change #1217492 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] prometheus-mariadb-replication-lag.py: mysql_heartbeat_lag_seconds metric

https://gerrit.wikimedia.org/r/1217492

Change #1129904 abandoned by Federico Ceratto:

[operations/cookbooks@master] Add switchover cookbook

Reason:

The tool has been moved to https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/merge_requests/17

https://gerrit.wikimedia.org/r/1129904

Change #1217492 merged by Federico Ceratto:

[operations/puppet@production] prometheus-mariadb-replication-lag.py: mysql_heartbeat_lag_seconds metric

https://gerrit.wikimedia.org/r/1217492